Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for CompactTest failure #2127

Closed
wants to merge 1 commit into from
Closed

Fix for CompactTest failure #2127

wants to merge 1 commit into from

Conversation

sarveshtamba
Copy link
Contributor

@sarveshtamba sarveshtamba commented Aug 20, 2019

I debugged the failing CompactTest test case and have managed to find the root cause of the failure.

After understanding the logic of the test case and tracing the code flow, I realised that the failure happened due to the incorrect assert check at the following location:-
https://github.com/apache/couchdb/blob/master/test/elixir/test/compact_test.exs#L46

This is because the final data size after deletion & further compaction is more than the deleted data size after only deletion, but not compaction.
The opposite was being checked due to which the test case was failing consistently.

Following are the values of the variables in question that I managed to trace:-

	CompactTest
	  * test compaction reduces size of deleted docs

	Value of orig_data_size = 4436.

	Value of orig_disk_size = 103907.

	Value of deleted_data_size = 7455.

	Value of final_data_size = 11924.

	Value of final_disk_size = 218681.

	  * test compaction reduces size of deleted docs (18819.2ms)

I have made the necessary changes and submitted this PR for the above fix.

@sarveshtamba
Copy link
Contributor Author

@kocolosk can you please review this one? If you are not the right person, can you suggest who is the right person to review this? Thanks!

@sarveshtamba
Copy link
Contributor Author

@wohali @nickva @kocolosk Please suggest who can review this PR?

@iilyak
Copy link
Contributor

iilyak commented Aug 26, 2019

If I recall correctly this used to be a flaky test. The current check seem to be correct: final_data_size < deleted_data_size. The failure is observed because we cannot detect compaction completion reliably.

@sarveshtamba
Copy link
Contributor Author

Hi @iilyak ,

Thanks for the update.
I observed that the final_data_size always remained greater than deleted_data_size even after compaction, and eventually the test failed after timeout.

Also I saw that the following check tests the deleted_data_size after deletion is greater that orig_data_size, so related that final_data_size after compaction would be greater than deleted_data_size.
https://github.com/apache/couchdb/blob/master/test/elixir/test/compact_test.exs#L31

Was not aware that we cannot detect compaction completion reliably.

Please let me know me if this is incorrect.

@iilyak
Copy link
Contributor

iilyak commented Aug 26, 2019

I am still convinced that we have a bug in either compactor or code which detects that compaction is finished. Your numbers confirms that. Value of orig_disk_size = 103907 compared to Value of final_disk_size = 218681. Compaction process suppose to compact file i.e. make it smaller than the original. In your test run it seems to indicate that the file became bigger after compaction.

@iilyak
Copy link
Contributor

iilyak commented Aug 26, 2019

There is third possibility which is incorrect calculation of sizes for attachments. The current test you are trying to fix uses attachments.

I remember there were fixes (by @eiri) in the area.

The PR is not merged yet.

@sarveshtamba
Copy link
Contributor Author

@iilyak can you please suggest a way forward? This is the only test failure that we are encountering and remains a blocker for our activities. Is there a way we can skip this test temporarily for now until the root cause of compaction or calculation of attachment sizes is resolved?

@sarveshtamba
Copy link
Contributor Author

@iilyak any updates for me on this one?

@eiri
Copy link
Member

eiri commented Aug 27, 2019

I remember there were fixes (by @eiri) in the area.

fwiw my none-approved fixes were for external, not for data block that's in question here.

I thought this test is already in skip list, the tag @tag :skip_on_jenkins indicates this?

@wohali
Copy link
Member

wohali commented Aug 27, 2019

@eiri The check in test/test_helper.exs is not sufficient to get this test to skip on Travis (which uses TRAVIS_BUILD_NUMBER and won't work if you're just running make check on the command line, which is undoubtedly what @sarveshtamba is doing. :)

@eiri
Copy link
Member

eiri commented Aug 27, 2019

@wohali Ah, I see. I should admit I'm still not very familiar with elixir test suite.

The fix doesn't look right to me, tbh, I mean it's strange to call test "compaction reduces size of deleted docs" and then change test to assert an exact opposite 😄 If test is repeatedly failing we need to find the root cause of this failure and fix it.

I appreciate it's not a simple task, sizes calculation, especially in combination with compaction is convoluted, so maybe someone more proficient than me in elixir suite can disable this test for time being and we'll get back to it in version 3.x

@sansato
Copy link

sansato commented Aug 27, 2019

data_size is deprecated. Does info["sizes.file"] or info["sizes"]["file"] give a better answer?

@wohali
Copy link
Member

wohali commented Aug 28, 2019

@sansato it's the exact same data... we're just going to remove the data_size field at some point from the output.

see:

{disk_size, FileSize}, % legacy
{data_size, ExternalSize}, % legacy
{sizes, {[
{file, FileSize},
{active, ActiveSize},
{external, ExternalSize}
]}},

data_size = sizes.external
disk_size = sizes.file
nothing = sizes.active

@eiri
Copy link
Member

eiri commented Aug 28, 2019

@sansato I'd say info["sizes.active"] is the best option here, it's a size of "live" data in bytes, so it should go down when doc deleted and db then compacted.

But I agree with @iilyak, the fact that in your trace finale data and disk sizes bigger than the originals indicates that compaction not finished at the time of final test assertion, i.e. it's not the question of what to compare, but more of a question when to compare.

@sarveshtamba
Copy link
Contributor Author

@sansato I'd say info["sizes.active"] is the best option here, it's a size of "live" data in bytes, so it should go down when doc deleted and db then compacted.

But I agree with @iilyak, the fact that in your trace finale data and disk sizes bigger than the originals indicates that compaction not finished at the time of final test assertion, i.e. it's not the question of what to compare, but more of a question when to compare.

@iilyak @eiri @sansato I have run this compaction test with increased timeouts of 99999999 ms (~27 hours), yet the compaction doesn't complete and the test fails/times out.

I appreciate it's not a simple task, sizes calculation, especially in combination with compaction is convoluted, so maybe someone more proficient than me in elixir suite can disable this test for time being and we'll get back to it in version 3.x

@eiri I agree with you on this, I myself am quite new to couchdb/elixir/erlang and it was a challenging task, though I could figure above details with the limited amount of knowledge I could gather.
Could we somehow disable this test for now and revisit it later, so that we can build the entire couchdb & the test suites off top of the tree/latest master branch without having to make any changes at the users end?

@sarveshtamba
Copy link
Contributor Author

@iilyak @eiri any decision on this one?

@iilyak
Copy link
Contributor

iilyak commented Aug 29, 2019

The decision for this PR is:

  • current test checks for the right condition
  • we shouldn't inverse the condition (-1 to this PR in its current form)
  • couchdb community should investigate why compaction is not compacting (or cannot complete) in this specific case
  • the best we can do is to disable this test until we figure out the fix for the compactor (would you mind to issue a PR to disable the test?)
    • the tests can be disabled using @tag :pending

@jaydoane
Copy link
Contributor

the tests can be disabled using @tag :pending

If we do end up disabling the test, I would encourage adding a comment pointing to this issue so that future code spelunkers will understand the history behind the decision.

@sarveshtamba
Copy link
Contributor Author

Hi @iilyak @jaydoane thanks for the updates. I have opened PR #2157 to disable this test until we figure out the fix for the compactor. Added comments and reference to this issue. Please review and approve the new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants