Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-2545] add bucket member count verification while build bloom filter #1067

Closed
wants to merge 1 commit into from

Conversation

ChengXiangLi
Copy link

Bug fix, see detail error message at FLINK-2545.

@zentol
Copy link
Contributor

zentol commented Aug 27, 2015

Have you verified that returning at that position does not cause other issues? This is essentially just swallowing the thrown exception, hoping nothing else goes wrong.

I don't see how this actually fixes the issue. The count being negatives tells us there is something wrong with the bucket count being set. Resolving that would be a fix.

@StephanEwen
Copy link
Contributor

@ChengXiangLi Do you know what caused the problem initially? I was puzzled, because the count in the bucket should never be negative, and a zero sized bucket should work with your original code.

Would be great to capture that error, to see if the root bug was actually somewhere else (not in the bloom filters), but in the other parts of the hash table structure.

Hopefully Greg can help us to reproduce this...

@ChengXiangLi
Copy link
Author

Thanks for the remind, @zentol and @StephanEwen , I should be too hurry to open this PR. I tried to fix the exception in bloom filter in this PR and verify other potential issues in hash table behind negative count number separately, obviously, there is no need to do in that way. So let's wait for Greg's response now.

@greghogan
Copy link
Contributor

I am currently running release-0.10.0-milestone-1.

Debugging with Eclipse and looking at MutableHashTable.initTable, numBuckets is computed as 16086. There are 63 memory segments with 256 buckets each = 16128 total buckets. The last 16128 - 16086 = 42 buckets are not initialized by initTable which terminates the inner loop when bucket == numBuckets. Here is an example header dump from the last memory segment showing the crossover from initialized to uninitialized data.

offset, partition, status, count, next-pointer
26880 10 0 0 -72340172838076673
27008 11 0 0 -72340172838076673
27136 12 0 0 -72340172838076673
27264 13 0 0 -72340172838076673
27392 0 -56 9 844425030795264
27520 0 -56 9 -9191846839379296256
27648 0 -56 9 10133099245469696
27776 0 -56 9 12103424082444288

Setting a breakpoint for MutableHashTable.buildBloomFilterForBucket for count < 0, the last memory segment looked as follows (this is from a different execution, operation, and thread).

offset, partition, status, count, next-pointer
26880 10 0 9 27584547767975936
27008 11 0 9 -9208735337998712832
27136 12 0 9 4503599694479360
27264 13 0 9 -9219994337067139072
27392 0 0 -32697 1161165883580435
27520 0 3 -15328 18016855230957176
27648 0 5 1388 -33740636012148672
27776 0 6 25494 -17363350186618861

MutableHashTable.buildBloomFilterForBucketsInPartition processed offset 27392 which happened to match the partition number and bucket status even though it looks to be uninitialized.

After changing MutableHashTable.initTable to initialize all buckets in all segments I have not seen the bug reoccur.

for (int k = 0; k < bucketsPerSegment /* && bucket < numBuckets*/; k++, bucket++) {

I see at least three potential resolutions: 1) have MutableHashTable.initTable initialize all buckets, 2) have MutableHashTable.buildBloomFilterForBucket skip uninitialized buckets, or 3) I have not looked enough at MutableHashTable.getInitialTableSize but it is possible to completely fill the last segment with usable buckets?

@StephanEwen
Copy link
Contributor

Ah, that makes perfect sense. The last memory segment is not fully used (only until the hash index has initialized enough buckets). The bloom filter initialization loops should also skip those last buckets.

Since the memory for these buckets is not initialized, their contents (like count) is undefined.

@ChengXiangLi
Copy link
Author

Nice job, @greghogan , you just pointed out the root cause and the solution. I add the logic to skip latest buckets as @StephanEwen suggested, and add related unit test for this issue.

@StephanEwen
Copy link
Contributor

Looks good, I'll merge this!

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Sep 1, 2015
@asfgit asfgit closed this in 2e6e4de Sep 2, 2015
nikste pushed a commit to nikste/flink that referenced this pull request Sep 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants