-
Notifications
You must be signed in to change notification settings - Fork 985
DRILL-5351: Minimize bounds checking in var len vectors for Parquet #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a clever way to avoid checks, it will lead to difficulty when debugging Drill. Intentionally throwing a common exception makes it even harder to find cases where the exception indicates an error.
Let's take a step back. One of the things we need to change in Parquet is to avoid "low density batches" vectors that have very little data. Turns out one reason is tied up with the assumption that the code makes that it can tell when it has reached the end of a vector. (There are many bugs, but that is the key idea.)
Vectors don't have that ability today, so the code never worked.
What if we solve that problem, and yours, by changing how the DrillBuf works:
The above avoids the spurious exception and provides the means to manage variable-length vectors in Parquet.
Note that the bounds check is still done, but only inside Drillbuf. And, of course, that same check is done with the PR code: that check is what raises the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for a DrillBuf exceeding bounds is already being done once in Netty (which also throws the Exception). What we want to do is to avoid having to do more than one bounds check for every vector write operation. By adding the suggested check, we simply move the check from every vector setSafe method to every Drillbuf write. This would possibly impact performance in other parts of the code.
I particularly changed only var len vectors to address a hotspot in the Parquet reader performance, because as you are suggesting, the right way to fix is to address the low density batch problem at the same time as the vector overflow problem. Perhaps a longer discussion may be required here.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Netty/Drillbuf code is complex, so this does boil down to details... Yes, I agree that Netty does the bounds checks -- if we call Netty code. Consider this code in DrillBuf:
We have one version that delegates to the "udle" which calls another buf which calls
PooledUnsafeDirectByteBuf
which does a bounds check.But, we have another method which cuts out the middleman and just does a direct memory copy. Given that, we could certainly add a version that does a bounds check and copy from heap into direct memory. Just use this Netty method:
So, something like this:
Of course, this assumes an implementation of the underlying direct memory, but, as we saw, we are already doing something similar.
Would this work and perform as well as the exception-based approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, if the suggested change blows up the scope of this fix, then the original proposal is fine; we can always adjust it later if needed when we solve the low-density batch problem.