New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-3874: flattening large JSON objects uses too much direct memory #181
Conversation
cwestin
commented
Oct 1, 2015
- add getBufferSizeFor() to ValueVector interface
- add implememtations of getBufferSizeFor() for all ValueVector derivatives
- add adaptive algorithm for adjusting batch size to flatten operator
Passes the unit tests, and the regression suite, except for one query in the functional suite that Rahul claimed (in email) is a known problem and that he has a fix for. |
return 0; | ||
} | ||
|
||
return offsets.getBufferSizeFor(valueCount) + vector.getBufferSizeFor(valueCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first one should be passed valueCount +1 right? As was done in VariableLengthVectors.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on setInitialValueCapacity(), I think you're right. Will do. (But I'm not going to go back and patch the EBF for that, because it'll just be off by one offsets value, or 4 bytes, which isn't going to kill the emergency fix.)
Overall, the changes look solid. I am thinking it may be worth trying to remove the bodies of the existing getBufferSize() methods and replace them with calls into these new methods, passing in the current valueCount that is expected to be set with a call to setValueCount(int) before getBufferSize() is used. It seems weird to have two different ways to compute the same values, although I see the reason for it, previously we were using the writerIndex() for all of the fixed length primitives, which is also set in setValueCount(). I don't consider it a must fix to check in, I just wanted to know if you had thought about this and ruled it out. |
Looks good. +1. |
Parth: |
I tried to make the getBufferSize()-calls-getBufferSizeFor() change that you suggested, Jason, but it causes a lot of unit test failures, so there must be something else going on in some places. I want to get this into 1.2, so I've undone that for now. But I've added the valueCount + 1 you spotted, and also changed the check to if (valueCount == 0) in BaseRepeatedValueVector.getBufferSizeFor() -- that was a bug nobody spotted. Testing with those changes now, will push shortly. It'd be great if one of you could merge this after that. |
Re OUTPUT_MEMORY_LIMIT: I'm fine if you leave it the way it is. |
I'm okay with merging as is, I think it is worth keeping a JIRA open for this refactoring. The fact that these behave inconsistently pretty much guarantees that one of them has a bug. |
- add getBufferSizeFor() to ValueVector interface - add implememtations of getBufferSizeFor() for all ValueVector derivatives - add adaptive algorithm for adjusting batch size to flatten operator
ca9bccb
to
93e9459
Compare
I just squashed and pushed with the valueCount + 1 for the BaseRepeatedValueVector, along with using valueCount in getBufferSizeFor() instead of getAccessor().getValueCount(). I opened https://issues.apache.org/jira/browse/DRILL-3876 for the proper long term solution, and https://issues.apache.org/jira/browse/DRILL-3889 for the getBufferSize()/getBufferSizeFor() improvement. Can one of you please merge this for 1.2 now? |
I'm running the unit tests on the changeset rebased on top of master, will be merging shortly if it passes. |
Just completed a build and unit test run successfully. I can merging this. |
Go for it |
This was merged. |
…ache#181) * prevent wrong results when there is NOT resides on AND