Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-3874: flattening large JSON objects uses too much direct memory #181

Closed
wants to merge 1 commit into from

Conversation

cwestin
Copy link
Contributor

@cwestin cwestin commented Oct 1, 2015

  • add getBufferSizeFor() to ValueVector interface
  • add implememtations of getBufferSizeFor() for all ValueVector derivatives
  • add adaptive algorithm for adjusting batch size to flatten operator

@cwestin
Copy link
Contributor Author

cwestin commented Oct 1, 2015

Passes the unit tests, and the regression suite, except for one query in the functional suite that Rahul claimed (in email) is a known problem and that he has a fix for.

return 0;
}

return offsets.getBufferSizeFor(valueCount) + vector.getBufferSizeFor(valueCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first one should be passed valueCount +1 right? As was done in VariableLengthVectors.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on setInitialValueCapacity(), I think you're right. Will do. (But I'm not going to go back and patch the EBF for that, because it'll just be off by one offsets value, or 4 bytes, which isn't going to kill the emergency fix.)

@jaltekruse
Copy link
Contributor

Overall, the changes look solid.

I am thinking it may be worth trying to remove the bodies of the existing getBufferSize() methods and replace them with calls into these new methods, passing in the current valueCount that is expected to be set with a call to setValueCount(int) before getBufferSize() is used. It seems weird to have two different ways to compute the same values, although I see the reason for it, previously we were using the writerIndex() for all of the fixed length primitives, which is also set in setValueCount().

I don't consider it a must fix to check in, I just wanted to know if you had thought about this and ruled it out.

@parthchandra
Copy link
Contributor

Looks good. +1.
Some minor comments in line

@cwestin
Copy link
Contributor Author

cwestin commented Oct 2, 2015

Parth:
Re ObjectVector: I don't know what that's for. I just followed the pattern: getBufferSize() already throws that exception.
Re OUTPUT_MEMORY_LIMIT: what do you think? I tend to avoid adding more knobs, but I can easily do that if you like (with the current 512MB as the default). Let me know soon, about to kick off testing on Jason's suggested replacement of getBufferSize() implementations with calls to getBufferSizeFor(). The problem I see here is that it will affect all flatten()s, whether they need it or not. And, this isn't really the long term solution, which is really to add projection capabilities so that we're not passing through the original record like this.

@cwestin
Copy link
Contributor Author

cwestin commented Oct 2, 2015

I tried to make the getBufferSize()-calls-getBufferSizeFor() change that you suggested, Jason, but it causes a lot of unit test failures, so there must be something else going on in some places. I want to get this into 1.2, so I've undone that for now. But I've added the valueCount + 1 you spotted, and also changed the check to if (valueCount == 0) in BaseRepeatedValueVector.getBufferSizeFor() -- that was a bug nobody spotted. Testing with those changes now, will push shortly. It'd be great if one of you could merge this after that.

@parthchandra
Copy link
Contributor

Re OUTPUT_MEMORY_LIMIT: I'm fine if you leave it the way it is.

@jaltekruse
Copy link
Contributor

I'm okay with merging as is, I think it is worth keeping a JIRA open for this refactoring. The fact that these behave inconsistently pretty much guarantees that one of them has a bug.
+1

- add getBufferSizeFor() to ValueVector interface
- add implememtations of getBufferSizeFor() for all ValueVector derivatives
- add adaptive algorithm for adjusting batch size to flatten operator
@cwestin
Copy link
Contributor Author

cwestin commented Oct 2, 2015

I just squashed and pushed with the valueCount + 1 for the BaseRepeatedValueVector, along with using valueCount in getBufferSizeFor() instead of getAccessor().getValueCount(). I opened https://issues.apache.org/jira/browse/DRILL-3876 for the proper long term solution, and https://issues.apache.org/jira/browse/DRILL-3889 for the getBufferSize()/getBufferSizeFor() improvement. Can one of you please merge this for 1.2 now?

@jaltekruse
Copy link
Contributor

I'm running the unit tests on the changeset rebased on top of master, will be merging shortly if it passes.

@parthchandra
Copy link
Contributor

Just completed a build and unit test run successfully. I can merging this.

@jaltekruse
Copy link
Contributor

Go for it

@cwestin cwestin closed this Oct 5, 2015
@cwestin
Copy link
Contributor Author

cwestin commented Oct 5, 2015

This was merged.

arina-ielchiieva pushed a commit to arina-ielchiieva/drill that referenced this pull request Oct 6, 2017
…ache#181)


* prevent wrong results when there is NOT resides on AND
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants