DRILL-3874: flattening large JSON objects uses too much direct memory #181

cwestin · 2015-10-01T13:43:38Z

add getBufferSizeFor() to ValueVector interface
add implememtations of getBufferSizeFor() for all ValueVector derivatives
add adaptive algorithm for adjusting batch size to flatten operator

cwestin · 2015-10-01T13:44:37Z

Passes the unit tests, and the regression suite, except for one query in the functional suite that Rahul claimed (in email) is a known problem and that he has a fix for.

jaltekruse · 2015-10-01T20:39:43Z

exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/BaseRepeatedValueVector.java

+      return 0;
+    }
+
+    return offsets.getBufferSizeFor(valueCount) + vector.getBufferSizeFor(valueCount);


This first one should be passed valueCount +1 right? As was done in VariableLengthVectors.java

Based on setInitialValueCapacity(), I think you're right. Will do. (But I'm not going to go back and patch the EBF for that, because it'll just be off by one offsets value, or 4 bytes, which isn't going to kill the emergency fix.)

jaltekruse · 2015-10-01T20:58:23Z

Overall, the changes look solid.

I am thinking it may be worth trying to remove the bodies of the existing getBufferSize() methods and replace them with calls into these new methods, passing in the current valueCount that is expected to be set with a call to setValueCount(int) before getBufferSize() is used. It seems weird to have two different ways to compute the same values, although I see the reason for it, previously we were using the writerIndex() for all of the fixed length primitives, which is also set in setValueCount().

I don't consider it a must fix to check in, I just wanted to know if you had thought about this and ruled it out.

parthchandra · 2015-10-02T01:49:09Z

Looks good. +1.
Some minor comments in line

cwestin · 2015-10-02T02:00:12Z

Parth:
Re ObjectVector: I don't know what that's for. I just followed the pattern: getBufferSize() already throws that exception.
Re OUTPUT_MEMORY_LIMIT: what do you think? I tend to avoid adding more knobs, but I can easily do that if you like (with the current 512MB as the default). Let me know soon, about to kick off testing on Jason's suggested replacement of getBufferSize() implementations with calls to getBufferSizeFor(). The problem I see here is that it will affect all flatten()s, whether they need it or not. And, this isn't really the long term solution, which is really to add projection capabilities so that we're not passing through the original record like this.

cwestin · 2015-10-02T14:15:30Z

I tried to make the getBufferSize()-calls-getBufferSizeFor() change that you suggested, Jason, but it causes a lot of unit test failures, so there must be something else going on in some places. I want to get this into 1.2, so I've undone that for now. But I've added the valueCount + 1 you spotted, and also changed the check to if (valueCount == 0) in BaseRepeatedValueVector.getBufferSizeFor() -- that was a bug nobody spotted. Testing with those changes now, will push shortly. It'd be great if one of you could merge this after that.

parthchandra · 2015-10-02T15:41:54Z

Re OUTPUT_MEMORY_LIMIT: I'm fine if you leave it the way it is.

jaltekruse · 2015-10-02T18:34:32Z

I'm okay with merging as is, I think it is worth keeping a JIRA open for this refactoring. The fact that these behave inconsistently pretty much guarantees that one of them has a bug.
+1

- add getBufferSizeFor() to ValueVector interface - add implememtations of getBufferSizeFor() for all ValueVector derivatives - add adaptive algorithm for adjusting batch size to flatten operator

cwestin · 2015-10-02T20:38:26Z

I just squashed and pushed with the valueCount + 1 for the BaseRepeatedValueVector, along with using valueCount in getBufferSizeFor() instead of getAccessor().getValueCount(). I opened https://issues.apache.org/jira/browse/DRILL-3876 for the proper long term solution, and https://issues.apache.org/jira/browse/DRILL-3889 for the getBufferSize()/getBufferSizeFor() improvement. Can one of you please merge this for 1.2 now?

jaltekruse · 2015-10-02T21:23:38Z

I'm running the unit tests on the changeset rebased on top of master, will be merging shortly if it passes.

parthchandra · 2015-10-02T21:34:07Z

Just completed a build and unit test run successfully. I can merging this.

jaltekruse · 2015-10-02T21:37:42Z

Go for it

cwestin · 2015-10-05T19:36:19Z

This was merged.

…ache#181) * prevent wrong results when there is NOT resides on AND

jaltekruse reviewed Oct 1, 2015
View reviewed changes

DRILL-3874: flattening large JSON objects uses too much direct memory

93e9459

- add getBufferSizeFor() to ValueVector interface - add implememtations of getBufferSizeFor() for all ValueVector derivatives - add adaptive algorithm for adjusting batch size to flatten operator

cwestin force-pushed the flatten-memory branch from ca9bccb to 93e9459 Compare October 2, 2015 20:33

cwestin closed this Oct 5, 2015

arina-ielchiieva pushed a commit to arina-ielchiieva/drill that referenced this pull request Oct 6, 2017

MD-2742: MD-2162 don't try to split the filter when it is not AND (ap…

152d88c

…ache#181) * prevent wrong results when there is NOT resides on AND

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-3874: flattening large JSON objects uses too much direct memory #181

DRILL-3874: flattening large JSON objects uses too much direct memory #181

cwestin commented Oct 1, 2015

cwestin commented Oct 1, 2015

jaltekruse Oct 1, 2015

cwestin Oct 2, 2015

jaltekruse commented Oct 1, 2015

parthchandra commented Oct 2, 2015

cwestin commented Oct 2, 2015

cwestin commented Oct 2, 2015

parthchandra commented Oct 2, 2015

jaltekruse commented Oct 2, 2015

cwestin commented Oct 2, 2015

jaltekruse commented Oct 2, 2015

parthchandra commented Oct 2, 2015

jaltekruse commented Oct 2, 2015

cwestin commented Oct 5, 2015

DRILL-3874: flattening large JSON objects uses too much direct memory #181

DRILL-3874: flattening large JSON objects uses too much direct memory #181

Conversation

cwestin commented Oct 1, 2015

cwestin commented Oct 1, 2015

jaltekruse Oct 1, 2015

Choose a reason for hiding this comment

cwestin Oct 2, 2015

Choose a reason for hiding this comment

jaltekruse commented Oct 1, 2015

parthchandra commented Oct 2, 2015

cwestin commented Oct 2, 2015

cwestin commented Oct 2, 2015

parthchandra commented Oct 2, 2015

jaltekruse commented Oct 2, 2015

cwestin commented Oct 2, 2015

jaltekruse commented Oct 2, 2015

parthchandra commented Oct 2, 2015

jaltekruse commented Oct 2, 2015

cwestin commented Oct 5, 2015