[multistage] [enhancement] Split row data block when row size is too large#9485
[multistage] [enhancement] Split row data block when row size is too large#9485walterddr merged 3 commits intoapache:masterfrom
Conversation
fc9bd66 to
5e25385
Compare
Codecov Report
@@ Coverage Diff @@
## master #9485 +/- ##
============================================
- Coverage 68.72% 63.76% -4.97%
- Complexity 4860 5189 +329
============================================
Files 1924 1873 -51
Lines 102425 100317 -2108
Branches 15542 15304 -238
============================================
- Hits 70392 63966 -6426
- Misses 26994 31645 +4651
+ Partials 5039 4706 -333
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
| public static void testOnlySetMaxBlockSize(int maxBlockSize) { | |
| public static void testOnlySetMaxBlockSizeMB(int maxBlockSizeMB) { |
it's a bit confusing whether this is rows or size, but looking at the code looks like it's size
There was a problem hiding this comment.
it could be nice to make this configurable, with a 4MB default
There was a problem hiding this comment.
should we fail here as well? I think it's confusing to have a function that doesn't do what it says.
There was a problem hiding this comment.
nit, we can just move this down and have it fall into default
There was a problem hiding this comment.
this will deserialize the entire block just to re-serialize it again later when we buildFromRows - which I suspect will be a significant performance regression for large blocks. is there any way to do this split without going through the serde twice?
There was a problem hiding this comment.
this is the first step. we will optimized this one out later
There was a problem hiding this comment.
it's nice not to have randomness in tests, if an error happens due to a specific randomness generator we won't be able to debug it. Can we make this deterministic?
There was a problem hiding this comment.
we can set a random seed. but for this particular test we are fine introduce some randomness
There was a problem hiding this comment.
this reminds me... it is a good idea we should add a print out for the row itself for reproducibility
There was a problem hiding this comment.
why are we excluding these types from our test?
There was a problem hiding this comment.
b/c these types are not supported in datablocks
walterddr
left a comment
There was a problem hiding this comment.
lgtm. i did a minor clean up push on top and it should be good to go
siddharthteotia
left a comment
There was a problem hiding this comment.
To confirm - metadata block / error block / eos block are not split right ?
| } else { | ||
| int rowSizeInBytes = ((RowDataBlock) block.getDataBlock()).getRowSizeInBytes(); | ||
| int numRowsPerChunk = maxBlockSize / rowSizeInBytes; | ||
| Preconditions.checkState(numRowsPerChunk > 0, "row size too large for query engine to handle, abort!"); |
There was a problem hiding this comment.
Can we include the offending rowSize in the message / exception ?
|
May be add a TODO to split columnar block as well in future |
Confirmed offline. Only Row data block at this point |
Split data block when the size is too large (exceeds grpc limit).
Set the limit to 4M for now.
The size is estimated from row size in bytes.
Columnar block split is not supported for now.