-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Materialize scan results correctly when columns are not present in the segments #16619
Conversation
sql/src/test/java/org/apache/druid/sql/calcite/CalciteSubqueryTest.java
Dismissed
Show dismissed
Hide dismissed
while (populateCursor()) { // Do till we don't have any more rows, or the next row isn't compatible with the current row | ||
if (!frameWriter.addSelection()) { // Add the cursor's row to the frame, till the frame is full | ||
break; | ||
} | ||
|
||
for (Integer columnNumber : nullTypedColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I wonder why use a fastutil IntList
- if it gets iterated with a foreach
; plain get
?
this could be moved into some method like validateRow
- that will naturally do a CSE of the currentRows.get(currentRowIndex)
so that it will be only evaluated once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason to use FastUtil IntList as such. I just thought it might be faster to create than an arraylist.
this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once
It is getting evaluated once here right? Unless I misinterpreted your comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex)
; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex)
as that's the row :)
processing/src/main/java/org/apache/druid/query/scan/ScanResultValueFramesIterable.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/scan/ScanResultValueFramesIterable.java
Outdated
Show resolved
Hide resolved
@@ -200,26 +229,33 @@ public FrameSignaturePair next() | |||
// start all the processing | |||
populateCursor(); | |||
boolean firstRowWritten = false; | |||
// While calling populateCursor() repeatedly, currentRowSignature might change. Therefore we store the signature | |||
// While calling populateCursor() repeatedly, currentRowSignature might change. Therefore, we store the signature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
....what if the signature changes - is that a problem? shouldn't that be an Exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are two cursors, CursorA with RowSignatureA and CursorB with RowSignatureB and the cursor is at the last row of CursorA, populate call will return false
, i.e. the two cursors cannot be batched together, and set currentRowSignature
to the RowSignatureB (i.e. prepare the variables for the next write). We still want to return the old frame with the old signature therefore we need to preserve the signature with which we have written the frame.
Per your previous suggestion, frameWriterFactory.signature()
would be sufficient and cleaner, and I will use that instead.
processing/src/test/java/org/apache/druid/query/scan/ScanResultValueFramesIterableTest.java
Outdated
Show resolved
Hide resolved
processing/src/test/java/org/apache/druid/query/scan/ScanResultValueFramesIterableTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good - left some minor notes
while (populateCursor()) { // Do till we don't have any more rows, or the next row isn't compatible with the current row | ||
if (!frameWriter.addSelection()) { // Add the cursor's row to the frame, till the frame is full | ||
break; | ||
} | ||
|
||
for (Integer columnNumber : nullTypedColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex)
; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex)
as that's the row :)
firstRowWritten = true; | ||
// Check that the columns with the null types are actually null before advancing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: isn't this comment misplaced? (note: this detail is not necessary - but it could live as an apidoc of the validateRow
if that would be around)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up the code
Thanks for the review! @kgyrtkirk |
Description
The query engine is unable to estimate the correct size in bytes of the subquery results when the scan query has columns which are missing from the segments. This is because the ScanQueryEngine receives all the columns of the scan query, and populates the row signature with null type if its unable to find the column in the segment.
This PR modifies the materializing logic to materialize the results of the columns whose types are known, and check that the columns whose types are unknown always have
null
values. This is helpful because:a. If the type is unknown and the column contains all null values, we don't need to materialize the results
b. If the type is unknown and the column contains non-null values in any row, we are running into the case of missing types, and we should throw an error.
Release note
Fixes a bug causing maxSubqueryBytes to not work when segments have missing columns.
Key changed/added classes in this PR
MyFoo
OurBar
TheirBaz
This PR has: