Materialize scan results correctly when columns are not present in the segments #16619

LakshSingla · 2024-06-17T16:22:33Z

Description

The query engine is unable to estimate the correct size in bytes of the subquery results when the scan query has columns which are missing from the segments. This is because the ScanQueryEngine receives all the columns of the scan query, and populates the row signature with null type if its unable to find the column in the segment.

This PR modifies the materializing logic to materialize the results of the columns whose types are known, and check that the columns whose types are unknown always have null values. This is helpful because:
a. If the type is unknown and the column contains all null values, we don't need to materialize the results
b. If the type is unknown and the column contains non-null values in any row, we are running into the case of missing types, and we should throw an error.

Release note

Fixes a bug causing maxSubqueryBytes to not work when segments have missing columns.

Key changed/added classes in this PR

MyFoo
OurBar
TheirBaz

This PR has:

sql/src/test/java/org/apache/druid/sql/calcite/CalciteSubqueryTest.java

kgyrtkirk · 2024-06-18T13:44:29Z

processing/src/main/java/org/apache/druid/query/scan/ScanResultValueFramesIterable.java

        while (populateCursor()) { // Do till we don't have any more rows, or the next row isn't compatible with the current row
          if (!frameWriter.addSelection()) { // Add the cursor's row to the frame, till the frame is full
            break;
          }
+
+          for (Integer columnNumber : nullTypedColumns) {


note: I wonder why use a fastutil IntList - if it gets iterated with a foreach ; plain get?
this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once

No reason to use FastUtil IntList as such. I just thought it might be faster to create than an arraylist.

this could be moved into some method like validateRow - that will naturally do a CSE of the currentRows.get(currentRowIndex) so that it will be only evaluated once

It is getting evaluated once here right? Unless I misinterpreted your comment

this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)

processing/src/main/java/org/apache/druid/query/scan/ScanResultValueFramesIterable.java

kgyrtkirk · 2024-06-18T14:03:16Z

processing/src/main/java/org/apache/druid/query/scan/ScanResultValueFramesIterable.java

@@ -200,26 +229,33 @@ public FrameSignaturePair next()
      // start all the processing
      populateCursor();
      boolean firstRowWritten = false;
-      // While calling populateCursor() repeatedly, currentRowSignature might change. Therefore we store the signature
+      // While calling populateCursor() repeatedly, currentRowSignature might change. Therefore, we store the signature


....what if the signature changes - is that a problem? shouldn't that be an Exception?

if there are two cursors, CursorA with RowSignatureA and CursorB with RowSignatureB and the cursor is at the last row of CursorA, populate call will return false, i.e. the two cursors cannot be batched together, and set currentRowSignature to the RowSignatureB (i.e. prepare the variables for the next write). We still want to return the old frame with the old signature therefore we need to preserve the signature with which we have written the frame.
Per your previous suggestion, frameWriterFactory.signature() would be sufficient and cleaner, and I will use that instead.

processing/src/test/java/org/apache/druid/query/scan/ScanResultValueFramesIterableTest.java

kgyrtkirk

looks good - left some minor notes

kgyrtkirk · 2024-06-20T11:03:20Z

processing/src/main/java/org/apache/druid/query/scan/ScanResultValueFramesIterable.java

        while (populateCursor()) { // Do till we don't have any more rows, or the next row isn't compatible with the current row
          if (!frameWriter.addSelection()) { // Add the cursor's row to the frame, till the frame is full
            break;
          }
+
+          for (Integer columnNumber : nullTypedColumns) {


this was just a note; this loop is validating one row; but to access that it has to do a function call currentRows.get(currentRowIndex) ; which became part of the loop body - moving it into a method could make it clear that it works on a row - and it will naturally remove the currentRows.get(currentRowIndex) as that's the row :)

kgyrtkirk · 2024-06-20T11:04:42Z

processing/src/main/java/org/apache/druid/query/scan/ScanResultValueFramesIterable.java

          firstRowWritten = true;
+          // Check that the columns with the null types are actually null before advancing


note: isn't this comment misplaced? (note: this detail is not necessary - but it could live as an apidoc of the validateRow if that would be around)

Cleaned up the code

LakshSingla · 2024-06-21T04:20:32Z

Thanks for the review! @kgyrtkirk

LakshSingla added 4 commits June 17, 2024 21:43

init commit

5f86bea

Merge branch 'master' into missing-col-frames

886439b

more tests

a95807d

add calcite tests

e66b948

github-actions bot added the Area - Querying label Jun 18, 2024

github-advanced-security bot found potential problems Jun 18, 2024

View reviewed changes

sql/src/test/java/org/apache/druid/sql/calcite/CalciteSubqueryTest.java Dismissed Show dismissed Hide dismissed

kgyrtkirk reviewed Jun 18, 2024

View reviewed changes

review comments

591e25c

kgyrtkirk approved these changes Jun 20, 2024

View reviewed changes

review comments

180f9fe

fix tests

9d64ed4

LakshSingla merged commit 00c9643 into apache:master Jun 23, 2024
86 of 87 checks passed

LakshSingla deleted the missing-col-frames branch June 23, 2024 17:45

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

kfaraz mentioned this pull request Oct 11, 2024

[DRAFT] 31.0.0 Release Notes #17332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Materialize scan results correctly when columns are not present in the segments #16619

Materialize scan results correctly when columns are not present in the segments #16619

LakshSingla commented Jun 17, 2024 •

edited

Loading

kgyrtkirk Jun 18, 2024

LakshSingla Jun 18, 2024

kgyrtkirk Jun 20, 2024

kgyrtkirk Jun 18, 2024

LakshSingla Jun 18, 2024

kgyrtkirk left a comment

kgyrtkirk Jun 20, 2024

kgyrtkirk Jun 20, 2024

LakshSingla Jun 21, 2024

LakshSingla commented Jun 21, 2024

		firstRowWritten = true;
		// Check that the columns with the null types are actually null before advancing

Materialize scan results correctly when columns are not present in the segments #16619

Materialize scan results correctly when columns are not present in the segments #16619

Conversation

LakshSingla commented Jun 17, 2024 • edited Loading

Description

Release note

Key changed/added classes in this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgyrtkirk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla commented Jun 21, 2024

LakshSingla commented Jun 17, 2024 •

edited

Loading