Fixing a data correctness issue in unnest when first row of an MVD is null#13764
Fixing a data correctness issue in unnest when first row of an MVD is null#13764somu-imply wants to merge 5 commits intoapache:masterfrom
Conversation
|
Did I understand you correctly that the previous (bad?) behavior was I.e. when it gets to the second row that has an array of 3 values, it unnests it into 3 rows. And you changed the code to do I.e. if it sees a null, it will use the null going forward an not unnest anything? If that understanding is correct, can you explain why the previous behavior is not the correct behavior? It is what I had expected at least... |
|
@imply-cheddar it is the other way round the previous was Which was incorrect. This was changed to the correct one which now should have 4 rows in the output. I have made the description clearer |
| public static final String DATASOURCE3 = "numfoo"; | ||
| public static final String DATASOURCE4 = "foo4"; | ||
| public static final String DATASOURCE5 = "lotsocolumns"; | ||
| public static final String DATASOURCE6 = "unnestnumfoo"; |
There was a problem hiding this comment.
Better name? nested perhaps? Also, would be cool to add a comment with the schema: I find it hard to suss that out from the code.
| public static final List<InputRow> ROWS1 = | ||
| RAW_ROWS1.stream().map(TestDataBuilder::createRow).collect(Collectors.toList()); | ||
|
|
||
| public static final List<ImmutableMap<String, Object>> RAW_ROWS_FOR_UNNEST = ImmutableList.of( |
There was a problem hiding this comment.
Does this have all the interesting corner cases? Empty arrays or objects? Null values? Fields that appear in one nested object but not another (in both orders: (a,b), (a), (a,c))? And so on. To help future readers, might be handy to add a comment above each .put( call that sets up one of these cases.
There was a problem hiding this comment.
Good idea will do
| // the column name cannot be EXPR$0 for both inner and outer. The inner one which gets executed first gets the name | ||
| // EXPR$0 and as we move up the tree we add a 0 at the end to make the top level EXPR$00. | ||
| // Ideally these names should be replaced by the alias names specified in the query. Any future developer if | ||
| // able to find these alias names should replace EXPR$0 by dim3 and EXPR$00 by dim2, i.e use the correct name from Calcite |
There was a problem hiding this comment.
Thanks much for the detailed explanation!
| .put("f1", 1.0f) | ||
| .put("l1", 7L) | ||
| .put("dim1", "") | ||
| .put("dim3", ImmutableList.of("a", ImmutableList.of("b", "c"))) |
There was a problem hiding this comment.
the string dimension indexer can't really handle nested arrays like this, i think you'll end up with something like "a" and then the 'toString' of ["b","c"], or maybe something even weirder...
I think you should stick to having either flat lists or single layer strings for these tests
| // able to find these alias names should replace EXPR$0 by dim3 and EXPR$00 by dim2, i.e use the correct name from Calcite | ||
|
|
||
| if (druidQueryRel instanceof DruidCorrelateUnnestRel) { | ||
| outputColName = outputColName + "0"; |
There was a problem hiding this comment.
im skeptical that this is always correct, is it really cool?
There was a problem hiding this comment.
This is a hacky way as of now, I have kept a pointer to this to be corrected by fetching the actual names. Will do this in a followup PR
|
This is fixed through #13934 . Closing in favor of that |
This PR solves 2 things:
Previous:
After this change
Before for the query
The planner would do
After this change the planner plans correctly as
Additionally unit test cases for 1 and 2 have been added by creating a new data source in the CalciteTests framework