Fixing a data correctness issue in unnest when first row of an MVD is null by somu-imply · Pull Request #13764 · apache/druid

somu-imply · 2023-02-07T19:47:52Z

This PR solves 2 things:

When the first row of a MVD is null, the previous version of unnest was giving an incorrect result as the underlying object was not being updated.

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time                    c1 c2            c3
2022-01-01T00:00:00.000Z  1  null          null
2022-01-01T00:00:00.000Z  2  ["A","B","C"] null

After this change

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time                    c1 c2            c3
2022-01-01T00:00:00.000Z  1  null          null
2022-01-01T00:00:00.000Z  2  ["A","B","C"] A
2022-01-01T00:00:00.000Z  2  ["A","B","C"] B
2022-01-01T00:00:00.000Z  2  ["A","B","C"] C

For nested queries that has an unnest in the inner as well as the outer query, the output column name was being always set to EXPR$0. This now changes that behavior to make output column names different.

Before for the query

with t AS (select dim2, d3 from druid.numfoo, unnest(MV_TO_ARRAY(dim3)) as unnested (d3))
select d2,d3 from t, UNNEST(MV_TO_ARRAY(dim2)) as unnested(d2)

The planner would do

{
  "queryType" : "scan",
  "dataSource" : {
    "type" : "unnest",
    "base" : {
      "type" : "query",
      "query" : {
        "queryType" : "scan",
        "dataSource" : {
          "type" : "unnest",
          "base" : {
            "type" : "table",
            "name" : "numfoo"
          },
          "column" : "dim3",
          "outputName" : "EXPR$0",
          "allowList" : null
        },
        "intervals" : {
          "type" : "intervals",
          "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
        },
        "resultFormat" : "compactedList",
        "columns" : [ "EXPR$0", "__time", "cnt", "d1", "d2", "dim1", "dim2", "dim3", "dim4", "dim5", "dim6", "f1", "f2", "l1", "l2", "m1", "m2", "unique_dim1" ],
        "legacy" : false,
        "context" : {
          "defaultTimeout" : 300000,
          "maxScatterGatherBytes" : 9223372036854775807,
          "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
          "sqlQueryId" : "dummy",
          "vectorize" : "false",
          "vectorizeVirtualColumns" : "false"
        },
        "granularity" : {
          "type" : "all"
        }
      }
    },
    "column" : "dim2",
    "outputName" : "EXPR$0",
    "allowList" : null
  },
  "intervals" : {
    "type" : "intervals",
    "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
  },
  "resultFormat" : "compactedList",
  "columns" : [ "EXPR$0", "EXPR$00" ],
  "legacy" : false,
  "context" : {
    "defaultTimeout" : 300000,
    "maxScatterGatherBytes" : 9223372036854775807,
    "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
    "sqlQueryId" : "dummy",
    "vectorize" : "false",
    "vectorizeVirtualColumns" : "false"
  },
  "granularity" : {
    "type" : "all"
  }
}

After this change the planner plans correctly as

{
  "queryType" : "scan",
  "dataSource" : {
    "type" : "unnest",
    "base" : {
      "type" : "query",
      "query" : {
        "queryType" : "scan",
        "dataSource" : {
          "type" : "unnest",
          "base" : {
            "type" : "table",
            "name" : "numfoo"
          },
          "column" : "dim3",
          "outputName" : "EXPR$0",
          "allowList" : null
        },
        "intervals" : {
          "type" : "intervals",
          "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
        },
        "resultFormat" : "compactedList",
        "columns" : [ "EXPR$0", "__time", "cnt", "d1", "d2", "dim1", "dim2", "dim3", "dim4", "dim5", "dim6", "f1", "f2", "l1", "l2", "m1", "m2", "unique_dim1" ],
        "legacy" : false,
        "context" : {
          "defaultTimeout" : 300000,
          "maxScatterGatherBytes" : 9223372036854775807,
          "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
          "sqlQueryId" : "dummy",
          "vectorize" : "false",
          "vectorizeVirtualColumns" : "false"
        },
        "granularity" : {
          "type" : "all"
        }
      }
    },
    "column" : "dim4",
    "outputName" : "EXPR$00",
    "allowList" : null
  },
  "intervals" : {
    "type" : "intervals",
    "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
  },
  "resultFormat" : "compactedList",
  "columns" : [ "EXPR$0", "EXPR$00" ],
  "legacy" : false,
  "context" : {
    "defaultTimeout" : 300000,
    "maxScatterGatherBytes" : 9223372036854775807,
    "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
    "sqlQueryId" : "dummy",
    "vectorize" : "false",
    "vectorizeVirtualColumns" : "false"
  },
  "granularity" : {
    "type" : "all"
  }
}

Additionally unit test cases for 1 and 2 have been added by creating a new data source in the CalciteTests framework

… row in a MVD

… null supported

imply-cheddar · 2023-02-08T06:56:10Z

Did I understand you correctly that the previous (bad?) behavior was

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time c1 c2 c3
2022-01-01T00:00:00.000Z 1 null null
2022-01-01T00:00:00.000Z 2 ["A","B","C"] A
2022-01-01T00:00:00.000Z 2 ["A","B","C"] B
2022-01-01T00:00:00.000Z 2 ["A","B","C"] C

I.e. when it gets to the second row that has an array of 3 values, it unnests it into 3 rows.

And you changed the code to do

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time c1 c2 c3
2022-01-01T00:00:00.000Z 1 null null
2022-01-01T00:00:00.000Z 2 ["A","B","C"] null

I.e. if it sees a null, it will use the null going forward an not unnest anything?

If that understanding is correct, can you explain why the previous behavior is not the correct behavior? It is what I had expected at least...

somu-imply · 2023-02-08T17:29:06Z

@imply-cheddar it is the other way round the previous was

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time c1 c2 c3
2022-01-01T00:00:00.000Z 1 null null
2022-01-01T00:00:00.000Z 2 ["A","B","C"] null

Which was incorrect. This was changed to the correct one which now should have 4 rows in the output. I have made the description clearer

paul-rogers · 2023-02-08T21:02:26Z

sql/src/test/java/org/apache/druid/sql/calcite/util/CalciteTests.java

  public static final String DATASOURCE3 = "numfoo";
  public static final String DATASOURCE4 = "foo4";
  public static final String DATASOURCE5 = "lotsocolumns";
+  public static final String DATASOURCE6 = "unnestnumfoo";


Better name? nested perhaps? Also, would be cool to add a comment with the schema: I find it hard to suss that out from the code.

paul-rogers · 2023-02-08T21:22:15Z

sql/src/test/java/org/apache/druid/sql/calcite/util/TestDataBuilder.java

  public static final List<InputRow> ROWS1 =
      RAW_ROWS1.stream().map(TestDataBuilder::createRow).collect(Collectors.toList());

+  public static final List<ImmutableMap<String, Object>> RAW_ROWS_FOR_UNNEST = ImmutableList.of(


Does this have all the interesting corner cases? Empty arrays or objects? Null values? Fields that appear in one nested object but not another (in both orders: (a,b), (a), (a,c))? And so on. To help future readers, might be handy to add a comment above each .put( call that sets up one of these cases.

Good idea will do

paul-rogers · 2023-02-08T21:23:30Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidCorrelateUnnestRel.java

+    // the column name cannot be EXPR$0 for both inner and outer. The inner one which gets executed first gets the name
+    // EXPR$0 and as we move up the tree we add a 0 at the end to make the top level EXPR$00.
+    // Ideally these names should be replaced by the alias names specified in the query. Any future developer if
+    // able to find these alias names should replace EXPR$0 by dim3 and EXPR$00 by dim2, i.e use the correct name from Calcite


Thanks much for the detailed explanation!

clintropolis · 2023-02-11T00:40:18Z

sql/src/test/java/org/apache/druid/sql/calcite/util/TestDataBuilder.java

+                  .put("f1", 1.0f)
+                  .put("l1", 7L)
+                  .put("dim1", "")
+                  .put("dim3", ImmutableList.of("a", ImmutableList.of("b", "c")))


the string dimension indexer can't really handle nested arrays like this, i think you'll end up with something like "a" and then the 'toString' of ["b","c"], or maybe something even weirder...

I think you should stick to having either flat lists or single layer strings for these tests

clintropolis · 2023-02-11T00:41:13Z

sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidCorrelateUnnestRel.java

+    // able to find these alias names should replace EXPR$0 by dim3 and EXPR$00 by dim2, i.e use the correct name from Calcite
+
+    if (druidQueryRel instanceof DruidCorrelateUnnestRel) {
+      outputColName = outputColName + "0";


im skeptical that this is always correct, is it really cool?

This is a hacky way as of now, I have kept a pointer to this to be corrected by fetching the actual names. Will do this in a followup PR

somu-imply · 2023-03-23T23:36:58Z

This is fixed through #13934 . Closing in favor of that

somu-imply added 3 commits February 7, 2023 11:11

Fixing a data correctness issue in unnest when nulls are in the first…

bbd673d

… row in a MVD

Fixing nested unnest queries giving the same output col name as well

8091d5c

Fixing a comment and updating a different table alias name for an UT

d0b7f6c

somu-imply marked this pull request as ready for review February 7, 2023 19:59

Adding a new datasource in tests with the unit test with first row as…

c7dd18f

… null supported

Fixing metadata queries as a new table has been added

75b3670

paul-rogers reviewed Feb 8, 2023

View reviewed changes

clintropolis reviewed Feb 11, 2023

View reviewed changes

a2l007 added Area - Querying Area - SQL labels Feb 14, 2023

somu-imply marked this pull request as draft February 21, 2023 18:26

somu-imply marked this pull request as ready for review February 21, 2023 18:26

somu-imply closed this Mar 23, 2023

somu-imply deleted the unnest_issues branch March 23, 2023 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing a data correctness issue in unnest when first row of an MVD is null#13764

Fixing a data correctness issue in unnest when first row of an MVD is null#13764
somu-imply wants to merge 5 commits intoapache:masterfrom
somu-imply:unnest_issues

somu-imply commented Feb 7, 2023 •

edited

Loading

Uh oh!

imply-cheddar commented Feb 8, 2023

Uh oh!

somu-imply commented Feb 8, 2023 •

edited

Loading

Uh oh!

paul-rogers Feb 8, 2023

Uh oh!

paul-rogers Feb 8, 2023

Uh oh!

somu-imply Feb 8, 2023

Uh oh!

paul-rogers Feb 8, 2023

Uh oh!

clintropolis Feb 11, 2023

Uh oh!

clintropolis Feb 11, 2023

Uh oh!

somu-imply Feb 21, 2023

Uh oh!

somu-imply commented Mar 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

somu-imply commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imply-cheddar commented Feb 8, 2023

Uh oh!

somu-imply commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paul-rogers Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

paul-rogers Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

somu-imply Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

paul-rogers Feb 8, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis Feb 11, 2023

Choose a reason for hiding this comment

Uh oh!

clintropolis Feb 11, 2023

Choose a reason for hiding this comment

Uh oh!

somu-imply Feb 21, 2023

Choose a reason for hiding this comment

Uh oh!

somu-imply commented Mar 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

somu-imply commented Feb 7, 2023 •

edited

Loading

somu-imply commented Feb 8, 2023 •

edited

Loading