Skip to content

Fixing a data correctness issue in unnest when first row of an MVD is null#13764

Closed
somu-imply wants to merge 5 commits intoapache:masterfrom
somu-imply:unnest_issues
Closed

Fixing a data correctness issue in unnest when first row of an MVD is null#13764
somu-imply wants to merge 5 commits intoapache:masterfrom
somu-imply:unnest_issues

Conversation

@somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Feb 7, 2023

This PR solves 2 things:

  1. When the first row of a MVD is null, the previous version of unnest was giving an incorrect result as the underlying object was not being updated.

Previous:

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time                    c1 c2            c3
2022-01-01T00:00:00.000Z  1  null          null
2022-01-01T00:00:00.000Z  2  ["A","B","C"] null

After this change

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time                    c1 c2            c3
2022-01-01T00:00:00.000Z  1  null          null
2022-01-01T00:00:00.000Z  2  ["A","B","C"] A
2022-01-01T00:00:00.000Z  2  ["A","B","C"] B
2022-01-01T00:00:00.000Z  2  ["A","B","C"] C
  1. For nested queries that has an unnest in the inner as well as the outer query, the output column name was being always set to EXPR$0. This now changes that behavior to make output column names different.

Before for the query

with t AS (select dim2, d3 from druid.numfoo, unnest(MV_TO_ARRAY(dim3)) as unnested (d3))
select d2,d3 from t, UNNEST(MV_TO_ARRAY(dim2)) as unnested(d2)

The planner would do

{
  "queryType" : "scan",
  "dataSource" : {
    "type" : "unnest",
    "base" : {
      "type" : "query",
      "query" : {
        "queryType" : "scan",
        "dataSource" : {
          "type" : "unnest",
          "base" : {
            "type" : "table",
            "name" : "numfoo"
          },
          "column" : "dim3",
          "outputName" : "EXPR$0",
          "allowList" : null
        },
        "intervals" : {
          "type" : "intervals",
          "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
        },
        "resultFormat" : "compactedList",
        "columns" : [ "EXPR$0", "__time", "cnt", "d1", "d2", "dim1", "dim2", "dim3", "dim4", "dim5", "dim6", "f1", "f2", "l1", "l2", "m1", "m2", "unique_dim1" ],
        "legacy" : false,
        "context" : {
          "defaultTimeout" : 300000,
          "maxScatterGatherBytes" : 9223372036854775807,
          "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
          "sqlQueryId" : "dummy",
          "vectorize" : "false",
          "vectorizeVirtualColumns" : "false"
        },
        "granularity" : {
          "type" : "all"
        }
      }
    },
    "column" : "dim2",
    "outputName" : "EXPR$0",
    "allowList" : null
  },
  "intervals" : {
    "type" : "intervals",
    "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
  },
  "resultFormat" : "compactedList",
  "columns" : [ "EXPR$0", "EXPR$00" ],
  "legacy" : false,
  "context" : {
    "defaultTimeout" : 300000,
    "maxScatterGatherBytes" : 9223372036854775807,
    "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
    "sqlQueryId" : "dummy",
    "vectorize" : "false",
    "vectorizeVirtualColumns" : "false"
  },
  "granularity" : {
    "type" : "all"
  }
}

After this change the planner plans correctly as

{
  "queryType" : "scan",
  "dataSource" : {
    "type" : "unnest",
    "base" : {
      "type" : "query",
      "query" : {
        "queryType" : "scan",
        "dataSource" : {
          "type" : "unnest",
          "base" : {
            "type" : "table",
            "name" : "numfoo"
          },
          "column" : "dim3",
          "outputName" : "EXPR$0",
          "allowList" : null
        },
        "intervals" : {
          "type" : "intervals",
          "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
        },
        "resultFormat" : "compactedList",
        "columns" : [ "EXPR$0", "__time", "cnt", "d1", "d2", "dim1", "dim2", "dim3", "dim4", "dim5", "dim6", "f1", "f2", "l1", "l2", "m1", "m2", "unique_dim1" ],
        "legacy" : false,
        "context" : {
          "defaultTimeout" : 300000,
          "maxScatterGatherBytes" : 9223372036854775807,
          "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
          "sqlQueryId" : "dummy",
          "vectorize" : "false",
          "vectorizeVirtualColumns" : "false"
        },
        "granularity" : {
          "type" : "all"
        }
      }
    },
    "column" : "dim4",
    "outputName" : "EXPR$00",
    "allowList" : null
  },
  "intervals" : {
    "type" : "intervals",
    "intervals" : [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ]
  },
  "resultFormat" : "compactedList",
  "columns" : [ "EXPR$0", "EXPR$00" ],
  "legacy" : false,
  "context" : {
    "defaultTimeout" : 300000,
    "maxScatterGatherBytes" : 9223372036854775807,
    "sqlCurrentTimestamp" : "2000-01-01T00:00:00Z",
    "sqlQueryId" : "dummy",
    "vectorize" : "false",
    "vectorizeVirtualColumns" : "false"
  },
  "granularity" : {
    "type" : "all"
  }
}

Additionally unit test cases for 1 and 2 have been added by creating a new data source in the CalciteTests framework

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@somu-imply somu-imply marked this pull request as ready for review February 7, 2023 19:59
@imply-cheddar
Copy link
Contributor

Did I understand you correctly that the previous (bad?) behavior was

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time c1 c2 c3
2022-01-01T00:00:00.000Z 1 null null
2022-01-01T00:00:00.000Z 2 ["A","B","C"] A
2022-01-01T00:00:00.000Z 2 ["A","B","C"] B
2022-01-01T00:00:00.000Z 2 ["A","B","C"] C

I.e. when it gets to the second row that has an array of 3 values, it unnests it into 3 rows.

And you changed the code to do

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time c1 c2 c3
2022-01-01T00:00:00.000Z 1 null null
2022-01-01T00:00:00.000Z 2 ["A","B","C"] null

I.e. if it sees a null, it will use the null going forward an not unnest anything?

If that understanding is correct, can you explain why the previous behavior is not the correct behavior? It is what I had expected at least...

@somu-imply
Copy link
Contributor Author

somu-imply commented Feb 8, 2023

@imply-cheddar it is the other way round the previous was

select * from mytest1, unnest(mv_to_array(c2)) as unnested(c3)

__time c1 c2 c3
2022-01-01T00:00:00.000Z 1 null null
2022-01-01T00:00:00.000Z 2 ["A","B","C"] null

Which was incorrect. This was changed to the correct one which now should have 4 rows in the output. I have made the description clearer

public static final String DATASOURCE3 = "numfoo";
public static final String DATASOURCE4 = "foo4";
public static final String DATASOURCE5 = "lotsocolumns";
public static final String DATASOURCE6 = "unnestnumfoo";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name? nested perhaps? Also, would be cool to add a comment with the schema: I find it hard to suss that out from the code.

public static final List<InputRow> ROWS1 =
RAW_ROWS1.stream().map(TestDataBuilder::createRow).collect(Collectors.toList());

public static final List<ImmutableMap<String, Object>> RAW_ROWS_FOR_UNNEST = ImmutableList.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have all the interesting corner cases? Empty arrays or objects? Null values? Fields that appear in one nested object but not another (in both orders: (a,b), (a), (a,c))? And so on. To help future readers, might be handy to add a comment above each .put( call that sets up one of these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea will do

// the column name cannot be EXPR$0 for both inner and outer. The inner one which gets executed first gets the name
// EXPR$0 and as we move up the tree we add a 0 at the end to make the top level EXPR$00.
// Ideally these names should be replaced by the alias names specified in the query. Any future developer if
// able to find these alias names should replace EXPR$0 by dim3 and EXPR$00 by dim2, i.e use the correct name from Calcite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for the detailed explanation!

.put("f1", 1.0f)
.put("l1", 7L)
.put("dim1", "")
.put("dim3", ImmutableList.of("a", ImmutableList.of("b", "c")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the string dimension indexer can't really handle nested arrays like this, i think you'll end up with something like "a" and then the 'toString' of ["b","c"], or maybe something even weirder...

I think you should stick to having either flat lists or single layer strings for these tests

// able to find these alias names should replace EXPR$0 by dim3 and EXPR$00 by dim2, i.e use the correct name from Calcite

if (druidQueryRel instanceof DruidCorrelateUnnestRel) {
outputColName = outputColName + "0";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im skeptical that this is always correct, is it really cool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hacky way as of now, I have kept a pointer to this to be corrected by fetching the actual names. Will do this in a followup PR

@somu-imply somu-imply marked this pull request as draft February 21, 2023 18:26
@somu-imply somu-imply marked this pull request as ready for review February 21, 2023 18:26
@somu-imply
Copy link
Contributor Author

This is fixed through #13934 . Closing in favor of that

@somu-imply somu-imply closed this Mar 23, 2023
@somu-imply somu-imply deleted the unnest_issues branch March 23, 2023 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants