Unnest functionality for Druid #13268

somu-imply · 2022-10-27T01:49:38Z

Implementation of Unnest.
Unnest has been created as a data source. An unnest data source has the following:

base (the base data source to be unnested)
column (the column in the base data source to be unnested)
outputColumn (the name under which the unnested column will be displayed)
allowList (allow to unnest a subset of a multivalue column).

segment references and storage adapters have been created. Two different cursors have also been added one for dictionary encoded columns (DimensionUnnestCursor) while the other (ColumnarValueUnnestCursor) handles column values without encoding.

Queries supported are:

Scan
{ "queryType": "scan", "dataSource": { "type": "unnest", "base": { "type": "table", "name": "foo" }, "column": "dim3", "outputName": "unnest-dim3" }, "intervals": { "type": "intervals", "intervals": [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ] }, "limit": 1000, "columns": [ "__time", "dim1", "dim2", "dim3", "m1", "m2", "unnest-dim3" ], "legacy": false, "granularity": { "type": "all" }, "context": { "debug": true, "useCache": false } }

GroupBy

{ "queryType": "groupBy", "dataSource": { "type": "unnest", "base": "foo", "column": "dim3", "outputName": "unnest-dim3", "allowList": null }, "intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"], "granularity": "all", "dimensions": [ "unnest-dim3" ], "limitSpec": { "type": "default", "columns": [ { "dimension": "unnest-dim3", "direction": "descending" } ], "limit": 1001 }, "context": { "debug": true } }

TopN

{ "queryType": "topN", "dataSource": { "type": "unnest", "base": { "type": "table", "name": "foo" }, "column": "dim3", "outputName": "unnest-dim3", "allowList": null }, "dimension": { "type": "default", "dimension": "dim2", "outputName": "d0", "outputType": "STRING" }, "metric": { "type": "inverted", "metric": { "type": "numeric", "metric": "a0" } }, "threshold": 3, "intervals": { "type": "intervals", "intervals": [ "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z" ] }, "granularity": { "type": "all" }, "aggregations": [ { "type": "floatMin", "name": "a0", "fieldName": "m1" } ], "context": { "debug": true } }

Additionally user can add allowLists

Filters can also be specified alongside allowLists

This allows nesting as well

{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
      "type": "unnest",
      "base": {
        "type": table,
        "name": foo1
      },
      "column": "dim3",
      "outputName": "unnest-dim3",
      "allowList": ["a"]
    },
    "column": "dim3",
    "outputName": "unnest-dim3-again",
    "allowList": ["b","d"]
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "limit": 1000,
  "columns": [
    "__time",
    "dim1",
    "dim2",
    "dim3",
    "m1",
    "m2",
    "unnest-dim3",
    "unnest-dim3-again"
  ],
  "legacy": false,
  "granularity": {
    "type": "all"
  },
  "context": {
    "debug": true,
    "useCache": false
  }
}

and you can also do multiple levels involving unnest joins and queries

{
  "queryType": "scan",
  "dataSource": {
    "type": "unnest",
    "base": {
        "type": "join",
        "left": {
          "type": "table",
          "name": "foo1"
        },
        "right": {
          "type": "query",
          "query": {
            "queryType": "scan",
            "dataSource": {
              "type": "table",
              "name": "foo1"
            },
            "intervals": {
              "type": "intervals",
              "intervals": [
                "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
              ]
            },
            "virtualColumns": [
              {
                "type": "expression",
                "name": "v0",
                "expression": "\"m2\"",
                "outputType": "FLOAT"
              }
            ],
            "resultFormat": "compactedList",
            "columns": [
              "__time",
              "dim1",
              "dim2",
              "dim3",
              "m1",
              "m2",
              "v0"
            ],
            "legacy": false,
            "context": {
              "queryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
              "sqlOuterLimit": 1001,
              "sqlQueryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
              "useNativeQueryExplain": true
            },
            "granularity": {
              "type": "all"
            }
          }
        },
        "rightPrefix": "j0.",
        "condition": "(\"m1\" == \"j0.v0\")",
        "joinType": "INNER"
      },
    "column": "dim3",
    "outputName": "unnest-dim3",
    "allowList": []
    },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "__time",
    "dim1",
    "dim2",
    "dim3",
    "j0.__time",
    "j0.dim1",
    "j0.dim2",
    "j0.dim3",
    "j0.m1",
    "j0.m2",
    "m1",
    "m2",
    "unnest-dim3"
  ],
  "legacy": false,
  "context": {
    "queryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "65c2cd34-2aa8-4370-9a41-01835cacc5fd",
    "useNativeQueryExplain": true
  },
  "granularity": {
    "type": "all"
  }
}

This PR has:

imply-cheddar · 2022-11-14T22:10:53Z

processing/src/main/java/org/apache/druid/query/UnnestDataSource.java

+  }
+
+  /**
+   * Create an unnest dataSource from a string condition.


What is this comment trying to tell me?

imply-cheddar · 2022-11-14T22:13:49Z

processing/src/main/java/org/apache/druid/query/UnnestDataSource.java

+    return JvmUtils.safeAccumulateThreadCpuTime(
+        cpuTimeAccumulator,
+        () -> {
+          if (column == null) {
+            return Function.identity();
+          } else if (column.isEmpty()) {
+            return Function.identity();
+          } else {
+            return baseSegment ->
+                new UnnestSegmentReference(
+                    baseSegment,
+                    column,
+                    outputName,
+                    allowList
+                );
+          }
+        }
+    );


This code doesn't seem to be delegating to its child, do you have any tests that test for, e.g. nesting of these things?

imply-cheddar · 2022-11-14T22:14:32Z

processing/src/main/java/org/apache/druid/query/UnnestDataSource.java

+  @Override
+  public DataSource withUpdatedDataSource(DataSource newSource)
+  {
+    return null;


Is this never called? If it is, my guess is that it will produce an NPE. Maybe include a comment about why it is safe to do this?

imply-cheddar · 2022-11-14T22:16:36Z

processing/src/main/java/org/apache/druid/query/planning/DataSourceAnalysis.java

@@ -125,6 +126,11 @@ public static DataSourceAnalysis forDataSource(final DataSource dataSource)
      current = subQuery.getDataSource();
    }

+    while (current instanceof UnnestDataSource) {


If there is a Query of an Unnest of a Query of an Unnest, the way that you have interleaved these is not going to completely unwrap the objects as expected.

This DataSourceAnalysis thing is probably another thing to move onto the DataSource object itself... Not sure if we should do that now or leave it as something to do for later though. either way, you need both conditions (check for Query and check for Unnest) on the while loop above.

imply-cheddar · 2022-11-14T22:17:49Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+  public ColumnarValueUnnestCursor(
+      Cursor cursor,
+      String columnName,
+      String outputColumnName,
+      LinkedHashSet<String> allowSet
+  )


It should be safe to pass in the baseColumnSelectorFactory directly. Once you've made the decision to use this object, you should already have a good column selector factory to use.

imply-cheddar · 2022-11-14T22:54:19Z

processing/src/main/java/org/apache/druid/segment/UnnestStorageAdapter.java

+    if (availableDimensions.contains(outputColumnName)) {
+      throw new IAE(
+          "Provided output name [%s] already exists in table to be unnested. Please use a different name.",
+          outputColumnName
+      );
+    } else {
+      availableDimensions.add(outputColumnName);
+    }
+    return new ListIndexed<>(Lists.newArrayList(availableDimensions));
+  }


Why is it bad for the output name to already exist?

imply-cheddar · 2022-11-14T22:55:01Z

processing/src/main/java/org/apache/druid/segment/UnnestStorageAdapter.java

+  @Override
+  public int getNumRows()
+  {
+    return 0;


I'm unsure if it's safe to return 0 from this... We should double check what uses this.

i think this is fine, only segment metadata uses it and some metrics about segment row counts

imply-cheddar · 2022-11-14T22:55:37Z

processing/src/test/java/org/apache/druid/query/UnnestQueryRunnerTest.java

+
+// TODO: Use placementish column in the QueryRunnerHelperTest
+// and set up native queries
+public class UnnestQueryRunnerTest extends InitializedNullHandlingTest


Is it intentional that this test is completely empty?

imply-cheddar · 2022-11-14T22:57:37Z

processing/src/test/java/org/apache/druid/segment/ListCursor.java

+    this.baseList = inputList;
+  }
+
+  void populateList()


I think what you want is a static method that builds a ListCursor rather than a method that can get called at any point in time to mutate and changes the internals of the ListCursor object.

imply-cheddar · 2022-11-14T23:00:16Z

processing/src/test/java/org/apache/druid/segment/UnnestStorageAdapterTest.java

+        Object dimSelectorVal = dimSelector.getObject();
+        Assert.assertNotNull(dimSelector.getRow());
+        Assert.assertNotNull(dimSelector.getValueCardinality());
+        Assert.assertNotNull(dimSelector.makeValueMatcher(OUTPUT_COLUMN_NAME));
+        Assert.assertNotNull(dimSelector.idLookup());
+        Assert.assertNotNull(dimSelector.lookupName(0));
+        Assert.assertNotNull(dimSelector.defaultGetObject());
+        Assert.assertFalse(dimSelector.isNull());
+        if (dimSelectorVal == null) {
+          Assert.assertNull(dimSelectorVal);
+        }


The assertions here should be updated. You should be able to know and validate the sepcific ValueCardinality and this seems to always be looking up the value for the 0 index, which shouldn't be correct. If the test isn't actually walking through rows with different dictionary values, it's not really validating what we need it to.

somu-imply · 2022-11-18T16:52:04Z

processing/src/main/java/org/apache/druid/segment/DimensionUnnestCursor.java

+          return baseColumnSelectorFactory.makeDimensionSelector(dimensionSpec);
+        }
+
+        //final DimensionSpec actualDimensionSpec = dimensionSpec.withDimension(columnName);


I'll remove this and the other commented out line

…for unnest of unnest with allowList

imply-cheddar · 2022-11-29T01:58:59Z

processing/src/main/java/org/apache/druid/query/UnnestDataSource.java

-                    outputName,
-                    allowList
+            return
+                segmentMapFn.andThen(


This is a style thing, but this sort of fluent style tends to produce hard-to-read stack traces if there are any errors. It creates stack traces with lines from the Function class rather than from UnnestDataSource. Generally speaking, only use a fluent style when the fluency doesn't go outside of the current scope of the code. If you are returning an object that is going to be used by someone else, create a closure.

Addressed, creating new object now

imply-cheddar · 2022-11-29T02:02:26Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+        if (!outputName.equals(dimensionSpec.getDimension())) {
+          return baseColumSelectorFactory.makeDimensionSelector(dimensionSpec);
+        }
+        return baseColumSelectorFactory.makeDimensionSelector(DefaultDimensionSpec.of(columnName));


I'm perhaps missing something, but this seems to be just delegating to the base and returning without attempting to do any unnesting?

You likely haven't run into this because you are doing a validation ahead of time for what the column can be. As such, the correct answer here might be to throw an UnsupportedOperatorException instead as you are expecting the user to be calling the ColumnValueSelector option instead.

Addressed, added an unit test for the exception as well

abhishekagarwal87

would docs come in a follow-up PR?
I haven't yet reviewed the cursor classes though those too could use some javadocs to explain what they are doing.

abhishekagarwal87 · 2022-11-29T15:02:56Z

processing/src/main/java/org/apache/druid/query/UnnestDataSource.java

+  @Override
+  public byte[] getCacheKey()
+  {
+    return null;


how does caching work for this data source?

I have kept this null as of now. The caching can be turned on by setting this part to

public byte[] getCacheKey() { return new byte[0]; }

This is similar to the other data sources that are involved in caching like TableDataSource

i think the column being unnested would need to be part of the cache key since the reason table datasources can get away with an empty cache key is because that is part of the segmentId. However here the results are dependent on what is being unnested, so we can't rely on just the datasource name, so a cache key would need to be non-empty

getCacheKey is documented as

/** * Compute a cache key prefix for a data source. This includes the data sources that participate in the RHS of a * join as well as any query specific constructs associated with join data source such as base table filter. This key prefix * can be used in segment level cache or result level cache. The function can return following * - Non-empty byte array - If there is join datasource involved and caching is possible. The result includes * join condition expression, join type and cache key returned by joinable factory for each {@link PreJoinableClause} * - NULL - There is a join but caching is not possible. It may happen if one of the participating datasource * in the JOIN is not cacheable. * * @return the cache key to be used as part of query cache key */

Meaning that a null return type should disable caching. We should likely be even more explicit and set isCachable to return false.

abhishekagarwal87 · 2022-11-29T15:07:02Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+import java.util.LinkedHashSet;
+import java.util.List;
+
+public class ColumnarValueUnnestCursor implements Cursor


can you add some javadocs here about this class?

Added javadocs

abhishekagarwal87 · 2022-11-29T15:13:30Z

processing/src/main/java/org/apache/druid/segment/UnnestStorageAdapter.java

+    this.baseAdapter = baseAdapter;
+    this.dimensionToUnnest = dimension;
+    this.outputColumnName = outputColumnName;
+    this.allowSet = allowSet;


what is special about allowSet that it gets its own variable? Is it just a filter or something more?

An infilter for a MVD returns the entire value in the row in case of any match.

This allowSet allows to filter inside an MVD just to allow to unnest the values specified in the allowList and ignore the others. So if we want to unnest only the a and b here we need to add them to the allowList like the one below:

abhishekagarwal87 · 2022-11-29T15:18:55Z

processing/src/main/java/org/apache/druid/segment/UnnestStorageAdapter.java

+  public ColumnCapabilities getColumnCapabilities(String column)
+  {
+    if (outputColumnName.equals(dimensionToUnnest)) {
+      return baseAdapter.getColumnCapabilities(column);


should the returned set of column capabilities always have hasMultipleValues to false?

This part delegates the column capabilities to the ones of the base adapter so the properties depends on the column capabilities of the input column. I am not sure I understood this correctly though

imply-cheddar · 2022-11-30T05:30:13Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+      @Override
+      public DimensionSelector makeDimensionSelector(DimensionSpec dimensionSpec)
+      {
+        throw new UnsupportedOperationException("Dimension selector not applicable for column value selector");


Errr, you did too much! I was only talking about the one case where something asks for the column that is being unnested. It's totally possible that one of the other columns is being accessed as a DimensionSelector and you want to still allow for that.

abhishekagarwal87 · 2022-11-30T16:34:25Z

processing/src/main/java/org/apache/druid/segment/DimensionUnnestCursor.java

+/**
+ * The cursor to help unnest MVDs with dictionary encoding.
+ * Consider a segment has 2 rows
+ * ['a', 'b', 'c']
+ * ['d', 'c']
+ *
+ * Considering dictionary encoding, these are represented as
+ *
+ * 'a' -> 0
+ * 'b' -> 1
+ * 'c' -> 2
+ * 'd' -> 3
+ *
+ * The baseCursor points to the row of IndexedInts [0, 1, 2]
+ * while the unnestCursor with each call of advance() moves over individual elements.
+ *
+ * advance() -> 0 -> 'a'
+ * advance() -> 1 -> 'b'
+ * advance() -> 2 -> 'c'
+ * advance() -> 3 -> 'd' (advances base cursor first)
+ * advance() -> 2 -> 'c'
+ *
+ * Total 5 advance calls above
+ *
+ * The allowSet if available helps skip over elements which are not in the allowList by moving the cursor to
+ * the next available match. The hashSet is converted into a bitset (during initialization) for efficiency.
+ * If allowSet is ['c', 'd'] then the advance moves over to the next available match
+ *
+ * advance() -> 2 -> 'c'
+ * advance() -> 3 -> 'd' (advances base cursor first)
+ * advance() -> 2 -> 'c'
+ *
+ * Total 3 advance calls in this case
+ *
+ * The index reference points to the index of each row that the unnest cursor is accessing
+ * The indexedInts for each row are held in the indexedIntsForCurrentRow object
+ *
+ * The needInitialization flag sets up the initial values of indexedIntsForCurrentRow at the beginning of the segment
+ *
+ */


Awesome. 👍

processing/src/main/java/org/apache/druid/segment/DimensionUnnestCursor.java

abhishekagarwal87 · 2022-11-30T16:53:13Z

processing/src/main/java/org/apache/druid/segment/DimensionUnnestCursor.java

+  public void advanceUninterruptibly()
+  {
+    do {
+      advanceAndUpdate();


what happens if baseCursor doesn't have any data. so advanceAndUpdate is done and in matchAndProceed, could indexedIntsForCurrentRow.get(index) throw an exception?

If the base cursor does not have any data it does not come until this stage of unnest cursor creation as the base cursor is already in a isDone==true state. Additionally UnnestStorageAdapter before the cursor creation ensures that the base cursor is non-null

but the baseCursor is also advanced in advanceAndUpdate? So its possible that baseCursor was not done before but got done during invocation of this method. maybe I am missing something. Advancing and then accessing base cursor doesn't look right here.

Now I think of it, it probably doesn't matter. what can happen is that index is reset to zero and indexedIntsForCurrentRow points to the last row, just before the loop is about to exit. the matchAndProceed will not throw an exception since indexedIntsForCurrentRow would always have at least one entry.

…estCursor.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

imply-cheddar · 2022-12-01T21:32:28Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+        throw new UnsupportedOperationException(
+            "Dimension selector not applicable for column value selector for column " + outputName);


More design nits:

We have a UOE that bundles String.format into the building of the exception use that.

We encase interpolated values in [] to help differentiate things like extra spaces.

… to support virtual columns unnesting

somu-imply · 2022-12-01T22:23:26Z

The ColumnarValueUnnestCursor was taking care of List or Strings but in case of virtual columns they are an array of objects. The change is made to support virtual columns and the following type of query. Thanks to @clintropolis for finding this

{
  "queryType": "scan",
  "dataSource":{
    "type": "unnest",
    "base": {
      "type": "table",
      "name": "foo1"
    },
    "column": "v0",
    "outputName": "unnest-v0"
  }
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "virtualColumns": [
    {
      "type": "expression",
      "name": "v0",
      "expression": "array(\"m1\",\"m2\")",
      "outputType": "ARRAY<LONG>"
    }
  ],
  "resultFormat": "compactedList",
  "limit": 1001,
  "columns": [
    "unnest-v0"
  ],
  "legacy": false,
  "context": {
    "populateCache": false,
    "queryId": "d273facb-08cc-4de7-ac0b-d0b82173e531",
    "sqlOuterLimit": 1001,
    "sqlQueryId": "d273facb-08cc-4de7-ac0b-d0b82173e531",
    "useCache": false,
    "useNativeQueryExplain": true
  },
  "granularity": {
    "type": "all"
  }
}

clintropolis · 2022-12-01T20:41:13Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+ * The needInitialization flag sets up the initial values of unnestListForCurrentRow at the beginning of the segment
+ *
+ */
+public class ColumnarValueUnnestCursor implements Cursor


super nitpick, but why not just call this thing what it is doing, e.g. UnnestColumnValueSelectorCursor? Same thing with the other one, UnnestDimensionSelectorCursor

clintropolis · 2022-12-01T20:46:55Z

processing/src/main/java/org/apache/druid/query/UnnestDataSource.java

+  @Override
+  public byte[] getCacheKey()
+  {
+    return null;


i think the column being unnested would need to be part of the cache key since the reason table datasources can get away with an empty cache key is because that is part of the segmentId. However here the results are dependent on what is being unnested, so we can't rely on just the datasource name, so a cache key would need to be non-empty

clintropolis · 2022-12-01T20:49:23Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+            if (value == null) {
+              return 0;
+            }
+            return Double.valueOf((String) value);


I don't think you can count on casting to a string here, since it depends on the type of the underlying column value selector, same for other primitive numeric getters

clintropolis · 2022-12-01T20:50:46Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+        if (!outputName.equals(columnName)) {
+          baseColumSelectorFactory.getColumnCapabilities(column);
+        }
+        return baseColumSelectorFactory.getColumnCapabilities(columnName);


i don't think you want to strictly pass through the underlying capabilities. If the underlying column is a multi-value string, you need to return capabilities that have multiple values set to false since it is no longer a multi-value string, if the underlying capabilities is an ARRAY type, you need to return the element type of the array.

clintropolis · 2022-12-01T20:54:41Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+      }
+      unnestListForCurrentRow.add(null);
+    } else {
+      if (currentVal instanceof List) {


I think you'll want to check for Object[] too, since that is the type we have been standardizing ARRAY types to deal in

ah this comment is stale now

clintropolis · 2022-12-01T21:00:29Z

processing/src/main/java/org/apache/druid/segment/DimensionUnnestCursor.java

+  // Helper class to help in returning
+  // getRow from the dimensionSelector
+  // This is set in the initialize method
+  private class SingleIndexInts implements IndexedInts


why not https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/data/SingleIndexedInt.java (apologies if this was discussed somewhere and I missed it....)

Because it reuses the index that is being incremented

clintropolis · 2022-12-01T21:03:48Z

processing/src/main/java/org/apache/druid/segment/DimensionUnnestCursor.java

+    public int get(int idx)
+    {
+      return indexedIntsForCurrentRow.get(index);
+    }


this seems a bit confusing to pass this through to the underlying rows IndexedInts... size is 1, so get of this method should always be 0, no?

I guess i'm worried about silent bugs possible by having it like this instead of the other SingleIndexInts which can only possible expose a single value

It's not passing through the idx, it's using the index that gets incremented. This is required to preserve the semantics of the dictionary.

clintropolis · 2022-12-01T21:08:04Z

processing/src/main/java/org/apache/druid/segment/UnnestStorageAdapter.java

+  @Override
+  public int getNumRows()
+  {
+    return 0;


i think this is fine, only segment metadata uses it and some metrics about segment row counts

clintropolis · 2022-12-01T21:11:08Z

processing/src/main/java/org/apache/druid/segment/UnnestStorageAdapter.java

+          ColumnCapabilities capabilities = cursor.getColumnSelectorFactory().getColumnCapabilities(dimensionToUnnest);
+          if (capabilities.isDictionaryEncoded() == ColumnCapabilities.Capable.TRUE
+              && capabilities.areDictionaryValuesUnique() == ColumnCapabilities.Capable.TRUE) {


capablities returned here are allowed to be null, suggest checking for nulls.

Also the statement can be slightly simplified
capabilities.isDictionaryEncoded().and(capabilities.areDictionaryValuesUnique()).isTrue()

clintropolis · 2022-12-01T23:58:20Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+import java.util.List;
+
+/**
+ * The cursor to help unnest MVDs without dictionary encoding.


this isn't specific to multi-value dimensions, since this also handles ARRAY typed selectors

clintropolis · 2022-12-02T00:23:30Z

processing/src/main/java/org/apache/druid/segment/ColumnarValueUnnestCursor.java

+      public ColumnCapabilities getColumnCapabilities(String column)
+      {
+        if (!outputName.equals(columnName)) {
+          baseColumSelectorFactory.getColumnCapabilities(column);


missing return (also on the dim cursor)

…ent and made unnest datasource not cacheable for the time being

clintropolis

going ahead and approving, but the column capabilities really need fixed up but is ok with me if you do as a follow-up...

it would also be nice to add native query tests to like GroupByQueryRunnerTest, TopNQueryRunnerTest, TimeseriesQueryRunnerTest, and ScanQueryRunnerTest to make sure everything works as expected with the different native query types, but it is fine to do that as a follow-up too. There is a multi-value dimension in the test data these tests use ('placementish') and i believe some numeric columns also so that numeric arrays can be tested with virtual columns.

clintropolis · 2022-12-02T19:23:11Z

processing/src/main/java/org/apache/druid/segment/UnnestColumnValueSelectorCursor.java

+          return baseColumSelectorFactory.getColumnCapabilities(column);
+        }
+        // This currently returns the same type as of the column to be unnested
+        // This is fine for STRING types


nit: I think its not really great for string types either since if 'hasMultipleValues` is set the engine will take less efficient paths and treat it as a multi-value string.

Btw, this is pretty easy to fix, all you need to do is something like this, but is fine to do as a follow-up too

final ColumnCapabilities capabilities = baseColumSelectorFactory.getColumnCapabilities(columnName); if (capabilities.isArray()) { return ColumnCapabilitiesImpl.copyOf(capabilities).setType(capabilities.getElementType()); } if (capabilities.hasMultipleValues()) { return ColumnCapabilitiesImpl.copyOf(capabilities).setHasMultipleValues(false); } return capabilities;

This was pretty minor change. Adding it here. The rest of the things will be added in the followup PR

somu-imply added 16 commits October 26, 2022 18:48

Moving all unnest cursor code atop refactored code for unnest

4e686a8

Updating unnest cursor

f354dd7

Removing dedup and fixing up some null checks

f9ac767

AllowList changes

9f98b12

Fixing some NPEs

31612fc

Using bitset for allowlist

035c423

Updating the initialization only when cursor is in non-done state

6a9b0d7

Updating code to skip rows not in allow list

ba55890

Adding a flag for cases when first element is not in allowed list

7333da5

Updating for a null in allowList

90073f6

Splitting unnest cursor into 2 subclasses

f6fc1aa

Intercepting some apis with columnName for new unnested column

bbc66f5

Adding test cases and renaming some stuff

e146ebb

checkstyle fixes

bb22a92

Moving to an interface for Unnest

1908242

handling null rows in a dimension

3de1161

somu-imply changed the title ~~Moving all unnest cursor code atop refactored code for unnest~~ Unnest functionality for Druid Nov 6, 2022

somu-imply marked this pull request as ready for review November 6, 2022 22:45

imply-cheddar suggested changes Nov 14, 2022

View reviewed changes

somu-imply added 7 commits November 15, 2022 22:33

Updating cursors after comments part-1

d1a884a

Addressing comments and adding some more tests

401a9b2

Reverting a change to ScanQueryRunner and improving a comment

bce6ffe

removing an unused function

240afe2

Updating cursors after comments part 2

9821c53

Merge remote-tracking branch 'upstream/master' into unnest_v2

b7ab781

One last fix for review comments

5fd3dd7

somu-imply commented Nov 18, 2022

View reviewed changes

Making some functions private, deleting some comments, adding a test …

576dbcc

…for unnest of unnest with allowList

clintropolis added the Area - Querying label Nov 18, 2022

abhishekagarwal87 added the Design Review label Nov 22, 2022

somu-imply requested a review from imply-cheddar November 28, 2022 02:06

imply-cheddar reviewed Nov 29, 2022

View reviewed changes

somu-imply added 2 commits November 29, 2022 14:04

Adding an exception for a case

e56b0b2

Closure for unnest data source

bb66e59

abhishekagarwal87 reviewed Nov 30, 2022

View reviewed changes

imply-cheddar reviewed Nov 30, 2022

View reviewed changes

somu-imply added 3 commits November 29, 2022 22:35

Adding some javadocs

8f71b81

One minor change in makeDimSelector of columnarCursor

e312f91

Updating an error message

0e3ede4

abhishekagarwal87 reviewed Nov 30, 2022

View reviewed changes

Update processing/src/main/java/org/apache/druid/segment/DimensionUnn…

edff9cd

…estCursor.java Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

imply-cheddar approved these changes Dec 1, 2022

View reviewed changes

Unnesting on virtual columns was missing an object array, adding that…

321536f

… to support virtual columns unnesting

somu-imply added 2 commits December 1, 2022 14:30

Updating exceptions to use UOE

5e8c38a

Merge remote-tracking branch 'upstream/master' into unnest_v2

2ecf23b

clintropolis reviewed Dec 1, 2022

View reviewed changes

clintropolis reviewed Dec 2, 2022

View reviewed changes

317brian mentioned this pull request Dec 2, 2022

docs: documentation for unnest datasource #13479

Merged

1 task

somu-imply added 3 commits December 1, 2022 17:58

Renamed files, added column capability test on adapter, return statem…

81b674a

…ent and made unnest datasource not cacheable for the time being

Handling for null values in dim selector

bd467dc

Fixing a NPE for null row

30c2897

clintropolis approved these changes Dec 2, 2022

View reviewed changes

somu-imply added 2 commits December 2, 2022 13:05

Updating capabilities

44c9955

Updating capabilities

5659760

clintropolis merged commit 9177419 into apache:master Dec 3, 2022

clintropolis added this to the 26.0 milestone Apr 10, 2023

techdocsmith mentioned this pull request Apr 12, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

gianm mentioned this pull request Apr 25, 2023

Community roadmap 2023 #14157

Open

		throw new UnsupportedOperationException(
		"Dimension selector not applicable for column value selector for column " + outputName);

Unnest functionality for Druid #13268

Unnest functionality for Druid #13268

Conversation

somu-imply commented Oct 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

somu-imply Nov 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

somu-imply Nov 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

somu-imply Nov 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

somu-imply commented Dec 1, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

somu-imply Dec 2, 2022 • edited

Choose a reason for hiding this comment

somu-imply commented Oct 27, 2022 •

edited

somu-imply Nov 30, 2022 •

edited

somu-imply Nov 30, 2022 •

edited

somu-imply Nov 30, 2022 •

edited

somu-imply commented Dec 1, 2022 •

edited

somu-imply Dec 2, 2022 •

edited