Grouping on complex columns aka unifying GroupBy strategies #16068

LakshSingla · 2024-03-07T13:04:51Z

Description

User Impact

Once the core functionality is in, users can pass complex types as dimensions to the group by queries. For example:

SELECT nested_col1, count(*) FROM foo GROUP BY nested_col1

Making grouping strategies type agnostic

The grouping engine requires a way to address a dimension and store a fixed-width representation of that dimension on a buffer. This can vary for different types, for example, longs can be represented as is, while strings require a dictionary (runtime, or precomputed) so that a variable length string can be addressed by a fixed width dictionary id.
Currently, we have a separate strategy for each type. For enabling grouping on complex columns (at the engine level), we need these strategies to be type-agnostic, so that they can work with the supported complex types, without knowing the types. We can also unify the pre-existing types so that the grouping strategies are classified based on the work (and dimensions) that they handle, rather than being specialized for each type. These are the following classifications:

Fixed width types - These include numeric primitives. In the future, we can optimize fixed-width complex types, like IPv4 and Geo types to use this strategy instead, with some hints from the type strategy, that the objects can be represented in fixed-width columns. Such types can be serialized into their representations without being mapped to a dictionary, as they are fixed widths.
Variable width types with prebuilt dictionaries - These include string types with prebuilt dictionaries. In the future, once we have a way of extracting array dictionaries from the "array columns selectors", we can have array types to use this strategy as well. Such types can be represented using their prebuilt dictionaryId
Variable width types without prebuilt dictionaries - These include any variable type without a prebuilt dictionary, including complex types. We need to build a dictionary + reverse dictionary corresponding to each key and represent the key using the dictionaryId. This also requires a way to estimate the size of the dictionaries and address the key on the dictionary - which requires some interface changes.

Implementing a hashing strategy for complex types

Group by on complex columns requires that the complex types are addressable from a reverse dictionary.

Currently, all the reverse dictionaries are of pre-known types, and hence the dictionary implementations themselves (hash maps) supply the hashCode and equals required to address the object on the dictionary. Dictionaries can be of two types, and we lay down the ins and outs of addressing an object on both the types:

Associative array, i.e. HashMap (Object2IntOpenHashMap)

This is currently used throughout the code.

For any type to be added to a map, it requires the correct implementation of two default methods of Object class

hashCode() : It is required to associate a position in the hash table with the object. Same objects must have the same hashCodes, while different objects can have the same hashcodes as well, which leads to the following point.
equals(): It is required for disambiguating the objects in case of collisions, i.e. when they have the same hashCode.

Sorted Set (Object2IntRBTreeMap/ Object2IntAvlTreeMap)

This would be a change in the existing dictionary-building strategy, therefore it will affect the performance of array types and string types without a prebuilt dictionary. Hence this change is benchmarked.
The benefit of using this would be that we can reuse the comparators for addressing the objects in the arrays.
However, this affects the performance of the grouping engine, therefore this design is dropped (check the benchmarks section).

Grouping on non useful types

Should grouping on types like HLLs, Pairs etc (used for earliest/latest), be allowed or not? There isn’t any practical use case for supporting these types, however, they can fall under the catch-all umbrella, where we serde the type, and compare the byte representation. Therefore, the major question is to or not to?

Currently, grouping on these types is disallowed, because they are often counter-intuitive to what the users want to measure. For example, the user might want to group on the finalized estimated count, and rather they are grouping on the sketch itself.

Interface changes

Following interface changes have been proposed to implement the stuff above:

TypeStrategy.java

public interface TypeStrategy<T> extends Comparator<Object>, Hash.Strategy<T>
{

  ....
  // Whether the type is groupable
  default boolean groupable()
  {
    return false;
  }

  // Implement for hashing complex dimensions
  @Override
  default int hashCode(T o)
  {
    throw DruidException.defensive("Not implemented. Check groupable() first");
  }

  // Implement for hashing complex dimensions
  @Override
  default boolean equals(T a, T b)
  {
    throw DruidException.defensive("Not implemented. Check groupable() first");
  }
  .......
}

Benchmarks

The performance of the current and the changes made using this patch are equivalent, while if we use the sorted sets, it degrades, and hence that approach is rejected

master i.e. current

Benchmark                                  (groupingDimension)  Mode  Cnt      Score       Error  Units
SqlGroupByBenchmark.querySql    stringArray-Sequential-100_000  avgt    5    946.754 ±  198.785  ms/op
SqlGroupByBenchmark.querySql  stringArray-Sequential-3_000_000  avgt    5  41515.654 ± 4081.663  ms/op
SqlGroupByBenchmark.querySql        stringArray-ZipF-1_000_000  avgt    5    435.283 ±   21.554  ms/op
SqlGroupByBenchmark.querySql     stringArray-Uniform-1_000_000  avgt    5   7669.069 ±  629.368  ms/op
SqlGroupByBenchmark.querySql      longArray-Sequential-100_000  avgt    5    820.420 ±   58.626  ms/op
SqlGroupByBenchmark.querySql    longArray-Sequential-3_000_000  avgt    5  43184.769 ± 6502.256  ms/op
SqlGroupByBenchmark.querySql          longArray-ZipF-1_000_000  avgt    5    319.639 ±   17.426  ms/op
SqlGroupByBenchmark.querySql       longArray-Uniform-1_000_000  avgt    5   6242.893 ±  279.628  ms/op

this branch + using associative array i.e. post this patch

Benchmark                                  (groupingDimension)  Mode  Cnt      Score       Error  Units
SqlGroupByBenchmark.querySql    stringArray-Sequential-100_000  avgt    5    852.758 ±   136.371  ms/op
SqlGroupByBenchmark.querySql  stringArray-Sequential-3_000_000  avgt    5  41204.688 ±  8408.370  ms/op
SqlGroupByBenchmark.querySql        stringArray-ZipF-1_000_000  avgt    5    390.198 ±    12.634  ms/op
SqlGroupByBenchmark.querySql     stringArray-Uniform-1_000_000  avgt    5   5509.581 ±   451.345  ms/op
SqlGroupByBenchmark.querySql      longArray-Sequential-100_000  avgt    5    660.656 ±    57.070  ms/op
SqlGroupByBenchmark.querySql    longArray-Sequential-3_000_000  avgt    5  47083.740 ± 20552.695  ms/op
SqlGroupByBenchmark.querySql          longArray-ZipF-1_000_000  avgt    5    300.093 ±    38.598  ms/op
SqlGroupByBenchmark.querySql       longArray-Uniform-1_000_000  avgt    5   5725.369 ±   725.885  ms/op

*this branch + using sorted sets i.e. discarded dictionary building approach

Benchmark                                  (groupingDimension)  Mode  Cnt      Score       Error  Units
SqlGroupByBenchmark.querySql    stringArray-Sequential-100_000  avgt    5   1221.906 ±   64.073  ms/op
SqlGroupByBenchmark.querySql  stringArray-Sequential-3_000_000  avgt    5  51887.836 ± 2145.206  ms/op
SqlGroupByBenchmark.querySql        stringArray-ZipF-1_000_000  avgt    5   1013.708 ±  110.974  ms/op
SqlGroupByBenchmark.querySql     stringArray-Uniform-1_000_000  avgt    5   9127.126 ±  183.487  ms/op
SqlGroupByBenchmark.querySql      longArray-Sequential-100_000  avgt    5    819.836 ±   26.940  ms/op
SqlGroupByBenchmark.querySql    longArray-Sequential-3_000_000  avgt    5  41988.609 ± 3683.729  ms/op
SqlGroupByBenchmark.querySql          longArray-ZipF-1_000_000  avgt    5    695.684 ±   21.028  ms/op
SqlGroupByBenchmark.querySql       longArray-Uniform-1_000_000  avgt    5   7877.897 ± 1755.698  ms/op
SqlGroupByBenchmark.querySql         nested-Sequential-100_000  avgt    5     77.833 ±    7.598  ms/op
SqlGroupByBenchmark.querySql       nested-Sequential-3_000_000  avgt    5    157.890 ±    7.128  ms/op
SqlGroupByBenchmark.querySql             nested-ZipF-1_000_000  avgt    5    103.281 ±    8.962  ms/op

Future work

Once the core functionality is up, we can have the following previously alluded to improvements for faster grouping of specific types:

Implement a group by on complex columns and arrays for vectorized query engine. This can be done in a follow-up PR, without requiring an interface change.
Once there’s a way to extract the dictionaries from array selectors, we can reuse those dictionaries, i.e. strategy (2), rather than rebuilding the dictionaries. This requires a new selector class/interface like ArrayColumnValueSelector, akin to DimensionSelector which would be a massive code change across Druid’s codebase
We can hint to the grouping engine about a few complex types being fixed width beforehand - these include the IP types. This would prevent the need for having a dictionary associated with the type, and instead, they can use strategy (1) to directly serialize the type onto the buffer. This will require adding methods to the public interface, which I wanna defer to the future patches

Release note

Key changed/added classes in this PR

MyFoo
OurBar
TheirBaz

This PR has:

clintropolis · 2024-03-15T21:23:14Z

processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java

@@ -413,9 +414,12 @@ public static Object convertObjectToType(
          case DOUBLE:
            return coerceToObjectArrayWithElementCoercionFunction(obj, DimensionHandlerUtils::convertObjectToDouble);


we probably want to add a case here for other array types to just recursively call this method on the element type for nested and non-primitive arrays

clintropolis · 2024-03-20T23:02:49Z

processing/src/main/java/org/apache/druid/segment/column/TypeStrategies.java

+  static {
+    ComplexMetrics.registerSerde(ColumnType.NESTED_DATA.getComplexTypeName(), new NestedDataComplexTypeSerde());
+  }


this should not be here, it happens in the module, if a test is needing this then just register manually for the test

oops, definitely didn't intend that there.

clintropolis · 2024-03-21T00:16:05Z

sql/src/test/java/org/apache/druid/sql/calcite/QueryTestRunner.java

@@ -380,7 +380,7 @@ public VerifyNativeQueries(BaseExecuteQuery execStep)
    public void verify()
    {
      for (QueryResults queryResults : execStep.results()) {
-        verifyQuery(queryResults);
+//        verifyQuery(queryResults);


this i assume was for testing some stuff and needs to go away

Yep, I didn't wanna write the query to verify that it was working. That's why I have also commented it out with Intellij style comment (s.t. it'd fail during checkstyle).

clintropolis · 2024-03-21T00:20:30Z

.../java/org/apache/druid/query/groupby/epinephelinae/GroupByColumnSelectorStrategyFactory.java

+      case ARRAY:
+        switch (capabilities.getElementType().getType()) {
+          case LONG:
+          case STRING:
+          case DOUBLE:
+            return DictionaryBuildingGroupByColumnSelectorStrategy.forType(capabilities.toColumnType());
+          case FLOAT:
+            // Array<Float> not supported in expressions, ingestion
+          default:
+            throw new IAE("Cannot create query type helper from invalid type [%s]", capabilities.asTypeString());
+
+        }
+      case COMPLEX:
+        return DictionaryBuildingGroupByColumnSelectorStrategy.forType(capabilities.toColumnType());
+      default:
+        throw new IAE("Cannot create query type helper from invalid type [%s]", capabilities.asTypeString());


there is no reason other array types cannot group in the same way, so I think we can just do this

Suggested change

case ARRAY:

switch (capabilities.getElementType().getType()) {

case LONG:

case STRING:

case DOUBLE:

return DictionaryBuildingGroupByColumnSelectorStrategy.forType(capabilities.toColumnType());

case FLOAT:

// Array<Float> not supported in expressions, ingestion

default:

throw new IAE("Cannot create query type helper from invalid type [%s]", capabilities.asTypeString());

}

case COMPLEX:

return DictionaryBuildingGroupByColumnSelectorStrategy.forType(capabilities.toColumnType());

default:

throw new IAE("Cannot create query type helper from invalid type [%s]", capabilities.asTypeString());

case ARRAY:

case COMPLEX:

default:

return DictionaryBuildingGroupByColumnSelectorStrategy.forType(capabilities.toColumnType());

clintropolis · 2024-03-21T00:22:10Z

...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java

+        super(keyBufferPosition);
+        this.keyBufferPosition = keyBufferPosition;
+        this.complexType = complexType;
+        this.complexTypeName = Preconditions.checkNotNull(complexType.getComplexTypeName(), "complex type name expected");


if instead of complexType.getComplexTypeName() we just use complexType.asTypeString() we can re-use this strategy generically for things like nested arrays or arrays of complex types too. Suggest renaming to GenericDictionaryBuildingRowBasedKeySerdeHelper or something.

I suppose we could pack dictionaries for all types into this since they are all basically the same, to have less stuff to track... but it also seems ok to leave the strings and primitive arrays handled separately in case we want to further optimize stuff

I kept it separate in case we wanna revisit using the hashmaps instead of sortedMaps for them.

clintropolis · 2024-03-21T00:45:20Z

...apache/druid/query/groupby/epinephelinae/column/FixedWidthGroupByColumnSelectorStrategy.java

+    // TODO(laksh): Check if calling .getObject() on primitive selectors be problematic??
+    // Convert the object to the desired type
+    //noinspection unchecked
+    return (T) DimensionHandlerUtils.convertObjectToType(columnValueSelector.getObject(), columnType);


i wonder if there is a nicer way we could do this by letting the thing that creates this supply a value supplier method that could be backed by whatever makes sense for the selector type, e.g. something like

public FixedWidthGroupByColumnSelectorStrategy( int keySizeBytes, boolean isPrimitive, ColumnType columnType, Function<ColumnValueSelector<?>, T> getterFn )

and then

case LONG: return new FixedWidthGroupByColumnSelectorStrategy<>( Byte.BYTES + Long.BYTES, true, ColumnType.LONG, ColumnValueSelector::getLong );

and so on. It seems to work afaict and doesn't seem like it should be any worse than going through getObject and dimension handler utils conversion methods

I think it makes sense given that either we'd be having numeric primitives, or complex (future work) types, since in that case, DimensionHandlerUtils.convertObjectToType() would be a no-op, and we can remove that call.

clintropolis · 2024-03-21T00:52:13Z

...rc/main/java/org/apache/druid/query/groupby/epinephelinae/column/DimensionToIdConverter.java

+ *
+ * @param <DimensionHolderType> Type of the dimension holder
+ */
+public interface DimensionToIdConverter<DimensionHolderType>


i feel like we should special handle multi-value dimensions instead of making new interfaces to accommodate them, though I'm still considering exactly how we should do this... maybe just making a dedicate multi-value string grouping strategy that doesn't fit any of the common patterns would be best.

Everything else should be grouping a single value and that is what we should always do going forwards. Things with multiple values like arrays should instead explicitly use UNNEST to aggregate individual elements of multiple values.

In fact, I almost wonder if we should dump all of the multi-value grouping stuff and just rewrite in GroupingEngine.process to wrap the storage adapter in an unnest if we detect any multi-value dimensions...

In fact, I almost wonder if we should dump all of the multi-value grouping stuff and just rewrite in GroupingEngine.process to wrap the storage adapter in an unnest if we detect any multi-value dimensions...

I think that would work, we'd need something like UNNEST(MV_TO_ARRAY(...))). However, that would perhaps alter the results which equated null to []. Also, we might not be able to use the dictionaries (if present) on top of the multi-value dimensions.

well, rather than explicitly writing that function, the idea would be to automatically wrap the storageAdapter in an UnnestStorageAdapter for each multi-value string being grouped. The cursors made by UnnestStorageAdapter already use dimension selectors in the case of a multi-value string, and so can produce dictionary encoded dimension selectors from the cursor as well, so we should be able to retain grouping directly on the dimension id. If i recall correctly, because we use the dimension selector for the unnest cursor, I think it should behave the same way as the grouping engine with regards to null and [], but I would have to confirm.

This is probably too big of a change for this PR though, so I don't think this refactor needs to be done right now, but is worth pursuing in the future I think because it feels like the right way to handle things, then grouping itself has no concept of multi-value dimensions which sounds pretty amazing to me.

👍
Currently, I am refactoring the keymapping strategy to
a. KeyMappingGroupByColumnSelectorStrategy - For arrays and complex types
b. KeyMappingMultiValueGroupByColumnSelectorStrategy - This would handle only strings. Upon thinking about it, it should be exactly the same as the StringGroupByColumnSelectorStrategy

clintropolis · 2024-03-21T00:55:10Z

...rc/main/java/org/apache/druid/query/groupby/epinephelinae/column/IdToDimensionConverter.java

+ *
+ * @param <DimensionType> Type of the dimension's values
+ */
+public interface IdToDimensionConverter<DimensionType>


hmm, this seems basically like IdLookup except generic, and the latter method is only used internally to the class where this thing is also used.

I wonder if we can just drop this interface, make IdLookup generic, and just define canCompareIds directly on KeyMappingGroupByColumnSelectorStrategy and let stuff override it

While refactoring, I realized that IdLookup is for the replacement of DimensionToIdConvertor. After separating out the strategy, it should be simple to replace DimensionToIdConverter with IdLookup

#16068 (comment) After completing this refactoring, I realized that now we only have a single implementation of Id->Dimension and Dimension->Id converters. Keeping this abstraction or removing it completely now lies on how whether or not we want to keep the stuff extendible for future dimensions having prebuilt dictionaries.

Also, I think we can't replace DimensionToIdConverter with IdLookup because the former requires tracking memory estimates as well.

clintropolis · 2024-03-21T00:57:59Z

...uery/groupby/epinephelinae/column/PrebuiltDictionaryStringGroupByColumnSelectorStrategy.java

+      return columnCapabilities != null
+             && columnCapabilities.hasBitmapIndexes()
+             && (columnCapabilities.areDictionaryValuesSorted()
+                                   .and(columnCapabilities.areDictionaryValuesUnique())).isTrue();


this isn't correct, we should be checking isDictionaryEncoded not hasBitmapIndexes

Suggested change

return columnCapabilities != null

&& columnCapabilities.hasBitmapIndexes()

&& (columnCapabilities.areDictionaryValuesSorted()

.and(columnCapabilities.areDictionaryValuesUnique())).isTrue();

return columnCapabilities != null

&& columnCapabilities.isDictionaryEncoded()

.and(columnCapabilities.areDictionaryValuesSorted())

.and(columnCapabilities.areDictionaryValuesUnique())

.isTrue();

Thanks for catching this. I was also confused about the condition, but then I chose to go ahead with the pre-existing code: https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/StringGroupByColumnSelectorStrategy.java#L165.

I'll correct this here.

clintropolis · 2024-03-21T01:03:11Z

...apache/druid/query/groupby/epinephelinae/column/KeyMappingGroupByColumnSelectorStrategy.java

+ * @see IdToDimensionConverter decoding logic for converting back dictionary to value
+ */
+@NotThreadSafe
+class KeyMappingGroupByColumnSelectorStrategy<DimensionType, DimensionHolderType>


yea, after looking more at this i really starting to feel like we should either split out multi-value string grouping into its own strategy (or do the other thing and dump it completely in favor of unnest adapter).

Everything except for mvds will spend time doing pointless stuff when there is ever only 1 dictionary id per row, so it doesn't feel worth the complexity/cost to have this odd strategy unified with the rest of the sane ones.

This makes more sense. It'll introduce some redundancy, but it should make the code a lot cleaner (especially when we handle single values)

clintropolis · 2024-03-28T22:18:25Z

processing/src/main/java/org/apache/druid/segment/column/TypeStrategies.java

+    @Override
+    public boolean groupable()
+    {
+      return true;


this should check if element type strategy is groupable

clintropolis · 2024-03-28T22:19:54Z

...ruid/query/groupby/epinephelinae/column/DictionaryBuildingGroupByColumnSelectorStrategy.java

+  {
+    if (columnType.equals(ColumnType.STRING)) {
+      // String types are handled specially because they can have multi-value dimensions
+      throw DruidException.defensive("Should use special variant which handles multi-value dimensions");


this can handle regular strings probably, since if we know that a string column definitely isn't multi-value it probably more efficient to not have to check every value?

I haven't it out, but I suspect most of the work would happen while building the dictionary. On the flip side, it helped me get rid of the PreBuiltGroupBySelectorStrategy because there's no type at the moment that can exploit this, therefore this seemed lucrative.

I could add that back in, so adding stuff for the array dictionary becomes easier later. WDYT?

This reverts commit c28eb79.

This reverts commit b51248f.

cryptoe

Took a first pass. Will take a second pass soon.

cryptoe · 2024-04-17T11:21:36Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteGroupByQueryTest.java

+ * under the License.
+ */
+
+package org.apache.druid.sql.calcite;


Why is this class required ?

processing/src/main/java/org/apache/druid/query/dimension/ColumnSelectorStrategyFactory.java

processing/src/main/java/org/apache/druid/query/filter/ArrayContainsElementFilter.java

...ruid/query/groupby/epinephelinae/column/DictionaryBuildingGroupByColumnSelectorStrategy.java

cryptoe · 2024-04-17T13:36:23Z

processing/src/main/java/org/apache/druid/segment/column/TypeStrategies.java

+     * Implements {@link Arrays#equals} but the element equality uses the element's type strategy
+     */
+    @Override
+    public boolean equals(@Nullable Object[] a, @Nullable Object[] b)


Lets have Ut's for this.

processing/src/main/java/org/apache/druid/segment/column/ObjectStrategyComplexTypeStrategy.java

processing/src/main/java/org/apache/druid/segment/nested/NestedDataComplexTypeSerde.java

processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java

cryptoe · 2024-04-17T13:46:48Z

...pby/epinephelinae/vector/DictionaryBuildingSingleValueStringGroupByVectorColumnSelector.java

@@ -82,7 +83,7 @@ public int writeKeys(

        // Use same ROUGH_OVERHEAD_PER_DICTIONARY_ENTRY as the nonvectorized version; dictionary structure is the same.
        stateFootprintIncrease +=
-            DictionaryBuilding.estimateEntryFootprint((value == null ? 0 : value.length()) * Character.BYTES);
+            DictionaryBuildingUtils.estimateEntryFootprint((value == null ? 0 : value.length()) * Character.BYTES);


Should we push this stringLogic inside the footPrintMethod or maybe have a wrapper string method inside the same class ?

...apache/druid/query/groupby/epinephelinae/column/KeyMappingGroupByColumnSelectorStrategy.java

cryptoe

Minor comments. Changes LGTM!!

...ruid/query/groupby/epinephelinae/column/DictionaryBuildingGroupByColumnSelectorStrategy.java

LakshSingla · 2024-04-22T09:42:11Z

I am cleaning up the dead todos, and adding a few tests soon! Should be good to go then.

#16068 modified DimensionHandlerUtils to accept complex types to be dimensions. This had an unintended side effect of allowing complex types to be joined upon (which wasn't guarded explicitly, it doesn't work). This PR modifies the IndexedTable to reject building the index on the complex types to prevent joining on complex types. The PR adds back the check in the same place, explicitly.

apache#16068 modified DimensionHandlerUtils to accept complex types to be dimensions. This had an unintended side effect of allowing complex types to be joined upon (which wasn't guarded explicitly, it doesn't work). This PR modifies the IndexedTable to reject building the index on the complex types to prevent joining on complex types. The PR adds back the check in the same place, explicitly.

#16068 modified DimensionHandlerUtils to accept complex types to be dimensions. This had an unintended side effect of allowing complex types to be joined upon (which wasn't guarded explicitly, it doesn't work). This PR modifies the IndexedTable to reject building the index on the complex types to prevent joining on complex types. The PR adds back the check in the same place, explicitly. Co-authored-by: Laksh Singla <lakshsingla@gmail.com>

LakshSingla added 4 commits March 6, 2024 11:53

init

39d6a32

draft work

731d273

more

b01344a

some stuff working

6e16865

github-actions bot added the Area - Segment Format and Ser/De label Mar 7, 2024

group by complex col working

718eef9

github-actions bot added the Area - Querying label Mar 7, 2024

LakshSingla added 2 commits March 8, 2024 02:47

fixup

cbb1181

fixup

33de9a8

LakshSingla added the Design Review label Mar 11, 2024

LakshSingla added 8 commits March 12, 2024 16:05

remove original strategies

f1394f7

checkstyle

bd82649

comments

3017e4d

add benchmarks

f159f67

all dictionaries now use sorted map

638bf52

Merge branch 'master' into unified-group-by-strategies-2

b21faac

tests 1

ead1ddf

tests and comments

25fc42c

clintropolis reviewed Mar 21, 2024

View reviewed changes

tests

58f8834

LakshSingla marked this pull request as ready for review March 21, 2024 21:52

LakshSingla added 5 commits March 22, 2024 10:48

fixup big mistake

5939715

Merge branch 'master' into unified-group-by-strategies-2

0f620b0

cleanup, refactor for diff strategies

d8a250c

group by on nested arrays, disallow topN and vector engine

2479155

hash stuff

e49206f

clintropolis reviewed Mar 28, 2024

View reviewed changes

LakshSingla added 3 commits April 1, 2024 16:35

some review, more tests

973fa88

fixup benchmark

8b62073

Merge branch 'master' into unified-group-by-strategies-2

53f1b54

LakshSingla added 5 commits April 12, 2024 13:01

Revert "openrewrite test"

b51248f

This reverts commit c28eb79.

Revert "Revert "openrewrite test""

031f586

This reverts commit b51248f.

openrewrite test

170d8d9

review comments

60c2b42

test fix

ad44b5f

github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 15, 2024

LakshSingla added 3 commits April 16, 2024 11:34

tests fix

f30c7c6

tests fix

6e2db86

tests fix

a3a0145

cryptoe reviewed Apr 17, 2024

View reviewed changes

review 1, delete bogus class

e95731c

github-advanced-security bot found potential problems Apr 18, 2024

View reviewed changes

...apache/druid/query/groupby/epinephelinae/column/KeyMappingGroupByColumnSelectorStrategy.java Fixed Show fixed Hide fixed

LakshSingla added 2 commits April 19, 2024 15:01

review 1, more comments

dfb5c9f

Merge branch 'master' into unified-group-by-strategies-2

a412386

cryptoe approved these changes Apr 19, 2024

View reviewed changes

...ruid/query/groupby/epinephelinae/column/DictionaryBuildingGroupByColumnSelectorStrategy.java Outdated Show resolved Hide resolved

cryptoe removed the Design Review label Apr 19, 2024

cleanup todos, review comments

2098f90

LakshSingla added 4 commits April 23, 2024 17:21

sorting and limiting fixup

3fbadd9

prevent round trip

5d5c8ae

prevent round trip 2

ad2fb04

tests, and fixup flake, preserve old incorrect behaviour

7fe36b1

cryptoe merged commit 6bca406 into apache:master Apr 24, 2024
85 checks passed

LakshSingla deleted the unified-group-by-strategies-2 branch April 26, 2024 04:46

LakshSingla mentioned this pull request Apr 29, 2024

Prevent joining on nested arrays and complex types #16349

Merged

10 tasks

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping on complex columns aka unifying GroupBy strategies #16068

Grouping on complex columns aka unifying GroupBy strategies #16068

LakshSingla commented Mar 7, 2024 •

edited

clintropolis Mar 15, 2024

clintropolis Mar 20, 2024

LakshSingla Mar 21, 2024

clintropolis Mar 21, 2024

LakshSingla Mar 21, 2024

clintropolis Mar 21, 2024

clintropolis Mar 21, 2024

LakshSingla Mar 21, 2024

clintropolis Mar 21, 2024

LakshSingla Mar 26, 2024

clintropolis Mar 21, 2024

LakshSingla Mar 21, 2024

clintropolis Mar 26, 2024

LakshSingla Mar 26, 2024 •

edited

clintropolis Mar 21, 2024

LakshSingla Mar 27, 2024

LakshSingla Mar 27, 2024 •

edited

clintropolis Mar 21, 2024

LakshSingla Mar 21, 2024 •

edited

clintropolis Mar 21, 2024

LakshSingla Mar 21, 2024

clintropolis Mar 28, 2024

clintropolis Mar 28, 2024

LakshSingla Apr 1, 2024 •

edited

cryptoe left a comment

cryptoe Apr 17, 2024

cryptoe Apr 17, 2024

cryptoe Apr 17, 2024

cryptoe left a comment

LakshSingla commented Apr 22, 2024

		@@ -413,9 +414,12 @@ public static Object convertObjectToType(
		case DOUBLE:
		return coerceToObjectArrayWithElementCoercionFunction(obj, DimensionHandlerUtils::convertObjectToDouble);

Grouping on complex columns aka unifying GroupBy strategies #16068

Grouping on complex columns aka unifying GroupBy strategies #16068

Conversation

LakshSingla commented Mar 7, 2024 • edited

Description

User Impact

Making grouping strategies type agnostic

Implementing a hashing strategy for complex types

Associative array, i.e. HashMap (Object2IntOpenHashMap)

Sorted Set (Object2IntRBTreeMap/ Object2IntAvlTreeMap)

Grouping on non useful types

Interface changes

Benchmarks

Future work

Release note

Key changed/added classes in this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla Mar 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla Mar 27, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla Mar 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LakshSingla Apr 1, 2024 • edited

Choose a reason for hiding this comment

cryptoe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cryptoe left a comment

Choose a reason for hiding this comment

LakshSingla commented Apr 22, 2024

LakshSingla commented Mar 7, 2024 •

edited

LakshSingla Mar 26, 2024 •

edited

LakshSingla Mar 27, 2024 •

edited

LakshSingla Mar 21, 2024 •

edited

LakshSingla Apr 1, 2024 •

edited