Deserialize dimensions in group by queries to their respective types when reading from their serialized format #16511

LakshSingla · 2024-05-29T05:54:22Z

Description

Grouping on complex columns allowed the dimensions to be of complex types. When reading the dimensions from their serialized format, we must deserialize them in their desired complex type. Otherwise, Druid deserializes the dimension into a generic java object equivalent to objectMapper.readValue(serializedValue, Object.class)

We read the dimensions from their serialized form when reading dimensions over the wire (in native queries)

This wasn't discovered at the time of the original PR because only JSON columns support groupability at the moment, and they are designed to handle the generic Java object that the object mapper deserializes to by default. However, this is required for most other complex types that require groupability.

Interface change

public interface TypeStrategy<T> extends Comparator<Object>, Hash.Strategy<T>
{
...
  default Class<?> complexDimensionType()
  {
     // To be implemented by any type supporting groupability
     // Throws exception by default
  }
}

Benchmarking - I have ran the GroupByBenchmark.queryMultiQueryableIndexWithSerde benchmark with limited iterations and here are the results before and after the change.


With changes, + array
Benchmark                                           (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSerde                -1                       2              4                 all            100000           basic.A        force  avgt    6  183664.081 ± 18777.234  us/op

Without changes + array
Benchmark                                           (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSerde                -1                       2              4                 all            100000           basic.A        force  avgt    6  184934.791 ± 18924.516  us/op

~

Release note

(none)

Key changed/added classes in this PR

MyFoo
OurBar
TheirBaz

This PR has:

clintropolis · 2024-06-11T06:04:10Z

processing/src/main/java/org/apache/druid/segment/column/TypeStrategy.java

@@ -216,4 +216,9 @@ default boolean equals(T a, T b)
  {
    throw DruidException.defensive("Not implemented. Check groupable() first");
  }
+
+  default Class<?> complexDimensionType()


nit: maybe getComplexClass or the classic getClazz would be a better name since it probably isn't a different type when used as a dimension (and we have the separate groupable to determine if it can be used as a dimension or not)

clintropolis · 2024-06-11T06:09:18Z

processing/src/main/java/org/apache/druid/segment/column/ObjectStrategyComplexTypeStrategy.java

  }

  public ObjectStrategyComplexTypeStrategy(
      ObjectStrategy<T> objectStrategy,
      TypeSignature<?> signature,
-      @Nullable final Hash.Strategy<T> hashStrategy
+      @Nullable final Hash.Strategy<T> hashStrategy,
+      @Nullable final Class<?> complexDimensionType


should this just use getClazz of ObjectStrategy?

clintropolis · 2024-06-11T06:11:04Z

...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java

+      final ObjectMapper jsonMapper,
      final ObjectMapper spillMapper,


should jsonMapper and spillMapper be different? I know spillMapper is usually smile, though isn't results from historicals also smile? regardless, probably worth a comment somewhere of why we have different mappers

When writing the code, this seemed natural, since we are using spillMapper for the spilling runner, and nothing else. After going through the group by code again, I think we should still be using the jsonMapper.
The input rows which get further deserialized according to the complex types are sourced from:

https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByRowProcessor.java#L90 - The input rows are from the processed subqueries.

https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByMergingQueryRunner.java#L254C54-L254C55 - The input rows are to the mergeRunner, which has already been read from the segments by the createRunner.

Retested stuff, serde in (2) is not required.

LakshSingla · 2024-06-14T08:34:15Z

Thanks for the review @clintropolis @asdf2014

…e types when reading from spilled files and cached results (#16620) Like #16511, but for keys that have been spilled or cached during the grouping process

init

bd7da78

github-actions bot added the Area - Segment Format and Ser/De label May 29, 2024

tests, pair groupable

fa76d35

github-actions bot added the Area - Querying label May 31, 2024

LakshSingla added 3 commits June 3, 2024 13:24

framework change

1ae4a94

tests

7aa5e7a

update benchmarks

c7d25fc

clintropolis reviewed Jun 11, 2024

View reviewed changes

LakshSingla added 3 commits June 12, 2024 11:42

comments

66c177d

add javadoc for the jsonMapper

f8ce547

remove extra deserialization

b3a51f5

asdf2014 approved these changes Jun 13, 2024

View reviewed changes

LakshSingla added 2 commits June 14, 2024 00:55

add special serde for map based result rows

3fc82ae

revert unnecessary change

7d70e53

LakshSingla force-pushed the dim-deserialize-2 branch from 26ef2a0 to 7d70e53 Compare June 13, 2024 19:34

clintropolis approved these changes Jun 14, 2024

View reviewed changes

Merge branch 'master' into dim-deserialize-2

959c84d

asdf2014 merged commit da1e293 into apache:master Jun 14, 2024
15 checks passed

LakshSingla deleted the dim-deserialize-2 branch June 14, 2024 08:34

LakshSingla mentioned this pull request Jun 17, 2024

Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results #16620

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialize dimensions in group by queries to their respective types when reading from their serialized format #16511

Deserialize dimensions in group by queries to their respective types when reading from their serialized format #16511

LakshSingla commented May 29, 2024 •

edited

Loading

clintropolis Jun 11, 2024

clintropolis Jun 11, 2024

clintropolis Jun 11, 2024

LakshSingla Jun 12, 2024

LakshSingla Jun 12, 2024

LakshSingla commented Jun 14, 2024

		final ObjectMapper jsonMapper,
		final ObjectMapper spillMapper,

Deserialize dimensions in group by queries to their respective types when reading from their serialized format #16511

Deserialize dimensions in group by queries to their respective types when reading from their serialized format #16511

Conversation

LakshSingla commented May 29, 2024 • edited Loading

Description

Interface change

Release note

Key changed/added classes in this PR

clintropolis Jun 11, 2024

Choose a reason for hiding this comment

clintropolis Jun 11, 2024

Choose a reason for hiding this comment

clintropolis Jun 11, 2024

Choose a reason for hiding this comment

LakshSingla Jun 12, 2024

Choose a reason for hiding this comment

LakshSingla Jun 12, 2024

Choose a reason for hiding this comment

LakshSingla commented Jun 14, 2024

LakshSingla commented May 29, 2024 •

edited

Loading