Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialize dimensions in group by queries to their respective types when reading from their serialized format #16511

Merged
merged 11 commits into from
Jun 14, 2024

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented May 29, 2024

Description

Grouping on complex columns allowed the dimensions to be of complex types. When reading the dimensions from their serialized format, we must deserialize them in their desired complex type. Otherwise, Druid deserializes the dimension into a generic java object equivalent to objectMapper.readValue(serializedValue, Object.class)

We read the dimensions from their serialized form when reading dimensions over the wire (in native queries)

This wasn't discovered at the time of the original PR because only JSON columns support groupability at the moment, and they are designed to handle the generic Java object that the object mapper deserializes to by default. However, this is required for most other complex types that require groupability.

Interface change

public interface TypeStrategy<T> extends Comparator<Object>, Hash.Strategy<T>
{
...
  default Class<?> complexDimensionType()
  {
     // To be implemented by any type supporting groupability
     // Throws exception by default
  }
} 

Benchmarking - I have ran the GroupByBenchmark.queryMultiQueryableIndexWithSerde benchmark with limited iterations and here are the results before and after the change.


With changes, + array
Benchmark                                           (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSerde                -1                       2              4                 all            100000           basic.A        force  avgt    6  183664.081 ± 18777.234  us/op

Without changes + array
Benchmark                                           (initialBuckets)  (numProcessingThreads)  (numSegments)  (queryGranularity)  (rowsPerSegment)  (schemaAndQuery)  (vectorize)  Mode  Cnt       Score       Error  Units
GroupByBenchmark.queryMultiQueryableIndexWithSerde                -1                       2              4                 all            100000           basic.A        force  avgt    6  184934.791 ± 18924.516  us/op

~

Release note

(none)


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@@ -216,4 +216,9 @@ default boolean equals(T a, T b)
{
throw DruidException.defensive("Not implemented. Check groupable() first");
}

default Class<?> complexDimensionType()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe getComplexClass or the classic getClazz would be a better name since it probably isn't a different type when used as a dimension (and we have the separate groupable to determine if it can be used as a dimension or not)

}

public ObjectStrategyComplexTypeStrategy(
ObjectStrategy<T> objectStrategy,
TypeSignature<?> signature,
@Nullable final Hash.Strategy<T> hashStrategy
@Nullable final Hash.Strategy<T> hashStrategy,
@Nullable final Class<?> complexDimensionType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just use getClazz of ObjectStrategy?

Comment on lines 191 to 192
final ObjectMapper jsonMapper,
final ObjectMapper spillMapper,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should jsonMapper and spillMapper be different? I know spillMapper is usually smile, though isn't results from historicals also smile? regardless, probably worth a comment somewhere of why we have different mappers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing the code, this seemed natural, since we are using spillMapper for the spilling runner, and nothing else. After going through the group by code again, I think we should still be using the jsonMapper.
The input rows which get further deserialized according to the complex types are sourced from:

  1. https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByRowProcessor.java#L90 - The input rows are from the processed subqueries.
  2. https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByMergingQueryRunner.java#L254C54-L254C55 - The input rows are to the mergeRunner, which has already been read from the segments by the createRunner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retested stuff, serde in (2) is not required.

@asdf2014 asdf2014 merged commit da1e293 into apache:master Jun 14, 2024
15 checks passed
@LakshSingla LakshSingla deleted the dim-deserialize-2 branch June 14, 2024 08:34
@LakshSingla
Copy link
Contributor Author

Thanks for the review @clintropolis @asdf2014

LakshSingla added a commit that referenced this pull request Jul 15, 2024
…e types when reading from spilled files and cached results (#16620)

Like #16511, but for keys that have been spilled or cached during the grouping process
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants