Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vector group by support for string expressions #11010

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 18, 2021

Description

Expands on the structures added in #10613 to add support for grouping on string expressions in the vectorized group by engine. The key addition that makes this possible is DictionaryBuildingSingleValueStringGroupByVectorColumnSelector, which is the vectorized group by engine version of DictionaryBuildingStringGroupByColumnSelectorStrategy, and allows the vector group by engine to group on strings which are not dictionary encoded.

To help showcase this, I added vectorization support to the concat operator string1 + 'foo', and the concat function concat(string1,'-',string2,'-',long1).

It provides a pretty decent performance increase. From the added benchmark queries:

      // 26: group by string expr with non-expr agg
      "SELECT CONCAT(string2, '-', long2), SUM(double1) FROM foo GROUP BY 1 ORDER BY 2",
      // 27: group by string expr with expr agg
      "SELECT CONCAT(string2, '-', long2), SUM(long1 * double4) FROM foo GROUP BY 1 ORDER BY 2"
Benchmark                        (query)  (rowsPerSegment)  (vectorize)  Mode  Cnt     Score    Error  Units
SqlExpressionBenchmark.querySql       26           5000000        false  avgt    5  1601.424 ± 22.075  ms/op
SqlExpressionBenchmark.querySql       26           5000000        force  avgt    5  1017.797 ± 18.384  ms/op
SqlExpressionBenchmark.querySql       27           5000000        false  avgt    5  2072.850 ± 46.369  ms/op
SqlExpressionBenchmark.querySql       27           5000000        force  avgt    5  1072.897 ± 19.756  ms/op

Vectorizing additional string expressions I will save for a future PR.


Key changed/added classes in this PR
  • DictionaryBuildingSingleValueStringGroupByVectorColumnSelector
  • VectorGroupByEngine
  • GroupByVectorColumnProcessorFactory
  • VectorStringProcessors
  • StringOutMultiStringInVectorProcessor

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

@Override
protected String processValue(@Nullable String leftVal, @Nullable String rightVal)
{
return leftVal + rightVal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some comment about why it does not handle nulls unlike the other concat method below?

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI

@clintropolis
Copy link
Member Author

thanks for review @jihoonson 👍

@clintropolis clintropolis merged commit 338886f into apache:master Apr 9, 2021
@clintropolis clintropolis deleted the vectorize-string-expr-group-by branch April 9, 2021 02:23
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants