Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove group-by v1 #14866

Merged
merged 9 commits into from
Aug 23, 2023
Merged

remove group-by v1 #14866

merged 9 commits into from
Aug 23, 2023

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Aug 18, 2023

Description

This PR removes the 'v1' grouping engine as well as the strategy selection abstraction. I found GroupByStrategyV2 to still be a relatively handy thing to keep around since it is a nice container for grabbing the resources and stuff needed and plumbing to the underlying group by v2 engine, so I have transitioned it into a new class GroupingEngine which is now used in place of prior uses of GroupByStrategySelector. Alternatively its functionality could have been distributed among the query runner and toolchest, but it seemed a bit more disruptive.

Release note

The V1 group by query has been removed in favor of always using the V2 engine. Any query context parameters associated with it will now be ignored internally.

AggregatorFactory.getRequiredColumns has been deprecated and will be removed in a future release. If you have an extension that implements AggregatorFactory then this method should be removed from your implementation (there is now a default definition that throws an exception about being deprecated).


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally!! 🎉

I looked through and thought about whether there are other things that could be updated. What are your thoughts on these two?

  • longLast, longFirst, doubleFirst, doubleLast could stop lying about their intermediate types. IIRC they only had to lie about this to make groupBy v1 work.
  • OnheapIncrementalIndex no longer needs to support multithreaded writers, so addToFacts could be simplified.

@clintropolis
Copy link
Member Author

longLast, longFirst, doubleFirst, doubleLast could stop lying about their intermediate types. IIRC they only had to lie about this to make groupBy v1 work.

I actually did this initially, but then noticed that #14462 was also making this change, so reverted for now. That said, it would only be a minor conflict with the other PR, so Im willing to add it back.

OnheapIncrementalIndex no longer needs to support multithreaded writers, so addToFacts could be simplified

Yeah, totally, i actually imagine a lot of other parts of IncrementalIndex could also be improved. Should I do that in this PR or save it for a follow-up? In the very least i can remove the group by v1 comment in addToFacts and add another comment to replace it indicating that the condition should now be impossible and we should work to simplify the code area in the future

@gianm
Copy link
Contributor

gianm commented Aug 21, 2023

I actually did this initially, but then noticed that #14462 was also making this change, so reverted for now. That said, it would only be a minor conflict with the other PR, so Im willing to add it back.

OK, fair enough. We can leave it for #14462.

Yeah, totally, i actually imagine a lot of other parts of IncrementalIndex could also be improved. Should I do that in this PR or save it for a follow-up? In the very least i can remove the group by v1 comment in addToFacts and add another comment to replace it indicating that the condition should now be impossible and we should work to simplify the code area in the future

Up to you how much you want to do in this PR. I'd at least adjust the comment in there that mentions groupBy v1. Other stuff could be done as future work.

Note: Even if cache is enabled, for [groupBy v2](../querying/groupbyquery.md#strategies) queries, segment level cache do not work on Brokers.
See [Differences between v1 and v2](../querying/groupbyquery.md#differences-between-v1-and-v2) and [Query caching](../querying/caching.md) for more information.
Note: Even if cache is enabled, for [groupBy](../querying/groupbyquery.md) queries, segment level cache does not work on Brokers.
See [query caching](../querying/caching.md) for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See [query caching](../querying/caching.md) for more information.
See [Query caching](../querying/caching.md) for more information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see lots of other links like this.... but why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean, its in the middle of a sentence, why would we uppercase it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, personally I find the suggested style to be better :)

Copy link
Member Author

@clintropolis clintropolis Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess like... I disagree, but we have a lot of other links doing that too.

Like take this example:
Screenshot 2023-08-22 at 9 08 33 PM

Why should the A in "Advance groupBy v2 configurations" and the M in "Memory tuning and resource limits" be uppercase?

Is this a matter of perspective? Like it bothers me because I view these inline links as just some text that is part of a sentence that happen to be clickable. But i guess another perspective might be that this is some proper section title and it should match as closely as possible? I still don't really personally like it, but i changed it because a handful of other links are like this on this and other pages (not all of them though...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it seems better because upper-casing highlights that part of the text. But yes, in the end, its a matter of perspective. I don't have an objective argument for either or against.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis In this specific example, Query caching is the title of the page and must be referenced using sentence case.

Copy link
Contributor

@gianm gianm Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO @ektravel's perspective makes sense. If the link text is a title of a page and is referring to the page itself then capitalizing the first letter seems good. It emphasizes that we're referring to a documentation page rather than a concept.

To me this is correct (the text refers to the page being linked to):

See [Query caching](../querying/caching.md) for more information.

And this is also correct (the text refers to a concept, not the page being linked to):

It is important to set up [query caching](../querying/caching.md) properly for best performance.

Btw, the following is also IMO correct, but awkward and I'd avoid it in favor of the first example. Here "query caching" is a common noun acting as an adjective for "documentation"; it's not referring to the page being linked to, so it shouldn't be capitalized:

See the [query caching](../querying/caching.md) documentation for more information.

@@ -440,6 +372,7 @@ Supported query contexts:

|Key|Description|Default|
|---|-----------|-------|
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|None|

The entry on line 375 is missing the default.

Copy link
Contributor

@ektravel ektravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the docs portion.

@clintropolis clintropolis removed the WIP label Aug 23, 2023
@gianm gianm merged commit 36e659a into apache:master Aug 23, 2023
74 checks passed
@clintropolis clintropolis deleted the remove-group-by-v1 branch August 23, 2023 19:48
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
gianm added a commit to gianm/druid that referenced this pull request Jan 16, 2024
Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.
gianm added a commit that referenced this pull request Jan 18, 2024
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
LakshSingla pushed a commit to LakshSingla/druid that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
abhishekagarwal87 pushed a commit that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants