remove group-by v1 #14866

clintropolis · 2023-08-18T01:20:07Z

Description

This PR removes the 'v1' grouping engine as well as the strategy selection abstraction. I found GroupByStrategyV2 to still be a relatively handy thing to keep around since it is a nice container for grabbing the resources and stuff needed and plumbing to the underlying group by v2 engine, so I have transitioned it into a new class GroupingEngine which is now used in place of prior uses of GroupByStrategySelector. Alternatively its functionality could have been distributed among the query runner and toolchest, but it seemed a bit more disruptive.

Release note

The V1 group by query has been removed in favor of always using the V2 engine. Any query context parameters associated with it will now be ignored internally.

AggregatorFactory.getRequiredColumns has been deprecated and will be removed in a future release. If you have an extension that implements AggregatorFactory then this method should be removed from your implementation (there is now a default definition that throws an exception about being deprecated).

This PR has:

gianm

Finally!! 🎉

I looked through and thought about whether there are other things that could be updated. What are your thoughts on these two?

longLast, longFirst, doubleFirst, doubleLast could stop lying about their intermediate types. IIRC they only had to lie about this to make groupBy v1 work.
OnheapIncrementalIndex no longer needs to support multithreaded writers, so addToFacts could be simplified.

clintropolis · 2023-08-18T22:18:39Z

longLast, longFirst, doubleFirst, doubleLast could stop lying about their intermediate types. IIRC they only had to lie about this to make groupBy v1 work.

I actually did this initially, but then noticed that #14462 was also making this change, so reverted for now. That said, it would only be a minor conflict with the other PR, so Im willing to add it back.

OnheapIncrementalIndex no longer needs to support multithreaded writers, so addToFacts could be simplified

Yeah, totally, i actually imagine a lot of other parts of IncrementalIndex could also be improved. Should I do that in this PR or save it for a follow-up? In the very least i can remove the group by v1 comment in addToFacts and add another comment to replace it indicating that the condition should now be impossible and we should work to simplify the code area in the future

gianm · 2023-08-21T15:33:47Z

I actually did this initially, but then noticed that #14462 was also making this change, so reverted for now. That said, it would only be a minor conflict with the other PR, so Im willing to add it back.

OK, fair enough. We can leave it for #14462.

Yeah, totally, i actually imagine a lot of other parts of IncrementalIndex could also be improved. Should I do that in this PR or save it for a follow-up? In the very least i can remove the group by v1 comment in addToFacts and add another comment to replace it indicating that the condition should now be impossible and we should work to simplify the code area in the future

Up to you how much you want to do in this PR. I'd at least adjust the comment in there that mentions groupBy v1. Other stuff could be done as future work.

ektravel · 2023-08-21T16:23:31Z

docs/configuration/index.md

- Note: Even if cache is enabled, for [groupBy v2](../querying/groupbyquery.md#strategies) queries, segment level cache do not work on Brokers.
- See [Differences between v1 and v2](../querying/groupbyquery.md#differences-between-v1-and-v2) and [Query caching](../querying/caching.md) for more information.
+ Note: Even if cache is enabled, for [groupBy](../querying/groupbyquery.md) queries, segment level cache does not work on Brokers.
+ See [query caching](../querying/caching.md) for more information.


Suggested change

See [query caching](../querying/caching.md) for more information.

See [Query caching](../querying/caching.md) for more information.

I see lots of other links like this.... but why?

i mean, its in the middle of a sentence, why would we uppercase it

FWIW, personally I find the suggested style to be better :)

I guess like... I disagree, but we have a lot of other links doing that too.

Like take this example:

Why should the A in "Advance groupBy v2 configurations" and the M in "Memory tuning and resource limits" be uppercase?

Is this a matter of perspective? Like it bothers me because I view these inline links as just some text that is part of a sentence that happen to be clickable. But i guess another perspective might be that this is some proper section title and it should match as closely as possible? I still don't really personally like it, but i changed it because a handful of other links are like this on this and other pages (not all of them though...)

For me, it seems better because upper-casing highlights that part of the text. But yes, in the end, its a matter of perspective. I don't have an objective argument for either or against.

@clintropolis In this specific example, Query caching is the title of the page and must be referenced using sentence case.

IMO @ektravel's perspective makes sense. If the link text is a title of a page and is referring to the page itself then capitalizing the first letter seems good. It emphasizes that we're referring to a documentation page rather than a concept.

To me this is correct (the text refers to the page being linked to):

See [Query caching](../querying/caching.md) for more information.

And this is also correct (the text refers to a concept, not the page being linked to):

It is important to set up [query caching](../querying/caching.md) properly for best performance.

Btw, the following is also IMO correct, but awkward and I'd avoid it in favor of the first example. Here "query caching" is a common noun acting as an adjective for "documentation"; it's not referring to the page being linked to, so it shouldn't be capitalized:

See the [query caching](../querying/caching.md) documentation for more information.

ektravel · 2023-08-21T16:28:27Z

docs/querying/groupbyquery.md

@@ -440,6 +372,7 @@ Supported query contexts:

 |Key|Description|Default|
 |---|-----------|-------|
+|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|


ektravel

Reviewed the docs portion.

Following apache#14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module.

* IncrementalIndex#add is no longer thread-safe. Following #14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module. * Spelling. * Changes from static analysis. * Fix javadoc.

* IncrementalIndex#add is no longer thread-safe. Following apache#14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module. * Spelling. * Changes from static analysis. * Fix javadoc.

* IncrementalIndex#add is no longer thread-safe. Following #14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module. * Spelling. * Changes from static analysis. * Fix javadoc. Co-authored-by: Gian Merlino <gianmerlino@gmail.com>

remove group-by v1

6e93044

clintropolis added Area - Querying Release Notes WIP labels Aug 18, 2023

docs

c1f4664

github-actions bot added the Area - Documentation label Aug 18, 2023

clintropolis added 5 commits August 17, 2023 19:56

remove unused configs, fix test

7e5ed4d

fix test

1681540

adjustments

0be4267

why not

caa9f9f

adjust

04fa897

gianm reviewed Aug 18, 2023

View reviewed changes

ektravel reviewed Aug 21, 2023

View reviewed changes

clintropolis added 2 commits August 22, 2023 20:56

review stuff

64e8c32

Merge remote-tracking branch 'upstream/master' into remove-group-by-v1

f75f501

clintropolis removed the WIP label Aug 23, 2023

gianm approved these changes Aug 23, 2023

View reviewed changes

gianm merged commit 36e659a into apache:master Aug 23, 2023
74 checks passed

clintropolis deleted the remove-group-by-v1 branch August 23, 2023 19:48

LakshSingla added this to the 28.0 milestone Oct 12, 2023

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

clintropolis mentioned this pull request Nov 29, 2023

simplify IncrementalIndex since group-by v1 has been removed #15448

Merged

2 tasks

clintropolis mentioned this pull request Jan 11, 2024

tidy up group by engines after removal of v1 #15665

Merged

3 tasks

gianm mentioned this pull request Jan 16, 2024

IncrementalIndex#add is no longer thread-safe. #15697

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove group-by v1 #14866

remove group-by v1 #14866

clintropolis commented Aug 18, 2023 •

edited

Loading

gianm left a comment

clintropolis commented Aug 18, 2023

gianm commented Aug 21, 2023

ektravel Aug 21, 2023

clintropolis Aug 23, 2023

clintropolis Aug 23, 2023

abhishekagarwal87 Aug 23, 2023

clintropolis Aug 23, 2023 •

edited

Loading

abhishekagarwal87 Aug 23, 2023

ektravel Aug 23, 2023

gianm Aug 23, 2023 •

edited

Loading

ektravel Aug 21, 2023

ektravel left a comment

	See [query caching](../querying/caching.md) for more information.
	See [Query caching](../querying/caching.md) for more information.

	\|`groupByIsSingleThreaded`\|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.\|
	\|`groupByIsSingleThreaded`\|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.\|None\|

remove group-by v1 #14866

remove group-by v1 #14866

Conversation

clintropolis commented Aug 18, 2023 • edited Loading

Description

Release note

gianm left a comment

Choose a reason for hiding this comment

clintropolis commented Aug 18, 2023

gianm commented Aug 21, 2023

ektravel Aug 21, 2023

Choose a reason for hiding this comment

clintropolis Aug 23, 2023

Choose a reason for hiding this comment

clintropolis Aug 23, 2023

Choose a reason for hiding this comment

abhishekagarwal87 Aug 23, 2023

Choose a reason for hiding this comment

clintropolis Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

abhishekagarwal87 Aug 23, 2023

Choose a reason for hiding this comment

ektravel Aug 23, 2023

Choose a reason for hiding this comment

gianm Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

ektravel Aug 21, 2023

Choose a reason for hiding this comment

ektravel left a comment

Choose a reason for hiding this comment

clintropolis commented Aug 18, 2023 •

edited

Loading

clintropolis Aug 23, 2023 •

edited

Loading

gianm Aug 23, 2023 •

edited

Loading