Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache value selectors in RowBasedColumnSelectorFactory. #15615

Merged
merged 3 commits into from
Jan 16, 2024

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Jan 3, 2024

There was already caching for dimension selectors. This patch adds caching for value (object and number) selectors. It's helpful when the same field is read multiple times during processing of a single row (for example, by being an input to both MIN and MAX aggregations).

This patch also updates Rows#objectToNumber to take an expectedType parameter, which is used to avoid parsing numbers twice (when the expected type is unknown, we try both long and double).

There was already caching for dimension selectors. This patch adds caching
for value (object and number) selectors. It's helpful when the same field is
read multiple times during processing of a single row (for example, by being
an input to both MIN and MAX aggregations).
@gianm gianm merged commit d359fb3 into apache:master Jan 16, 2024
83 checks passed
@gianm gianm deleted the rbcsf-cache-values branch January 16, 2024 02:03
gianm added a commit to gianm/druid that referenced this pull request Jan 16, 2024
Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.
gianm added a commit that referenced this pull request Jan 18, 2024
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
LakshSingla pushed a commit to LakshSingla/druid that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
abhishekagarwal87 pushed a commit that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
gianm added a commit to gianm/druid that referenced this pull request Feb 28, 2024
PR apache#15615 added an optimization to avoid parsing numbers twice in cases
where we know that they should definitely be longs or
definitely be doubles. Rather than try parsing as long first, and then
try parsing as double, it would use only the parsing routine specific to
the requested outputType.

This caused a bug: previously, we would accept decimals like "1.0" or
"1.23" as longs, by truncating them to "1". After that patch, we would
treat such decimals as nulls when the outputType is set to LONG.

This patch retains the short-circuit for doubles: if outputType is
DOUBLE, we only parse the string as a double. But for outputType LONG,
this patch restores the old behavior: try to parse as long first,
then double.
abhishekagarwal87 pushed a commit that referenced this pull request Mar 4, 2024
* Rows.objectToNumber: Accept decimals with output type LONG.

PR #15615 added an optimization to avoid parsing numbers twice in cases
where we know that they should definitely be longs or
definitely be doubles. Rather than try parsing as long first, and then
try parsing as double, it would use only the parsing routine specific to
the requested outputType.

This caused a bug: previously, we would accept decimals like "1.0" or
"1.23" as longs, by truncating them to "1". After that patch, we would
treat such decimals as nulls when the outputType is set to LONG.

This patch retains the short-circuit for doubles: if outputType is
DOUBLE, we only parse the string as a double. But for outputType LONG,
this patch restores the old behavior: try to parse as long first,
then double.
gianm added a commit to gianm/druid that referenced this pull request Mar 6, 2024
…5999)

* Rows.objectToNumber: Accept decimals with output type LONG.

PR apache#15615 added an optimization to avoid parsing numbers twice in cases
where we know that they should definitely be longs or
definitely be doubles. Rather than try parsing as long first, and then
try parsing as double, it would use only the parsing routine specific to
the requested outputType.

This caused a bug: previously, we would accept decimals like "1.0" or
"1.23" as longs, by truncating them to "1". After that patch, we would
treat such decimals as nulls when the outputType is set to LONG.

This patch retains the short-circuit for doubles: if outputType is
DOUBLE, we only parse the string as a double. But for outputType LONG,
this patch restores the old behavior: try to parse as long first,
then double.
abhishekagarwal87 pushed a commit that referenced this pull request Mar 8, 2024
…16062)

* Rows.objectToNumber: Accept decimals with output type LONG.

PR #15615 added an optimization to avoid parsing numbers twice in cases
where we know that they should definitely be longs or
definitely be doubles. Rather than try parsing as long first, and then
try parsing as double, it would use only the parsing routine specific to
the requested outputType.

This caused a bug: previously, we would accept decimals like "1.0" or
"1.23" as longs, by truncating them to "1". After that patch, we would
treat such decimals as nulls when the outputType is set to LONG.

This patch retains the short-circuit for doubles: if outputType is
DOUBLE, we only parse the string as a double. But for outputType LONG,
this patch restores the old behavior: try to parse as long first,
then double.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants