Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IncrementalIndex#add is no longer thread-safe. #15697

Merged
merged 4 commits into from
Jan 18, 2024

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Jan 16, 2024

Following #14866, there is no longer a reason for IncrementalIndex#add to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way, as exposed by #15615 making testMultithreadAddFactsUsingExpressionAndJavaScript in IncrementalIndexIngestionTest flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors.

This patch also:

  1. Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
    which the similar datasketches versions already have. This is done to
    help them adhere to the contract of Aggregator: concurrent calls to
    "aggregate" and "get" must be thread-safe.

  2. Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
    druid-benchmarks module.

Benchmarks below. I think the speedup is mainly due to the additional caching enabled by the row ID supplier. The benchmark involves two primitive aggregators reading the same input fields, which would now be able to make use of in-selector caching.

patch

Benchmark                                          Mode  Cnt     Score    Error  Units
OnheapIncrementalIndexBenchmark.concurrentAddRead  avgt    5  1805.434 ± 51.638  ms/op

master

Benchmark                                          Mode  Cnt     Score    Error  Units
OnheapIncrementalIndexBenchmark.concurrentAddRead  avgt    5  2049.733 ± 21.237  ms/op

Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.
@gianm gianm merged commit 792e5c5 into apache:master Jan 18, 2024
83 checks passed
@gianm gianm deleted the incremental-index-add-not-thread-safe branch January 18, 2024 11:45
@abhishekagarwal87 abhishekagarwal87 added this to the 29.0.0 milestone Jan 24, 2024
LakshSingla pushed a commit to LakshSingla/druid that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
abhishekagarwal87 pushed a commit that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants