-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache value selectors in RowBasedColumnSelectorFactory. #15615
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was already caching for dimension selectors. This patch adds caching for value (object and number) selectors. It's helpful when the same field is read multiple times during processing of a single row (for example, by being an input to both MIN and MAX aggregations).
17ed965
to
5f5b54c
Compare
abhishekagarwal87
approved these changes
Jan 15, 2024
gianm
added a commit
to gianm/druid
that referenced
this pull request
Jan 16, 2024
Following apache#14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module.
gianm
added a commit
that referenced
this pull request
Jan 18, 2024
* IncrementalIndex#add is no longer thread-safe. Following #14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module. * Spelling. * Changes from static analysis. * Fix javadoc.
LakshSingla
pushed a commit
to LakshSingla/druid
that referenced
this pull request
Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe. Following apache#14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module. * Spelling. * Changes from static analysis. * Fix javadoc.
abhishekagarwal87
pushed a commit
that referenced
this pull request
Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe. Following #14866, there is no longer a reason for IncrementalIndex#add to be thread-safe. It turns out it already was not using its selectors in a thread-safe way, as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript` in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't new: Strings have been stored in the dimension selectors for some time, but we didn't have a test that checked for that case; we only have this test that checks for concurrent adds involving numeric selectors. At any rate, this patch changes OnheapIncrementalIndex to no longer try to offer a thread-safe "add" method. It also improves performance a bit by adding a row ID supplier to the selectors it uses to read InputRows, meaning that it can get the benefit of caching values inside the selectors. This patch also: 1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator, which the similar datasketches versions already have. This is done to help them adhere to the contract of Aggregator: concurrent calls to "aggregate" and "get" must be thread-safe. 2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the druid-benchmarks module. * Spelling. * Changes from static analysis. * Fix javadoc. Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
gianm
added a commit
to gianm/druid
that referenced
this pull request
Feb 28, 2024
PR apache#15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType. This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG. This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.
abhishekagarwal87
pushed a commit
that referenced
this pull request
Mar 4, 2024
* Rows.objectToNumber: Accept decimals with output type LONG. PR #15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType. This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG. This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.
gianm
added a commit
to gianm/druid
that referenced
this pull request
Mar 6, 2024
…5999) * Rows.objectToNumber: Accept decimals with output type LONG. PR apache#15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType. This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG. This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.
abhishekagarwal87
pushed a commit
that referenced
this pull request
Mar 8, 2024
…16062) * Rows.objectToNumber: Accept decimals with output type LONG. PR #15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType. This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG. This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was already caching for dimension selectors. This patch adds caching for value (object and number) selectors. It's helpful when the same field is read multiple times during processing of a single row (for example, by being an input to both MIN and MAX aggregations).
This patch also updates
Rows#objectToNumber
to take anexpectedType
parameter, which is used to avoid parsing numbers twice (when the expected type is unknown, we try both long and double).