Add optimized global aggregates using doc-values #10048

mfussenegger · 2020-06-04T13:15:29Z

Summary of the changes / Why this improves CrateDB

Q: select avg("adRevenue") from uservisits
C: 1
| Version |         Mean ±    Stdev |        Min |     Median |         Q3 |        Max |
|   V1    |       80.198 ±   32.376 |     46.869 |     78.789 |     80.146 |    789.931 |
|   V2    |       27.888 ±   20.782 |     23.429 |     25.451 |     26.543 |    480.902 |
├---------┴-------------------------┴------------┴------------┴------------┴------------┘
|               -  96.79%                           - 102.34%
There is a 100.00% probability that the observed difference is not random, and the best estimate of that difference is 96.79%
The test has statistical significance

System/JVM Metrics (durations in ms, byte-values in MB)
    |    YOUNG GC            |       OLD GC           |      HEAP         |     ALLOC
    |  cnt      avg      max |  cnt      avg      max |  initial     used |     rate      total
 V1 |    8    23.13    26.17 |    0     0.00     0.00 |     8590     2319 |   777.04      31413
 V2 |    0     0.00     0.00 |    0     0.00     0.00 |     8590        0 |    14.71        198

V1 top allocation frames
  Float.valueOf(float):30818982785
  CompositeBatchIterator$AsyncCompositeBI.lambda$loadNextBatch$0(int, int):287010080
  169079530.get$Lambda(...):78850440
  291374498.get$Lambda(...):57452096
  MatchAllDocsQuery.createWeight(...):26222072
V2 top allocation frames
  CompositeBatchIterator$AsyncCompositeBI.lambda$loadNextBatch$0(int, int):46133248
  ParserATNSimulator.getEpsilonTarget(...):12582912
  IndexInput.getFullSliceDescription(String):10487808
  Weight.scorerSupplier(...):1047961

Checklist

Added an entry in CHANGES.txt for user facing changes
Updated documentation & sql_features table for user facing changes
Touched code is covered by tests
CLA is signed
This does not contain breaking changes, or if it does:
- It is released within a major release
- It is recorded in CHANGES.txt
- It was marked as deprecated in an earlier release if possible
- You've thought about the consequences and other components are adapted
  (E.g. AdminUI)

mfussenegger · 2020-06-19T09:11:33Z

server/src/main/java/io/crate/execution/engine/aggregation/DocValueAggregator.java

+
+public abstract class DocValueAggregator<T> {
+
+    public abstract T initialState();


Once we use DocValueAggregator for grouping we should probably extend this for memory accounting. Not sure if already necessary for the global aggregate case, since there is only a single row.

Ram accounting would also be required for the collect_set aggregation afaik.

Right, I'll adapt it once I optimize that function (or any other where we allocate more than a couple of bytes)

seut

🚀 Awesome, just left some minor comments

seut · 2020-06-19T09:52:31Z

server/src/main/java/io/crate/execution/engine/aggregation/DocValueAggregator.java

+
+import org.apache.lucene.index.LeafReader;
+
+public abstract class DocValueAggregator<T> {


any reason for an abstract class instead of an interface here?

No not really. I remember reading that other projects saw performance improvements using abstract classes over interfaces, but comparing two variants it looks to be within the noise.

Given the current use I don't see a good rationale to prefer one over the other, do you?

Q: select avg("adRevenue") from uservisits C: 1 | Version | Mean ± Stdev | Min | Median | Q3 | Max | | V1 | 29.886 ± 31.273 | 23.795 | 25.733 | 29.571 | 716.034 | | V2 | 30.842 ± 28.182 | 24.007 | 27.005 | 32.376 | 644.652 | ├---------┴-------------------------┴------------┴------------┴------------┴------------┘ | + 3.15% + 4.83% There is a 38.82% probability that the observed difference is not random, and the best estimate of that difference is 3.15% The test has no statistical significance System/JVM Metrics (durations in ms, byte-values in MB) | YOUNG GC | OLD GC | HEAP | ALLOC | cnt avg max | cnt avg max | initial used | rate total V1 | 0 0.00 0.00 | 0 0.00 0.00 | 8590 0 | 14.29 201 V2 | 1 31.33 31.33 | 0 0.00 0.00 | 8590 2288 | 15.94 249

V1: 4.2.0-5dc2bf16ea7b60b70a4decdb1699fab7d239062e V2: 4.2.0-96ccb18013a00b077ed72764750d99bd6b9c14d4 Q: select avg("adRevenue") from uservisits C: 1 | Version | Mean ± Stdev | Min | Median | Q3 | Max | | V1 | 30.697 ± 30.922 | 25.107 | 27.639 | 29.414 | 711.979 | | V2 | 30.527 ± 26.952 | 24.135 | 27.340 | 29.316 | 620.206 | ├---------┴-------------------------┴------------┴------------┴------------┴------------┘ | - 0.55% - 1.09% There is a 7.37% probability that the observed difference is not random, and the best estimate of that difference is 0.55% The test has no statistical significance System/JVM Metrics (durations in ms, byte-values in MB) | YOUNG GC | OLD GC | HEAP | ALLOC | cnt avg max | cnt avg max | initial used | rate total V1 | 0 0.00 0.00 | 0 0.00 0.00 | 8590 0 | 12.99 201 V2 | 0 0.00 0.00 | 0 0.00 0.00 | 8590 0 | 12.89 199

Given the current use I don't see a good rationale to prefer one over the other, do you?

Only for the multi-inheritance support of interfaces. If there is no need for an abstract class (e.g. ctor definition) I prefer interfaces, but no strong opinion here.

By coincidence found one of the references mentioning the improved performance: https://netty.io/wiki/new-and-noteworthy-in-4.0.html#bytebuf-is-not-an-interface-but-an-abstract-class

seut · 2020-06-19T09:53:46Z

server/src/main/java/io/crate/execution/engine/aggregation/DocValueAggregator.java

+
+public abstract class DocValueAggregator<T> {
+
+    public abstract T initialState();


Ram accounting would also be required for the collect_set aggregation afaik.

Q: select avg("adRevenue") from uservisits C: 1 | Version | Mean ± Stdev | Min | Median | Q3 | Max | | V1 | 80.198 ± 32.376 | 46.869 | 78.789 | 80.146 | 789.931 | | V2 | 27.888 ± 20.782 | 23.429 | 25.451 | 26.543 | 480.902 | ├---------┴-------------------------┴------------┴------------┴------------┴------------┘ | - 96.79% - 102.34% There is a 100.00% probability that the observed difference is not random, and the best estimate of that difference is 96.79% The test has statistical significance System/JVM Metrics (durations in ms, byte-values in MB) | YOUNG GC | OLD GC | HEAP | ALLOC | cnt avg max | cnt avg max | initial used | rate total V1 | 8 23.13 26.17 | 0 0.00 0.00 | 8590 2319 | 777.04 31413 V2 | 0 0.00 0.00 | 0 0.00 0.00 | 8590 0 | 14.71 198 V1 top allocation frames Float.valueOf(float):30818982785 CompositeBatchIterator$AsyncCompositeBI.lambda$loadNextBatch$0(int, int):287010080 169079530.get$Lambda(...):78850440 291374498.get$Lambda(...):57452096 MatchAllDocsQuery.createWeight(...):26222072 V2 top allocation frames CompositeBatchIterator$AsyncCompositeBI.lambda$loadNextBatch$0(int, int):46133248 ParserATNSimulator.getEpsilonTarget(...):12582912 IndexInput.getFullSliceDescription(String):10487808 Weight.scorerSupplier(...):1047961

It improves performance of the global standart deviation aggregations. See #10048 for more information about the appoach.

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

It improves the performance of the global standard deviation aggregations. See #10048 for more information

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

It improves the performance of the global standard deviation aggregations. See #10048 for more information

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

It improves the performance of the global standard deviation aggregations. See #10048 for more information

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

It improves the performance of the global standard deviation aggregations. See #10048 for more information

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

This was referenced Jun 9, 2020

DRAFT: Vectorized execution engine #10063

Open

Add block based aggregation processing #10065

Closed

mfussenegger force-pushed the j/fast-sum branch 5 times, most recently from 40b724b to 7bc3380 Compare June 19, 2020 09:01

mfussenegger marked this pull request as ready for review June 19, 2020 09:01

mfussenegger requested review from seut and kovrus June 19, 2020 09:01

mfussenegger commented Jun 19, 2020

View reviewed changes

seut approved these changes Jun 19, 2020

View reviewed changes

mfussenegger force-pushed the j/fast-sum branch from 7bc3380 to 5dc2bf1 Compare June 19, 2020 12:43

mfussenegger force-pushed the j/fast-sum branch from 5dc2bf1 to f617355 Compare June 19, 2020 13:10

mfussenegger added the ready-to-merge Let Mergify merge the PR once approved and checks pass label Jun 19, 2020

mergify bot merged commit 12fcce5 into master Jun 19, 2020

mergify bot deleted the j/fast-sum branch June 19, 2020 14:28

kovrus mentioned this pull request Aug 6, 2020

Fix min, max, avg, sum and variance global aggregations for float and double types #10293

Merged

5 tasks

kovrus added a commit that referenced this pull request Aug 11, 2020

Add doc-values aggregator for std deviation aggregation.

724068e

It improves performance of the global standart deviation aggregations. See #10048 for more information about the appoach.

kovrus mentioned this pull request Aug 11, 2020

Add doc-values aggregator for std deviation aggregation. #10347

Merged

5 tasks

kovrus added a commit that referenced this pull request Aug 11, 2020

Add doc-values aggregator for std deviation aggregation.

10f4e24

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus added a commit that referenced this pull request Aug 12, 2020

Add doc-values aggregator for std deviation aggregation.

d3dc6df

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus added a commit that referenced this pull request Aug 12, 2020

Add doc-values aggregator for the geometric mean aggregation.

dd2d576

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

chaudum pushed a commit that referenced this pull request Aug 12, 2020

Add doc-values aggregator for std deviation aggregation.

6d00cf8

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus added a commit that referenced this pull request Aug 12, 2020

Add doc-values aggregator for std deviation aggregation.

ee2a2d0

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus added a commit that referenced this pull request Aug 12, 2020

Add doc-values aggregator for the geometric mean aggregation.

c746ce0

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus mentioned this pull request Aug 12, 2020

Add doc-values aggregator for the geometric mean aggregation. #10358

Merged

5 tasks

kovrus added a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for std deviation aggregation.

9669887

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus added a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for the count(v) aggregation.

24ad161

It improves the performance of the global standard deviation aggregations. See #10048 for more information

kovrus mentioned this pull request Aug 13, 2020

Add doc-values aggregator for the count(v) aggregation. #10369

Merged

5 tasks

kovrus added a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for the geometric mean aggregation.

a704194

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

mergify bot pushed a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for std deviation aggregation.

c397854

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus added a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for the count(v) aggregation.

51a3863

It improves the performance of the global standard deviation aggregations. See #10048 for more information

chaudum pushed a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for the geometric mean aggregation.

2622d3b

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

kovrus added a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for the count(v) aggregation.

d95d300

It improves the performance of the global standard deviation aggregations. See #10048 for more information

mergify bot pushed a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for the geometric mean aggregation.

b0c819f

It improves the performance of the global standard deviation aggregations. See #10048 for more information about the approach.

mergify bot pushed a commit that referenced this pull request Aug 13, 2020

Add doc-values aggregator for the count(v) aggregation.

df71e7a

It improves the performance of the global standard deviation aggregations. See #10048 for more information

kovrus added a commit that referenced this pull request Aug 14, 2020

Add doc-values aggregators for the array aggregation.

d50205c

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 14, 2020

Add doc-values aggregators for the collect set aggregation.

a9f5b88

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the array aggregation.

ce93beb

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the collect set aggregation.

135d449

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the array aggregation.

0b4f677

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the collect set aggregation.

6c4e36c

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the arbitrary aggregation.

dda0b3d

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus mentioned this pull request Aug 17, 2020

Add doc-values aggregators for the arbitrary aggregation. #10384

Merged

5 tasks

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the arbitrary aggregation.

99a3544

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the arbitrary aggregation.

599f328

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 17, 2020

Add doc-values aggregators for the arbitrary aggregation.

2495071

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

chaudum pushed a commit that referenced this pull request Aug 18, 2020

Add doc-values aggregators for the arbitrary aggregation.

9acf7dc

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

mergify bot pushed a commit that referenced this pull request Aug 18, 2020

Add doc-values aggregators for the arbitrary aggregation.

8895420

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 19, 2020

Add doc-values aggregators for the array aggregation.

dbc6849

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 19, 2020

Add doc-values aggregators for the collect set aggregation.

7a006c0

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 20, 2020

Add doc-values aggregators for the array aggregation.

443ecf5

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 20, 2020

Add doc-values aggregators for the collect set aggregation.

2ad5ad8

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 20, 2020

Add doc-values aggregators for the array aggregation.

2295016

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

kovrus added a commit that referenced this pull request Aug 20, 2020

Add doc-values aggregators for the collect set aggregation.

791fada

It improves the performance of the global standard deviation aggregations. See #10048 for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized global aggregates using doc-values #10048

Add optimized global aggregates using doc-values #10048

mfussenegger commented Jun 4, 2020

mfussenegger Jun 19, 2020

seut Jun 19, 2020

mfussenegger Jun 19, 2020

seut left a comment

seut Jun 19, 2020

mfussenegger Jun 19, 2020

seut Jun 19, 2020

mfussenegger Jun 19, 2020

mfussenegger Jun 19, 2020

seut Jun 19, 2020


		public abstract class DocValueAggregator<T> {

		public abstract T initialState();


		import org.apache.lucene.index.LeafReader;

		public abstract class DocValueAggregator<T> {

Add optimized global aggregates using doc-values #10048

Add optimized global aggregates using doc-values #10048

Conversation

mfussenegger commented Jun 4, 2020

Summary of the changes / Why this improves CrateDB

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seut left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment