Integer Tuple Sketch support #10427

andimiller · 2023-03-15T17:20:18Z

This adds support for BYTES columns containing Tuple Sketches with Integer as the summary type.

The added classes currently support Sum as the semigroup, but are generic so others can be added.

Feature breakdown:

Add transform functions that can be used to create Integer Tuple Sketches during ingestion, eg. toIntegerSumTupleSketch(colA, colbB, 16)
Add Codecs that use the Datasketches serialization
Add aggregation functions:

DISTINCT_COUNT_TUPLE_SKETCH will just get the estimate for the number of unique keys, same as Theta or HLL
DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH will merge the sketches using Sum as the semigroup and return the raw sketch
SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH will merge the sketches using Sum as the semigroup and estimate the sum of the value side
AVG_VALUES_INTEGER_SUM_TUPLE_SKETCH will merge the sketches using Sum as the semigroup and estimate the average of the value side

Add ValueAggregator<_, _>s for use in StarTree indexes for all 4 above aggregations
Add ValueAggregators for use in rollups for all 4 above aggregations

This adds support for `BYTES` columns containing Tuple Sketches with Integer as the summary type. The added classes currently support `Sum` as the semigroup, but are generic so others can be added. Feature breakdown: 1. Add transform functions that can be used to create Integer Tuple Sketches during ingestion, eg. `toIntegerSumTupleSketch(colA, colbB, 16)` 2. Add Codecs that use the Datasketches serialization 3. Add aggregation functions: * `DISTINCT_COUNT_TUPLE_SKETCH` will just get the estimate for the number of unique keys, same as Theta or HLL * `DISTINCT_COUNT_RAW_INTEGER_SUM_TUPLE_SKETCH` will merge the sketches using `Sum` as the semigroup and return the raw sketch * `SUM_VALUES_INTEGER_SUM_TUPLE_SKETCH` will merge the sketches using `Sum` as the semigroup and estimate the sum of the value side * `AVG_VALUES_INTEGER_SUM_TUPLE_SKETCH` will merge the sketches using `Sum` as the semigroup and estimate the average of the value side 4. Add `ValueAggregator<_, _>`s for use in `StarTree` indexes for all 4 above aggregations 5. Add `ValueAggregator`s for use in rollups for all 4 above aggregations

andimiller · 2023-03-15T17:20:47Z

I could do with some advice on the best place to add tests for the aggregation functions, I've been looking through the existing tests and can't find anywhere suitable

codecov-commenter · 2023-03-15T18:19:51Z

Codecov Report

Merging #10427 (1b7fe74) into master (00d3133) will decrease coverage by 56.66%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master   #10427       +/-   ##
=============================================
- Coverage     70.30%   13.64%   -56.66%     
+ Complexity     6494      439     -6055     
=============================================
  Files          2158     2110       -48     
  Lines        116070   113782     -2288     
  Branches      17566    17294      -272     
=============================================
- Hits          81608    15531    -66077     
- Misses        28778    96979    +68201     
+ Partials       5684     1272     -4412

Flag	Coverage Δ
integration1	`?`
integration2	`?`
unittests1	`?`
unittests2	`13.64% <0.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...org/apache/pinot/core/common/ObjectSerDeUtils.java	`0.00% <0.00%> (-90.83%)`	⬇️
...he/pinot/core/function/scalar/SketchFunctions.java	`0.00% <0.00%> (-100.00%)`	⬇️
...gregation/function/AggregationFunctionFactory.java	`0.00% <0.00%> (-82.66%)`	⬇️
...AvgValueIntegerTupleSketchAggregationFunction.java	`0.00% <0.00%> (ø)`
...nctCountIntegerTupleSketchAggregationFunction.java	`0.00% <0.00%> (ø)`
...unction/IntegerTupleSketchAggregationFunction.java	`0.00% <0.00%> (ø)`
...umValuesIntegerTupleSketchAggregationFunction.java	`0.00% <0.00%> (ø)`
...ssing/aggregator/IntegerTupleSketchAggregator.java	`0.00% <0.00%> (ø)`
.../processing/aggregator/ValueAggregatorFactory.java	`0.00% <0.00%> (-42.86%)`	⬇️
.../aggregator/IntegerTupleSketchValueAggregator.java	`0.00% <0.00%> (ø)`
... and 4 more

... and 1699 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java

swaminathanmanish · 2023-03-16T16:36:25Z

...c/main/java/org/apache/pinot/segment/local/aggregator/IntegerTupleSketchValueAggregator.java

+
+  @Override
+  public byte[] serializeAggregatedValue(Sketch<IntegerSummary> value) {
+    return CustomSerDeUtils.DATA_SKETCH_INT_TUPLE_SER_DE.serialize(value);


Just curious to know if there's a reason why we have 2 ser/deser utilities (CustomSerDeUtils, ObjectSerDeUtils) ? @Jackie-Jiang

...nt-local/src/main/java/org/apache/pinot/segment/local/aggregator/ValueAggregatorFactory.java

davecromberge

Looks good @andimiller!

davecromberge · 2023-03-16T23:04:14Z

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

@@ -213,6 +216,8 @@ public static ObjectType getObjectType(Object value) {
        return ObjectType.VarianceTuple;
      } else if (value instanceof PinotFourthMoment) {
        return ObjectType.PinotFourthMoment;
+      } else if (value instanceof org.apache.datasketches.tuple.Sketch) {
+        return ObjectType.IntegerTupleSketch;


Is this a safe assumption? Is it also necessary to inspect the summary type to verify integer?

right now it is, but to add other types of tuple Sketch we'd need to add wrapper types, due to JVM type erasure

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java

...c/main/java/org/apache/pinot/core/query/aggregation/function/AggregationFunctionFactory.java

davecromberge · 2023-03-16T23:26:53Z

...he/pinot/core/query/aggregation/function/SumValuesIntegerTupleSketchAggregationFunction.java

+    }
+    double estimate = retainedTotal / union.getResult().getRetainedEntries() * union.getResult().getEstimate();
+    return Double.valueOf(estimate).longValue();
+  }


Does the serde always deserialise bytes to a compact sketch? It could be better to use the base Sketch abstraction for cases where the sketches have been created outside the system and not compacted.

I can give that a go, I swapped it to all compact because I was having issues with the non-threadsafe nature of Sketch

I had the same question as well :).

...he/pinot/core/query/aggregation/function/SumValuesIntegerTupleSketchAggregationFunction.java

...g/apache/pinot/core/query/aggregation/function/AvgIntegerTupleSketchAggregationFunction.java

.../org/apache/pinot/core/query/aggregation/function/IntegerTupleSketchAggregationFunction.java

...nt-local/src/main/java/org/apache/pinot/segment/local/aggregator/ValueAggregatorFactory.java

swaminathanmanish · 2023-03-16T17:01:23Z

pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java

@@ -96,6 +96,9 @@ public static class Helix {
    // https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html
    public static final int DEFAULT_THETA_SKETCH_NOMINAL_ENTRIES = 65536;

+
+    public static final int DEFAULT_TUPLE_SKETCH_LGK = 16;


Any references that can help explain this value?

I'll add a comment, it's the same as the theta one above, but log 2

swaminathanmanish · 2023-03-16T20:41:14Z

pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java

+        is.update((String) key, value);
+      } else if (key instanceof byte[]) {
+        is.update((byte[]) key, value);
+      }


In case you want to validate/catch invalid types, consider throwing an IllegalStateException/illegalArg exception ?

done, added it for theta too and expanded the tests to cover

swaminathanmanish · 2023-03-16T20:54:10Z

...c/main/java/org/apache/pinot/segment/local/aggregator/IntegerTupleSketchValueAggregator.java

+import org.apache.pinot.spi.data.FieldSpec.DataType;
+
+
+public class IntegerTupleSketchValueAggregator implements ValueAggregator<byte[], Sketch<IntegerSummary>> {


Can the raw type (R) be Sketch, instead of byte[] here ? Looking at the other sketch implementation (DistinctCountThetaSketchValueAggregator), which has Object as the raw type, I just wanted to check.

it can be but Sketch isn't thread-safe, and I swapped this to byte[] while hunting down some thread safety issues, I will see if I can swap it back now that I've made all the Union use thread safe

it may need to be Object, this was a good catch, doing more local testing

have tested it more locally, it is fine being byte[] because we only handle aggregated sketches

swaminathanmanish · 2023-03-16T22:23:40Z

...c/main/java/org/apache/pinot/core/query/aggregation/function/AggregationFunctionFactory.java

@@ -299,13 +299,21 @@ public static AggregationFunction getAggregationFunction(FunctionContext functio
            return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.KURTOSIS);
          case FOURTHMOMENT:
            return new FourthMomentAggregationFunction(firstArgument, FourthMomentAggregationFunction.Type.MOMENT);
+          case DISTINCTCOUNTTUPLESKETCH:
+            // mode actually doesn't matter here because we only care about keys, not values
+            return new DistinctCountIntegerTupleSketchAggregationFunction(arguments, IntegerSummary.Mode.Sum);


is there a reason why we pass IntegerSummary.Mode.Sum as a parameter ? We are already differentiating based on the aggregation implementations IntegerTupleSketchAggregationFunction vs AvgIntegerTupleSketchAggregationFunction vs SumValuesIntegerTupleSketchAggregationFunction

that is the mode for IntegerSummary merging, all of these use Sum

Ok, so there can be functions that can use other summary modes (min, max..) in the future.

.../org/apache/pinot/core/query/aggregation/function/IntegerTupleSketchAggregationFunction.java

swaminathanmanish · 2023-03-17T00:04:45Z

...he/pinot/core/query/aggregation/function/SumValuesIntegerTupleSketchAggregationFunction.java

+import org.apache.pinot.segment.spi.AggregationFunctionType;
+
+
+public class SumValuesIntegerTupleSketchAggregationFunction extends IntegerTupleSketchAggregationFunction {


Would composition + delegation make the APIs for Sum, Avg, distinct clearer than inheritance ? That way we know when/how IntegerTupleSketchAggregationFunction is exactly used and it'll decouple the Integer API from the rest.

I've followed the way it was implemented for Theta, using the simplest one as the base and inheriting it

Yes makes sense to keep them consistent.

...che/pinot/core/query/aggregation/function/AvgValueIntegerTupleSketchAggregationFunction.java

.../org/apache/pinot/core/query/aggregation/function/IntegerTupleSketchAggregationFunction.java

swaminathanmanish · 2023-03-20T14:40:13Z

.../org/apache/pinot/core/query/aggregation/function/IntegerTupleSketchAggregationFunction.java

+    }
+    ArrayList<CompactSketch<IntegerSummary>> merged =
+        new ArrayList<>(intermediateResult1.size() + intermediateResult2.size());
+    merged.addAll(intermediateResult1);


Just curious - We dont want to do a union here for the merge? Im looking at DistinctCountThetaSketchAggregationFunction for reference.

this is an optimisation similar to the one used in the Theta version, where merges can be quite expensive, and it's better to delay the merge til we have a lot of sketches to combine, hence using List as the intermediate type

swaminathanmanish · 2023-03-20T14:50:23Z

.../org/apache/pinot/core/query/aggregation/function/IntegerTupleSketchAggregationFunction.java

+          }
+        }
+      } catch (Exception e) {
+        throw new RuntimeException("Caught exception while merging Tuple Sketches", e);


I guess this is groupBy and not merging tuple sketches ?

swaminathanmanish · 2023-03-20T14:51:36Z

.../org/apache/pinot/core/query/aggregation/function/IntegerTupleSketchAggregationFunction.java

+      byte[] value = valueArray[i];
+      CompactSketch<IntegerSummary> newSketch =
+          ObjectSerDeUtils.DATA_SKETCH_INT_TUPLE_SER_DE.deserialize(value).compact();
+      for (int groupKey : groupKeysArray[i]) {


This looks exactly the same as aggregateGroupBySV except that we iterate over group keys as it can be multivalued ?

swaminathanmanish · 2023-03-20T14:53:03Z

...he/pinot/core/query/aggregation/function/SumValuesIntegerTupleSketchAggregationFunction.java

+import org.apache.pinot.segment.spi.AggregationFunctionType;
+
+
+public class SumValuesIntegerTupleSketchAggregationFunction extends IntegerTupleSketchAggregationFunction {


Yes makes sense to keep them consistent.

swaminathanmanish · 2023-03-20T14:53:59Z

...he/pinot/core/query/aggregation/function/SumValuesIntegerTupleSketchAggregationFunction.java

+    }
+    double estimate = retainedTotal / union.getResult().getRetainedEntries() * union.getResult().getEstimate();
+    return Double.valueOf(estimate).longValue();
+  }


I had the same question as well :).

Jackie-Jiang

LGTM
Can you please rebase and resolve the conflict, and also respond to the pending comments?

swaminathanmanish · 2023-04-27T16:03:34Z

Thanks for taking care of comments !

@dang

…pache#250) ### Notify cc stripe-private-oss-forks/pinot-reviewers ### Summary For upstream pulls, please list all the PRs of interest and risk Interesting commits: - apache#10598 : better null handling in transform functions - apache#10757: table configs has an updated version? - apache#10766 - ideal state compression is on for findata and rad clusters. - apache#10643 - new perecentile agg function - better error rates then t-digest - apache#10687 - refactoring + new mutable index spi - will require specific commit to fix our HLL implementation + proper testing in QA - apache#10784 (@dang contributed): making sure servers wait the full time they need to wait before shutting down - apache#10785 - allow env var substitution for pinot configs - apache#10427 - another new agg function - integer sketch support ### Motivation biweekly upstream pull ### Testing Specifics to include: - diff some test table configs, see if anything changes - HLL pre-agg and big decimal tests Table config diff testing + Big decimal / HLL testing: https://docs.google.com/document/d/1E8X_ARM_m9VtecRhoscOztZ_cPTZ6aV2YqjEJ4n9PaM/edit?usp=sharing Upstream pull load testing: https://docs.google.com/document/d/1GmCiHhaFP8HVVaoQ4t_3a2SqF9JE6RtF6on8erzCX3U/edit?usp=sharing ### Rollout/monitoring/revert plan rollout to prod sandbox, perform load testing, then roll out to a400 clusters and a200 next day (Squashed by Merge Queue - Original PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/250)

andimiller added 4 commits March 15, 2023 16:47

Merge branch 'master' into tuple-sketch-support

af8a3b9

fix style

0eb23bb

add test for sketch agg

a081c10

fix mangled license headers

9a3771c

andimiller changed the title ~~Tuple sketch support~~ Integer Tuple Sketch support Mar 15, 2023

annotate types for old versions of java

bb8054f

swaminathanmanish reviewed Mar 16, 2023

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/function/scalar/SketchFunctions.java Show resolved Hide resolved

swaminathanmanish reviewed Mar 16, 2023

View reviewed changes

...nt-local/src/main/java/org/apache/pinot/segment/local/aggregator/ValueAggregatorFactory.java Show resolved Hide resolved

davecromberge reviewed Mar 16, 2023

View reviewed changes

swaminathanmanish reviewed Mar 16, 2023

View reviewed changes

swaminathanmanish reviewed Mar 17, 2023

View reviewed changes

andimiller added 6 commits March 17, 2023 15:41

Cache Tuple Union result so it's not recomputed

1821a2c

Improve null handling in Tuple aggregation functions

4dcffb3

Cleanup in IntegerTupleSketchAggregationFunction's parameters

b038bbc

Make Theta and Tuple transform functions throw on unexpected key types

3b41c5b

Clean up sum/avg implementations for Tuple Sketch values

bb69257

Fix on Java 8

0c26150

swaminathanmanish reviewed Mar 20, 2023

View reviewed changes

Jackie-Jiang approved these changes Apr 26, 2023

View reviewed changes

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Apr 26, 2023

andimiller added 3 commits April 27, 2023 14:46

Merge branch 'master' into tuple-sketch-support

c99bb69

Expand todo for tuple sketch aggregation function

1207e88

add preconditions to tuple aggregation function

d246b81

swaminathanmanish approved these changes Apr 27, 2023

View reviewed changes

andimiller added 9 commits April 28, 2023 10:29

empty commit to re-trigger CI

0efea6e

empty commit to re-trigger CI again

da958b8

Merge branch 'master' into tuple-sketch-support

798181a

fix merge

3e13196

empty commit to re-trigger CI again

b63d47e

Merge branch 'master' into tuple-sketch-support

c021c64

Merge branch 'master' into tuple-sketch-support

2381e4d

fix merge

b44b0d9

fix merge again

1b7fe74

mayankshriv merged commit ded7e8f into apache:master May 25, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer Tuple Sketch support #10427

Integer Tuple Sketch support #10427

andimiller commented Mar 15, 2023

andimiller commented Mar 15, 2023

codecov-commenter commented Mar 15, 2023 •

edited

swaminathanmanish Mar 16, 2023

davecromberge left a comment

davecromberge Mar 16, 2023

andimiller Mar 17, 2023 •

edited

davecromberge Mar 16, 2023

andimiller Mar 17, 2023

swaminathanmanish Mar 20, 2023

swaminathanmanish Mar 16, 2023

andimiller Mar 17, 2023 •

edited

swaminathanmanish Mar 16, 2023

andimiller Mar 17, 2023

swaminathanmanish Mar 16, 2023

andimiller Mar 17, 2023

andimiller Mar 17, 2023

andimiller Mar 17, 2023

swaminathanmanish Mar 16, 2023

andimiller Mar 17, 2023

swaminathanmanish Mar 20, 2023

swaminathanmanish Mar 17, 2023

andimiller Mar 17, 2023

swaminathanmanish Mar 20, 2023

swaminathanmanish Mar 20, 2023

andimiller Apr 27, 2023

swaminathanmanish Mar 20, 2023

swaminathanmanish Mar 20, 2023

andimiller Apr 27, 2023

swaminathanmanish Mar 20, 2023

swaminathanmanish Mar 20, 2023

Jackie-Jiang left a comment

swaminathanmanish commented Apr 27, 2023

		import org.apache.pinot.spi.data.FieldSpec.DataType;


		public class IntegerTupleSketchValueAggregator implements ValueAggregator<byte[], Sketch<IntegerSummary>> {

		import org.apache.pinot.segment.spi.AggregationFunctionType;


		public class SumValuesIntegerTupleSketchAggregationFunction extends IntegerTupleSketchAggregationFunction {

Integer Tuple Sketch support #10427

Integer Tuple Sketch support #10427

Conversation

andimiller commented Mar 15, 2023

andimiller commented Mar 15, 2023

codecov-commenter commented Mar 15, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

davecromberge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andimiller Mar 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andimiller Mar 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

swaminathanmanish commented Apr 27, 2023

codecov-commenter commented Mar 15, 2023 •

edited

andimiller Mar 17, 2023 •

edited

andimiller Mar 17, 2023 •

edited