Support for exact distinct count for non int data types #5872

kishoreg · 2020-08-16T09:31:41Z

Description

Currently in DistinctCount, we use IntOpenHashSet to store distinct ids even for non int types. While this is efficient, the accuracy drops as the cardinality increase. This PR sets up the right HashSet based on column data type.

Upgrade Notes

Brokers should be upgraded before servers in order to keep backward-compatible

Release Notes

With this change, the DistinctCount aggregation function will always return the exact distinct count regardless of the column data type. It might bring performance overhead for data types other than INT.
For use cases that is performance sensitive and not require the exact distinct count, use DistinctCountBitmap which has the same behavior as the current DistinctCount and better performance.
Provide a new boolean Helix cluster config enable.distinct.count.bitmap.override to auto-rewrite DistinctCount to DistinctCountBitmap on broker.

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

Jackie-Jiang

I think we might need to create a new function for this, or the existing distinctCount queries will face inconsistent results and performance degradation.

pinot-common/src/main/java/org/apache/pinot/common/function/AggregationFunctionType.java

Jackie-Jiang · 2020-08-17T17:58:19Z

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

@@ -111,6 +127,14 @@ public static ObjectType getObjectType(Object value) {
        return ObjectType.Geometry;
      } else if (value instanceof RoaringBitmap) {
        return ObjectType.RoaringBitmap;
+      } else if (value instanceof LongSet) {
+        return ObjectType.LongSet;
+      } else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) {


Suggested change

} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) {

} else if (value instanceof FloatSet) {

Jackie-Jiang · 2020-08-17T17:58:29Z

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

+        return ObjectType.LongSet;
+      } else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) {
+        return ObjectType.FloatSet;
+      } else if (value instanceof it.unimi.dsi.fastutil.doubles.DoubleSet) {


Suggested change

} else if (value instanceof it.unimi.dsi.fastutil.doubles.DoubleSet) {

} else if (value instanceof DoubleSet) {

Jackie-Jiang · 2020-08-17T18:00:12Z

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

-      GEOMETRY_SER_DE,
-      ROARING_BITMAP_SER_DE
-  };
+  private static final ObjectSerDe[] SER_DES =


Revert this reformat (you may want to enable formatter markers in comments in your IDE)

Jackie-Jiang · 2020-08-17T18:01:38Z

...e/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java

              for (int dictId = 0; dictId < dictionarySize; dictId++) {
-                set.add(dictionary.getStringValue(dictId).hashCode());
+                set.add(ByteBuffer.wrap(dictionary.getStringValue(dictId).getBytes(Charsets.UTF_8)));


Suggest using ByteArray instead of ByteBuffer to store bytes

Use StringUtils.encodeUtf8() to encode string for better performance

Jackie-Jiang · 2020-08-17T18:07:27Z

.../java/org/apache/pinot/core/query/aggregation/function/DistinctCountAggregationFunction.java

    }
  }

+  private AbstractCollection emptyCollection() {
+    return new AbstractCollection() {


I don't think this works for ser/de. You need to construct a type specific set based on the data type

kishoreg · 2020-08-17T19:48:38Z

I think we might need to create a new function for this, or the existing distinctCount queries will face inconsistent results and performance degradation.

Existing behavior is very hard to explain, we can add a new function ''distinctCountHash' to bring back the previous behavior but I don't know why would someone use that vs distinctCountHLL

mayankshriv

I think we might need to create a new function for this, or the existing distinctCount queries will face inconsistent results and performance degradation.

+1

mayankshriv · 2020-08-17T21:56:17Z

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

+      int size = floatSet.size();
+      byte[] bytes = new byte[Integer.BYTES + size * Float.BYTES];
+      ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);
+      byteBuffer.putInt(size);


Wondering if we should have a single ser/de for different data types, by writing the data type as part of header. Not sure if the iterators share the same interface to be able to share the same serialize().

We are using fastutil Sets for better performance, and each Set has its own iterator class. Keeping them separate is more readable IMO. The ObjectType info is already maintained in the header, no need to introduce another level of type info.

mayankshriv · 2020-08-17T21:59:17Z

...e/src/main/java/org/apache/pinot/core/operator/query/DictionaryBasedAggregationOperator.java

              }
              break;
            case STRING:
+              set = new ObjectOpenHashSet<ByteBuffer>(dictionarySize);


why not byte[]?

mayankshriv

A few comments:

Given that there's potential result change for existing distinct cases, we should not provide a way to measure impact (perhaps with config, or having a separate aggr function).
We should also add tests to see how it improves wrt existing function.
The release notes section in the PR should be updated accordingly.

Jackie-Jiang · 2020-08-17T23:34:18Z

@mayankshriv Can you please check the existing use cases on distinctCount? If the column is not STRING type, then the overhead should be minimal.
For the existing behavior, you may use distinctCountBitmap which is the enhanced version of the current distinctCount using the RoaringBitmap

mayankshriv · 2020-08-19T00:18:37Z

@mayankshriv Can you please check the existing use cases on distinctCount? If the column is not STRING type, then the overhead should be minimal.
For the existing behavior, you may use distinctCountBitmap which is the enhanced version of the current distinctCount using the RoaringBitmap

@Jackie-Jiang We have thousands of tables at LinkedIn, so they will likely cover all kinds of distinctCount use cases. I don't quite follow your suggestion on us using distinctCountBitmap. Are you suggesting we ask our customers to migrate to the bitmap based implementation?

Jackie-Jiang · 2020-08-19T01:26:43Z

@mayankshriv Currently the DistinctCount is not having the expected behavior of returning the exact distinct count because it is storing the hashCode() of the values instead of the actual values, and will return less than accurate result when hash collision happens. We are fixing this unexpected behavior in this PR, but that has performance overhead.
DistinctCountBitmap will have the same behavior as the current DistinctCount (storing hash of the values) and similar or better performance. In case there are performance-sensitive use cases with DistinctCount, you might consider using DistinctCountBitmap instead. For non-performance-sensitive use cases, nothing need to be changed as DistinctCount will return the exact distinct count, which is the expected behavior.

xiangfu0

LGTM

xiangfu0 · 2020-08-20T01:02:01Z

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java

+        return ObjectType.DoubleSet;
+      } else if (value instanceof ObjectSet) {
+        ObjectSet objectSet = (ObjectSet) value;
+        if (objectSet.isEmpty() || objectSet.iterator().next() instanceof String) {


will this be a problem as we always return StringSet for empty value?

It is fine for empty set as empty string set and empty bytes set are the same (both are empty ObjectOpenHashSet)

mayankshriv · 2020-08-20T04:08:50Z

@mayankshriv Currently the DistinctCount is not having the expected behavior of returning the exact distinct count because it is storing the hashCode() of the values instead of the actual values, and will return less than accurate result when hash collision happens. We are fixing this unexpected behavior in this PR, but that has performance overhead.
DistinctCountBitmap will have the same behavior as the current DistinctCount (storing hash of the values) and similar or better performance. In case there are performance-sensitive use cases with DistinctCount, you might consider using DistinctCountBitmap instead. For non-performance-sensitive use cases, nothing need to be changed as DistinctCount will return the exact distinct count, which is the expected behavior.

@Jackie-Jiang I completely agree that this is a good change for the functionality. I am just apprehensive that it can potentially have negative impact on performance, deployment (different versions of broker/server).

It is not just latency performance, there's memory impact as well, right? There are likely various use cases that call distinctCount on string columns (in fact, that is the most common use case). Given that we are going to use String hashSets instead of Int hashSets, there's a memory penalty as well. This can be amplified for multi-tenant cases where a single distinct query on a string column can put memory pressure and adversely impact all tables on the node.

For a large deployment, it is not practical to identify all use cases which may get impacted, manually evaluate the impact for each one of them, and ask the clients to move to the bit-map based aggregation function. For this reason, I was proposing to either have this as a new aggregation function, or provide a runtime way of disabling the new behavior in case issues happen in production.

At the same time, I am fine to have this PR merged (as I fundamentally agree with the functional side of the change). However, I do want to call out that large deployments might likely need to invest in manually validating this PR for the various use cases they may have.

mcvsubbu · 2020-08-20T15:41:51Z

@mayankshriv Currently the DistinctCount is not having the expected behavior of returning the exact distinct count because it is storing the hashCode() of the values instead of the actual values, and will return less than accurate result when hash collision happens. We are fixing this unexpected behavior in this PR, but that has performance overhead.
DistinctCountBitmap will have the same behavior as the current DistinctCount (storing hash of the values) and similar or better performance. In case there are performance-sensitive use cases with DistinctCount, you might consider using DistinctCountBitmap instead. For non-performance-sensitive use cases, nothing need to be changed as DistinctCount will return the exact distinct count, which is the expected behavior.

@Jackie-Jiang I completely agree that this is a good change for the functionality. I am just apprehensive that it can potentially have negative impact on performance, deployment (different versions of broker/server).

It is not just latency performance, there's memory impact as well, right? There are likely various use cases that call distinctCount on string columns (in fact, that is the most common use case). Given that we are going to use String hashSets instead of Int hashSets, there's a memory penalty as well. This can be amplified for multi-tenant cases where a single distinct query on a string column can put memory pressure and adversely impact all tables on the node.

For a large deployment, it is not practical to identify all use cases which may get impacted, manually evaluate the impact for each one of them, and ask the clients to move to the bit-map based aggregation function. For this reason, I was proposing to either have this as a new aggregation function, or provide a runtime way of disabling the new behavior in case issues happen in production.

At the same time, I am fine to have this PR merged (as I fundamentally agree with the functional side of the change). However, I do want to call out that large deployments might likely need to invest in manually validating this PR for the various use cases they may have.

+1 on this. Site facing use cases breaking after deployment is not something to look forward to.

Jackie-Jiang · 2020-08-20T18:38:13Z

@mayankshriv @mcvsubbu Added a new boolean Helix cluster config enable.distinct.count.bitmap.override to auto-rewrite DistinctCount to DistinctCountBitmap on broker. If the cluster runs into performance issue because of the new DistinctCount behavior, this flag is able to switch to DistinctCountBitmap without restarting the machines.

kishoreg requested a review from Jackie-Jiang August 16, 2020 09:31

kishoreg force-pushed the exact-distinct-count branch from dc1b19e to ef30e07 Compare August 17, 2020 06:16

xiangfu0 reviewed Aug 17, 2020

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java Outdated Show resolved Hide resolved

xiangfu0 reviewed Aug 17, 2020

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java Outdated Show resolved Hide resolved

xiangfu0 reviewed Aug 17, 2020

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java Outdated Show resolved Hide resolved

Jackie-Jiang reviewed Aug 17, 2020

View reviewed changes

mayankshriv reviewed Aug 17, 2020

View reviewed changes

Jackie-Jiang force-pushed the exact-distinct-count branch from 8f2dff4 to 1810550 Compare August 19, 2020 23:45

Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Aug 19, 2020

xiangfu0 approved these changes Aug 20, 2020

View reviewed changes

Support exact distinct count

1159f79

Jackie-Jiang force-pushed the exact-distinct-count branch from 1810550 to a683628 Compare August 20, 2020 18:28

Add DistinctCountBitmap query override

4fdf4cd

Jackie-Jiang force-pushed the exact-distinct-count branch from a683628 to 4fdf4cd Compare August 20, 2020 18:32

Jackie-Jiang merged commit c223dfc into master Aug 20, 2020

Jackie-Jiang deleted the exact-distinct-count branch August 20, 2020 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for exact distinct count for non int data types #5872

Support for exact distinct count for non int data types #5872

kishoreg commented Aug 16, 2020 •

edited by Jackie-Jiang

Jackie-Jiang left a comment

Jackie-Jiang Aug 17, 2020

Jackie-Jiang Aug 17, 2020

Jackie-Jiang Aug 17, 2020

Jackie-Jiang Aug 17, 2020

Jackie-Jiang Aug 17, 2020

Jackie-Jiang Aug 17, 2020

kishoreg commented Aug 17, 2020

mayankshriv left a comment

mayankshriv Aug 17, 2020

Jackie-Jiang Aug 20, 2020

mayankshriv Aug 17, 2020

mayankshriv left a comment

Jackie-Jiang commented Aug 17, 2020

mayankshriv commented Aug 19, 2020

Jackie-Jiang commented Aug 19, 2020

xiangfu0 left a comment

xiangfu0 Aug 20, 2020

Jackie-Jiang Aug 20, 2020

mayankshriv commented Aug 20, 2020

mcvsubbu commented Aug 20, 2020

Jackie-Jiang commented Aug 20, 2020 •

edited

	} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) {
	} else if (value instanceof FloatSet) {

	} else if (value instanceof it.unimi.dsi.fastutil.doubles.DoubleSet) {
	} else if (value instanceof DoubleSet) {

Support for exact distinct count for non int data types #5872

Support for exact distinct count for non int data types #5872

Conversation

kishoreg commented Aug 16, 2020 • edited by Jackie-Jiang

Description

Upgrade Notes

Release Notes

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kishoreg commented Aug 17, 2020

mayankshriv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayankshriv left a comment

Choose a reason for hiding this comment

Jackie-Jiang commented Aug 17, 2020

mayankshriv commented Aug 19, 2020

Jackie-Jiang commented Aug 19, 2020

xiangfu0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayankshriv commented Aug 20, 2020

mcvsubbu commented Aug 20, 2020

Jackie-Jiang commented Aug 20, 2020 • edited

kishoreg commented Aug 16, 2020 •

edited by Jackie-Jiang

Jackie-Jiang commented Aug 20, 2020 •

edited