New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for exact distinct count for non int data types #5872
Conversation
dc1b19e
to
ef30e07
Compare
pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/common/ObjectSerDeUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to create a new function for this, or the existing distinctCount
queries will face inconsistent results and performance degradation.
pinot-common/src/main/java/org/apache/pinot/common/function/AggregationFunctionType.java
Outdated
Show resolved
Hide resolved
@@ -111,6 +127,14 @@ public static ObjectType getObjectType(Object value) { | |||
return ObjectType.Geometry; | |||
} else if (value instanceof RoaringBitmap) { | |||
return ObjectType.RoaringBitmap; | |||
} else if (value instanceof LongSet) { | |||
return ObjectType.LongSet; | |||
} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) { | |
} else if (value instanceof FloatSet) { |
return ObjectType.LongSet; | ||
} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) { | ||
return ObjectType.FloatSet; | ||
} else if (value instanceof it.unimi.dsi.fastutil.doubles.DoubleSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if (value instanceof it.unimi.dsi.fastutil.doubles.DoubleSet) { | |
} else if (value instanceof DoubleSet) { |
GEOMETRY_SER_DE, | ||
ROARING_BITMAP_SER_DE | ||
}; | ||
private static final ObjectSerDe[] SER_DES = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this reformat (you may want to enable formatter markers in comments in your IDE)
for (int dictId = 0; dictId < dictionarySize; dictId++) { | ||
set.add(dictionary.getStringValue(dictId).hashCode()); | ||
set.add(ByteBuffer.wrap(dictionary.getStringValue(dictId).getBytes(Charsets.UTF_8))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using ByteArray
instead of ByteBuffer
to store bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use StringUtils.encodeUtf8()
to encode string for better performance
} | ||
} | ||
|
||
private AbstractCollection emptyCollection() { | ||
return new AbstractCollection() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works for ser/de. You need to construct a type specific set based on the data type
Existing behavior is very hard to explain, we can add a new function ''distinctCountHash' to bring back the previous behavior but I don't know why would someone use that vs distinctCountHLL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need to create a new function for this, or the existing
distinctCount
queries will face inconsistent results and performance degradation.
+1
int size = floatSet.size(); | ||
byte[] bytes = new byte[Integer.BYTES + size * Float.BYTES]; | ||
ByteBuffer byteBuffer = ByteBuffer.wrap(bytes); | ||
byteBuffer.putInt(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should have a single ser/de for different data types, by writing the data type as part of header. Not sure if the iterators share the same interface to be able to share the same serialize()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using fastutil Sets for better performance, and each Set has its own iterator class. Keeping them separate is more readable IMO. The ObjectType
info is already maintained in the header, no need to introduce another level of type info.
} | ||
break; | ||
case STRING: | ||
set = new ObjectOpenHashSet<ByteBuffer>(dictionarySize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not byte[]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments:
-
Given that there's potential result change for existing
distinct
cases, we should not provide a way to measure impact (perhaps with config, or having a separate aggr function). -
We should also add tests to see how it improves wrt existing function.
-
The release notes section in the PR should be updated accordingly.
@mayankshriv Can you please check the existing use cases on |
@Jackie-Jiang We have thousands of tables at LinkedIn, so they will likely cover all kinds of distinctCount use cases. I don't quite follow your suggestion on us using |
@mayankshriv Currently the |
8f2dff4
to
1810550
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return ObjectType.DoubleSet; | ||
} else if (value instanceof ObjectSet) { | ||
ObjectSet objectSet = (ObjectSet) value; | ||
if (objectSet.isEmpty() || objectSet.iterator().next() instanceof String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be a problem as we always return StringSet for empty value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine for empty set as empty string set and empty bytes set are the same (both are empty ObjectOpenHashSet)
@Jackie-Jiang I completely agree that this is a good change for the functionality. I am just apprehensive that it can potentially have negative impact on performance, deployment (different versions of broker/server). It is not just latency performance, there's memory impact as well, right? There are likely various use cases that call For a large deployment, it is not practical to identify all use cases which may get impacted, manually evaluate the impact for each one of them, and ask the clients to move to the bit-map based aggregation function. For this reason, I was proposing to either have this as a new aggregation function, or provide a runtime way of disabling the new behavior in case issues happen in production. At the same time, I am fine to have this PR merged (as I fundamentally agree with the functional side of the change). However, I do want to call out that large deployments might likely need to invest in manually validating this PR for the various use cases they may have. |
+1 on this. Site facing use cases breaking after deployment is not something to look forward to. |
1810550
to
a683628
Compare
a683628
to
4fdf4cd
Compare
@mayankshriv @mcvsubbu Added a new boolean Helix cluster config |
Description
Currently in
DistinctCount
, we useIntOpenHashSet
to store distinct ids even for non int types. While this is efficient, the accuracy drops as the cardinality increase. This PR sets up the rightHashSet
based on column data type.Upgrade Notes
Brokers should be upgraded before servers in order to keep backward-compatible
Release Notes
With this change, the
DistinctCount
aggregation function will always return the exact distinct count regardless of the column data type. It might bring performance overhead for data types other thanINT
.For use cases that is performance sensitive and not require the exact distinct count, use
DistinctCountBitmap
which has the same behavior as the currentDistinctCount
and better performance.Provide a new boolean Helix cluster config
enable.distinct.count.bitmap.override
to auto-rewriteDistinctCount
toDistinctCountBitmap
on broker.