Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for exact distinct count for non int data types #5872

Merged
merged 2 commits into from Aug 20, 2020

Conversation

kishoreg
Copy link
Member

@kishoreg kishoreg commented Aug 16, 2020

Description

Currently in DistinctCount, we use IntOpenHashSet to store distinct ids even for non int types. While this is efficient, the accuracy drops as the cardinality increase. This PR sets up the right HashSet based on column data type.

Upgrade Notes

Brokers should be upgraded before servers in order to keep backward-compatible

Release Notes

With this change, the DistinctCount aggregation function will always return the exact distinct count regardless of the column data type. It might bring performance overhead for data types other than INT.
For use cases that is performance sensitive and not require the exact distinct count, use DistinctCountBitmap which has the same behavior as the current DistinctCount and better performance.
Provide a new boolean Helix cluster config enable.distinct.count.bitmap.override to auto-rewrite DistinctCount to DistinctCountBitmap on broker.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to create a new function for this, or the existing distinctCount queries will face inconsistent results and performance degradation.

@@ -111,6 +127,14 @@ public static ObjectType getObjectType(Object value) {
return ObjectType.Geometry;
} else if (value instanceof RoaringBitmap) {
return ObjectType.RoaringBitmap;
} else if (value instanceof LongSet) {
return ObjectType.LongSet;
} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) {
} else if (value instanceof FloatSet) {

return ObjectType.LongSet;
} else if (value instanceof it.unimi.dsi.fastutil.floats.FloatSet) {
return ObjectType.FloatSet;
} else if (value instanceof it.unimi.dsi.fastutil.doubles.DoubleSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (value instanceof it.unimi.dsi.fastutil.doubles.DoubleSet) {
} else if (value instanceof DoubleSet) {

GEOMETRY_SER_DE,
ROARING_BITMAP_SER_DE
};
private static final ObjectSerDe[] SER_DES =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this reformat (you may want to enable formatter markers in comments in your IDE)

for (int dictId = 0; dictId < dictionarySize; dictId++) {
set.add(dictionary.getStringValue(dictId).hashCode());
set.add(ByteBuffer.wrap(dictionary.getStringValue(dictId).getBytes(Charsets.UTF_8)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using ByteArray instead of ByteBuffer to store bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use StringUtils.encodeUtf8() to encode string for better performance

}
}

private AbstractCollection emptyCollection() {
return new AbstractCollection() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works for ser/de. You need to construct a type specific set based on the data type

@kishoreg
Copy link
Member Author

I think we might need to create a new function for this, or the existing distinctCount queries will face inconsistent results and performance degradation.

Existing behavior is very hard to explain, we can add a new function ''distinctCountHash' to bring back the previous behavior but I don't know why would someone use that vs distinctCountHLL

Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to create a new function for this, or the existing distinctCount queries will face inconsistent results and performance degradation.

+1

int size = floatSet.size();
byte[] bytes = new byte[Integer.BYTES + size * Float.BYTES];
ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);
byteBuffer.putInt(size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should have a single ser/de for different data types, by writing the data type as part of header. Not sure if the iterators share the same interface to be able to share the same serialize().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using fastutil Sets for better performance, and each Set has its own iterator class. Keeping them separate is more readable IMO. The ObjectType info is already maintained in the header, no need to introduce another level of type info.

}
break;
case STRING:
set = new ObjectOpenHashSet<ByteBuffer>(dictionarySize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not byte[]?

Copy link
Contributor

@mayankshriv mayankshriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  • Given that there's potential result change for existing distinct cases, we should not provide a way to measure impact (perhaps with config, or having a separate aggr function).

  • We should also add tests to see how it improves wrt existing function.

  • The release notes section in the PR should be updated accordingly.

@Jackie-Jiang
Copy link
Contributor

@mayankshriv Can you please check the existing use cases on distinctCount? If the column is not STRING type, then the overhead should be minimal.
For the existing behavior, you may use distinctCountBitmap which is the enhanced version of the current distinctCount using the RoaringBitmap

@mayankshriv
Copy link
Contributor

@mayankshriv Can you please check the existing use cases on distinctCount? If the column is not STRING type, then the overhead should be minimal.
For the existing behavior, you may use distinctCountBitmap which is the enhanced version of the current distinctCount using the RoaringBitmap

@Jackie-Jiang We have thousands of tables at LinkedIn, so they will likely cover all kinds of distinctCount use cases. I don't quite follow your suggestion on us using distinctCountBitmap. Are you suggesting we ask our customers to migrate to the bitmap based implementation?

@Jackie-Jiang
Copy link
Contributor

@mayankshriv Currently the DistinctCount is not having the expected behavior of returning the exact distinct count because it is storing the hashCode() of the values instead of the actual values, and will return less than accurate result when hash collision happens. We are fixing this unexpected behavior in this PR, but that has performance overhead.
DistinctCountBitmap will have the same behavior as the current DistinctCount (storing hash of the values) and similar or better performance. In case there are performance-sensitive use cases with DistinctCount, you might consider using DistinctCountBitmap instead. For non-performance-sensitive use cases, nothing need to be changed as DistinctCount will return the exact distinct count, which is the expected behavior.

@Jackie-Jiang Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Aug 19, 2020
Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return ObjectType.DoubleSet;
} else if (value instanceof ObjectSet) {
ObjectSet objectSet = (ObjectSet) value;
if (objectSet.isEmpty() || objectSet.iterator().next() instanceof String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be a problem as we always return StringSet for empty value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine for empty set as empty string set and empty bytes set are the same (both are empty ObjectOpenHashSet)

@mayankshriv
Copy link
Contributor

@mayankshriv Currently the DistinctCount is not having the expected behavior of returning the exact distinct count because it is storing the hashCode() of the values instead of the actual values, and will return less than accurate result when hash collision happens. We are fixing this unexpected behavior in this PR, but that has performance overhead.
DistinctCountBitmap will have the same behavior as the current DistinctCount (storing hash of the values) and similar or better performance. In case there are performance-sensitive use cases with DistinctCount, you might consider using DistinctCountBitmap instead. For non-performance-sensitive use cases, nothing need to be changed as DistinctCount will return the exact distinct count, which is the expected behavior.

@Jackie-Jiang I completely agree that this is a good change for the functionality. I am just apprehensive that it can potentially have negative impact on performance, deployment (different versions of broker/server).

It is not just latency performance, there's memory impact as well, right? There are likely various use cases that call distinctCount on string columns (in fact, that is the most common use case). Given that we are going to use String hashSets instead of Int hashSets, there's a memory penalty as well. This can be amplified for multi-tenant cases where a single distinct query on a string column can put memory pressure and adversely impact all tables on the node.

For a large deployment, it is not practical to identify all use cases which may get impacted, manually evaluate the impact for each one of them, and ask the clients to move to the bit-map based aggregation function. For this reason, I was proposing to either have this as a new aggregation function, or provide a runtime way of disabling the new behavior in case issues happen in production.

At the same time, I am fine to have this PR merged (as I fundamentally agree with the functional side of the change). However, I do want to call out that large deployments might likely need to invest in manually validating this PR for the various use cases they may have.

@mcvsubbu
Copy link
Contributor

@mayankshriv Currently the DistinctCount is not having the expected behavior of returning the exact distinct count because it is storing the hashCode() of the values instead of the actual values, and will return less than accurate result when hash collision happens. We are fixing this unexpected behavior in this PR, but that has performance overhead.
DistinctCountBitmap will have the same behavior as the current DistinctCount (storing hash of the values) and similar or better performance. In case there are performance-sensitive use cases with DistinctCount, you might consider using DistinctCountBitmap instead. For non-performance-sensitive use cases, nothing need to be changed as DistinctCount will return the exact distinct count, which is the expected behavior.

@Jackie-Jiang I completely agree that this is a good change for the functionality. I am just apprehensive that it can potentially have negative impact on performance, deployment (different versions of broker/server).

It is not just latency performance, there's memory impact as well, right? There are likely various use cases that call distinctCount on string columns (in fact, that is the most common use case). Given that we are going to use String hashSets instead of Int hashSets, there's a memory penalty as well. This can be amplified for multi-tenant cases where a single distinct query on a string column can put memory pressure and adversely impact all tables on the node.

For a large deployment, it is not practical to identify all use cases which may get impacted, manually evaluate the impact for each one of them, and ask the clients to move to the bit-map based aggregation function. For this reason, I was proposing to either have this as a new aggregation function, or provide a runtime way of disabling the new behavior in case issues happen in production.

At the same time, I am fine to have this PR merged (as I fundamentally agree with the functional side of the change). However, I do want to call out that large deployments might likely need to invest in manually validating this PR for the various use cases they may have.

+1 on this. Site facing use cases breaking after deployment is not something to look forward to.

@Jackie-Jiang
Copy link
Contributor

Jackie-Jiang commented Aug 20, 2020

@mayankshriv @mcvsubbu Added a new boolean Helix cluster config enable.distinct.count.bitmap.override to auto-rewrite DistinctCount to DistinctCountBitmap on broker. If the cluster runs into performance issue because of the new DistinctCount behavior, this flag is able to switch to DistinctCountBitmap without restarting the machines.

@Jackie-Jiang Jackie-Jiang merged commit c223dfc into master Aug 20, 2020
@Jackie-Jiang Jackie-Jiang deleted the exact-distinct-count branch August 20, 2020 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants