Add String Column Support for Count Distinct Aggregation #1196

atangwbd · 2023-06-27T17:05:20Z

Add String Support for Count Distinct Aggregation

Currently, feathr doesn't support the COUNT_DISTINCT aggregation on string column types, since it assumes the data type from the schema prior to the aggregation. When we try applying COUNT_DISTINCT on string columns, we get this error:

Caused by: Job aborted due to stage failure: Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of string

This PR converts each string into a unique 32 bit number using the built in spark hash function such that COUNT_DISTINCT aggregations can also work on string columns.

How was this PR tested?

I ran spark jobs locally and saw them fail before this change, and succeed after this change.

Does this PR introduce any user-facing changes?

COUNT_DISTINCT should now work on string columns.

No. You can skip the rest of this section.
[x ] Yes. Make sure to clarify your proposed changes.

xiaoyongzhu · 2023-06-27T17:15:26Z

@atangwbd looks like there's a test failure? do you mind fixing it?

Gradle suite > Gradle test > com.linkedin.feathr.offline.SlidingWindowAggIntegTest > testSWACountDistinct FAILED
org.apache.spark.sql.AnalysisException at SlidingWindowAggIntegTest.scala:1823

atangwbd · 2023-06-28T06:31:51Z

@atangwbd looks like there's a test failure? do you mind fixing it?

Gradle suite > Gradle test > com.linkedin.feathr.offline.SlidingWindowAggIntegTest > testSWACountDistinct FAILED org.apache.spark.sql.AnalysisException at SlidingWindowAggIntegTest.scala:1823

Fixed.

xiaoyongzhu · 2023-06-28T19:07:14Z

Thanks for the PR! The tests failed should be irrelevant to this change

add string support for count distinct column

869c31a

atangwbd added the safe to test Tag to execute build pipeline for a PR from forked repo label Jun 27, 2023

atangwbd assigned aabbasi-hbo and xiaoyongzhu Jun 27, 2023

aabbasi-hbo previously approved these changes Jun 27, 2023

View reviewed changes

xiaoyongzhu previously approved these changes Jun 27, 2023

View reviewed changes

atangwbd added DO-NOT-MERGE The PR shall not be merged work-in-progress/do-not-merge Work in Progress PR, do not merge and removed DO-NOT-MERGE The PR shall not be merged labels Jun 28, 2023

fix: use native hashing algorithm and fix bug

ff9bd8d

atangwbd dismissed stale reviews from xiaoyongzhu and aabbasi-hbo via ff9bd8d June 28, 2023 06:23

atangwbd requested review from aabbasi-hbo and xiaoyongzhu June 28, 2023 06:32

atangwbd removed the work-in-progress/do-not-merge Work in Progress PR, do not merge label Jun 28, 2023

xiaoyongzhu approved these changes Jun 28, 2023

View reviewed changes

xiaoyongzhu merged commit 169b86e into main Jun 28, 2023
20 of 33 checks passed

xiaoyongzhu deleted the feature/count-distinct-string-support branch June 28, 2023 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add String Column Support for Count Distinct Aggregation #1196

Add String Column Support for Count Distinct Aggregation #1196

atangwbd commented Jun 27, 2023 •

edited

Loading

xiaoyongzhu commented Jun 27, 2023

atangwbd commented Jun 28, 2023

xiaoyongzhu commented Jun 28, 2023

Add String Column Support for Count Distinct Aggregation #1196

Add String Column Support for Count Distinct Aggregation #1196

Conversation

atangwbd commented Jun 27, 2023 • edited Loading

Add String Support for Count Distinct Aggregation

How was this PR tested?

Does this PR introduce any user-facing changes?

xiaoyongzhu commented Jun 27, 2023

atangwbd commented Jun 28, 2023

xiaoyongzhu commented Jun 28, 2023

atangwbd commented Jun 27, 2023 •

edited

Loading