Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add String Column Support for Count Distinct Aggregation #1196

Merged
merged 2 commits into from
Jun 28, 2023

Conversation

atangwbd
Copy link
Collaborator

@atangwbd atangwbd commented Jun 27, 2023

Add String Support for Count Distinct Aggregation

Currently, feathr doesn't support the COUNT_DISTINCT aggregation on string column types, since it assumes the data type from the schema prior to the aggregation. When we try applying COUNT_DISTINCT on string columns, we get this error:

Caused by: Job aborted due to stage failure: Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of string

This PR converts each string into a unique 32 bit number using the built in spark hash function such that COUNT_DISTINCT aggregations can also work on string columns.

How was this PR tested?

I ran spark jobs locally and saw them fail before this change, and succeed after this change.

Does this PR introduce any user-facing changes?

COUNT_DISTINCT should now work on string columns.

  • No. You can skip the rest of this section.
  • [x ] Yes. Make sure to clarify your proposed changes.

@atangwbd atangwbd added the safe to test Tag to execute build pipeline for a PR from forked repo label Jun 27, 2023
aabbasi-hbo
aabbasi-hbo previously approved these changes Jun 27, 2023
xiaoyongzhu
xiaoyongzhu previously approved these changes Jun 27, 2023
@xiaoyongzhu
Copy link
Member

@atangwbd looks like there's a test failure? do you mind fixing it?

Gradle suite > Gradle test > com.linkedin.feathr.offline.SlidingWindowAggIntegTest > testSWACountDistinct FAILED
org.apache.spark.sql.AnalysisException at SlidingWindowAggIntegTest.scala:1823

@atangwbd atangwbd added DO-NOT-MERGE The PR shall not be merged work-in-progress/do-not-merge Work in Progress PR, do not merge and removed DO-NOT-MERGE The PR shall not be merged labels Jun 28, 2023
@atangwbd atangwbd dismissed stale reviews from xiaoyongzhu and aabbasi-hbo via ff9bd8d June 28, 2023 06:23
@atangwbd
Copy link
Collaborator Author

@atangwbd looks like there's a test failure? do you mind fixing it?

Gradle suite > Gradle test > com.linkedin.feathr.offline.SlidingWindowAggIntegTest > testSWACountDistinct FAILED org.apache.spark.sql.AnalysisException at SlidingWindowAggIntegTest.scala:1823

Fixed.

@atangwbd atangwbd removed the work-in-progress/do-not-merge Work in Progress PR, do not merge label Jun 28, 2023
@xiaoyongzhu
Copy link
Member

Thanks for the PR! The tests failed should be irrelevant to this change

@xiaoyongzhu xiaoyongzhu merged commit 169b86e into main Jun 28, 2023
20 of 33 checks passed
@xiaoyongzhu xiaoyongzhu deleted the feature/count-distinct-string-support branch June 28, 2023 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test Tag to execute build pipeline for a PR from forked repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants