-
Notifications
You must be signed in to change notification settings - Fork 28.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-34165][SQL] Add count_distinct as an option to Dataset#summary
### What changes were proposed in this pull request? Add `count_distinct` as an option argument to Dataset#summary (the method already supports count, min, max, etc.) ### Why are the changes needed? The `summary()` method is used for lightweight exploratory data analysis. A distinct count of all the columns is one of the most common exploratory data analysis queries. Distinct counts can be expensive, so this shouldn't be enabled by default. The proposed implementation is completely backwards compatible. ### Does this PR introduce _any_ user-facing change? Yes, users can now call `df.summary("count_distinct")`, which wasn't an option before. Users can still call `df.summary()` without any arguments and the output is the same. `count_distinct` was not added as one of the `defaultStatistics`. ### How was this patch tested? Unit tests. ### Additional comments If this idea is accepted, we should add a PySpark implementation in this PR, as suggested by zero323. Closes #31254 from MrPowers/SPARK-34165. Authored-by: MrPowers <matthewkevinpowers@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
- Loading branch information
Showing
3 changed files
with
48 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters