[SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1)#54370
[SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1)#54370devin-petersohn wants to merge 3 commits intoapache:masterfrom
Conversation
I generated some benchmarks for new implementation and compared against the old implementation. The performance numbers are show below. - **Row counts:** 1,000 and 10,000 - **Column counts:** 2, 5, 10, 20, 40, 100 - **Data distribution:** Random uniform distribution over 10 distinct values per column - **Total tests:** 11 configurations | Rows | Columns | Old Time | New Time | Speedup | Time Saved | Improvement | Jobs (Old→New) | Jobs Saved | |---------|---------|----------|----------|----------|------------|-------------|----------------|------------| | 1,000 | **1** | **0.125s** | **0.188s** | **0.66x** | **-0.063s** | **-50.6%** | **2 → 3** | **-1** | | 1,000 | 2 | 0.226s | 0.233s | 0.97x | -0.007s | -2.9% | 4 → 3 | 1 | | 1,000 | 5 | 0.501s | 0.225s | 2.23x | 0.276s | 55.1% | 10 → 3 | 7 | | 1,000 | 10 | 0.861s | 0.351s | 2.46x | 0.511s | 59.3% | 20 → 3 | 17 | | 1,000 | 20 | 1.539s | 0.418s | 3.68x | 1.120s | 72.8% | 40 → 3 | 37 | | 1,000 | 40 | 3.176s | 0.514s | 6.18x | 2.662s | 83.8% | 80 → 3 | 77 | | 1,000 | 100 | 7.483s | 0.586s | 12.77x | 6.897s | 92.2% | 200 → 3 | 197 | | 10,000 | **1** | **0.073s** | **0.111s** | **0.66x** | **-0.038s** | **-51.9%** | **2 → 3** | **-1** | | 10,000 | 5 | 0.362s | 0.194s | 1.87x | 0.168s | 46.5% | 10 → 3 | 7 | | 10,000 | 10 | 1.446s | 0.257s | 5.61x | 1.188s | 82.2% | 20 → 3 | 17 | | 10,000 | 20 | 1.424s | 0.382s | 3.72x | 1.041s | 73.1% | 40 → 3 | 37 | | 10,000 | 40 | 3.171s | 0.521s | 6.09x | 2.650s | 83.6% | 80 → 3 | 77 | | 10,000 | 100 | 10.953s | 1.163s | 9.41x | 9.789s | 89.4% | 200 → 3 | 197 | **Aggregate Statistics:** - Average speedup: 4.33x - Average improvement: 48.7% - Average jobs saved: 54.2 per operation - Maximum speedup: 12.77x (100 columns) - **Regression case: 0.66x for N=1** (new approach is 50% slower) Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
holdenk
left a comment
There was a problem hiding this comment.
I like this :) Quick first look, haven't had a chance for a proper review yet but love to see these old TODOs getting fixed :)
There was a problem hiding this comment.
Reason for dropping the coment?
| # Unfortunately, there's no straightforward way to get the top value and its frequency | ||
| # for each column without collecting the data to the driver side. |
There was a problem hiding this comment.
Note for the future: This seems like a good follow up issue, I think we could do something smarter here long term. I've been thinking about some kind of bounded collection types for aggregations and this might fit. (although tbf describe isn't used all that often, but would love to put these together if we can). They do still end up being large but on the executors and the final driver part is a bit smaller.
|
cc @gaogaotiantian FYI |
| ) | ||
| top_freq_dict = {row.column_name: (row.value, row.freq) for row in top_values} | ||
| tops = [str(top_freq_dict[col_name][0]) for col_name in column_names] | ||
| freqs = [str(top_freq_dict[col_name][1]) for col_name in column_names] |
There was a problem hiding this comment.
top_freq_dict = {row.column_name: (str(row.value), str(row.freq)) for row in top_values}
tops, freqs = map(list, zip(*(top_freq_dict[col_name] for col_name in column_names)))Maybe less duplicate? Not a huge thing though.
|
The improvement for multiple columns is great. A few questions:
|
I generated some benchmarks for new implementation and compared against the old implementation. The performance numbers are show below.
Aggregate Statistics:
What changes were proposed in this pull request?
Fixes describe for string-only dataframes to have a fixed number of jobs rather than one job per column
Why are the changes needed?
Performance
Does this PR introduce any user-facing change?
No
How was this patch tested?
CI
Was this patch authored or co-authored using generative AI tooling?
Co-authored-by: Claude Sonnet 4.5