[SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1) by devin-petersohn · Pull Request #54370 · apache/spark

devin-petersohn · 2026-02-18T19:35:14Z

I generated some benchmarks for new implementation and compared against the old implementation. The performance numbers are show below.

Row counts: 1,000 and 10,000
Column counts: 2, 5, 10, 20, 40, 100
Data distribution: Random uniform distribution over 10 distinct values per column
Total tests: 11 configurations

Rows	Columns	Old Time	New Time	Speedup	Time Saved	Improvement	Jobs (Old→New)	Jobs Saved
1,000	1	0.125s	0.188s	0.66x	-0.063s	-50.6%	2 → 3	-1
1,000	2	0.226s	0.233s	0.97x	-0.007s	-2.9%	4 → 3	1
1,000	5	0.501s	0.225s	2.23x	0.276s	55.1%	10 → 3	7
1,000	10	0.861s	0.351s	2.46x	0.511s	59.3%	20 → 3	17
1,000	20	1.539s	0.418s	3.68x	1.120s	72.8%	40 → 3	37
1,000	40	3.176s	0.514s	6.18x	2.662s	83.8%	80 → 3	77
1,000	100	7.483s	0.586s	12.77x	6.897s	92.2%	200 → 3	197
10,000	1	0.073s	0.111s	0.66x	-0.038s	-51.9%	2 → 3	-1
10,000	5	0.362s	0.194s	1.87x	0.168s	46.5%	10 → 3	7
10,000	10	1.446s	0.257s	5.61x	1.188s	82.2%	20 → 3	17
10,000	20	1.424s	0.382s	3.72x	1.041s	73.1%	40 → 3	37
10,000	40	3.171s	0.521s	6.09x	2.650s	83.6%	80 → 3	77
10,000	100	10.953s	1.163s	9.41x	9.789s	89.4%	200 → 3	197

Aggregate Statistics:

Average speedup: 4.33x
Average improvement: 48.7%
Average jobs saved: 54.2 per operation
Maximum speedup: 12.77x (100 columns)
Regression case: 0.66x for N=1 (new approach is 50% slower)

What changes were proposed in this pull request?

Fixes describe for string-only dataframes to have a fixed number of jobs rather than one job per column

Why are the changes needed?

Performance

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Sonnet 4.5

I generated some benchmarks for new implementation and compared against the old implementation. The performance numbers are show below. - **Row counts:** 1,000 and 10,000 - **Column counts:** 2, 5, 10, 20, 40, 100 - **Data distribution:** Random uniform distribution over 10 distinct values per column - **Total tests:** 11 configurations | Rows | Columns | Old Time | New Time | Speedup | Time Saved | Improvement | Jobs (Old→New) | Jobs Saved | |---------|---------|----------|----------|----------|------------|-------------|----------------|------------| | 1,000 | **1** | **0.125s** | **0.188s** | **0.66x** | **-0.063s** | **-50.6%** | **2 → 3** | **-1** | | 1,000 | 2 | 0.226s | 0.233s | 0.97x | -0.007s | -2.9% | 4 → 3 | 1 | | 1,000 | 5 | 0.501s | 0.225s | 2.23x | 0.276s | 55.1% | 10 → 3 | 7 | | 1,000 | 10 | 0.861s | 0.351s | 2.46x | 0.511s | 59.3% | 20 → 3 | 17 | | 1,000 | 20 | 1.539s | 0.418s | 3.68x | 1.120s | 72.8% | 40 → 3 | 37 | | 1,000 | 40 | 3.176s | 0.514s | 6.18x | 2.662s | 83.8% | 80 → 3 | 77 | | 1,000 | 100 | 7.483s | 0.586s | 12.77x | 6.897s | 92.2% | 200 → 3 | 197 | | 10,000 | **1** | **0.073s** | **0.111s** | **0.66x** | **-0.038s** | **-51.9%** | **2 → 3** | **-1** | | 10,000 | 5 | 0.362s | 0.194s | 1.87x | 0.168s | 46.5% | 10 → 3 | 7 | | 10,000 | 10 | 1.446s | 0.257s | 5.61x | 1.188s | 82.2% | 20 → 3 | 17 | | 10,000 | 20 | 1.424s | 0.382s | 3.72x | 1.041s | 73.1% | 40 → 3 | 37 | | 10,000 | 40 | 3.171s | 0.521s | 6.09x | 2.650s | 83.6% | 80 → 3 | 77 | | 10,000 | 100 | 10.953s | 1.163s | 9.41x | 9.789s | 89.4% | 200 → 3 | 197 | **Aggregate Statistics:** - Average speedup: 4.33x - Average improvement: 48.7% - Average jobs saved: 54.2 per operation - Maximum speedup: 12.77x (100 columns) - **Regression case: 0.66x for N=1** (new approach is 50% slower) Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com> Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

holdenk

I like this :) Quick first look, haven't had a chance for a proper review yet but love to see these old TODOs getting fixed :)

holdenk · 2026-02-18T22:39:13Z

python/pyspark/pandas/frame.py

Reason for dropping the coment?

holdenk · 2026-02-18T22:42:33Z

python/pyspark/pandas/frame.py

+            # Unfortunately, there's no straightforward way to get the top value and its frequency
+            # for each column without collecting the data to the driver side.


Note for the future: This seems like a good follow up issue, I think we could do something smarter here long term. I've been thinking about some kind of bounded collection types for aggregations and this might fit. (although tbf describe isn't used all that often, but would love to put these together if we can). They do still end up being large but on the executors and the final driver part is a bit smaller.

HyukjinKwon · 2026-02-19T01:08:20Z

cc @gaogaotiantian FYI

gaogaotiantian · 2026-02-19T01:54:53Z

python/pyspark/pandas/frame.py

+            )
+            top_freq_dict = {row.column_name: (row.value, row.freq) for row in top_values}
+            tops = [str(top_freq_dict[col_name][0]) for col_name in column_names]
+            freqs = [str(top_freq_dict[col_name][1]) for col_name in column_names]


top_freq_dict = {row.column_name: (str(row.value), str(row.freq)) for row in top_values} tops, freqs = map(list, zip(*(top_freq_dict[col_name] for col_name in column_names)))

Maybe less duplicate? Not a huge thing though.

gaogaotiantian · 2026-02-19T01:57:56Z

The improvement for multiple columns is great. A few questions:

Is it worth it to have a fast path for single column? Just fallback the original implementation?
If I understand correctly, we stack data right? What does it mean for memory usage when the df is large?

devin-petersohn and others added 3 commits February 12, 2026 11:20

cleanup

464a39b

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

Comment

1f0f4dd

Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

holdenk reviewed Feb 18, 2026

View reviewed changes

gaogaotiantian reviewed Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1)#54370

[SPARK-37711][PS] Reduce pandas describe job count from O(N) to O(1)#54370
devin-petersohn wants to merge 3 commits intoapache:masterfrom
devin-petersohn:devin/describe_strings_oneshot

devin-petersohn commented Feb 18, 2026

Uh oh!

holdenk left a comment

Uh oh!

holdenk Feb 18, 2026

Uh oh!

holdenk Feb 18, 2026

Uh oh!

HyukjinKwon commented Feb 19, 2026

Uh oh!

gaogaotiantian Feb 19, 2026

Uh oh!

gaogaotiantian commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

		# Unfortunately, there's no straightforward way to get the top value and its frequency
		# for each column without collecting the data to the driver side.

Conversation

devin-petersohn commented Feb 18, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

holdenk Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 19, 2026

Uh oh!

gaogaotiantian Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments