Skip to content

[SPARK-56607][PYTHON][FOLLOWUP] Use pyspark.sql.DataFrame to support connect-only#55630

Closed
gaogaotiantian wants to merge 1 commit intoapache:masterfrom
gaogaotiantian:fix-mlutils
Closed

[SPARK-56607][PYTHON][FOLLOWUP] Use pyspark.sql.DataFrame to support connect-only#55630
gaogaotiantian wants to merge 1 commit intoapache:masterfrom
gaogaotiantian:fix-mlutils

Conversation

@gaogaotiantian
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Use pyspark.sql.DataFrame, not the classic one, in mlutils.py.

Why are the changes needed?

We have connect only CI which does not even have class DataFrame. This util should work with connect DataFrame too.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

test_pipeline and test_parity_pipeline passed locally.

Was this patch authored or co-authored using generative AI tooling?

No.

@gaogaotiantian
Copy link
Copy Markdown
Contributor Author

gaogaotiantian commented Apr 30, 2026

@HyukjinKwon you were right in #55526 (comment) - I should not use classic.DataFrame directly because this util is also used by connect. We should still overwrite __new__ like others did.

https://github.com/apache/spark/actions/runs/25129704754

@HyukjinKwon
Copy link
Copy Markdown
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants