Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44815][CONNECT]Cache df.schema to avoid extra RPC #42499

Closed
wants to merge 5 commits into from

Conversation

grundprinzip
Copy link
Contributor

What changes were proposed in this pull request?

This patch caches the result of the df.schema call in the DataFrame to avoid the extra roundtrip to the Spark Connect service to retrieve the columns or the schema. Since the Dataframe is immutable, the schema will not change.

Why are the changes needed?

Performance / Stability

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UT

@grundprinzip grundprinzip changed the title [SPARK-44815] Cache df.schema to avoid extra RPC [SPARK-44815][CONNECT]Cache df.schema to avoid extra RPC Aug 15, 2023
@hvanhovell hvanhovell reopened this Jan 24, 2024
# Conflicts:
#	python/pyspark/sql/connect/dataframe.py
Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hvanhovell
Copy link
Contributor

Merging to master.

@hvanhovell hvanhovell closed this in 6f87fe2 Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants