-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50324][PYTHON][CONNECT] Make createDataFrame trigger Config RPC at most once
#48856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
05f17b8 to
2da389a
Compare
8bac4ae to
6e0dd30
Compare
|
LGTM thank you! |
createDataFrame trigger Config RPC at most oncecreateDataFrame trigger Config RPC at most once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just get all the confs in batch eagerly? Seems like we should get the conf once anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 cases that we don't need the configs:
1, the local data is empty, and the schema is specified, it returns a valid empty df;
2, the creation fails due to some assertions
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another way of doing this is to maintain a sized dictionary in Python side, and cache the value retrieved within the Python side.
e.g.,
-
spark.get("a")
- look up cacehd["a"] = v
- if not, spark.get("a")
- look up cacehd["a"] = v
-
spark.set("a", aa")
- empty cache cached["a"]
- spark.set("a")
and create a dictioanry with TTL and max size
|
which will work all for Spark Classic and Spark Connect. |
A problem is that |
6e0dd30 to
b84e0a3
Compare
|
It seems we don't need this helper class to achieve the goal, will have another try |
035a86d to
c8af66d
Compare
e28cfd0 to
1b02bf4
Compare
|
thanks, merged to master |
What changes were proposed in this pull request?
Get all configs in batch
Why are the changes needed?
there are too many related configs in
createDataFrame, they are fetched one by one (or group by group) in different branches:1, it is possible no Config RPC is triggered, e.g. in this branch:
spark/python/pyspark/sql/connect/session.py
Lines 502 to 509 in 2633035
2, multiple Config RPCs for different configs, e.g. in this branch:
spark/python/pyspark/sql/connect/session.py
Lines 599 to 601 in 2633035
Does this PR introduce any user-facing change?
no
How was this patch tested?
ci
Was this patch authored or co-authored using generative AI tooling?
no