-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47336][SQL][CONNECT] Provide to PySpark a functionality to get estimated size of DataFrame in bytes #46368
Open
SemyonSinchenko
wants to merge
19
commits into
apache:master
Choose a base branch
from
SemyonSinchenko:size_in_bytes_api
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On branch size_in_bytes_api Changes to be committed: modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala modified: connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala modified: connector/connect/common/src/main/protobuf/spark/connect/base.proto modified: connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala modified: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Changes to be committed: modified: .gitignore modified: connector/connect/common/src/main/protobuf/spark/connect/base.proto modified: connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala modified: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+ Relation instead of Plan in base.proto + Fix broken ids in base.proto + Fix corresponding parts in AnalyzeHandler On branch size_in_bytes_api Your branch is up to date with 'origin/size_in_bytes_api'. Changes to be committed: modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala modified: connector/connect/common/src/main/protobuf/spark/connect/base.proto modified: connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala
+ update naming following the discussion in JIRA On branch size_in_bytes_api Your branch is up to date with 'origin/size_in_bytes_api'. Changes to be committed: modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala modified: connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala modified: connector/connect/common/src/main/protobuf/spark/connect/base.proto modified: connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala modified: connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala
+ small fixes + tests On branch size_in_bytes_api Your branch is up to date with 'origin/size_in_bytes_api'. Changes to be committed: modified: python/pyspark/sql/classic/dataframe.py modified: python/pyspark/sql/connect/client/core.py modified: python/pyspark/sql/connect/dataframe.py modified: python/pyspark/sql/connect/proto/base_pb2.py modified: python/pyspark/sql/connect/proto/base_pb2.pyi modified: python/pyspark/sql/dataframe.py modified: python/pyspark/sql/tests/connect/test_connect_basic.py modified: python/pyspark/sql/tests/test_dataframe.py modified: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
On branch size_in_bytes_api Your branch is up to date with 'origin/size_in_bytes_api'. Changes to be committed: modified: python/pyspark/sql/dataframe.py modified: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+ delete an example because it requires a data On branch size_in_bytes_api Your branch is up to date with 'origin/size_in_bytes_api'. Changes to be committed: modified: python/pyspark/sql/connect/dataframe.py modified: python/pyspark/sql/dataframe.py
HyukjinKwon
reviewed
May 7, 2024
HyukjinKwon
reviewed
May 7, 2024
HyukjinKwon
reviewed
May 7, 2024
zhengruifeng
reviewed
May 7, 2024
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala
Outdated
Show resolved
Hide resolved
- change from Long to bytes[] in proto - JVM methods return BigInteger from now - in Python conversion from BigInteger to int is via bytes[] - drop .dir-locals.el from .gitignore - rename _sizeInBytes -> _size_in_bytes on Python side On branch size_in_bytes_api Your branch is up to date with 'origin/size_in_bytes_api'. Changes to be committed: modified: .gitignore modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala modified: connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala modified: connector/connect/common/src/main/protobuf/spark/connect/base.proto modified: connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala modified: python/pyspark/sql/classic/dataframe.py modified: python/pyspark/sql/connect/client/core.py modified: python/pyspark/sql/connect/dataframe.py modified: python/pyspark/sql/connect/proto/base_pb2.py modified: python/pyspark/sql/connect/proto/base_pb2.pyi modified: python/pyspark/sql/dataframe.py modified: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
On branch size_in_bytes_api Your branch is ahead of 'origin/size_in_bytes_api' by 1 commit. (use "git push" to publish your local commits) Changes to be committed: modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala modified: connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectAnalyzeHandler.scala
On branch size_in_bytes_api Your branch is ahead of 'origin/size_in_bytes_api' by 2 commits. (use "git push" to publish your local commits) Changes to be committed: modified: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
On branch size_in_bytes_api Your branch is ahead of 'origin/size_in_bytes_api' by 3 commits. (use "git push" to publish your local commits) Changes to be committed: modified: connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala
On branch size_in_bytes_api Your branch is up to date with 'origin/size_in_bytes_api'. Changes to be committed: modified: python/pyspark/sql/classic/dataframe.py
New changes:
|
+ resolving conflicts + regenerate python proto-classes
@HyukjinKwon sorry for tagging, but may you please make a look again? Thanks in advance! |
I updated docstring for sizeInBytes method of dataframe. Changes to be committed: modified: python/pyspark/sql/connect/proto/base_pb2.py modified: python/pyspark/sql/dataframe.py
Changes from the last two commits (actual changes marked by bold):
|
Changes to be committed: modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala modified: connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
@HyukjinKwon I'm sorry for tagging you again, but maybe you can make a look? Thanks in advance! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In PySpark connect there is no access to JVM to call
queryExecution().optimizedPlan.stats
. So, there is no way to get information about size in bytes from plan except parsing by regexps an output ofexplain
. This PR is trying to fill that gap by providingsizeInBytesApproximation
method to JVM, PySpark Classic and PySpark Connect APIs. Under the hood it is just a call toqueryExecution().optimizedPlan.stats.sizeInBytes
. JVM and PySpark Classic APIs were updated just to have a parity.Dataset.scala
in JVM connect by adding a new APIDataset.scala
in JVM classic by adding a new APIdataframe.py
in sql by adding signature and doc of a new APIdataframe.py
in connect by adding an implementation of a new APIdataframe.py
in classic by adding an implementation of a new APIbase.proto
in partAnalyzeRequest
/AnalyzeResponse
by adding new messageSparkConnectAnalyzeHandler
by extendingmatch
and adding call toqueryExecution
SparkConnectClient
by adding a new method that build a new requestSparkSession
by adding a call to client and parsing a responseWhy are the changes needed?
To provide to PySpark Connect users an ability to get in runtime the DataFrame size estimation without forcing them to parse string-output of
df.explain
. Other changes are needed to have a parity across Connect / Classic and PySpark / JVM Spark.Does this PR introduce any user-facing change?
Only a new API. The new API is mostly for PySpark Connect users.
How was this patch tested?
Because the actual logic is in
queryExecution
I added tests only for syntax / calls. In tests we are testing that for a dataframe the returned size is greater than zero.Was this patch authored or co-authored using generative AI tooling?
No.
@grundprinzip We discussed that ticket with you, may you please make a look? Thanks!