-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32846][SQL][PYTHON] Support createDataFrame from an RDD of pd.DataFrames #29719
Closed
linar-jether
wants to merge
6
commits into
apache:master
from
linar-jether:pandas-rdd-to-spark-df-SPARK-3284
Closed
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
c2d6892
Initial commit
linar-jether 823dbbf
Reuse the ArrowStreamPandasSerializer and fix test & linter issues
linar-jether f045838
Fix doctests
linar-jether 22aace8
Merge pull request #1 from apache/master
linar-jether d88aad6
Merge branch 'master' of https://github.com/linar-jether/spark into p…
linar-jether b324a5d
Merge branch 'master' into pandas-rdd-to-spark-df-SPARK-3284
linar-jether File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do type checking here instead of having a flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we somehow define/get the type of the RDD[py-object] without evaluating the first element of it?
If not, then the RDD might contain any type of object, so the pandasRDD option is used as a way to differentiate between initialization from an RDD and an RDD of pd.DataFrames.
Thank you for reviewing! please let me know if there's anything else i can do to get this merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. If we look in
session.py
we can see_createFromRDD
does it magic there. Personally I would put this logic instead inside of_inferSchema
andtoInternal
respectively but I'm coming at this from more a core-spark dev perspective maybe @HyukjinKwon has a different view.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we don't refactor this back into session.py, I'd encourage you to look at session.py and consider structuring this in a similar way so that we don't have to have this flag here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this seems to fit well into
_inferSchema
&_createFromRDD
, although we still would need some way to discern between an rdd of DataFrames and other types when the user provides a schema (and we don't want to peek into the first item).Do you think it would be better to move the pandas flag into
_createFromRDD
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So _inferSchema does effectively peek into the first element. I think we could just put the logic down inside of the map and then the user doesn't have to specify this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's in case the user wants to infer the schema (so we have to peek into the rdd), but in case the use does specify the schema, there's no need to peek, and we're left with no other option to tell which code path we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say the user specifies a schema, in that case inside of _createFromRDD we can just look at the type of each element that were processing and see if it's a DataFrame or a Row or a Dictionary and dispatch the logic there. What do you think? Or is there a reason I'm missing why we couldn't do the dispatch inside of _createFromRDD based on type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well in case the user specifies a schema, the entire process is lazy, so there's no need to evaluate any of the rdd elements...
if we keep everything lazy and map each element to either a row or RecordBatch, we would still need to know which path to take, e.g. for RecordBatches we need to call:
and for Rows we need to call: