New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30257] [PySpark] Add simpleString map #26884
Conversation
ok to test |
Thank you for making your first PR, @vanhooser . |
Test build #115324 has finished for PR 26884 at commit
|
No problem @dongjoon-hyun hopefully it isn’t my last. Lmk if any revisions needed. |
Test build #115330 has finished for PR 26884 at commit
|
Test build #115331 has finished for PR 26884 at commit
|
@@ -1603,6 +1627,13 @@ def convert(self, obj, gateway_client): | |||
t.setNanos(obj.microsecond * 1000) | |||
return t | |||
|
|||
|
|||
def from_simple_string(simple_str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an API? I don't think this is a particularily useful. I wouldn't add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, doesn't seem useful enough; when would you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When scanning over a DataFrame’s columns, I very often use dtypes()
to get both column and type info.
Therefore, if I want to take this information and create more columns or synthesize other dataframes, I have to map from simpleString to actual type manually.
This leads me to make dictionaries in many places that do what’s contained in this PR. This will make it much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Spark DataFrame exposes .schema, which would already give you all fields and their types. Is that what you're talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the dtypes method. This unwraps the struct for you and makes it easy to scan over a schema.
schema()
is useful for full manipulation but I personally find dtypes
used more often by my users. The simplicity right now is limited to schema -> Python list of string descriptions, the reverse for string descriptions -> schema is what’s added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you need the type objects, that's not what dtypes
is for - you do want schema
. It's available in Python. dtypes
is really just mapping those objects to strings anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I know what the schema()
method is for, my point here is dtypes()
is more convenient for users as it makes it quite convenient and easy to scan a schema.
I understand philosophically schema()
is more correct, but I don’t think I’m the end users care if it’s more correct if it’s more annoying to use.
dtypes()
is less annoying to users in practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I can speak for all users but I don't particularly see why dtypes is easier or more common to use, esp. if it's not giving you what you need in this case. I wouldn't add this method but wouldn't object to someone else merging it.
Thoughts here @dongjoon-hyun ? |
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
https://issues.apache.org/jira/browse/SPARK-30257