Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30257] [PySpark] Add simpleString map #26884

Closed
wants to merge 7 commits into from

Conversation

vanhooser
Copy link

@vanhooser vanhooser commented Dec 13, 2019

What changes were proposed in this pull request?

  • Adds a simple map function from simpleString to the equivalent Spark SQL type

Why are the changes needed?

  • Mapping results of dtype() to equivalent Spark SQL types is annoying

Does this PR introduce any user-facing change?

  • Exposes new method to PySpark API

How was this patch tested?

https://issues.apache.org/jira/browse/SPARK-30257

@vanhooser vanhooser changed the title Add simpleString map [SPARK-30257] Add simpleString map Dec 13, 2019
@vanhooser vanhooser changed the title [SPARK-30257] Add simpleString map [SPARK-30257] [PySpark] Add simpleString map Dec 13, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

Thank you for making your first PR, @vanhooser .

@SparkQA
Copy link

SparkQA commented Dec 14, 2019

Test build #115324 has finished for PR 26884 at commit ed9f39c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanhooser
Copy link
Author

No problem @dongjoon-hyun hopefully it isn’t my last. Lmk if any revisions needed.

@SparkQA
Copy link

SparkQA commented Dec 14, 2019

Test build #115330 has finished for PR 26884 at commit 2c41d9c.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2019

Test build #115331 has finished for PR 26884 at commit c708f59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1603,6 +1627,13 @@ def convert(self, obj, gateway_client):
t.setNanos(obj.microsecond * 1000)
return t


def from_simple_string(simple_str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an API? I don't think this is a particularily useful. I wouldn't add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, doesn't seem useful enough; when would you need this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When scanning over a DataFrame’s columns, I very often use dtypes() to get both column and type info.

Therefore, if I want to take this information and create more columns or synthesize other dataframes, I have to map from simpleString to actual type manually.

This leads me to make dictionaries in many places that do what’s contained in this PR. This will make it much easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Spark DataFrame exposes .schema, which would already give you all fields and their types. Is that what you're talking about?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the dtypes method. This unwraps the struct for you and makes it easy to scan over a schema.

schema() is useful for full manipulation but I personally find dtypes used more often by my users. The simplicity right now is limited to schema -> Python list of string descriptions, the reverse for string descriptions -> schema is what’s added here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you need the type objects, that's not what dtypes is for - you do want schema. It's available in Python. dtypes is really just mapping those objects to strings anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know what the schema() method is for, my point here is dtypes() is more convenient for users as it makes it quite convenient and easy to scan a schema.

I understand philosophically schema() is more correct, but I don’t think I’m the end users care if it’s more correct if it’s more annoying to use.

dtypes() is less annoying to users in practice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I can speak for all users but I don't particularly see why dtypes is easier or more common to use, esp. if it's not giving you what you need in this case. I wouldn't add this method but wouldn't object to someone else merging it.

@vanhooser
Copy link
Author

Thoughts here @dongjoon-hyun ?

@srowen srowen closed this Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants