[SPARK-30257] [PySpark] Add simpleString map #26884

vanhooser · 2019-12-13T19:06:16Z

What changes were proposed in this pull request?

Adds a simple map function from simpleString to the equivalent Spark SQL type

Why are the changes needed?

Mapping results of dtype() to equivalent Spark SQL types is annoying

Does this PR introduce any user-facing change?

Exposes new method to PySpark API

How was this patch tested?

https://issues.apache.org/jira/browse/SPARK-30257

dongjoon-hyun · 2019-12-14T01:46:31Z

ok to test

dongjoon-hyun · 2019-12-14T01:46:44Z

Thank you for making your first PR, @vanhooser .

SparkQA · 2019-12-14T02:13:58Z

Test build #115324 has finished for PR 26884 at commit ed9f39c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanhooser · 2019-12-14T04:03:45Z

No problem @dongjoon-hyun hopefully it isn’t my last. Lmk if any revisions needed.

python/pyspark/sql/types.py

SparkQA · 2019-12-14T12:44:53Z

Test build #115330 has finished for PR 26884 at commit 2c41d9c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-14T13:41:11Z

Test build #115331 has finished for PR 26884 at commit c708f59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-15T08:19:16Z

python/pyspark/sql/types.py

@@ -1603,6 +1627,13 @@ def convert(self, obj, gateway_client):
        t.setNanos(obj.microsecond * 1000)
        return t

+
+def from_simple_string(simple_str):


Is it an API? I don't think this is a particularily useful. I wouldn't add it.

Yeah, doesn't seem useful enough; when would you need this?

When scanning over a DataFrame’s columns, I very often use dtypes() to get both column and type info.

Therefore, if I want to take this information and create more columns or synthesize other dataframes, I have to map from simpleString to actual type manually.

This leads me to make dictionaries in many places that do what’s contained in this PR. This will make it much easier.

A Spark DataFrame exposes .schema, which would already give you all fields and their types. Is that what you're talking about?

No, the dtypes method. This unwraps the struct for you and makes it easy to scan over a schema.

schema() is useful for full manipulation but I personally find dtypes used more often by my users. The simplicity right now is limited to schema -> Python list of string descriptions, the reverse for string descriptions -> schema is what’s added here.

But if you need the type objects, that's not what dtypes is for - you do want schema. It's available in Python. dtypes is really just mapping those objects to strings anyway.

Yes I know what the schema() method is for, my point here is dtypes() is more convenient for users as it makes it quite convenient and easy to scan a schema.

I understand philosophically schema() is more correct, but I don’t think I’m the end users care if it’s more correct if it’s more annoying to use.

dtypes() is less annoying to users in practice

I can't say I can speak for all users but I don't particularly see why dtypes is easier or more common to use, esp. if it's not giving you what you need in this case. I wouldn't add this method but wouldn't object to someone else merging it.

vanhooser · 2019-12-16T20:20:35Z

Thoughts here @dongjoon-hyun ?

vanhooser added 3 commits December 13, 2019 14:04

Add simpleString map

7cffa9d

Remove cruft method

89a96c9

Add exposed method

aafd3a3

vanhooser changed the title ~~Add simpleString map~~ [SPARK-30257] Add simpleString map Dec 13, 2019

vanhooser changed the title ~~[SPARK-30257] Add simpleString map~~ [SPARK-30257] [PySpark] Add simpleString map Dec 13, 2019

vanhooser added 2 commits December 13, 2019 14:29

Fix type annotation

4f9fd78

Fix linting

ed9f39c

dongjoon-hyun added the PYSPARK label Dec 13, 2019

oliverw1 reviewed Dec 14, 2019

View reviewed changes

python/pyspark/sql/types.py Outdated Show resolved Hide resolved

Make more pythonic

2c41d9c

Fix lint

c708f59

HyukjinKwon reviewed Dec 15, 2019

View reviewed changes

vanhooser requested a review from HyukjinKwon December 16, 2019 19:17

srowen closed this Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30257] [PySpark] Add simpleString map #26884

[SPARK-30257] [PySpark] Add simpleString map #26884

vanhooser commented Dec 13, 2019 •

edited

dongjoon-hyun commented Dec 14, 2019

dongjoon-hyun commented Dec 14, 2019

SparkQA commented Dec 14, 2019

vanhooser commented Dec 14, 2019

SparkQA commented Dec 14, 2019

SparkQA commented Dec 14, 2019

HyukjinKwon Dec 15, 2019

srowen Dec 15, 2019

vanhooser Dec 15, 2019

srowen Dec 15, 2019

vanhooser Dec 15, 2019

srowen Dec 15, 2019

vanhooser Dec 15, 2019

srowen Dec 15, 2019

vanhooser commented Dec 16, 2019

[SPARK-30257] [PySpark] Add simpleString map #26884

[SPARK-30257] [PySpark] Add simpleString map #26884

Conversation

vanhooser commented Dec 13, 2019 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Dec 14, 2019

dongjoon-hyun commented Dec 14, 2019

SparkQA commented Dec 14, 2019

vanhooser commented Dec 14, 2019

SparkQA commented Dec 14, 2019

SparkQA commented Dec 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanhooser commented Dec 16, 2019

vanhooser commented Dec 13, 2019 •

edited