[SPARK-15244][PYTHON] Type of column name created with createDataFrame is not consistent. #13097

dongjoon-hyun · 2016-05-13T09:22:10Z

What changes were proposed in this pull request?

createDataFrame returns inconsistent types for column names.

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = spark.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = spark.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']

The reason is only StructField has the following code.

if not isinstance(name, str):
    name = name.encode('utf-8')

This PR adds the same logic into createDataFrame for consistency.

if isinstance(schema, list):
    schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema]

How was this patch tested?

Pass the Jenkins test (with new python doctest)

SparkQA · 2016-05-13T09:37:57Z

Test build #58557 has finished for PR 13097 at commit c47b12b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-13T10:08:49Z

Test build #58565 has finished for PR 13097 at commit 406a53d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-13T10:17:04Z

Hi, @andrewor14 .
Could you review this PR?

SparkQA · 2016-05-16T17:08:13Z

Test build #58646 has finished for PR 13097 at commit 66a0bf5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-05-16T18:11:08Z

Looks OK, also cc @davies

dongjoon-hyun · 2016-05-16T18:15:49Z

Thank you for review, @andrewor14 .

davies · 2016-05-16T18:17:17Z

python/pyspark/sql/session.py

We should put this into unit tests.

Oh, it's already tested by python doctest by runtest.py.
Do you mean I need to add this into another separate unit test python file?

Yes, these will become part of the API docs as examples, it's anonying to see many corner cases here.

I see. Sorry, but could you tell me the location of the unit test python file?

Maybe, tests.py?

davies · 2016-05-16T18:18:18Z

LGTM, only one comment.

dongjoon-hyun · 2016-05-16T18:45:11Z

Thank you for review, @davies . I'll fix in an hour.

…me is not consistent.

SparkQA · 2016-05-16T20:28:06Z

Test build #58654 has finished for PR 13097 at commit e994d12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-16T20:29:43Z

Test build #58655 has finished for PR 13097 at commit 50d3ee4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-05-16T20:33:45Z

Thank you, @andrewor14 and @davies .
I moved the testcase into sql/tests.py and it passed the tests again.

dongjoon-hyun · 2016-05-17T20:00:20Z

Hi, @davies .
Could you merge this PR?

davies · 2016-05-17T20:04:27Z

LGTM,
Merging this into master and 2.0, thanks!

…me is not consistent. ## What changes were proposed in this pull request? **createDataFrame** returns inconsistent types for column names. ```python >>> from pyspark.sql.types import StructType, StructField, StringType >>> schema = StructType([StructField(u"col", StringType())]) >>> df1 = spark.createDataFrame([("a",)], schema) >>> df1.columns # "col" is str ['col'] >>> df2 = spark.createDataFrame([("a",)], [u"col"]) >>> df2.columns # "col" is unicode [u'col'] ``` The reason is only **StructField** has the following code. ``` if not isinstance(name, str): name = name.encode('utf-8') ``` This PR adds the same logic into **createDataFrame** for consistency. ``` if isinstance(schema, list): schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema] ``` ## How was this patch tested? Pass the Jenkins test (with new python doctest) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13097 from dongjoon-hyun/SPARK-15244. (cherry picked from commit 0f576a5) Signed-off-by: Davies Liu <davies.liu@gmail.com>

dongjoon-hyun · 2016-05-17T20:08:16Z

Thank you, @davies !

…ateDataFrame with Arrow ## What changes were proposed in this pull request? If schema is passed as a list of unicode strings for column names, they should be re-encoded to 'utf-8' to be consistent. This is similar to the apache#13097 but for creation of DataFrame using Arrow. ## How was this patch tested? Added new test of using unicode names for schema. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#19738 from BryanCutler/arrow-createDataFrame-followup-unicode-SPARK-20791.

dongjoon-hyun changed the title ~~[SPARK-15244][PYSPARK] Type of column name created with createDataFrame is not consistent.~~ [SPARK-15244][PYTHON] Type of column name created with createDataFrame is not consistent. May 13, 2016

davies reviewed May 16, 2016
View reviewed changes

[SPARK-15244][PYSPARK] Type of column name created with createDataFra…

50d3ee4

…me is not consistent.

asfgit closed this in 0f576a5 May 17, 2016

dongjoon-hyun deleted the SPARK-15244 branch July 20, 2016 07:35

BryanCutler mentioned this pull request Nov 13, 2017

[SPARK-20791][PYTHON][FOLLOWUP] Check for unicode column names in createDataFrame with Arrow #19738

Closed

[SPARK-15244][PYTHON] Type of column name created with createDataFrame is not consistent. #13097

[SPARK-15244][PYTHON] Type of column name created with createDataFrame is not consistent. #13097

Uh oh!

Conversation

dongjoon-hyun commented May 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 13, 2016

Uh oh!

SparkQA commented May 13, 2016

Uh oh!

dongjoon-hyun commented May 13, 2016

Uh oh!

SparkQA commented May 16, 2016

Uh oh!

andrewor14 commented May 16, 2016

Uh oh!

dongjoon-hyun commented May 16, 2016

Uh oh!

davies May 16, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 16, 2016

Choose a reason for hiding this comment

Uh oh!

davies May 16, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 16, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 16, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented May 16, 2016

Uh oh!

dongjoon-hyun commented May 16, 2016

Uh oh!

SparkQA commented May 16, 2016

Uh oh!

SparkQA commented May 16, 2016

Uh oh!

dongjoon-hyun commented May 16, 2016

Uh oh!

dongjoon-hyun commented May 17, 2016

Uh oh!

davies commented May 17, 2016

Uh oh!

dongjoon-hyun commented May 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun commented May 13, 2016 •

edited

Loading