Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented May 13, 2016

What changes were proposed in this pull request?

createDataFrame returns inconsistent types for column names.

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = spark.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = spark.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']

The reason is only StructField has the following code.

if not isinstance(name, str):
    name = name.encode('utf-8')

This PR adds the same logic into createDataFrame for consistency.

if isinstance(schema, list):
    schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema]

How was this patch tested?

Pass the Jenkins test (with new python doctest)

@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58557 has finished for PR 13097 at commit c47b12b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2016

Test build #58565 has finished for PR 13097 at commit 406a53d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-15244][PYSPARK] Type of column name created with createDataFrame is not consistent. [SPARK-15244][PYTHON] Type of column name created with createDataFrame is not consistent. May 13, 2016
@dongjoon-hyun
Copy link
Member Author

Hi, @andrewor14 .
Could you review this PR?

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58646 has finished for PR 13097 at commit 66a0bf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Looks OK, also cc @davies

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @andrewor14 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put this into unit tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's already tested by python doctest by runtest.py.
Do you mean I need to add this into another separate unit test python file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these will become part of the API docs as examples, it's anonying to see many corner cases here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Sorry, but could you tell me the location of the unit test python file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, tests.py?

@davies
Copy link
Contributor

davies commented May 16, 2016

LGTM, only one comment.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @davies . I'll fix in an hour.

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58654 has finished for PR 13097 at commit e994d12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58655 has finished for PR 13097 at commit 50d3ee4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you, @andrewor14 and @davies .
I moved the testcase into sql/tests.py and it passed the tests again.

@dongjoon-hyun
Copy link
Member Author

Hi, @davies .
Could you merge this PR?

@davies
Copy link
Contributor

davies commented May 17, 2016

LGTM,
Merging this into master and 2.0, thanks!

asfgit pushed a commit that referenced this pull request May 17, 2016
…me is not consistent.

## What changes were proposed in this pull request?

**createDataFrame** returns inconsistent types for column names.
```python
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = spark.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = spark.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']
```

The reason is only **StructField** has the following code.
```
if not isinstance(name, str):
    name = name.encode('utf-8')
```
This PR adds the same logic into **createDataFrame** for consistency.
```
if isinstance(schema, list):
    schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema]
```

## How was this patch tested?

Pass the Jenkins test (with new python doctest)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13097 from dongjoon-hyun/SPARK-15244.

(cherry picked from commit 0f576a5)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
@asfgit asfgit closed this in 0f576a5 May 17, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you, @davies !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-15244 branch July 20, 2016 07:35
ghost pushed a commit to dbtsai/spark that referenced this pull request Nov 15, 2017
…ateDataFrame with Arrow

## What changes were proposed in this pull request?

If schema is passed as a list of unicode strings for column names, they should be re-encoded to 'utf-8' to be consistent.  This is similar to the apache#13097 but for creation of DataFrame using Arrow.

## How was this patch tested?

Added new test of using unicode names for schema.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#19738 from BryanCutler/arrow-createDataFrame-followup-unicode-SPARK-20791.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants