-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-15244][PYTHON] Type of column name created with createDataFrame is not consistent. #13097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #58557 has finished for PR 13097 at commit
|
|
Test build #58565 has finished for PR 13097 at commit
|
|
Hi, @andrewor14 . |
|
Test build #58646 has finished for PR 13097 at commit
|
|
Looks OK, also cc @davies |
|
Thank you for review, @andrewor14 . |
python/pyspark/sql/session.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put this into unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's already tested by python doctest by runtest.py.
Do you mean I need to add this into another separate unit test python file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these will become part of the API docs as examples, it's anonying to see many corner cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Sorry, but could you tell me the location of the unit test python file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, tests.py?
|
LGTM, only one comment. |
|
Thank you for review, @davies . I'll fix in an hour. |
…me is not consistent.
|
Test build #58654 has finished for PR 13097 at commit
|
|
Test build #58655 has finished for PR 13097 at commit
|
|
Thank you, @andrewor14 and @davies . |
|
Hi, @davies . |
|
LGTM, |
…me is not consistent.
## What changes were proposed in this pull request?
**createDataFrame** returns inconsistent types for column names.
```python
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = spark.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = spark.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']
```
The reason is only **StructField** has the following code.
```
if not isinstance(name, str):
name = name.encode('utf-8')
```
This PR adds the same logic into **createDataFrame** for consistency.
```
if isinstance(schema, list):
schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema]
```
## How was this patch tested?
Pass the Jenkins test (with new python doctest)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13097 from dongjoon-hyun/SPARK-15244.
(cherry picked from commit 0f576a5)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
|
Thank you, @davies ! |
…ateDataFrame with Arrow ## What changes were proposed in this pull request? If schema is passed as a list of unicode strings for column names, they should be re-encoded to 'utf-8' to be consistent. This is similar to the apache#13097 but for creation of DataFrame using Arrow. ## How was this patch tested? Added new test of using unicode names for schema. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#19738 from BryanCutler/arrow-createDataFrame-followup-unicode-SPARK-20791.
What changes were proposed in this pull request?
createDataFrame returns inconsistent types for column names.
The reason is only StructField has the following code.
This PR adds the same logic into createDataFrame for consistency.
How was this patch tested?
Pass the Jenkins test (with new python doctest)