[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

zasdfgbnm · 2016-07-14T08:58:04Z

What changes were proposed in this pull request?

Fix bugs about types that result an array of null when creating DataFrame using python.

Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows.

A simple code to reproduce this bug is:

from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array

sc = SparkContext()
sqlContext = SQLContext(sc)

row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()

which have output

+---------------+------------------+
|    doublearray|        floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+

How was this patch tested?

tested manually

Python's array has more type than python it self, for example python only has float while array support 'f' (float) and 'd' (double) Switching to array.typecode helps spark make a better inference For example, for the code: from pyspark.sql.types import _infer_type from array import array a = array('f',[1,2,3,4,5,6]) _infer_type(a) We will get ArrayType(DoubleType,true) before change, but ArrayType(FloatType,true) after change

holdenk · 2016-10-07T20:24:08Z

Oh interesting - thanks for working on this @zasdfgbnm and sorry its sort of fallen through the cracks. Is this something you are still working on? For PRs to get in you generally need some form of automated tests, let me know if you would like some help adding tests for this issue.

zasdfgbnm · 2016-10-07T20:34:38Z

I'd love to help @holdenk

zasdfgbnm · 2016-10-07T20:45:41Z

Something to mention is, there is still one problem that I'm not sure whether I solve it correctly: in python's array, unsigned types are supported, but unsigned types are not supported in JVM. The solution in this PR is to convert unsigned types to a larger type, e.g. unsigned int -> long. I'm not sure whether it would be better to reject the unsigned types in python and throw an exception.

zasdfgbnm · 2016-10-18T00:25:12Z

Hi @holdenk , I think I'm done. I create a test for this issue and I do find from the test that spark has the same issue not only for float but also for byte and short. After several commits, ./python/run-tests --modules=pyspark-sql passes on my computer.

To be clear, I need to say that only array with typecode b,h,i,l,f,d are supported, array with typecode u is not supported because it "corresponds to Python’s obsolete unicode character", array with typecode B,H,I,L are not supported because there is no unsigned types on JVM, array with typecode q,Q are not supported because they "are available only if the platform C compiler used to build Python supports C long long", which makes supporting them complicated. For the unsupported typecodes, a TypeError will be raised if the user try to create a DataFrame of it.

Would you, or any other developer, review my code and get it merged?

gatorsmile · 2017-06-13T16:14:16Z

cc @ueshin

ueshin · 2017-06-13T18:14:29Z

~~@holdenk Are you still working on this? If so, could you rebase or merge master to fix conflicts please?~~
Oops, I'm sorry, I made a mistake to ping to.

ueshin · 2017-06-13T18:18:40Z

@zasdfgbnm Are you still working on this? If so, could you rebase or merge master to fix conflicts please?

gatorsmile · 2017-06-27T06:41:41Z

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

zasdfgbnm · 2017-06-27T09:20:48Z

@ueshin @gatorsmile I'm happy to resolve the conflicts IF AND ONLY IF there will be a developer work on the code review for this. This PR was opened more than a year ago and I keep waiting for the review for one year. If it is guaranteed that there will be a reviewer assigned for this recently, I will resolve the conflicts. Otherwise, I don't want to maintain a PR forever just to wait for review.

felixcheung · 2017-06-27T16:33:11Z

@zasdfgbnm I think you can ping @ueshin to review.
Sounds important to me to have. Ping me if it falls through.

gatorsmile · 2017-06-27T20:37:13Z

@zasdfgbnm Please reopen the PR and @ueshin can help review your PR. Thanks!

zasdfgbnm · 2017-06-28T01:36:36Z

reopened at #18444

gatorsmile · 2017-06-28T03:31:34Z

retest this please

… of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of #14198, with merge conflicts resolved. ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ ``` ## How was this patch tested? New test case added Author: Xiang Gao <qasdfgtyuiop@gmail.com> Author: Gao, Xiang <qasdfgtyuiop@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #18444 from zasdfgbnm/fix_array_infer.

zasdfgbnm added 5 commits July 8, 2016 20:58

Merge branch 'master' into fix_array_infer

70131f3

sync with upstream

505e819

add case (c: Float, FloatType) to fromJava

05979ca

sync with upstream

5cd817a

zasdfgbnm changed the title ~~Fix bugs about types that result an array of null when creating dataframe using python~~ [SPARK-16542] Fix bugs about types that result an array of null when creating dataframe using python Jul 14, 2016

zasdfgbnm changed the title ~~[SPARK-16542] Fix bugs about types that result an array of null when creating dataframe using python~~ [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python Jul 14, 2016

zasdfgbnm added 5 commits October 16, 2016 22:44

add test for array in dataframe

cd2ec6b

merge with upstream/master

527d969

set unsigned types and Py_UNICODE as unsupported

82223c0

fix code style

0a967e2

fix the same problem for byte and short

2059435

HyukjinKwon mentioned this pull request Jun 25, 2017

[INFRA] Close stale PRs #18417

Closed

asfgit closed this in b32bd00 Jun 27, 2017

zasdfgbnm mentioned this pull request Jun 28, 2017

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python #18444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

zasdfgbnm commented Jul 14, 2016 •

edited

holdenk commented Oct 7, 2016

zasdfgbnm commented Oct 7, 2016 •

edited

zasdfgbnm commented Oct 7, 2016

zasdfgbnm commented Oct 18, 2016

gatorsmile commented Jun 13, 2017

ueshin commented Jun 13, 2017 •

edited

ueshin commented Jun 13, 2017

gatorsmile commented Jun 27, 2017

zasdfgbnm commented Jun 27, 2017

felixcheung commented Jun 27, 2017

gatorsmile commented Jun 27, 2017

zasdfgbnm commented Jun 28, 2017

gatorsmile commented Jun 28, 2017

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

Conversation

zasdfgbnm commented Jul 14, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

holdenk commented Oct 7, 2016

zasdfgbnm commented Oct 7, 2016 • edited

zasdfgbnm commented Oct 7, 2016

zasdfgbnm commented Oct 18, 2016

gatorsmile commented Jun 13, 2017

ueshin commented Jun 13, 2017 • edited

ueshin commented Jun 13, 2017

gatorsmile commented Jun 27, 2017

zasdfgbnm commented Jun 27, 2017

felixcheung commented Jun 27, 2017

gatorsmile commented Jun 27, 2017

zasdfgbnm commented Jun 28, 2017

gatorsmile commented Jun 28, 2017

zasdfgbnm commented Jul 14, 2016 •

edited

zasdfgbnm commented Oct 7, 2016 •

edited

ueshin commented Jun 13, 2017 •

edited