Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198

Closed
wants to merge 10 commits into from

Conversation

zasdfgbnm
Copy link
Contributor

@zasdfgbnm zasdfgbnm commented Jul 14, 2016

What changes were proposed in this pull request?

Fix bugs about types that result an array of null when creating DataFrame using python.

Python's array.array have richer type than python itself, e.g. we can have array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have array('f') in your rows.

A simple code to reproduce this bug is:

from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array

sc = SparkContext()
sqlContext = SQLContext(sc)

row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()

which have output

+---------------+------------------+
|    doublearray|        floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+

How was this patch tested?

tested manually

Python's array has more type than python it self, for example
python only has float while array support 'f' (float) and 'd' (double)
Switching to array.typecode helps spark make a better inference

For example, for the code:

from pyspark.sql.types import _infer_type
from array import array
a = array('f',[1,2,3,4,5,6])
_infer_type(a)

We will get ArrayType(DoubleType,true) before change,
but ArrayType(FloatType,true) after change
@zasdfgbnm zasdfgbnm changed the title Fix bugs about types that result an array of null when creating dataframe using python [SPARK-16542] Fix bugs about types that result an array of null when creating dataframe using python Jul 14, 2016
@zasdfgbnm zasdfgbnm changed the title [SPARK-16542] Fix bugs about types that result an array of null when creating dataframe using python [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python Jul 14, 2016
@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

Oh interesting - thanks for working on this @zasdfgbnm and sorry its sort of fallen through the cracks. Is this something you are still working on? For PRs to get in you generally need some form of automated tests, let me know if you would like some help adding tests for this issue.

@zasdfgbnm
Copy link
Contributor Author

zasdfgbnm commented Oct 7, 2016

I'd love to help @holdenk

@zasdfgbnm
Copy link
Contributor Author

Something to mention is, there is still one problem that I'm not sure whether I solve it correctly: in python's array, unsigned types are supported, but unsigned types are not supported in JVM. The solution in this PR is to convert unsigned types to a larger type, e.g. unsigned int -> long. I'm not sure whether it would be better to reject the unsigned types in python and throw an exception.

@zasdfgbnm
Copy link
Contributor Author

Hi @holdenk , I think I'm done. I create a test for this issue and I do find from the test that spark has the same issue not only for float but also for byte and short. After several commits, ./python/run-tests --modules=pyspark-sql passes on my computer.

To be clear, I need to say that only array with typecode b,h,i,l,f,d are supported, array with typecode u is not supported because it "corresponds to Python’s obsolete unicode character", array with typecode B,H,I,L are not supported because there is no unsigned types on JVM, array with typecode q,Q are not supported because they "are available only if the platform C compiler used to build Python supports C long long", which makes supporting them complicated. For the unsupported typecodes, a TypeError will be raised if the user try to create a DataFrame of it.

Would you, or any other developer, review my code and get it merged?

@gatorsmile
Copy link
Member

cc @ueshin

@ueshin
Copy link
Member

ueshin commented Jun 13, 2017

@holdenk Are you still working on this? If so, could you rebase or merge master to fix conflicts please?
Oops, I'm sorry, I made a mistake to ping to.

@ueshin
Copy link
Member

ueshin commented Jun 13, 2017

@zasdfgbnm Are you still working on this? If so, could you rebase or merge master to fix conflicts please?

@gatorsmile
Copy link
Member

We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!

@zasdfgbnm
Copy link
Contributor Author

@ueshin @gatorsmile I'm happy to resolve the conflicts IF AND ONLY IF there will be a developer work on the code review for this. This PR was opened more than a year ago and I keep waiting for the review for one year. If it is guaranteed that there will be a reviewer assigned for this recently, I will resolve the conflicts. Otherwise, I don't want to maintain a PR forever just to wait for review.

@asfgit asfgit closed this in b32bd00 Jun 27, 2017
@felixcheung
Copy link
Member

@zasdfgbnm I think you can ping @ueshin to review.
Sounds important to me to have. Ping me if it falls through.

@gatorsmile
Copy link
Member

@zasdfgbnm Please reopen the PR and @ueshin can help review your PR. Thanks!

@zasdfgbnm
Copy link
Contributor Author

reopened at #18444

@gatorsmile
Copy link
Member

retest this please

asfgit pushed a commit that referenced this pull request Jul 20, 2017
… of null when creating DataFrame using python

## What changes were proposed in this pull request?
This is the reopen of #14198, with merge conflicts resolved.

ueshin Could you please take a look at my code?

Fix bugs about types that result an array of null when creating DataFrame using python.

Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows.

A simple code to reproduce this bug is:

```
from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array

sc = SparkContext()
sqlContext = SQLContext(sc)

row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()
```

which have output

```
+---------------+------------------+
|    doublearray|        floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+
```

## How was this patch tested?

New test case added

Author: Xiang Gao <qasdfgtyuiop@gmail.com>
Author: Gao, Xiang <qasdfgtyuiop@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>

Closes #18444 from zasdfgbnm/fix_array_infer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants