Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28454][PYTHON] Validate LongType in createDataFrame(verifySchema=True) #25117

Closed
wants to merge 5 commits into from
Closed

Conversation

simplylizz
Copy link
Contributor

@simplylizz simplylizz commented Jul 11, 2019

What changes were proposed in this pull request?

Add missing validation for LongType in pyspark.sql.types._make_type_verifier.

How was this patch tested?

Doctests / unittests / manual tests.

Unpatched version:

In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect()
Out[23]: [Row(x=None)]

Patched:

In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-c1740fcadbf9> in <module>
----> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect()

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    689             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    690         else:
--> 691             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    692         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    693         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    405         # make sure data could consumed multiple times
    406         if not isinstance(data, list):
--> 407             data = list(data)
    408
    409         if schema is None or isinstance(schema, (list, tuple)):

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in prepare(obj)
    671
    672             def prepare(obj):
--> 673                 verify_func(obj)
    674                 return obj
    675         elif isinstance(schema, DataType):

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
   1427     def verify(obj):
   1428         if not verify_nullability(obj):
-> 1429             verify_value(obj)
   1430
   1431     return verify

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_struct(obj)
   1397             if isinstance(obj, dict):
   1398                 for f, verifier in verifiers:
-> 1399                     verifier(obj.get(f))
   1400             elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
   1401                 # the order in obj could be different than dataType.fields

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
   1427     def verify(obj):
   1428         if not verify_nullability(obj):
-> 1429             verify_value(obj)
   1430
   1431     return verify

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_long(obj)
   1356             if obj < -9223372036854775808 or obj > 9223372036854775807:
   1357                 raise ValueError(
-> 1358                     new_msg("object of LongType out of range, got: %s" % obj))
   1359
   1360         verify_value = verify_long

ValueError: field x: object of LongType out of range, got: 18446744073709551616

@HyukjinKwon
Copy link
Member

Can you show the output before / after? Also, please file a JIRA. See https://spark.apache.org/contributing.html

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 15, 2019

Test build #107661 has finished for PR 25117 at commit 1d8761a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@simplylizz simplylizz changed the title [Python] Validate LongType in _make_type_verifier [SPARK-28454][Python] Validate LongType in _make_type_verifier Jul 19, 2019
@simplylizz
Copy link
Contributor Author

Can you show the output before / after?

Could you clarify what kind of output do you want to see? Test results or what?

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107928 has finished for PR 25117 at commit 87bc45c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2019

Test build #107936 has finished for PR 25117 at commit 97771d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

createDataFrame with verifySchema enabled where long exceeds the range you specified.

@simplylizz
Copy link
Contributor Author

Unpatched version:

In [23]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect()
Out[23]: [Row(x=None)]

Patched:

In [5]: s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-c1740fcadbf9> in <module>
----> 1 s.createDataFrame([{'x': 1 << 64}], StructType([StructField('x', LongType())])).collect()

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    689             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    690         else:
--> 691             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    692         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    693         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    405         # make sure data could consumed multiple times
    406         if not isinstance(data, list):
--> 407             data = list(data)
    408
    409         if schema is None or isinstance(schema, (list, tuple)):

/usr/local/lib/python3.5/site-packages/pyspark/sql/session.py in prepare(obj)
    671
    672             def prepare(obj):
--> 673                 verify_func(obj)
    674                 return obj
    675         elif isinstance(schema, DataType):

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
   1427     def verify(obj):
   1428         if not verify_nullability(obj):
-> 1429             verify_value(obj)
   1430
   1431     return verify

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_struct(obj)
   1397             if isinstance(obj, dict):
   1398                 for f, verifier in verifiers:
-> 1399                     verifier(obj.get(f))
   1400             elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
   1401                 # the order in obj could be different than dataType.fields

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify(obj)
   1427     def verify(obj):
   1428         if not verify_nullability(obj):
-> 1429             verify_value(obj)
   1430
   1431     return verify

/usr/local/lib/python3.5/site-packages/pyspark/sql/types.py in verify_long(obj)
   1356             if obj < -9223372036854775808 or obj > 9223372036854775807:
   1357                 raise ValueError(
-> 1358                     new_msg("object of LongType out of range, got: %s" % obj))
   1359
   1360         verify_value = verify_long

ValueError: field x: object of LongType out of range, got: 18446744073709551616

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post the comment in PR description and update migration guide (https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md#upgrading-from-spark-sql-24-to-30) as well?

We can mention like in spark 3.0, it verifies long type when creating a dataframe via spark.createDataFrame. Previously long type was not verified and resulted in null. To keep this behavior, set verifySchema=False.

@HyukjinKwon HyukjinKwon changed the title [SPARK-28454][Python] Validate LongType in _make_type_verifier [SPARK-28454][PYTHON] Validate LongType in _make_type_verifier Jul 29, 2019
@HyukjinKwon HyukjinKwon changed the title [SPARK-28454][PYTHON] Validate LongType in _make_type_verifier [SPARK-28454][PYTHON] Validate LongType in createDataFrame(verifySchema=True) Jul 29, 2019
@simplylizz
Copy link
Contributor Author

Done.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108489 has finished for PR 25117 at commit fde0a90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108488 has finished for PR 25117 at commit 080431d.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@simplylizz
Copy link
Contributor Author

@HyukjinKwon any action required from my side?

It seems that tests are just flapping. They are failing in unrelated parts of code which I didn't touch.

@HyukjinKwon
Copy link
Member

retest this please

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108759 has finished for PR 25117 at commit fde0a90.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108761 has finished for PR 25117 at commit 1f02d0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Thanks for the contribution, @simplylizz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants