[SPARK-35211][PYTHON] verify inferred schema for _create_dataframe #32320

da-liii · 2021-04-24T10:21:54Z

What changes were proposed in this pull request?

refactor code using inner_map
do extra schema verification after it is inferred

This PR do not introduction any semantic changes except for the extra schema verification.

This pr fixes SPARK-35211 when schema verification is turned on. If schema verification is turned off, the bug described in SPARK-35211 still exists. I will create another PR to solve the issue.

Why are the changes needed?

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
df = spark.createDataFrame(pdf)
df.show()

The result is not correct because of incorrect type conversion.

With this PR, type check will be performed:

(spark) ➜  spark git:(sadhen/SPARK-35211) ✗ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/04/24 17:42:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.12:4040
Spark context available as 'sc' (master = local[*], app id = local-1619257343692).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
>>> df = spark.createDataFrame(pdf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 653, in createDataFrame
    return super(SparkSession, self).createDataFrame(
  File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 340, in createDataFrame
    return self._create_dataframe(data, schema, samplingRatio, verifySchema)
  File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 699, in _create_dataframe
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 499, in _createFromLocal
    data = list(data)
  File "/Users/da/github/apache/spark/python/pyspark/sql/session.py", line 688, in prepare
    verify_func(obj)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1409, in verify
    verify_value(obj)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1390, in verify_struct
    verifier(v)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1409, in verify
    verify_value(obj)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1304, in verify_udf
    verifier(dataType.toInternal(obj))
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1409, in verify
    verify_value(obj)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1354, in verify_array
    element_verifier(i)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1409, in verify
    verify_value(obj)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1403, in verify_default
    verify_acceptable_types(obj)
  File "/Users/da/github/apache/spark/python/pyspark/sql/types.py", line 1291, in verify_acceptable_types
    raise TypeError(new_msg("%s can not accept object %r in type %s"
TypeError: element in array field point: DoubleType can not accept object 1 in type <class 'int'>

Does this PR introduce any user-facing change?

No

How was this patch tested?

unit test

SparkQA · 2021-04-24T12:19:21Z

Test build #137882 has finished for PR 32320 at commit bea87a5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-24T15:40:18Z

Test build #137889 has finished for PR 32320 at commit 4dc085c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-25T07:30:30Z

@sadhen, can we separate refactoring and the UDT inferred type verification? It would make the change much easier to review.

da-liii · 2021-04-25T07:59:18Z

@HyukjinKwon There are little differences in _createFromRDD and _createFromLocal. If I do inferred type verification in a separate PR, I need to insert the following code snippet twice:

                verify_func = _make_type_verifier(struct) if verifySchema else lambda _: True

                def verified_converter(obj):
                    verify_func(obj)
                    return converter(obj)
                data = inner_map(verified_converter, data)

That's why I did a refactor.

Let me create another PR for inferred type verification.

da-liii · 2021-04-25T08:15:55Z

A PR without refactor is prepared: #32332

da-liii · 2021-04-25T08:17:25Z

This PR will be rebased on master when #32332 is merged.

_create_dataframe: infer schema earlier and do type check

bea87a5

github-actions bot added CORE PYTHON SQL labels Apr 24, 2021

da-liii changed the title ~~[SPARK-35211][PYSPARK] _create_dataframe: infer schema earlier and do type check~~ [SPARK-35211][PYTHON] _create_dataframe: infer schema earlier and do type check Apr 24, 2021

da-liii marked this pull request as draft April 24, 2021 12:27

refactor using inner_map and verify schema after inferring

4dc085c

da-liii changed the title ~~[SPARK-35211][PYTHON] _create_dataframe: infer schema earlier and do type check~~ [SPARK-35211][PYTHON] _create_dataframe: verify inferred schema Apr 24, 2021

da-liii changed the title ~~[SPARK-35211][PYTHON] _create_dataframe: verify inferred schema~~ [SPARK-35211][PYTHON] verify inferred schema for _create_dataframe Apr 24, 2021

da-liii marked this pull request as ready for review April 24, 2021 15:44

da-liii mentioned this pull request Apr 26, 2021

[SPARK-35211][PYTHON] verify inferred schema for _create_dataframe #32332

Closed

da-liii closed this Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35211][PYTHON] verify inferred schema for _create_dataframe #32320

[SPARK-35211][PYTHON] verify inferred schema for _create_dataframe #32320

da-liii commented Apr 24, 2021 •

edited

SparkQA commented Apr 24, 2021

SparkQA commented Apr 24, 2021

HyukjinKwon commented Apr 25, 2021

da-liii commented Apr 25, 2021

da-liii commented Apr 25, 2021

da-liii commented Apr 25, 2021

[SPARK-35211][PYTHON] verify inferred schema for _create_dataframe #32320

[SPARK-35211][PYTHON] verify inferred schema for _create_dataframe #32320

Conversation

da-liii commented Apr 24, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Apr 24, 2021

SparkQA commented Apr 24, 2021

HyukjinKwon commented Apr 25, 2021

da-liii commented Apr 25, 2021

da-liii commented Apr 25, 2021

da-liii commented Apr 25, 2021

da-liii commented Apr 24, 2021 •

edited