[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests #22568

alex7c4 · 2018-09-27T10:04:26Z

What changes were proposed in this pull request?

Add more data types for Pandas UDF Tests for PySpark SQL

How was this patch tested?

manual tests

…Tests

python/pyspark/sql/tests.py

HyukjinKwon · 2018-09-27T11:00:27Z

python/pyspark/sql/tests.py

-                    'Invalid returnType.*grouped map Pandas UDF.*ArrayType.*TimestampType'):
-                pandas_udf(lambda x: x, schema, PandasUDFType.GROUPED_MAP)
+        # type, error message regexp
+        unsupported_types_with_msg = (


I think one Invalid returnType.*grouped map Pandas UDF.* is good enough.

HyukjinKwon · 2018-09-27T11:00:37Z

ok to test

SparkQA · 2018-09-27T11:34:47Z

Test build #96690 has finished for PR 22568 at commit 4a57833.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-27T11:52:26Z

Test build #96691 has finished for PR 22568 at commit b27e0a1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-27T13:17:13Z

Test build #96692 has finished for PR 22568 at commit 53ff750.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM otherwise

python/pyspark/sql/tests.py

alex7c4 · 2018-09-28T07:50:07Z

@HyukjinKwon
Just noticed strange behavior, that can be reproduced with a test below placed under python.pyspark.sql.tests.GroupedMapPandasUDFTests.
Is it a bug or some feature?

def test_supported_types_array(self):
    from pyspark.sql.functions import pandas_udf, PandasUDFType

    schema = StructType([
        StructField('id', IntegerType()),
        StructField('array', ArrayType(IntegerType()))
    ])
    df = self.spark.createDataFrame(
        [[1, [1, 2, 3]]], schema=schema
    )

    udf1 = pandas_udf(
        lambda pdf: pdf.assign(array=pdf.array * 2),
        schema,
        PandasUDFType.GROUPED_MAP
    )

    result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
    expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)
    self.assertPandasEqual(expected1, result1)

Here is output:

python/pyspark/sql/tests.py:244: in assertPandasEqual
    self.assertTrue(expected.equals(result), msg=msg)
E   AssertionError: DataFrames are not equal:
E
E   Expected:
E      id               array
E   0   1  [1, 2, 3, 1, 2, 3]
E   id        int32
E   array    object
E   dtype: object
E
E   Result:
E      id      array
E   0   1  [2, 4, 6]
E   id        int32
E   array    object
E   dtype: object

You can see that behavior of array=pdf.array * 2 different for result1 and expected1:

result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
    >> [2, 4, 6]
expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)
    >> [1, 2, 3, 1, 2, 3]

Default Python behavior is:

[1, 2, 3] * 2
    >> [1, 2, 3, 1, 2, 3]

Thanks.
Added:
Same code for pyspark

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

schema = StructType([
    StructField('id', IntegerType()),
    StructField('array', ArrayType(IntegerType()))
])
df = spark.createDataFrame(
    [[1, [1, 2, 3]]], schema=schema
)

udf1 = pandas_udf(
    lambda pdf: pdf.assign(array=pdf.array * 2),
    schema,
    PandasUDFType.GROUPED_MAP
)

result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)

result1.equals(expected1)
result1
expected1

>>> result1
   id      array
0   1  [2, 4, 6]
>>> expected1
   id               array
0   1  [1, 2, 3, 1, 2, 3]

Edit:
Or maybe this happens because we are performing operations on different types of data?

SparkQA · 2018-09-28T09:38:58Z

Test build #96744 has finished for PR 22568 at commit 20e360e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T10:34:10Z

Test build #96747 has finished for PR 22568 at commit 90a4b1f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T12:41:54Z

Test build #96755 has finished for PR 22568 at commit f5b6ac8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-28T13:55:24Z

Test build #96756 has finished for PR 22568 at commit 5b8ca68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-29T11:15:41Z

re: #22568 (comment)

That's because within Pandas UDF it's numpy.ndarray which allows vectorized operations whereas the latter case is a Python object list (when Arrow optimization is disabled by default).

Try:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

schema = StructType([
    StructField('id', IntegerType()),
    StructField('array', ArrayType(IntegerType()))])
df = spark.createDataFrame([[1, [1, 2, 3]]], schema=schema)

def foo(pdf):
    print type(pdf.array[0])
    return pdf.assign(array=pdf.array * 2)

udf1 = pandas_udf(foo, schema, PandasUDFType.GROUPED_MAP)
result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
print(type(result1.array[0]))

There's difference about type conversion details when Arrow is enabled/disabled. You will have the same results if you enable spark.sql.execution.arrow.enabled (which means enabling Arrow optimization for toPandas and createDataFrame).

The type conversion (powered by Arrow) is not exactly matched to original Pnadas PySpark's conversion. We should match both. For now, I think matching it to numpy.ndarray makes more sense when spark.sql.execution.arrow.enabled is disabled from a cursory look. Probably we should fix array conversion from Array -> a list to array -> numpy array (from JVM to Python). My guys say it needs some more discussion and investigation.

python/pyspark/sql/tests.py

HyukjinKwon · 2018-09-29T11:37:33Z

Please feel free to merge alex7c4#1 to your branch. Should be good to go then.

Fix array test in PR 22568

SparkQA · 2018-10-01T09:13:12Z

Test build #96812 has finished for PR 22568 at commit 2d6fb62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM!

HyukjinKwon · 2018-10-01T09:19:04Z

Merged to master and branch-2.4.

## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 30f5d0f) Signed-off-by: hyukjinkwon <gurwls223@apache.org>

## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes apache#22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

Add more test cases for supported and unsupported types for PandasUDF…

4a57833

…Tests

HyukjinKwon reviewed Sep 27, 2018

View reviewed changes

python/pyspark/sql/tests.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 27, 2018

View reviewed changes

python/pyspark/sql/tests.py Show resolved Hide resolved

HyukjinKwon reviewed Sep 27, 2018

View reviewed changes

Fix comments

b27e0a1

Add/remove BinaryType in tests for different pyarrow versions

53ff750

HyukjinKwon approved these changes Sep 27, 2018

View reviewed changes

fix comments

20e360e

fix comments

90a4b1f

fix list values

f5b6ac8

change map to list comprehension for Python3 support

5b8ca68

HyukjinKwon reviewed Sep 29, 2018

View reviewed changes

python/pyspark/sql/tests.py Show resolved Hide resolved

Fix array test in PR 22568

deaffce

HyukjinKwon mentioned this pull request Sep 29, 2018

Fix array test in PR 22568 alex7c4/spark#1

Merged

Merge pull request #1 from HyukjinKwon/fix-22568

2d6fb62

Fix array test in PR 22568

HyukjinKwon approved these changes Oct 1, 2018

View reviewed changes

asfgit closed this in 30f5d0f Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests #22568

[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests #22568

alex7c4 commented Sep 27, 2018

HyukjinKwon Sep 27, 2018

HyukjinKwon commented Sep 27, 2018

SparkQA commented Sep 27, 2018

SparkQA commented Sep 27, 2018

SparkQA commented Sep 27, 2018

HyukjinKwon left a comment

alex7c4 commented Sep 28, 2018 •

edited

Loading

SparkQA commented Sep 28, 2018

SparkQA commented Sep 28, 2018

SparkQA commented Sep 28, 2018

SparkQA commented Sep 28, 2018

HyukjinKwon commented Sep 29, 2018 •

edited

Loading

HyukjinKwon commented Sep 29, 2018

SparkQA commented Oct 1, 2018

HyukjinKwon left a comment

HyukjinKwon commented Oct 1, 2018

[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests #22568

[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests #22568

Conversation

alex7c4 commented Sep 27, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Sep 27, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Sep 27, 2018

SparkQA commented Sep 27, 2018

SparkQA commented Sep 27, 2018

SparkQA commented Sep 27, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

alex7c4 commented Sep 28, 2018 • edited Loading

SparkQA commented Sep 28, 2018

SparkQA commented Sep 28, 2018

SparkQA commented Sep 28, 2018

SparkQA commented Sep 28, 2018

HyukjinKwon commented Sep 29, 2018 • edited Loading

HyukjinKwon commented Sep 29, 2018

SparkQA commented Oct 1, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 1, 2018

alex7c4 commented Sep 28, 2018 •

edited

Loading

HyukjinKwon commented Sep 29, 2018 •

edited

Loading