Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests #22568

Closed
wants to merge 9 commits into from

Conversation

alex7c4
Copy link
Contributor

@alex7c4 alex7c4 commented Sep 27, 2018

What changes were proposed in this pull request?

Add more data types for Pandas UDF Tests for PySpark SQL

How was this patch tested?

manual tests

'Invalid returnType.*grouped map Pandas UDF.*ArrayType.*TimestampType'):
pandas_udf(lambda x: x, schema, PandasUDFType.GROUPED_MAP)
# type, error message regexp
unsupported_types_with_msg = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one Invalid returnType.*grouped map Pandas UDF.* is good enough.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Sep 27, 2018

Test build #96690 has finished for PR 22568 at commit 4a57833.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2018

Test build #96691 has finished for PR 22568 at commit b27e0a1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2018

Test build #96692 has finished for PR 22568 at commit 53ff750.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

python/pyspark/sql/tests.py Outdated Show resolved Hide resolved
python/pyspark/sql/tests.py Outdated Show resolved Hide resolved
python/pyspark/sql/tests.py Show resolved Hide resolved
python/pyspark/sql/tests.py Outdated Show resolved Hide resolved
python/pyspark/sql/tests.py Outdated Show resolved Hide resolved
@alex7c4
Copy link
Contributor Author

alex7c4 commented Sep 28, 2018

@HyukjinKwon
Just noticed strange behavior, that can be reproduced with a test below placed under python.pyspark.sql.tests.GroupedMapPandasUDFTests.
Is it a bug or some feature?

def test_supported_types_array(self):
    from pyspark.sql.functions import pandas_udf, PandasUDFType

    schema = StructType([
        StructField('id', IntegerType()),
        StructField('array', ArrayType(IntegerType()))
    ])
    df = self.spark.createDataFrame(
        [[1, [1, 2, 3]]], schema=schema
    )

    udf1 = pandas_udf(
        lambda pdf: pdf.assign(array=pdf.array * 2),
        schema,
        PandasUDFType.GROUPED_MAP
    )

    result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
    expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)
    self.assertPandasEqual(expected1, result1)

Here is output:

python/pyspark/sql/tests.py:244: in assertPandasEqual
    self.assertTrue(expected.equals(result), msg=msg)
E   AssertionError: DataFrames are not equal:
E
E   Expected:
E      id               array
E   0   1  [1, 2, 3, 1, 2, 3]
E   id        int32
E   array    object
E   dtype: object
E
E   Result:
E      id      array
E   0   1  [2, 4, 6]
E   id        int32
E   array    object
E   dtype: object

You can see that behavior of array=pdf.array * 2 different for result1 and expected1:

result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
    >> [2, 4, 6]
expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)
    >> [1, 2, 3, 1, 2, 3]

Default Python behavior is:

[1, 2, 3] * 2
    >> [1, 2, 3, 1, 2, 3]

Thanks.
Added:
Same code for pyspark

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

schema = StructType([
    StructField('id', IntegerType()),
    StructField('array', ArrayType(IntegerType()))
])
df = spark.createDataFrame(
    [[1, [1, 2, 3]]], schema=schema
)

udf1 = pandas_udf(
    lambda pdf: pdf.assign(array=pdf.array * 2),
    schema,
    PandasUDFType.GROUPED_MAP
)

result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)

result1.equals(expected1)
result1
expected1
>>> result1
   id      array
0   1  [2, 4, 6]
>>> expected1
   id               array
0   1  [1, 2, 3, 1, 2, 3]

Edit:
Or maybe this happens because we are performing operations on different types of data?

@SparkQA
Copy link

SparkQA commented Sep 28, 2018

Test build #96744 has finished for PR 22568 at commit 20e360e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2018

Test build #96747 has finished for PR 22568 at commit 90a4b1f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2018

Test build #96755 has finished for PR 22568 at commit f5b6ac8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2018

Test build #96756 has finished for PR 22568 at commit 5b8ca68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 29, 2018

re: #22568 (comment)

That's because within Pandas UDF it's numpy.ndarray which allows vectorized operations whereas the latter case is a Python object list (when Arrow optimization is disabled by default).

Try:

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

schema = StructType([
    StructField('id', IntegerType()),
    StructField('array', ArrayType(IntegerType()))])
df = spark.createDataFrame([[1, [1, 2, 3]]], schema=schema)

def foo(pdf):
    print type(pdf.array[0])
    return pdf.assign(array=pdf.array * 2)

udf1 = pandas_udf(foo, schema, PandasUDFType.GROUPED_MAP)
result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
print(type(result1.array[0]))

There's difference about type conversion details when Arrow is enabled/disabled. You will have the same results if you enable spark.sql.execution.arrow.enabled (which means enabling Arrow optimization for toPandas and createDataFrame).

The type conversion (powered by Arrow) is not exactly matched to original Pnadas PySpark's conversion. We should match both. For now, I think matching it to numpy.ndarray makes more sense when spark.sql.execution.arrow.enabled is disabled from a cursory look. Probably we should fix array conversion from Array -> a list to array -> numpy array (from JVM to Python). My guys say it needs some more discussion and investigation.

@HyukjinKwon
Copy link
Member

Please feel free to merge alex7c4#1 to your branch. Should be good to go then.

Fix array test in PR 22568
@SparkQA
Copy link

SparkQA commented Oct 1, 2018

Test build #96812 has finished for PR 22568 at commit 2d6fb62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@HyukjinKwon
Copy link
Member

Merged to master and branch-2.4.

asfgit pushed a commit that referenced this pull request Oct 1, 2018
## What changes were proposed in this pull request?
Add more data types for Pandas UDF Tests for PySpark SQL

## How was this patch tested?
manual tests

Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests.

Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(cherry picked from commit 30f5d0f)
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
@asfgit asfgit closed this in 30f5d0f Oct 1, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?
Add more data types for Pandas UDF Tests for PySpark SQL

## How was this patch tested?
manual tests

Closes apache#22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests.

Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com>
Co-authored-by: hyukjinkwon <gurwls223@apache.org>
Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants