-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests #22568
Conversation
python/pyspark/sql/tests.py
Outdated
'Invalid returnType.*grouped map Pandas UDF.*ArrayType.*TimestampType'): | ||
pandas_udf(lambda x: x, schema, PandasUDFType.GROUPED_MAP) | ||
# type, error message regexp | ||
unsupported_types_with_msg = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one Invalid returnType.*grouped map Pandas UDF.*
is good enough.
ok to test |
Test build #96690 has finished for PR 22568 at commit
|
Test build #96691 has finished for PR 22568 at commit
|
Test build #96692 has finished for PR 22568 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
@HyukjinKwon def test_supported_types_array(self):
from pyspark.sql.functions import pandas_udf, PandasUDFType
schema = StructType([
StructField('id', IntegerType()),
StructField('array', ArrayType(IntegerType()))
])
df = self.spark.createDataFrame(
[[1, [1, 2, 3]]], schema=schema
)
udf1 = pandas_udf(
lambda pdf: pdf.assign(array=pdf.array * 2),
schema,
PandasUDFType.GROUPED_MAP
)
result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)
self.assertPandasEqual(expected1, result1) Here is output:
You can see that behavior of
Default Python behavior is:
Thanks. from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
schema = StructType([
StructField('id', IntegerType()),
StructField('array', ArrayType(IntegerType()))
])
df = spark.createDataFrame(
[[1, [1, 2, 3]]], schema=schema
)
udf1 = pandas_udf(
lambda pdf: pdf.assign(array=pdf.array * 2),
schema,
PandasUDFType.GROUPED_MAP
)
result1 = df.groupby('id').apply(udf1).sort('id').toPandas()
expected1 = df.toPandas().groupby('id').apply(udf1.func).reset_index(drop=True)
result1.equals(expected1)
result1
expected1
Edit: |
Test build #96744 has finished for PR 22568 at commit
|
Test build #96747 has finished for PR 22568 at commit
|
Test build #96755 has finished for PR 22568 at commit
|
Test build #96756 has finished for PR 22568 at commit
|
re: #22568 (comment) That's because within Pandas UDF it's Try:
There's difference about type conversion details when Arrow is enabled/disabled. You will have the same results if you enable The type conversion (powered by Arrow) is not exactly matched to original Pnadas PySpark's conversion. We should match both. For now, I think matching it to |
Please feel free to merge alex7c4#1 to your branch. Should be good to go then. |
Fix array test in PR 22568
Test build #96812 has finished for PR 22568 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Merged to master and branch-2.4. |
## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> (cherry picked from commit 30f5d0f) Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes apache#22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Add more data types for Pandas UDF Tests for PySpark SQL
How was this patch tested?
manual tests