[SPARK-36710][PYTHON] Support new typing syntax in function apply APIs in pandas API on Spark #34007

HyukjinKwon · 2021-09-15T12:53:59Z

What changes were proposed in this pull request?

This PR proposes the new syntax introduced in #33954. Namely, users now can specify the index type and name as below:

import pandas as pd
import pyspark.pandas as ps
def transform(pdf) -> pd.DataFrame[int, [int, int]]:
    pdf['A'] = pdf.id + 1
    return pdf

ps.range(5).koalas.apply_batch(transform)

import pandas as pd
import pyspark.pandas as ps
def transform(pdf) -> pd.DataFrame[("index", int), [("a", int), ("b", int)]]:
    pdf['A'] = pdf.id * pdf.id
    return pdf

ps.range(5).koalas.apply_batch(transform)

       a   b
index
0      0   0
1      1   1
2      2   4
3      3   9
4      4  16

Again, this syntax remains experimental and this is a non-standard way apart from Python standard. We should migrate to proper typing once pandas supports it like numpy.typing.

Why are the changes needed?

The rationale is described in #33954. In order to avoid unnecessary computation for default index or schema inference.

Does this PR introduce any user-facing change?

Yes, this PR affects the following APIs:

DataFrame.apply(..., axis=1)
DataFrame.groupby.apply(...)
DataFrame.pandas_on_spark.transform_batch(...)
DataFrame.pandas_on_spark.apply_batch(...)

Now they can specify the index type with the new syntax below:

DataFrame[index_type, [type, ...]]
DataFrame[(index_name, index_type), [(name, type), ...]]
DataFrame[dtype instance, dtypes instance]
DataFrame[(index_name, index_type), zip(names, types)]

How was this patch tested?

Manually tested, and unittests were added.

HyukjinKwon · 2021-09-15T12:54:30Z

python/pyspark/pandas/tests/test_dataframe.py

I used slice because mypy fails. FYI, the new syntax doesn't fail w/ mypy.

HyukjinKwon · 2021-09-15T12:55:17Z

cc @ueshin @xinrong-databricks @itholic FYI

SparkQA · 2021-09-15T15:21:30Z

Test build #143304 has finished for PR 34007 at commit 0b436b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-15T15:42:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47806/

SparkQA · 2021-09-15T15:53:23Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47806/

SparkQA · 2021-09-15T21:25:55Z

Test build #143323 has finished for PR 34007 at commit 682a95a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-15T21:40:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47826/

SparkQA · 2021-09-15T21:48:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47826/

HyukjinKwon · 2021-09-17T00:41:52Z

will merge in 3 days if there are no more comments.

HyukjinKwon · 2021-09-20T01:36:56Z

Merged to master.

HyukjinKwon requested a review from ueshin September 15, 2021 12:54

github-actions bot added CORE PYTHON labels Sep 15, 2021

HyukjinKwon commented Sep 15, 2021

View reviewed changes

HyukjinKwon force-pushed the SPARK-36710 branch 2 times, most recently from 4271ffc to 8f312b3 Compare September 15, 2021 13:04

HyukjinKwon changed the title ~~[SPARK-36710][PYTHON] Support new syntax in function apply APIs in pandas API on Spark~~ [SPARK-36710][PYTHON] Support new typing syntax in function apply APIs in pandas API on Spark Sep 15, 2021

This comment has been minimized.

Sign in to view

Support new syntax in function apply APIs in pandas API on Spark

0b436b8

HyukjinKwon force-pushed the SPARK-36710 branch from 8f312b3 to 0b436b8 Compare September 15, 2021 14:33

HyukjinKwon added 2 commits September 16, 2021 05:52

Update python/pyspark/pandas/accessors.py

567213c

Update python/pyspark/pandas/accessors.py

682a95a

HyukjinKwon mentioned this pull request Sep 17, 2021

[SPARK-36708][PYTHON] Support numpy.typing for annotating ArrayType in pandas API on Spark #34028

Closed

HyukjinKwon closed this in 8d8b4aa Sep 20, 2021

HyukjinKwon deleted the SPARK-36710 branch January 4, 2022 00:52

[SPARK-36710][PYTHON] Support new typing syntax in function apply APIs in pandas API on Spark #34007

[SPARK-36710][PYTHON] Support new typing syntax in function apply APIs in pandas API on Spark #34007

Uh oh!

Conversation

HyukjinKwon commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 15, 2021

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Sep 15, 2021

Uh oh!

SparkQA commented Sep 15, 2021

Uh oh!

SparkQA commented Sep 15, 2021

Uh oh!

SparkQA commented Sep 15, 2021

Uh oh!

SparkQA commented Sep 15, 2021

Uh oh!

SparkQA commented Sep 15, 2021

Uh oh!

HyukjinKwon commented Sep 17, 2021

Uh oh!

HyukjinKwon commented Sep 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HyukjinKwon commented Sep 15, 2021 •

edited

Loading