Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 15, 2021

What changes were proposed in this pull request?

This PR proposes the new syntax introduced in #33954. Namely, users now can specify the index type and name as below:

import pandas as pd
import pyspark.pandas as ps
def transform(pdf) -> pd.DataFrame[int, [int, int]]:
    pdf['A'] = pdf.id + 1
    return pdf

ps.range(5).koalas.apply_batch(transform)
   c0  c1
0   0   1
1   1   2
2   2   3
3   3   4
4   4   5
import pandas as pd
import pyspark.pandas as ps
def transform(pdf) -> pd.DataFrame[("index", int), [("a", int), ("b", int)]]:
    pdf['A'] = pdf.id * pdf.id
    return pdf

ps.range(5).koalas.apply_batch(transform)
       a   b
index
0      0   0
1      1   1
2      2   4
3      3   9
4      4  16

Again, this syntax remains experimental and this is a non-standard way apart from Python standard. We should migrate to proper typing once pandas supports it like numpy.typing.

Why are the changes needed?

The rationale is described in #33954. In order to avoid unnecessary computation for default index or schema inference.

Does this PR introduce any user-facing change?

Yes, this PR affects the following APIs:

  • DataFrame.apply(..., axis=1)
  • DataFrame.groupby.apply(...)
  • DataFrame.pandas_on_spark.transform_batch(...)
  • DataFrame.pandas_on_spark.apply_batch(...)

Now they can specify the index type with the new syntax below:

DataFrame[index_type, [type, ...]]
DataFrame[(index_name, index_type), [(name, type), ...]]
DataFrame[dtype instance, dtypes instance]
DataFrame[(index_name, index_type), zip(names, types)]

How was this patch tested?

Manually tested, and unittests were added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used slice because mypy fails. FYI, the new syntax doesn't fail w/ mypy.

@HyukjinKwon
Copy link
Member Author

cc @ueshin @xinrong-databricks @itholic FYI

@HyukjinKwon HyukjinKwon force-pushed the SPARK-36710 branch 2 times, most recently from 4271ffc to 8f312b3 Compare September 15, 2021 13:04
@HyukjinKwon HyukjinKwon changed the title [SPARK-36710][PYTHON] Support new syntax in function apply APIs in pandas API on Spark [SPARK-36710][PYTHON] Support new typing syntax in function apply APIs in pandas API on Spark Sep 15, 2021
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Test build #143304 has finished for PR 34007 at commit 0b436b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47806/

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47806/

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Test build #143323 has finished for PR 34007 at commit 682a95a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47826/

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47826/

@HyukjinKwon
Copy link
Member Author

will merge in 3 days if there are no more comments.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon HyukjinKwon deleted the SPARK-36710 branch January 4, 2022 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants