Skip to content

[SPARK-34544][PYTHON] Convert PandasDataFrameLike and PandasSeriesLike to aliases of Pandas types#34927

Closed
zero323 wants to merge 46 commits intoapache:masterfrom
zero323:SPARK-34544
Closed

[SPARK-34544][PYTHON] Convert PandasDataFrameLike and PandasSeriesLike to aliases of Pandas types#34927
zero323 wants to merge 46 commits intoapache:masterfrom
zero323:SPARK-34544

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Dec 17, 2021

What changes were proposed in this pull request?

This PR proposes replacing currently used Protocols:

  • PandasDataFrameLike
  • PandasSeriesLike

with simple aliases of upstream types.

This exposed a number of typing issues, pirmairly around pyspark.pandas API, which will be resolved in this PR.

Additionally it adds VirtusLab/pandas-stubs to CI dependencies.

Why are the changes needed?

Currently used Protocols where a workaround, included to improve typing coverage until Pandas exposes their type hints.

In a meantime, relatively stable stubs sources emerged (with a lot of ongoing discussions around) and pandas-stubs package (available on PyPI and conda-forge) provides better coverage without adding maintenance overhead on our side.

Does this PR introduce any user-facing change?

Better typing experience around Pandas UDFs.

How was this patch tested?

Existing typecheck pipeline and unit tests.

@zero323
Copy link
Member Author

zero323 commented Dec 17, 2021

FYI @HyukjinKwon

There is still a lot of work to done here (reduced typing errors from ~150 to <50 so far).

(Also, commit messages contain explanations why certain fixes are needed, I'll include these here once I am closer to completion).

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146302 has finished for PR 34927 at commit 5958791.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I like this!!

@HyukjinKwon
Copy link
Member

cc @ueshin @itholic @xinrong-databricks FYI

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50774/

@itholic
Copy link
Contributor

itholic commented Dec 17, 2021

I've brief reviewed of the work, and so far looks pretty nice to me!!

Let me revisit and take a closer look when it's ready to review.

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146332 has finished for PR 34927 at commit 3c5c894.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50807/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50807/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146339 has finished for PR 34927 at commit e5043a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50813/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50813/

@zero323
Copy link
Member Author

zero323 commented Dec 17, 2021

I've brief reviewed of the work, and so far looks pretty nice to me!!

Let me revisit and take a closer look when it's ready to review.

I'll ping you once I am closer to completion ‒ I resolved most of the minor problems, but hit a bigger issue with pandas_udf annotations ‒ seems like we'll have to rethink how this stuff is done, which is something I am not very happy about.

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146341 has finished for PR 34927 at commit d81bfee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Dec 17, 2021

We can remove the pandas entry from mypy.ini?
https://github.com/apache/spark/blob/d81bfee6b56ffffa3deac2ab14e8a88821998e77/python/mypy.ini#L119-L120

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50815/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50815/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Test build #146356 has finished for PR 34927 at commit 045d587.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Test build #146357 has finished for PR 34927 at commit 111b6cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50830/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Test build #146358 has finished for PR 34927 at commit 535e3fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50831/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50832/

@SparkQA
Copy link

SparkQA commented Dec 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50830/

@SparkQA
Copy link

SparkQA commented Dec 19, 2021

Test build #146370 has finished for PR 34927 at commit 47a727d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50844/

@SparkQA
Copy link

SparkQA commented Dec 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50844/

@zero323 zero323 marked this pull request as ready for review December 19, 2021 05:33
@zero323 zero323 changed the title [WIP][SPARK-34544][PYTHON] Convert PandasDataFrameLike and PandasSeriesLike to aliases of Pandas types [SPARK-34544][PYTHON] Convert PandasDataFrameLike and PandasSeriesLike to aliases of Pandas types Dec 19, 2021
@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50920/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Test build #146445 has finished for PR 34927 at commit 1224d1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ExposesMetadataColumns extends LogicalPlan

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50920/

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bunch of cleanup!

@zero323 zero323 closed this in a70006d Dec 23, 2021
@zero323
Copy link
Member Author

zero323 commented Dec 23, 2021

Merged into master.

Thanks all!

@zero323 zero323 deleted the SPARK-34544 branch December 23, 2021 00:16
HyukjinKwon added a commit that referenced this pull request Dec 27, 2021
### What changes were proposed in this pull request?

This PR is a minor followup of #34927 that adds `pandas-stubs` dependency into `dev/requirements.txt`.

### Why are the changes needed?

For easier development setup.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested as below:

```bash
pip install -r dev/requirements.txt
```

Closes #35029 from HyukjinKwon/SPARK-34544.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants