[SPARK-36403][PYTHON] Implement `Index.putmask` #33744

beobest2 · 2021-08-14T17:04:02Z

What changes were proposed in this pull request?

Implement Index.putmask

This pull request is based on databricks/koalas#1560

Why are the changes needed?

putmask returns a new Index of the values set with the mask.
putmask is supported in pandas. PySpark should support that as well.

Does this PR introduce any user-facing change?

Yes. Index.putmask can be used.

>>> pidx = pd.Index(["a", "b", "c", "d", "e"])
>>> psidx = ps.from_pandas(pidx)
>>> psidx.putmask(psidx < "c", "k").sort_values()
Index(['c', 'd', 'e', 'k', 'k'], dtype='object')
>>> psidx.putmask(psidx < "c", ["g", "h", "i", "j", "k"]).sort_values()
Index(['c', 'd', 'e', 'g', 'h'], dtype='object')
>>> psidx.putmask(psidx < "c", ("g", "h", "i", "j", "k")).sort_values()
Index(['c', 'd', 'e', 'g', 'h'], dtype='object')
>>> psidx.putmask(psidx < "c", ps.Index(["g", "h", "i", "j", "k"])).sort_values()
Index(['c', 'd', 'e', 'g', 'h'], dtype='object')
>>> psidx.putmask(psidx < "c", "MASKED").sort_values()
Index(['MASKED', 'MASKED', 'c', 'd', 'e'], dtype='object')

How was this patch tested?

Unit tests.

HyukjinKwon · 2021-08-15T02:18:53Z

Jenkins, ok to test

SparkQA · 2021-08-15T02:55:07Z

Test build #142472 has finished for PR 33744 at commit 57de3fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-15T03:04:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46980/

SparkQA · 2021-08-15T03:41:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46980/

xinrong-meng · 2021-08-16T16:47:34Z

CC @ueshin @HyukjinKwon @itholic

ueshin · 2021-08-16T18:23:09Z

@beobest2 Could you enable the GIthub Action in your forked repository to run tests?

SparkQA · 2021-08-17T01:40:54Z

Test build #142522 has finished for PR 33744 at commit 7b6a20e.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T02:26:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47024/

SparkQA · 2021-08-17T02:53:38Z

Test build #142524 has finished for PR 33744 at commit f9ee813.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T03:01:41Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47024/

SparkQA · 2021-08-17T03:07:48Z

Test build #142525 has finished for PR 33744 at commit f1c4692.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T03:11:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47025/

SparkQA · 2021-08-17T03:27:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47026/

SparkQA · 2021-08-17T04:06:01Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47025/

SparkQA · 2021-08-17T04:24:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47026/

SparkQA · 2021-08-17T05:01:20Z

Test build #142531 has finished for PR 33744 at commit b239373.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T05:25:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47032/

SparkQA · 2021-08-17T06:03:23Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47032/

SparkQA · 2021-08-17T06:55:43Z

Test build #142538 has finished for PR 33744 at commit c02382c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T07:09:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47039/

SparkQA · 2021-08-17T08:01:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47039/

SparkQA · 2021-08-17T08:38:24Z

Test build #142544 has finished for PR 33744 at commit 12eea7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T08:53:39Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47046/

SparkQA · 2021-08-17T10:24:50Z

Test build #142550 has finished for PR 33744 at commit e1ae627.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T10:56:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47052/

SparkQA · 2021-08-17T11:54:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47052/

SparkQA · 2021-08-17T12:01:21Z

Test build #142553 has finished for PR 33744 at commit 52064b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T12:32:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47055/

SparkQA · 2021-08-17T12:49:51Z

Test build #142555 has finished for PR 33744 at commit b7e06cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T12:53:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47057/

SparkQA · 2021-08-17T13:03:29Z

Test build #142556 has finished for PR 33744 at commit 83b0c4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T13:10:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47058/

SparkQA · 2021-08-17T13:30:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47057/

SparkQA · 2021-08-17T13:33:39Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47055/

SparkQA · 2021-08-17T13:51:48Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47058/

SparkQA · 2021-08-19T05:10:07Z

Test build #142636 has finished for PR 33744 at commit ac455de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-19T05:47:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47137/

SparkQA · 2021-08-19T06:28:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47137/

itholic · 2021-09-06T08:16:16Z

Could you check beobest2#1 when you find some time ??

The point that we should not use to_pandas() and Index.size() as much as possible if there is another workaround.

This is a rough fix, so I think an additional reviews should be needed.

itholic · 2021-09-06T08:47:39Z

python/pyspark/pandas/typedef/typehints.py

@@ -323,7 +323,7 @@ def infer_pd_series_spark_type(pser: pd.Series, dtype: Dtype) -> types.DataType:
    if dtype == np.dtype("object"):
        if len(pser) == 0 or pser.isnull().all():
            return types.NullType()
-        elif hasattr(pser.iloc[0], "__UDT__"):
+        elif hasattr(pser, "iloc") and hasattr(pser.iloc[0], "__UDT__"):


Could you explain what is this change for ??

At this point : https://github.com/apache/spark/pull/33744/files#diff-c19199b1eb4ba73f00acb31a2c2c055be95b697fd08049ee6ba54655392adfa5R1984

If the type of input parameter value of this putmask is Index ,
the function infer_pd_series_spark_type raises the exception, because Index type doesn't have iloc attribute.
This is why I fix this part. I thought that it had no effect on the operation of the existing Series type.

Adding the infer_pd_index_spark_type function would make it cleaner.

I think we should not pass the Index to the infer_pd_series_spark_type.

At least we should change the function name (such as infer_pd_indexops_spark_type) and input & output type of the function, or add a new function and use it as you mentioned.

BTW, actually I think we don't really need to use pandas_udf here, though.

beobest2 · 2021-09-06T08:48:11Z

@itholic Thank you for the suggestion, I will review and test this more.

SparkQA · 2021-10-08T09:22:51Z

Test build #144017 has finished for PR 33744 at commit e083147.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-08T09:38:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48494/

SparkQA · 2021-10-08T10:23:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48494/

SparkQA · 2021-11-12T14:57:00Z

Test build #145165 has finished for PR 33744 at commit e083147.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2022-02-21T00:14:47Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added CORE PYTHON labels Aug 14, 2021

beobest2 mentioned this pull request Aug 14, 2021

Implements Index.putmask databricks/koalas#1560

Open

beobest2 force-pushed the add_putmask branch from 57de3fd to 7b6a20e Compare August 17, 2021 01:37

beobest2 changed the title ~~[SPARK-36403][PYTHON] Implement Index.putmask~~ [SPARK-36403][PYTHON] Implement Index.putmask Aug 17, 2021

beobest2 force-pushed the add_putmask branch from 83b0c4d to ac455de Compare August 19, 2021 04:40

itholic reviewed Sep 6, 2021

View reviewed changes

Implement Index.putmask

e083147

beobest2 force-pushed the add_putmask branch from ac455de to e083147 Compare October 8, 2021 08:50

github-actions bot added the Stale label Feb 21, 2022

github-actions bot closed this Feb 22, 2022

[SPARK-36403][PYTHON] Implement Index.putmask #33744

[SPARK-36403][PYTHON] Implement Index.putmask #33744

Conversation

beobest2 commented Aug 14, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Aug 15, 2021

SparkQA commented Aug 15, 2021

SparkQA commented Aug 15, 2021

SparkQA commented Aug 15, 2021

xinrong-meng commented Aug 16, 2021

ueshin commented Aug 16, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 17, 2021

SparkQA commented Aug 19, 2021

SparkQA commented Aug 19, 2021

SparkQA commented Aug 19, 2021

itholic commented Sep 6, 2021 • edited

itholic Sep 6, 2021

Choose a reason for hiding this comment

beobest2 Sep 6, 2021

Choose a reason for hiding this comment

beobest2 Sep 6, 2021 • edited

Choose a reason for hiding this comment

itholic Sep 7, 2021 • edited

Choose a reason for hiding this comment

beobest2 commented Sep 6, 2021

SparkQA commented Oct 8, 2021

SparkQA commented Oct 8, 2021

SparkQA commented Oct 8, 2021

SparkQA commented Nov 12, 2021

github-actions bot commented Feb 21, 2022

[SPARK-36403][PYTHON] Implement `Index.putmask` #33744

[SPARK-36403][PYTHON] Implement `Index.putmask` #33744

itholic commented Sep 6, 2021 •

edited

beobest2 Sep 6, 2021 •

edited

itholic Sep 7, 2021 •

edited