Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36403][PYTHON] Implement Index.putmask #33744

Closed
wants to merge 1 commit into from

Conversation

beobest2
Copy link
Contributor

What changes were proposed in this pull request?

Implement Index.putmask

This pull request is based on databricks/koalas#1560

Why are the changes needed?

putmask returns a new Index of the values set with the mask.
putmask is supported in pandas. PySpark should support that as well.

Does this PR introduce any user-facing change?

Yes. Index.putmask can be used.

>>> pidx = pd.Index(["a", "b", "c", "d", "e"])
>>> psidx = ps.from_pandas(pidx)
>>> psidx.putmask(psidx < "c", "k").sort_values()
Index(['c', 'd', 'e', 'k', 'k'], dtype='object')
>>> psidx.putmask(psidx < "c", ["g", "h", "i", "j", "k"]).sort_values()
Index(['c', 'd', 'e', 'g', 'h'], dtype='object')
>>> psidx.putmask(psidx < "c", ("g", "h", "i", "j", "k")).sort_values()
Index(['c', 'd', 'e', 'g', 'h'], dtype='object')
>>> psidx.putmask(psidx < "c", ps.Index(["g", "h", "i", "j", "k"])).sort_values()
Index(['c', 'd', 'e', 'g', 'h'], dtype='object')
>>> psidx.putmask(psidx < "c", "MASKED").sort_values()
Index(['MASKED', 'MASKED', 'c', 'd', 'e'], dtype='object')

How was this patch tested?

Unit tests.

@HyukjinKwon
Copy link
Member

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Aug 15, 2021

Test build #142472 has finished for PR 33744 at commit 57de3fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46980/

@SparkQA
Copy link

SparkQA commented Aug 15, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46980/

@xinrong-meng
Copy link
Member

CC @ueshin @HyukjinKwon @itholic

@ueshin
Copy link
Member

ueshin commented Aug 16, 2021

@beobest2 Could you enable the GIthub Action in your forked repository to run tests?

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142522 has finished for PR 33744 at commit 7b6a20e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47024/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142524 has finished for PR 33744 at commit f9ee813.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47024/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142525 has finished for PR 33744 at commit f1c4692.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47025/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47026/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47025/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47026/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142531 has finished for PR 33744 at commit b239373.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47032/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47032/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142538 has finished for PR 33744 at commit c02382c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47039/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47039/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142544 has finished for PR 33744 at commit 12eea7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47046/

@beobest2 beobest2 changed the title [SPARK-36403][PYTHON] Implement Index.putmask [SPARK-36403][PYTHON] Implement Index.putmask Aug 17, 2021
@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142550 has finished for PR 33744 at commit e1ae627.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47052/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47052/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142553 has finished for PR 33744 at commit 52064b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47055/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142555 has finished for PR 33744 at commit b7e06cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47057/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142556 has finished for PR 33744 at commit 83b0c4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47058/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47057/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47055/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47058/

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Test build #142636 has finished for PR 33744 at commit ac455de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47137/

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47137/

@itholic
Copy link
Contributor

itholic commented Sep 6, 2021

Could you check beobest2#1 when you find some time ??

The point that we should not use to_pandas() and Index.size() as much as possible if there is another workaround.

This is a rough fix, so I think an additional reviews should be needed.

@@ -323,7 +323,7 @@ def infer_pd_series_spark_type(pser: pd.Series, dtype: Dtype) -> types.DataType:
if dtype == np.dtype("object"):
if len(pser) == 0 or pser.isnull().all():
return types.NullType()
elif hasattr(pser.iloc[0], "__UDT__"):
elif hasattr(pser, "iloc") and hasattr(pser.iloc[0], "__UDT__"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what is this change for ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point : https://github.com/apache/spark/pull/33744/files#diff-c19199b1eb4ba73f00acb31a2c2c055be95b697fd08049ee6ba54655392adfa5R1984

If the type of input parameter value of this putmask is Index ,
the function infer_pd_series_spark_type raises the exception, because Index type doesn't have iloc attribute.
This is why I fix this part. I thought that it had no effect on the operation of the existing Series type.

Copy link
Contributor Author

@beobest2 beobest2 Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the infer_pd_index_spark_type function would make it cleaner.

Copy link
Contributor

@itholic itholic Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not pass the Index to the infer_pd_series_spark_type.

At least we should change the function name (such as infer_pd_indexops_spark_type) and input & output type of the function, or add a new function and use it as you mentioned.

BTW, actually I think we don't really need to use pandas_udf here, though.

@beobest2
Copy link
Contributor Author

beobest2 commented Sep 6, 2021

@itholic Thank you for the suggestion, I will review and test this more.

@SparkQA
Copy link

SparkQA commented Oct 8, 2021

Test build #144017 has finished for PR 33744 at commit e083147.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48494/

@SparkQA
Copy link

SparkQA commented Oct 8, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48494/

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145165 has finished for PR 33744 at commit e083147.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 21, 2022
@github-actions github-actions bot closed this Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants