Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndex.map #33756

Closed
wants to merge 11 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Aug 17, 2021

What changes were proposed in this pull request?

Implement CategoricalIndex.map and DatetimeIndex.map

MultiIndex.map cannot be implemented in the same way as the map of other indexes. It should be taken care of separately if necessary.

Why are the changes needed?

Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well.

Does this PR introduce any user-facing change?

Yes. CategoricalIndex.map and DatetimeIndex.map can be used now.

  • CategoricalIndex.map
>>> idx = ps.CategoricalIndex(['a', 'b', 'c'])
>>> idx  
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.map(lambda x: x.upper())  
CategoricalIndex(['A', 'B', 'C'],  categories=['A', 'B', 'C'], ordered=False, dtype='category')

>>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True))
>>> idx.map(pser) 
CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'}) 
CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
  • DatetimeIndex.map
>>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10")
>>> psidx = ps.from_pandas(pidx)

>>> mapper_dict = {
...   datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8),
...   datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9),
... }
>>> psidx.map(mapper_dict)
DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None)

>>> mapper_pser = pd.Series([1, 2, 3], index=pidx)
>>> psidx.map(mapper_pser)
Int64Index([1, 2, 3], dtype='int64')
>>> psidx
DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None)

>>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r"))
Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM',
       'August 10, 2020, 12:00:00 AM'],
      dtype='object')

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142521 has finished for PR 33756 at commit 29aeaec.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142523 has finished for PR 33756 at commit 6a1578a.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47022/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47023/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47023/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47022/

@xinrong-meng xinrong-meng changed the title [SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndx.map [SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndex.map Aug 17, 2021
@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142568 has finished for PR 33756 at commit 08aa1f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142569 has finished for PR 33756 at commit c598f53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47070/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Test build #142570 has finished for PR 33756 at commit f47ccfa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47071/

@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47070/

@xinrong-meng xinrong-meng marked this pull request as ready for review August 17, 2021 22:17
@SparkQA
Copy link

SparkQA commented Aug 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47071/

@SparkQA
Copy link

SparkQA commented Aug 18, 2021

Test build #142572 has finished for PR 33756 at commit b831748.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47073/

@SparkQA
Copy link

SparkQA commented Aug 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47073/

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

python/docs/source/reference/pyspark.pandas/indexing.rst Outdated Show resolved Hide resolved
python/pyspark/pandas/missing/indexes.py Outdated Show resolved Hide resolved
@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Test build #142653 has finished for PR 33756 at commit ad0ab47.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

python/pyspark/pandas/indexes/category.py Outdated Show resolved Hide resolved
@@ -516,19 +515,6 @@ def test_missing(self):
getattr(psdf.set_index("c").index, name)()

# CategoricalIndex functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove from this line to line 525 as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Removed.

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Test build #142654 has finished for PR 33756 at commit 9938b4b.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47154/

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47155/

@SparkQA
Copy link

SparkQA commented Aug 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47154/

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending tests.

@SparkQA
Copy link

SparkQA commented Aug 20, 2021

Test build #142674 has finished for PR 33756 at commit 080b746.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47176/

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Merging to branch-3.2 too since RC1 failed.

HyukjinKwon pushed a commit that referenced this pull request Aug 23, 2021
…ndex.map`

Implement `CategoricalIndex.map` and `DatetimeIndex.map`

`MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary.

Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well.

Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now.

- CategoricalIndex.map

```py
>>> idx = ps.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'],  categories=['A', 'B', 'C'], ordered=False, dtype='category')

>>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True))
>>> idx.map(pser)
CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
```

- DatetimeIndex.map

```py
>>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10")
>>> psidx = ps.from_pandas(pidx)

>>> mapper_dict = {
...   datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8),
...   datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9),
... }
>>> psidx.map(mapper_dict)
DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None)

>>> mapper_pser = pd.Series([1, 2, 3], index=pidx)
>>> psidx.map(mapper_pser)
Int64Index([1, 2, 3], dtype='int64')
>>> psidx
DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None)

>>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r"))
Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM',
       'August 10, 2020, 12:00:00 AM'],
      dtype='object')
```

Unit tests.

Closes #33756 from xinrong-databricks/other_indexes_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 0b6af46)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants