[SPARK-37511][PYTHON] Introduce TimedeltaIndex to pandas API on Spark by xinrong-meng · Pull Request #34657 · apache/spark

xinrong-meng · 2021-11-19T01:02:37Z

What changes were proposed in this pull request?

Introduce TimedeltaIndex to pandas API on Spark.

Properties, functions, and basic operations of TimedeltaIndex will be supported in follow-up PRs.

Note

Please note that PySpark DayTimeIntervalType follows python datetime.timedelta, in which the smallest time unit is microsecond. However, pandas TimedeltaIndex has nanosecond support.
Thus, we may observe the inconsistency as below:

>>> pidx = pd.TimedeltaIndex([1])
>>> pidx
TimedeltaIndex(['0 days 00:00:00.000000001'], dtype='timedelta64[ns]', freq=None)
>>> ps.from_pandas(pidx)
TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)

To inspect further in PySpark side:

>>> pdf  # nanosecond
          __index_level_0__
0 0 days 00:00:00.000000001

>>> sdf = spark.createDataFrame(pdf)

>>> sdf.show(2, False)
+-----------------------------------+
|__index_level_0__                  |
+-----------------------------------+
|INTERVAL '0 00:00:00' DAY TO SECOND|
+-----------------------------------+

Why are the changes needed?

Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex support in pandas API on Spark accordingly.

Does this PR introduce any user-facing change?

Yes.

# TimedeltaIndex is introduced
>>> ps.TimedeltaIndex([timedelta(1)])
TimedeltaIndex(['1 days'], dtype='timedelta64[ns]', freq=None)
>>> ps.TimedeltaIndex([timedelta(seconds=1)])
TimedeltaIndex(['0 days 00:00:01'], dtype='timedelta64[ns]', freq=None)
>>> ps.from_pandas(pd.TimedeltaIndex([timedelta(seconds=1)]))
TimedeltaIndex(['0 days 00:00:01'], dtype='timedelta64[ns]', freq=None)

# timedelta64 Series/DataFrame is also supported as a consequence
>>> ps.DataFrame({'td': [timedelta(hours=1, minutes=30)]})
               td
0 0 days 01:30:00
>>> ps.Series([timedelta(hours=1, minutes=30)])
0   0 days 01:30:00
dtype: timedelta64[ns]

How was this patch tested?

Unit tests.

SparkQA · 2021-11-19T02:13:56Z

Test build #145423 has finished for PR 34657 at commit b860fe0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-19T03:11:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49896/

SparkQA · 2021-11-19T03:55:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49896/

SparkQA · 2021-11-30T04:54:15Z

Test build #145747 has finished for PR 34657 at commit 37d2ff4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-30T05:17:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50218/

SparkQA · 2021-11-30T06:18:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50218/

HyukjinKwon · 2021-12-01T02:49:13Z

nice!

SparkQA · 2021-12-01T03:58:38Z

Test build #145787 has finished for PR 34657 at commit ffd1979.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T05:22:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50260/

xinrong-meng · 2021-12-01T05:34:06Z

Considering the "Note" section in the PR description, shall we call the type timedelta64[ns] to follow pandas or timedelta64[ms]?

xinrong-meng · 2021-12-01T05:34:14Z

CC @ueshin @HyukjinKwon @itholic

xinrong-meng · 2021-12-01T05:57:25Z

python/pyspark/pandas/indexes/timedelta.py

+            raise TypeError("Index.name must be a hashable type")
+
+        if isinstance(data, (Series, Index)):
+            # TODO(SPARK-37512): Support TimedeltaIndex creation given a timedelta Series/Index


To support TimedeltaIndex creation given a timedelta Series/Index involves many changes in python/pyspark/pandas/data_type_ops/. Shall we implement that separately in https://issues.apache.org/jira/browse/SPARK-37512?

SparkQA · 2021-12-01T06:01:21Z

Test build #145794 has finished for PR 34657 at commit 0317bde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T06:06:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50260/

SparkQA · 2021-12-01T06:36:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50267/

SparkQA · 2021-12-01T07:06:05Z

Test build #145797 has finished for PR 34657 at commit 5f00fe7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T07:36:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50268/

SparkQA · 2021-12-01T07:39:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50267/

SparkQA · 2021-12-01T08:13:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50272/

SparkQA · 2021-12-01T08:42:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50268/

HyukjinKwon

Looks pretty good

SparkQA · 2021-12-01T09:15:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50272/

HyukjinKwon · 2021-12-01T09:20:49Z

Merged to master.

…imedeltaIndex ### What changes were proposed in this pull request? This PR is a followup of #34657 that adds underline to match with the title. ### Why are the changes needed? To fix the PySpark documentation build warning: ``` /.../spark/python/docs/source/reference/pyspark.pandas/indexing.rst:340: WARNING: Title underline too short. TimedeltaIndex ------------- ``` ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manual build of the PySpark documentation. Closes #34775 from HyukjinKwon/SPARK-37511. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added CORE PYTHON labels Nov 19, 2021

xinrong-meng added 5 commits November 30, 2021 11:49

Add TimedeltaIndex

e5a6bcd

fmt

54185a5

fix

b8012d4

creation

0a1b43b

doctest

37d2ff4

xinrong-meng force-pushed the timedeltaindex branch from d739a22 to 37d2ff4 Compare November 30, 2021 03:49

xinrong-meng marked this pull request as ready for review December 1, 2021 02:24

xinrong-meng changed the title ~~[WIP] Support TimedeltaIndex in pandas API on Spark~~ [SPARK-37511][PYSPARK] Introduce TimedeltaIndex to pandas API on Spark Dec 1, 2021

xinrong-meng changed the title ~~[SPARK-37511][PYSPARK] Introduce TimedeltaIndex to pandas API on Spark~~ [SPARK-37511][PYTHON] Introduce TimedeltaIndex to pandas API on Spark Dec 1, 2021

__new__ and test

ffd1979

fmt

0317bde

xinrong-meng added 2 commits December 1, 2021 13:42

minor

8093dd0

TODO SPARK-37512

5f00fe7

xinrong-meng commented Dec 1, 2021

View reviewed changes

HyukjinKwon approved these changes Dec 1, 2021

View reviewed changes

HyukjinKwon closed this in 654cd97 Dec 1, 2021

HyukjinKwon mentioned this pull request Dec 2, 2021

[SPARK-37511][DOCS][FOLLOW-UP] Fix documentation build warning from TimedeltaIndex #34775

Closed

Conversation

xinrong-meng commented Nov 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Note

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 19, 2021

Uh oh!

SparkQA commented Nov 19, 2021

Uh oh!

SparkQA commented Nov 19, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

HyukjinKwon commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

xinrong-meng commented Dec 1, 2021

Uh oh!

xinrong-meng commented Dec 1, 2021

Uh oh!

xinrong-meng Dec 1, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

HyukjinKwon commented Dec 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

xinrong-meng commented Nov 19, 2021 •

edited

Loading