Skip to content

[SPARK-37511][PYTHON] Introduce TimedeltaIndex to pandas API on Spark#34657

Closed
xinrong-meng wants to merge 9 commits intoapache:masterfrom
xinrong-meng:timedeltaindex
Closed

[SPARK-37511][PYTHON] Introduce TimedeltaIndex to pandas API on Spark#34657
xinrong-meng wants to merge 9 commits intoapache:masterfrom
xinrong-meng:timedeltaindex

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Nov 19, 2021

What changes were proposed in this pull request?

Introduce TimedeltaIndex to pandas API on Spark.

Properties, functions, and basic operations of TimedeltaIndex will be supported in follow-up PRs.

Note

Please note that PySpark DayTimeIntervalType follows python datetime.timedelta, in which the smallest time unit is microsecond. However, pandas TimedeltaIndex has nanosecond support.
Thus, we may observe the inconsistency as below:

>>> pidx = pd.TimedeltaIndex([1])
>>> pidx
TimedeltaIndex(['0 days 00:00:00.000000001'], dtype='timedelta64[ns]', freq=None)
>>> ps.from_pandas(pidx)
TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)

To inspect further in PySpark side:

>>> pdf  # nanosecond
          __index_level_0__
0 0 days 00:00:00.000000001

>>> sdf = spark.createDataFrame(pdf)

>>> sdf.show(2, False)
+-----------------------------------+
|__index_level_0__                  |
+-----------------------------------+
|INTERVAL '0 00:00:00' DAY TO SECOND|
+-----------------------------------+

Why are the changes needed?

Since DayTimeIntervalType is supported in PySpark, we may add TimedeltaIndex support in pandas API on Spark accordingly.

Does this PR introduce any user-facing change?

Yes.

# TimedeltaIndex is introduced
>>> ps.TimedeltaIndex([timedelta(1)])
TimedeltaIndex(['1 days'], dtype='timedelta64[ns]', freq=None)
>>> ps.TimedeltaIndex([timedelta(seconds=1)])
TimedeltaIndex(['0 days 00:00:01'], dtype='timedelta64[ns]', freq=None)
>>> ps.from_pandas(pd.TimedeltaIndex([timedelta(seconds=1)]))
TimedeltaIndex(['0 days 00:00:01'], dtype='timedelta64[ns]', freq=None)
# timedelta64 Series/DataFrame is also supported as a consequence
>>> ps.DataFrame({'td': [timedelta(hours=1, minutes=30)]})
               td
0 0 days 01:30:00
>>> ps.Series([timedelta(hours=1, minutes=30)])
0   0 days 01:30:00
dtype: timedelta64[ns]

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Test build #145423 has finished for PR 34657 at commit b860fe0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49896/

@SparkQA
Copy link

SparkQA commented Nov 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49896/

@SparkQA
Copy link

SparkQA commented Nov 30, 2021

Test build #145747 has finished for PR 34657 at commit 37d2ff4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50218/

@SparkQA
Copy link

SparkQA commented Nov 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50218/

@xinrong-meng xinrong-meng marked this pull request as ready for review December 1, 2021 02:24
@xinrong-meng xinrong-meng changed the title [WIP] Support TimedeltaIndex in pandas API on Spark [SPARK-37511][PYSPARK] Introduce TimedeltaIndex to pandas API on Spark Dec 1, 2021
@xinrong-meng xinrong-meng changed the title [SPARK-37511][PYSPARK] Introduce TimedeltaIndex to pandas API on Spark [SPARK-37511][PYTHON] Introduce TimedeltaIndex to pandas API on Spark Dec 1, 2021
@HyukjinKwon
Copy link
Member

nice!

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Test build #145787 has finished for PR 34657 at commit ffd1979.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50260/

@xinrong-meng
Copy link
Member Author

Considering the "Note" section in the PR description, shall we call the type timedelta64[ns] to follow pandas or timedelta64[ms]?

@xinrong-meng
Copy link
Member Author

CC @ueshin @HyukjinKwon @itholic

raise TypeError("Index.name must be a hashable type")

if isinstance(data, (Series, Index)):
# TODO(SPARK-37512): Support TimedeltaIndex creation given a timedelta Series/Index
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support TimedeltaIndex creation given a timedelta Series/Index involves many changes in python/pyspark/pandas/data_type_ops/. Shall we implement that separately in https://issues.apache.org/jira/browse/SPARK-37512?

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Test build #145794 has finished for PR 34657 at commit 0317bde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50260/

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50267/

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Test build #145797 has finished for PR 34657 at commit 5f00fe7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50268/

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50267/

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50272/

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50268/

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50272/

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon added a commit that referenced this pull request Dec 2, 2021
…imedeltaIndex

### What changes were proposed in this pull request?

This PR is a followup of #34657 that adds underline to match with the title.

### Why are the changes needed?

To fix the PySpark documentation build warning:

```
/.../spark/python/docs/source/reference/pyspark.pandas/indexing.rst:340: WARNING: Title underline too short.

TimedeltaIndex
-------------
```

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manual build of the PySpark documentation.

Closes #34775 from HyukjinKwon/SPARK-37511.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments