Skip to content

[SPARK-40437][SS][PYTHON] Support string representation of durationMs in GroupState.setTimeoutDuration#56178

Open
brijrajk wants to merge 1 commit into
apache:masterfrom
brijrajk:SPARK-40437-groupstate-string-duration
Open

[SPARK-40437][SS][PYTHON] Support string representation of durationMs in GroupState.setTimeoutDuration#56178
brijrajk wants to merge 1 commit into
apache:masterfrom
brijrajk:SPARK-40437-groupstate-string-duration

Conversation

@brijrajk
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

GroupState.setTimeoutDuration previously accepted only an integer milliseconds value. This PR
extends it to also accept a Spark interval string (e.g. "5 minutes", "1 hour 30 minutes",
"1.5 seconds"), matching the behaviour of the Scala API's
GroupStateImpl.setTimeoutDuration(String) overload.

Changes:

  • Added _parse_timeout_duration(duration: str) -> int helper in
    python/pyspark/sql/streaming/state.py that converts a Spark interval string to milliseconds.
    Parsing behaviour mirrors Scala's IntervalUtils.stringToInterval and IntervalUtils.getDuration
    (31 days/month convention for structured streaming watermarks).
  • Updated setTimeoutDuration to accept Union[int, str] and call the helper when a string is
    passed.
  • Added INVALID_TIMEOUT_DURATION_STRING error class to
    python/pyspark/errors/error-conditions.json.
  • Added python/pyspark/sql/tests/streaming/test_state.py with 27 unit tests covering: all
    supported units, months/years (31-day convention), negative component offsets, fractional seconds,
    leading-dot decimals (.5 seconds), explicit +/- signs, whitespace between sign and
    quantity, the interval keyword prefix, compound durations, case-insensitivity, and various
    invalid-input cases.

Why are the changes needed?

The Scala API supports both setTimeoutDuration(long durationMs) and
setTimeoutDuration(String duration). The Python implementation only supported the integer form,
leaving users unable to use human-readable interval strings as described in SPARK-40437.

Does this PR introduce any user-facing change?

Yes. GroupState.setTimeoutDuration now also accepts a Spark interval string such as
"5 minutes" or "1 hour 30 minutes". The integer form continues to work unchanged.
This change is relative to the unreleased master branch.

How was this patch tested?

27 new pure-Python unit tests in python/pyspark/sql/tests/streaming/test_state.py, covering
both positive cases (all units, compound durations, fractional seconds, edge-case signs and
whitespace) and negative cases (invalid strings, non-positive durations, wrong timeout mode).

Tests can be run without a full Spark build:

source .venv/bin/activate
PYTHONPATH=python python3 -m unittest pyspark.sql.tests.streaming.test_state -v

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic)

@brijrajk brijrajk force-pushed the SPARK-40437-groupstate-string-duration branch 3 times, most recently from 1b06aaa to d063818 Compare May 29, 2026 12:35
@brijrajk
Copy link
Copy Markdown
Contributor Author

Could a committer please review this? It extends GroupState.setTimeoutDuration to accept a Spark interval string (e.g. "5 minutes", "1 hour 30 minutes") in addition to integer milliseconds, matching the existing Scala API overload (SPARK-40437).

cc @zhengruifeng @itholic

@zhengruifeng zhengruifeng changed the title [SPARK-40437][PYTHON] Support string representation of durationMs in GroupState.setTimeoutDuration [SPARK-40437][SS][PYTHON] Support string representation of durationMs in GroupState.setTimeoutDuration Jun 3, 2026
@zhengruifeng
Copy link
Copy Markdown
Contributor

I think @HyukjinKwon and @HeartSaVioR should have more context as per the discussion in https://issues.apache.org/jira/browse/SPARK-40437

def setTimeoutDuration(self, durationMs: Union[int, str]) -> None:
"""
Set the timeout duration in ms for this key.
Processing time timeout must be enabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a versionchanged to doc that str is supported?

Copy link
Copy Markdown
Contributor Author

@brijrajk brijrajk Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added! A versionchanged note is useful here because this is a behavioral change to an existing method — users upgrading from an older Spark version would not know that string durations are now accepted unless the API docs call it out explicitly.

…GroupState.setTimeoutDuration

Allow `setTimeoutDuration` to accept a Spark interval string (e.g. '5 seconds',
'1 hour 30 minutes') in addition to an integer millisecond value, matching
the Scala-side overload. A Python parser converts supported time units
(weeks, days, hours, minutes, seconds, milliseconds, microseconds) to
milliseconds; month/year units and invalid strings raise PySparkValueError.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@brijrajk brijrajk force-pushed the SPARK-40437-groupstate-string-duration branch from d063818 to 646a31c Compare June 3, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants