Skip to content

[SPARK-57073][SS][PYTHON][TESTS] Fix flaky test_listener_events_spark_command#56109

Draft
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:fix-listener-events-spark-command-dev4
Draft

[SPARK-57073][SS][PYTHON][TESTS] Fix flaky test_listener_events_spark_command#56109
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:fix-listener-events-spark-command-dev4

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented May 26, 2026

What changes were proposed in this pull request?

Reintroduce the time.sleep(10) before reading the listener-written tables in test_listener_events_spark_command, and add a sticky comment explaining why the sleep must not be removed or replaced with @eventually.

Why are the changes needed?

SPARK-54957 replaced a time.sleep(60) with @eventually(timeout=60, catch_assertions=True). eventually only retries AssertionError by default. spark.read.table() on a missing table raises AnalysisException (TABLE_OR_VIEW_NOT_FOUND), which is not in the caught set, so the decorator aborts on the first attempt instead of polling.

The onQueryTerminated callback fires asynchronously after q.stop(), so the listener_terminated_events table is sometimes not yet written when the test first reads it. This makes the test flaky; it has failed on the Build / Python-only (master, Python 3.12, MacOS26) scheduled workflow on:

Both failed with the same exception at the same line.

A 10s sleep gives an order-of-magnitude safety margin over the observed sub-second race window, while avoiding the 60s waste that SPARK-54957 was trying to eliminate.

Does this PR introduce any user-facing change?

No, test-only change.

How was this patch tested?

Existing test pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events_spark_command.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

@zhengruifeng zhengruifeng force-pushed the fix-listener-events-spark-command-dev4 branch from 0c58ed1 to 5c34f7e Compare May 26, 2026 09:01
### What changes were proposed in this pull request?

Reintroduce the `time.sleep(30)` before reading the listener-written
tables in `test_listener_events_spark_command`, and add a sticky comment
explaining why the sleep must not be removed or replaced with
`@eventually`.

### Why are the changes needed?

SPARK-54957 replaced a `time.sleep(60)` with
`@eventually(timeout=60, catch_assertions=True)`. `eventually` only
retries `AssertionError` by default. `spark.read.table()` on a missing
table raises `AnalysisException` (`TABLE_OR_VIEW_NOT_FOUND`), which is
not in the caught set, so the decorator aborts on the first attempt
instead of polling.

The `onQueryTerminated` callback fires asynchronously after `q.stop()`,
so the `listener_terminated_events` table is sometimes not yet written
when the test first reads it. This makes the test flaky; it has failed
on the `Build / Python-only (master, Python 3.12, MacOS26)` scheduled
workflow on:

- 2026-05-23 — https://github.com/apache/spark/actions/runs/26346300968/job/77556662680
- 2026-05-25 — https://github.com/apache/spark/actions/runs/26423905857/job/77783724134

Both failed with the same exception at the same line.

### Does this PR introduce _any_ user-facing change?

No, test-only change.

### How was this patch tested?

Existing test
`pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests.test_listener_events_spark_command`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7
@zhengruifeng zhengruifeng force-pushed the fix-listener-events-spark-command-dev4 branch from 5c34f7e to 7065ea1 Compare May 26, 2026 09:35
@zhengruifeng zhengruifeng changed the title [SS][PYTHON][TESTS] Fix flaky test_listener_events_spark_command [SPARK-57073][SS][PYTHON][TESTS] Fix flaky test_listener_events_spark_command May 26, 2026
@gaogaotiantian
Copy link
Copy Markdown
Contributor

I think the better way to solve this is to catch the potential exception in eventually. We can either add AnalysisException in expected_exceptions, or do it explicitly in the wrapped function. We should avoid using time.sleep as much as possible. That's a potential cause for unstable tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants