Skip to content

fix(tests): stabilize flaky durable timeout tests#9035

Merged
bnusunny merged 3 commits into
aws:developfrom
bnusunny:fix/durable-timeout-test-flakes
May 20, 2026
Merged

fix(tests): stabilize flaky durable timeout tests#9035
bnusunny merged 3 commits into
aws:developfrom
bnusunny:fix/durable-timeout-test-flakes

Conversation

@bnusunny
Copy link
Copy Markdown
Contributor

Summary

  • The ExecutionTimeout and wait_for_callback timeout integration tests have been intermittently failing on unrelated PRs (e.g. run 26182603010).
  • Root cause: an upstream emulator/SDK issue causes the invoke event payload to occasionally not propagate to the Lambda handler. When the payload is lost, handlers fall back to short default durations (e.g. wait_seconds=2, timeout_seconds=120) and executions complete SUCCEEDED / stay RUNNING instead of timing out, so the assertions fail.
  • Fix: introduce dedicated handlers that hardcode the wait and callback timeout durations, removing the dependency on event-payload propagation. The original event-driven handlers remain for the tests that exercise input handling.

Changes:

  • New handler wait/wait_30_seconds.py (hardcoded 30s wait) wired up to the ExecutionTimeout function.
  • New handler wait_for_callback/wait_for_callback_short_timeout.py (hardcoded 5s/3s) wired up to a new WaitForCallbackShortTimeout function.
  • Tests updated to use --no-event and the new dedicated functions; obsolete timeout_test_event.json deleted.

Test plan

  • pytest tests/integration/local/invoke/test_invoke_durable.py::TestInvokeDurable::test_local_invoke_durable_function_timeout
  • pytest tests/integration/local/start_lambda/test_start_lambda_durable.py::TestStartLambdaDurable::test_local_start_lambda_invoke_timeout
  • pytest tests/integration/local/start_lambda/test_start_lambda_durable.py::TestStartLambdaDurable::test_local_start_lambda_invoke_wait_for_callback_timeout
  • Verify the non-timeout durable tests (HelloWorld, NamedStep, NamedWait, MapOperations, Parallel, callback success/failure/heartbeat) still pass and were not affected by template changes.

The ExecutionTimeout and wait_for_callback timeout integration tests were
intermittently failing due to an upstream emulator/SDK issue where event
payloads occasionally fail to propagate to the Lambda handler. When the
input was lost, handlers fell back to short default durations and the
executions completed successfully instead of timing out, causing the
assertions to fail.

Replace the event-driven handlers with dedicated handlers that hardcode
the wait and callback timeout durations, removing the dependency on the
event payload reaching the function and making the tests deterministic.
@bnusunny bnusunny requested a review from a team as a code owner May 20, 2026 20:47
bnusunny added 2 commits May 20, 2026 20:52
The :latest tag of public.ecr.aws/durable-functions/aws-durable-execution-emulator
currently fails to persist execution state, causing GetDurableExecution to return
404 immediately after a successful start_durable_execution. This breaks every
durable integration test (HelloWorld, NamedStep, callbacks, etc.), not just the
ones modified in this PR.

Pin the emulator image tag for the durable-functions CI matrix entry so the
suite is reproducible until :latest is republished cleanly.
Copy link
Copy Markdown
Member

@roger-zhangg roger-zhangg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage concern (finding 1 of 3): SAM CLI event-payload propagation to durable functions is no longer exercised by any test that would fail if the payload were silently dropped.

The two timeout tests being rewritten here were effectively the only durable integ tests whose pass/fail signal depended on event payload values reaching the handler. After this change:

  • test_local_invoke_callback_heartbeat (tests/integration/local/invoke/test_invoke_durable.py:181) passes {timeout_seconds: 60, heartbeat_timeout_seconds: 30}, but completes via callback success well before any timeout — it would still pass with the handler defaults (120/60).
  • test_local_callback_cli_heartbeat (tests/integration/local/start_lambda/test_start_lambda_durable.py:339) follows the same pattern and has the same property.

The PR description attributes the flake to an upstream emulator/SDK issue. If that's confirmed, this is acceptable — but the description doesn't include emulator/SDK logs proving the propagation gap is purely upstream. If SAM CLI is partly to blame for invoke-event drops to durable handlers, this PR removes the only integ-level detector.

Suggestion: add (or restore) one durable test where the assertion explicitly depends on the event payload being honored — e.g., a wait-for-callback test that sends a short-timeout payload and asserts CallbackTimedOut, or an invoke test that varies wait_seconds between two values and asserts the resulting wait duration. Otherwise a real propagation regression would land silently regardless of where the underlying bug lives.

# Assert invoke output shows timeout
execution_arn = self.assert_invoke_output(
stdout, input_data={"wait_seconds": 30}, execution_name=execution_name, expected_status="TIMED_OUT"
stdout, input_data={}, execution_name=execution_name, expected_status="TIMED_OUT"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 2 of 3: this weakens the only durable invoke test that asserted a non-trivial input made it into the execution-history output.

Before: input_data={"wait_seconds": 30}assert_invoke_output validated the input shown in the execution actually matched what was passed in.

After: input_data={} — there's nothing to validate against, so this branch of assert_invoke_output is effectively a no-op for propagation. Combined with the switch to --no-event, the timeout assertion now passes solely because the hardcoded handler does what it does, regardless of whether the event channel works at all.

If the goal is just stabilization, that's fine, but please consider keeping at least one durable test that asserts the input round-trips correctly through invoke → handler → execution history.

function_name, event_path=event_path, durable_execution_name=execution_name
)
command_list = self.get_invoke_command_list(function_name, no_event=True, durable_execution_name=execution_name)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 3 of 3: the -e <event_path> code path for durable invokes is no longer exercised by the timeout test.

It's still covered by test_local_invoke_callback_heartbeat (line 181) which uses event_path=..., so it's not a total loss — flagging mainly because the timeout flow was the test most likely to fail on a regression in event-file handling specific to durable invoke (e.g., payload encoding, file read path differences when DurableExecutionName is also set). Worth a sanity check that the heartbeat test would actually catch such a regression, since it doesn't assert on the event contents being honored either.

@bnusunny
Copy link
Copy Markdown
Contributor Author

bnusunny commented May 20, 2026

Thanks for the careful review — these are fair findings. I traced the propagation path locally to evaluate the suggested coverage:

What I tried: added an EchoInput durable function whose handler returns event["echo_value"] plus an integ test passing a sentinel payload via -e (and via Payload= for start-lambda) and asserting the sentinel round-trips into the printed Result.

What I found: the test reproduces a real propagation drop. SAM CLI prints Input: {"echo_value": "..."} correctly in the execution summary (so SAM CLI → emulator's StartDurableExecution.Input is intact and the emulator stores it on the EXECUTION operation's InputPayload), but the handler receives event={}. Confirmed by replacing the body with raise RuntimeError(f"event={event!r}") — output is event={} type=dict. The drop happens after the emulator (which both stores and surfaces the payload via GetDurableExecution), so the cause is downstream of samcli/local/docker/durable_functions_emulator_container.py:start_durable_execution — most likely in the emulator → RAPID lambda_client.invoke(Payload=...) hop, or in how the durable SDK's handler wrapper reconstructs InitialExecutionState.Operations[0].ExecutionDetails.InputPayload. I did not finish localizing it.

On scope for this PR: since this is upstream of the timeout-flake fix, and a working propagation integ test would currently fail in CI on develop today, I'm leaving #9035 narrow:

  • not adding a new propagation test in this PR;
  • I'll file a separate aws-sam-cli issue with the reproducer and what I observed so the bug isn't lost.

@bnusunny bnusunny requested a review from roger-zhangg May 20, 2026 22:46
@bnusunny bnusunny added this pull request to the merge queue May 20, 2026
@bnusunny bnusunny removed this pull request from the merge queue due to a manual request May 20, 2026
@bnusunny bnusunny added this pull request to the merge queue May 20, 2026
Merged via the queue into aws:develop with commit 78c43a8 May 20, 2026
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants