fix: handle CloudFormation throttling in import gateway polling#1185
Conversation
agentcore-cli-automation
left a comment
There was a problem hiding this comment.
Thanks for improving the import flow's throttle resilience — the shared poll() utility is a nice abstraction.
A couple of things worth addressing before merging, mostly around the new utility's error semantics. The PR's stated goal is to improve behavior under throttling, but as written the caller loses all context when polling gives up, which somewhat undermines debuggability in the exact scenario this PR targets. Details inline.
Adds a shared poll() utility with throttle-aware retry and migrates phase2-import.ts to use it. Previously, Rate exceeded errors from CloudFormation during concurrent e2e tests would crash the import operation. Now throttle errors are retried on the next poll iteration. Fixes: import-gateway e2e test failures under parallel execution
53872eb to
e38ac3e
Compare
…pping
Addresses PR review feedback:
- PollExhaustedError and PollTimeoutError now include the last error
as `cause` for debuggability (e.g., shows 'Rate exceeded' when
throttling exhausts retries)
- phase2-import.ts wraps poll errors with operation-specific messages
('Timed out waiting for change set creation') preserving original
error as cause
- Fixed misleading message when maxConsecutiveErrors triggers (now
reports actual attempt count)
- Added 3 tests verifying cause propagation
e38ac3e to
39164a3
Compare
agentcore-cli-automation
left a comment
There was a problem hiding this comment.
Reviewed the current state of the PR. The substantive concerns from the earlier review have been addressed:
39164a3preserveslastErrorascauseon bothPollExhaustedErrorandPollTimeoutError, so the throttling root cause is no longer discarded.39164a3also wrapsPollExhaustedError/PollTimeoutErrorat the call sites inphase2-import.tswith domain-specific messages (Timed out waiting for change set creation/Timed out waiting for import to complete), preserving the underlying error viacause.e3d827acorrectly scopesvi.useFakeTimers()to thebackoffdescribe block viabeforeEach/afterEach, so timer mocks don't leak into other tests.
The one remaining minor loose end already covered in the existing comment thread is the maxConsecutiveErrors branch still producing "Polling exhausted after N attempts" without noting that the abort was specifically due to consecutive errors (line 82). cause is now populated so users can diagnose, so IMO this is fine to leave as a follow-up polish.
The tests use real dependencies (no fs/SDK mocking) and only rely on fake timers where needed. No new serious issues from me — LGTM.
jesseturner21
left a comment
There was a problem hiding this comment.
Thanks for fixing this!
Description
Problem
There has been a flaky test on main due to CloudFormation throttling under parallel test execution. See https://github.com/aws/agentcore-cli/actions/runs/25528496015/job/74929406576 for an example. The key log is
[error] Phase 2 failed: Import change set failed: Rate exceeded. The existing retry logic in import does not gracefully handle throttling exceptions.Solution
introduce a general
pollutility that does handle transient errors, and is extendable to handle other types of errors. Migrate the import code to leverage it. Note that existing polling mechanisms exist, but none are generalized and re-usable. Migrating other cases is high-risk and low reward, and is therefore left out of scope here. They can be moved as needed.Related Issue
Partially addresses #1179
Documentation PR
N/A
Type of Change
Testing
How have you tested the change?
npm run test:unitandnpm run test:integnpm run typechecknpm run lintsrc/assets/, I rannpm run test:update-snapshotsand committed the updated snapshots17 unit tests added for the new polling utility. All pass along with typecheck.
Checklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.