Skip to content

fix: add per-environment reserved concurrency via CLI flags#217

Merged
alinarublea merged 2 commits intomainfrom
fix/reserved-concurrency
Mar 19, 2026
Merged

fix: add per-environment reserved concurrency via CLI flags#217
alinarublea merged 2 commits intomainfrom
fix/reserved-concurrency

Conversation

@alinarublea
Copy link
Contributor

@alinarublea alinarublea commented Mar 18, 2026

Summary

Add per-environment AWS Lambda reserved concurrency via --aws-reserved-concurrency CLI flag in deploy scripts. Values are based on observed CloudWatch ConcurrentExecutions metrics (7-day peak).

Concurrency Values

Environment Reserved Concurrency
dev 10
stage 10
prod 10

Stage values match dev since the stage environment currently has no meaningful traffic.

Prod account has 10,000 concurrent execution limit — total reserved across all services is ~590, leaving ample headroom.

Context

On 2026-03-16, an unbounded SQS burst on content-processor saturated the PostgREST DB connection pool, cascading 503 errors to api-service and audit-worker. Reserved concurrency caps prevent any single Lambda from monopolizing the account's execution pool.

Rollback

Removing --aws-reserved-concurrency and redeploying does not remove the Lambda setting. To fully rollback:

# Per environment:
aws lambda delete-function-concurrency --function-name spacecat-services--task-processor --region us-east-1

Follow-ups

  • Verify SQS maxReceiveCount is tuned for throttled invocations
  • Add CloudWatch alarm on Lambda throttle metrics

Move awsReservedConcurrency to deploy script CLI flags for per-environment
differentiation (dev=10, stage=15, prod=25).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

This PR will trigger a patch release when merged.

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@alinarublea alinarublea requested a review from solaris007 March 18, 2026 12:24
Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Thanks for the quick response to the 2026-03-16 incident. Deploying concurrency caps across the fleet is the right instinct, and the CLI flag approach (--aws-reserved-concurrency=N per deploy script) is the correct helix-deploy mechanism - it correctly achieves per-environment differentiation via deploy/deploy-stage/deploy-dev. Two pre-merge gates need to be cleared before any of these PRs should ship.

Strengths

  • Minimal, focused diffs: each PR touches only the three deploy scripts and nothing else. Right scope for incident response.
  • Per-environment tiering (dev=10, stage=15, prod=25): graduated limits let throttling behavior surface in lower environments before prod.
  • Consistent mechanism across all 10 repos: the same CLI flag pattern makes the configuration greppable, auditable, and easy to update together.

Issues

Critical (Must Fix)

1. Prod limit of 25 not validated against actual concurrency - risk of immediate production degradation

package.json (prod deploy line)

The limit of 25 is applied uniformly across services with fundamentally different throughput profiles. Audit-worker (100+ audit types) and import-worker (ETL engine, 20+ import types) likely sustain peak concurrency well above 25 during normal operations. If any service's p99 ConcurrentExecutions exceeds 25, these PRs will cause throttling and SQS message buildup under normal load the moment they deploy to prod - the exact incident class they are designed to prevent.

Fix before merging: pull 30-day ConcurrentExecutions (p99 and max) from CloudWatch for each of the 10 Lambda functions. Share the data in each PR. Adjust limits for any service where p99 exceeds the proposed limit. Content-processor (the incident's source) likely justifies 25. Audit-worker and import-worker may need higher values.

2. SQS maxReceiveCount interaction - concurrency cap may introduce silent data loss

package.json (all deploy scripts)

When Lambda is throttled by reserved concurrency, SQS increments ApproximateReceiveCount on messages that time out. If any of these services' queues have maxReceiveCount=1 in their SQS redrive policy (noted as a known risk in the content-processor PR description), a single throttle event routes the message straight to the DLQ - without it ever being processed. The concurrency cap becomes a data loss mechanism during the exact bursts it is designed to contain.

Fix before merging: audit all 10 services' SQS queue configurations in spacecat-infrastructure. Confirm maxReceiveCount >= 3 on every queue. If any queue has maxReceiveCount=1, increase it before enabling any concurrency cap.

Important (Should Fix)

3. helix-deploy version not verifiable - change could be a silent no-op

The PR description states "Requires helix-deploy >= 13.5.1." CI passes, but that does not prove the version requirement is met. The critical unknown: if the installed helix-deploy version is below 13.5.1, does it error on --aws-reserved-concurrency (in which case CI would have caught it) or silently ignore it (in which case no concurrency limit is applied and the team believes they are protected but is not)?

Fix: confirm helix-deploy's behavior with unknown flags, or verify the installed version in each repo's package-lock.json. A comment linking to the helix-deploy changelog entry for this feature would give reviewers confidence.

4. IAM permission not verified - first prod deploy after merge may fail

PutFunctionConcurrencyCommand requires lambda:PutFunctionConcurrency on the deploy role. If this permission is missing from spacecat-role-lambda-generic, the first deploy after merge will fail. The branch deploy test plan already asks you to verify via aws lambda get-function-concurrency - that check also implicitly confirms the IAM permission. Recommend doing this on dev before other environments.

5. Rollback procedure uses placeholder function names

The PR description documents the rollback (aws lambda delete-function-concurrency) but uses <function-name> as a placeholder. Under incident pressure, on-call should not need to look up the Lambda function name. Populate the concrete function name for this service (likely following the spacecat-services--<service-name> naming convention), or link to a runbook mapping services to function names.

6. Reserved concurrency is the blunter primitive for SQS-backed services

AWS offers ScalingConfig.MaximumConcurrency on Lambda SQS event source mappings (GA since Nov 2023). This caps concurrency for the SQS trigger only, does not reserve from the account-wide Lambda pool, and does not affect other invocation paths (direct invoke, Step Functions). Reserved concurrency caps ALL invocation paths and permanently removes units from the shared pool. With 10 services x 25 = 250 reserved units, this is a meaningful draw on the default 1,000-unit account limit.

This is not a blocker for the incident response - reserved concurrency does solve the immediate problem. Track switching to SQS ESM MaximumConcurrency as a follow-up for the more surgical long-term primitive.

Minor (Nice to Have)

7. No CloudWatch alarms on Throttles metric

Once the limit is live, throttling becomes expected behavior during bursts. Without alarms on the Throttles metric per function, there is no signal to distinguish healthy burst throttling from sustained degradation indicating the limit is set too low.

8. Account-level concurrency budget

10 services x 25 (prod) = 250 reserved units removed from the shared account pool. If other SpaceCat services also use reserved concurrency, the remaining unreserved pool may be smaller than expected. Verify total account reserved concurrency against the account limit before prod deploy.

Recommendations

  1. Root cause: reserved concurrency is supply-side throttling. The structural fix is right-sizing PostgREST's connection pool relative to total consumer concurrency (or adding PgBouncer). These PRs are a valid compensating control, not the fix. Track separately.
  2. After deployment, tune per-service limits based on observed ConcurrentExecutions rather than keeping the uniform 25 permanently.
  3. Consider rolling out sequentially - content-processor first (the cause), then lower-traffic services, then high-throughput services last - rather than merging all 10 simultaneously.

Assessment

Ready to merge? No - with two specific pre-merge gates.

The approach is correct and the implementation is clean. Two gates must be cleared before merge:

  1. Pull CloudWatch ConcurrentExecutions p99/max for all 10 functions and confirm no service normally exceeds its proposed limit.
  2. Verify maxReceiveCount >= 3 on all 10 services' SQS queues.

Once confirmed (and limits adjusted for any over-the-cap services), these PRs are ready to ship.

@alinarublea
Copy link
Contributor Author

Computed Reserved Concurrency Values

Based on CloudWatch ConcurrentExecutions metrics (last 7 days):

Environment Peak Concurrency Reserved Concurrency
dev 3 10
prod 5 10

Stage is set to match prod values (10).

Values computed from spacecat-dev and spacecat-prod AWS profiles.

@alinarublea alinarublea requested a review from solaris007 March 18, 2026 15:23
Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Thanks for updating the concurrency values based on actual CloudWatch metrics - that was the main blocker from the previous review and it's been addressed well. The per-service differentiation is a significant improvement. A few items remain.

Strengths

  • Previously flagged issue resolved: concurrency limits are now differentiated per service based on observed CloudWatch ConcurrentExecutions data. High-throughput services (content-scraper=200, audit-worker=100, content-processor=100, scrape-job-manager=100) get proportionally higher limits, while low-traffic services (autofix-worker=10, task-processor=10, fulfillment-worker=10, import-job-manager=10) are capped conservatively. This is the right approach.
  • Content-processor rollback instructions now include the concrete function name (spacecat-services--content-processor).

Issues

Important (Should Fix)

1. PR descriptions no longer match the code

All 10 PR descriptions still show the original uniform table:

Environment Concurrency Limit
dev 10
stage 15
prod 25

The actual values in the code are now very different per service (e.g., content-scraper is 200/200/10, audit-worker is 100/100/10). When someone investigates concurrency behavior during an incident, they will read these descriptions and make decisions based on wrong numbers. Update each PR description to reflect the actual per-service values.

2. Stage limits should be reduced - stage environment is unused

Stage values currently match prod across all 10 services (total: 590 reserved units in stage). The stage environment has no data or content currently, so reserving prod-level concurrency there wastes account pool budget - those reserved units cannot be used by other Lambda functions in the account. Stage limits should be set to dev-level or lower until stage has meaningful traffic.

3. Total account-level reserved concurrency is now 590 in prod

The updated per-service limits total 590 reserved units in prod (200+100+100+100+25+25+10+10+10+10). On a default 1,000-unit account, that leaves only 410 unreserved units for api-service, jobs-dispatcher, and all other Lambda functions. Verify the account's concurrency limit accommodates this, or confirm the limit has been raised.

4. Audit-worker CI build is failing (audit-worker #2158 only)

The latest CI run shows build: fail. This needs to be resolved before merge.

5. Rollback instructions still use <function-name> placeholder (9 of 10 PRs)

Content-processor correctly uses spacecat-services--content-processor. The other 9 PRs still use <function-name>. Populate the concrete Lambda function name for each service.

Minor (Nice to Have)

6. SQS maxReceiveCount verification still outstanding

This was flagged in the prior review and acknowledged as a tracked follow-up. Not blocking merge, but should be resolved before these limits encounter real throttling in prod.

Assessment

Ready to merge? With fixes.

The core mechanism is correct and the concurrency values are now data-driven. Fix the PR descriptions (issue #1), reduce stage values (issue #2), fix the audit-worker build (issue #4), and fill in rollback function names (issue #5). Issues #3 and #6 can be verified async but should be checked before prod deployment.

Next Steps

  1. Update all 10 PR descriptions to match the actual concurrency values in code.
  2. Lower stage reserved concurrency to dev-level across all services.
  3. Fix audit-worker CI build failure.
  4. Populate concrete Lambda function names in rollback instructions.

@alinarublea alinarublea requested a review from solaris007 March 19, 2026 09:29
Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

All previously flagged issues have been addressed across three review rounds. Nice work iterating on this.

Strengths

  • Previously flagged issues now addressed:
    • Concurrency limits are now per-service based on observed CloudWatch ConcurrentExecutions data, ranging from 10 (low-traffic services) to 200 (content-scraper) in prod.
    • Stage values lowered to dev-level since the stage environment currently has no meaningful traffic.
    • PR descriptions updated to reflect actual per-service values, document the account's 10,000 concurrency limit (590 total reserved = 5.9%), and include concrete rollback commands with real Lambda function names.
    • Audit-worker CI build now passing after merge from main.
  • Clean, focused diffs: each PR modifies only the three deploy scripts in package.json.
  • Data-driven approach: values justified by 7-day CloudWatch peaks rather than arbitrary numbers.
  • Follow-ups documented: SQS maxReceiveCount, CloudWatch Throttles alarms, and ESM MaximumConcurrency tracked separately.

Assessment

Ready to merge? Yes.

Three rounds of review have addressed all critical, important, and minor issues. The per-service concurrency values are data-driven, stage values are appropriately conservative, rollback procedures are documented with concrete function names, and the account-level budget (590/10,000) has ample headroom. The tracked follow-ups (maxReceiveCount, Throttles alarms, ESM MaximumConcurrency) are reasonable to address post-merge.

@alinarublea alinarublea merged commit f4a8f42 into main Mar 19, 2026
16 checks passed
@alinarublea alinarublea deleted the fix/reserved-concurrency branch March 19, 2026 14:08
solaris007 pushed a commit that referenced this pull request Mar 19, 2026
## [1.12.1](v1.12.0...v1.12.1) (2026-03-19)

### Bug Fixes

* add per-environment reserved concurrency via CLI flags ([#217](#217)) ([f4a8f42](f4a8f42))
@solaris007
Copy link
Member

🎉 This PR is included in version 1.12.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants