v1.222.0-rc.8
Pre-releasedocs(ci): document github/artifacts planfile runtime-token requirement + E2E test Erik Osterman (Cloud Posse) (@osterman) (#2649)
## what- Planfile storage works end-to-end in CI. The
github/artifactsstore talks to the GitHub Artifacts runtime API for both upload and download, so a planfile uploaded by aplanjob can be consumed by a separatedeployjob in the same run. - Automatic, configurable drift verification on
deploy. When planfile storage is configured andatmos terraform deployruns under CI, Atmos downloads the stored plan, re-plans, compares them with a semantic JSON plan-diff, and applies the verified plan — failing on drift by default. Configurable viacomponents.terraform.planfiles.verify(fail | warn | off) and--verify-plan/--no-verify-plan(CLI > config > CI default). - Generalized the in-repo
github-runtimeaction to advertise planfiles, documented the runtime-token requirement, and added the automatic-flowplanfile-verify-e2eworkflow (kept the manualplanfile-artifacts-e2e).
why
- The same-run plan→deploy handoff (the core CI use case) was broken: GitHub's REST API won't serve an in-progress run's artifact, and verification was opt-in and undocumented.
- A planfile legitimately varies between review and apply (values known-after-apply, computed fields, hashes, ordering). A naive diff rejects a still-valid plan as "drifted"; the semantic comparison tolerates benign variation while catching real drift — which is what makes plan-then-deploy practical.
- Verification belongs on
deploy(which re-runsplan, so a fresh plan exists to diff against), notapply(which never re-plans).
references
- Docs: Planfile Storage, Planfile drift verification,
atmos terraform deploy - Changelog:
website/blog/2026-06-22-native-ci-planfile-verification.mdx
Summary by CodeRabbit
Release Notes
-
New Features
- Configurable Terraform planfile drift verification during deploy with three modes:
fail(strict),warn(proceed), oroff(skip). - GitHub Actions Artifacts support for same-run planfile downloads via runtime API.
- CLI flags (
--verify-plan/--verify-plan=false) to override configuration at runtime.
- Configurable Terraform planfile drift verification during deploy with three modes:
-
Tests
- Added GitHub Actions E2E workflows for planfile artifacts and verification scenarios.
- Comprehensive unit and integration test coverage for verification modes and storage operations.
-
Documentation
- Updated planfile storage guides with drift verification behavior and GitHub Actions setup instructions.
- Blog post explaining plan-then-deploy workflow and configuration options.
-
Chores
- GitHub Actions runtime action now exposes credentials for planfile artifact storage.
feat: native container steps for workflows and custom commands Erik Osterman (Cloud Posse) (@osterman) (#2626)
## what- Add a native
type: containerstep (build/push/run) to the shared step library used by both workflows and custom commands, built on the existingpkg/containerDocker/Podman runtime (new ephemeral one-shot runner plus image build/tag/push/inspect helpers;ImageInspectadded to theRuntimeinterface, mocks regenerated). - Formalize step outputs: every named step exposes
value/values/metadata/outputs/skipped/error(command-like steps addstdout/stderr/exit_code), so a build step can publish an image reference consumed by later push/run steps via{{ .steps.<name>.outputs.<key> }}. - Support per-step
identityfor registry auth and Docker Buildx + Buildx Bake builds; Podman uses the nativepodman buildpath. - Add the
examples/container-stepexample and a hermetic GitHub Actions job ([container-step]) that exercises build → push → run against aregistry:2service onlocalhost:5000, including failure-propagation. - Document the step type (
website/docs/workflows), add a changelog blog post, and update the roadmap (container steps + step outputs marked shipped). - Land the design PRDs for the follow-on primitives —
container-components.md,compose-components.md, and a rewritten membership-basedcompositions.md— and trimcontainer-actions-and-step-outputs.mdto cover only the procedural step. Remove the earliertargets:-based composition scaffolding (pkg/composition,cmd/composition, the composition step, andschema.Composition*) in favor of those PRDs. - Split the container-step handler into focused files and reduce complexity to satisfy the lint gate.
why
- Atmos workflows and custom commands increasingly resemble CI pipelines; running containers natively (build images, push to registries, run tools) removes the need for one-off shell scripts and keeps the same automation usable locally and in CI.
- A first-class step-outputs contract lets build → push → run/deploy pipelines pass structured values without shell parsing or temporary env files.
- The procedural container step is the shippable foundation; the component kinds (container, compose) and compositions are specified as PRDs so the broader system can be designed and reviewed before implementation, without blocking this PR.
references
- PRDs:
docs/prd/container-actions-and-step-outputs.md,docs/prd/container-components.md,docs/prd/compose-components.md,docs/prd/compositions.md - Changelog:
website/blog/2026-06-17-native-container-steps.mdx - Roadmap initiative: "Container Composition & Local Development" in
website/src/data/roadmap.js
Summary by CodeRabbit
-
New Features
- Added native
type: containerworkflow step actions:build,push,run, andinspect, with image metadata and step exit/status details. - Added workflow-level container sandbox support for shared execution context across shell steps.
- Added step
outputssupport with{{ .steps.<name>.outputs.<key> }}propagation for later steps. - Improved Podman runtime auto-start/recovery and identity-based registry authentication in container flows.
- Added native
-
Bug Fixes
- Improved environment-variable casing handling for nested
env:declarations.
- Improved environment-variable casing handling for nested
-
Documentation
- Added extensive docs and examples for container steps, workflow container config, and step outputs.
-
Tests / CI
- Expanded automated tests and added CI coverage for local container build→push→run flows.
🚀 Enhancements
fix(secret): skip remote-state reads in credential-free secret list Brian Ojeda (@sgtoj) (#2657)
## what- Make credential-free
atmos secret listskip the YAML functions that perform authenticated backend reads (!terraform.state,!terraform.output,!store,!store.get) while it enumerates secret declarations. - Add a
credentialFreeSkip()helper and use it in the two credential-free paths:secret list -s <stack>enumeration and the single-scopesecret list -s <stack> -c <component>path without--verify. - Authenticated paths (
get/set/exec/shell, andsecret list --verify) are unchanged — they keep skipping only!secret. - Adds
TestCredentialFreeSkippinning the skip set and adocs/fixeswrite-up.
why
- Secret listing is intentionally credential-free: it disables authentication so a large stack doesn't run one full auth cycle per component. But it still evaluated
!terraform.state/!terraform.output/!storein componentvars/settings. With auth disabled, the S3 backend assumes its configured role with no base credentials, the AWS SDK falls back to the default credential chain, and ultimately dials the EC2 IMDS endpoint — unreachable on a workstation — so the command aborts with a confusing assume-role/credentials error even immediately after a successfulatmos auth login. - Listing only needs the static
secrets.varsdeclarations (secrets.ExtractDeclarations), which never require a resolved value. Evaluating these functions was unnecessary and failure-prone. A skipped function leaves its raw string in place, which the declaration extractor ignores, so discovery is unchanged. - This is a regression: before credential-free enumeration was introduced,
secret listauthenticated per component, so these reads had credentials (slow, but working). Disabling auth removed the credentials without removing the reads.
references
- Related to #2639 (originally reported against
atmos secret list). - Follow-up to #2646, which made secret-list enumeration credential-free but left the credentialed function evaluation in place.
- Write-up:
docs/fixes/2026-06-23-secret-list-credential-free-skip.md - Verified with
go test ./cmd/secret/...and the repo'scustom-gcllint (both green), and end-to-end against a multi-account repo whose components reference cross-account!terraform.state:secret list -s <stack>aborted before, completes after (no state reads, no credential-resolution fallbacks).
Summary by CodeRabbit
- Bug Fixes
- Fixed
atmos secret listfailing with credential errors afteratmos auth loginby preventing credential-free operations from attempting authenticated backend reads.
- Fixed
fix(auth): retry transient auth on freshly-brokered STS git clones Erik Osterman (Cloud Posse) (@osterman) (#2653)
## what- Retry transient git authentication failures within a bounded window (default 30s, exponential backoff + jitter) only when Atmos brokered a fresh GitHub STS token this process — wired through a new
broker.HasBrokeredCredentials()signal and aCustomGitGetter.RetryAuthErrorsflag (existing per-sourceretry:config still takes precedence). - Keep auth failures terminal (fail fast) for non-brokered/static-credential clones, so a genuinely wrong or expired token is never masked by retries.
- Surface previously-swallowed credential-broker failures at
Warn(wasDebug, invisible at the defaultWarninglog level) and log an actionableErrorwhen the brokered-auth retry window is exhausted. - Add tests: brokered retry succeeds, non-brokered fails fast, bounded-budget exhaustion, and a
-raceconcurrency guard provingEnsureCredentialsprovisions exactly once with a happens-before barrier.
why
- Under Atmos Pro cross-repo STS,
atmos vendor pullintermittently failed withfatal: Authentication failedeven though the same run logged a successful token mint and OIDC auth — a subset of clones failed and a rerun was clean. - Root cause is GitHub's brief post-creation 401 window: a just-minted installation token is not yet valid across all of GitHub's git frontends. The atmos-pro server already self-heals its own API calls on this 401 (Sentry
APP-CLOUDPOSSE-COM-AM2), but the CLI git path did not —isRetryableGitErrortreated auth as terminal and vendor sources have no retry by default, so the earliest clones failed hard. - This gives the CLI the same tolerance the server has, scoped narrowly to brokered tokens so static credentials still fail fast, and removes the observability gap that made the failure hard to diagnose.
references
- Token TTL is GitHub's standard ~60 min (confirmed in atmos-pro
mint.ts/ token-provider), ruling out mid-run expiry; the post-mint propagation window is the cause. - Follow-up (out of scope):
revoke_on_exitcross-process token race on the shared, unlocked github/stsstate.json.
Summary by CodeRabbit
-
New Features
- Added automatic, bounded retries for transient Git authentication failures when using brokered GitHub App credentials, including a sensible default when no explicit retry window is configured.
- Introduced a
RetryAuthErrorssetting on the custom Git getter to enable this behavior.
-
Improvements
- Credential brokering now runs only for remote sources, and auth-retry logic activates only after brokered credentials are successfully exported.
- Enhanced warning logs when credential broker provisioning or environment export fails.
-
Tests
- Expanded coverage for brokered-credential detection, export-failure handling, and concurrent provisioning, plus brokered vs non-brokered retry behavior.
-
Chores/CI
- Pinned the
gofumptversion in pre-commit.
- Pinned the
perf(stacks): dedupe per-identity auth in nested terraform.state resolution Brian Ojeda (@sgtoj) (#2656)
## what- Extends the per-component
AuthManagermemoization introduced in #2652 from the top-leveldescribe stackspass into the nested resolution path that runs while templates and YAML functions are evaluated (!terraform.state,!terraform.output,atmos.Component(...)). - Adds a process-scoped
nestedAuthManagerCache, consulted byresolveAuthManagerForNestedComponent, keyed by the parent auth chain + a deterministic JSON fingerprint of the component's auth section. - Extracts the key logic into a shared
buildComponentAuthCacheKeyused by both the processor cache (#2652) and this nested path, so the two keying strategies cannot drift. - Caches only successful, non-nil resolutions;
ResetStateCache()also clears the new cache (kept consistent with theterraformStateCacheit pairs with). Neither is reset in production.
why
- #2652 deduped per-component auth at the top level, but a component that references another component via
!terraform.statestill ran a full auth cycle (credential writes, file locks, keyring rebuilds) once per distinct target — even when every target resolves to the same identity.terraformStateCacheonly short-circuits a repeat read of the same target, not distinct targets that share an identity. - The result was the same N-auth blowup #2652 removed, just relocated into template/YAML resolution. Memoizing by identity removes it.
Measured — atmos describe stacks -s <stack> on a large real-world stack (credentials provided via auth exec, 45s cap; only the binary under test varies):
| build | wall time | per-component auth cycles |
|---|---|---|
| latest release | DNF (>45s) | — |
main (includes #2652) |
~17–19s | 44 |
| this PR (#2652 + nested dedup) | ~10–11s | 5 |
Output was verified equivalent to main: the remaining run-to-run differences are pre-existing auto_provision_workdir_for_outputs / tofu output provisioning nondeterminism present on both builds (same identity resolved throughout, no new errors). A matched-output pair differed by fewer lines than the main-vs-main noise floor.
The nested path is shared by describe affected, list, and terraform --all/--query. On a large multi-component change, the full per-identity auth cycles during describe affected likewise drop from scaling with the number of resolved components to roughly one-per-identity, with the rest served from the cache.
test plan
go build ./... && go test ./internal/exec/...— new unit tests cover key behavior, dedupe-by-identity, parent-chain isolation, errors-not-cached, unserializable-section-not-cached, and theResetStateCachecoupling.custom-gcl run --new-from-rev=main→ 0 issues.- Real-repo benchmark above.
references
- Related to #2639
- Builds on #2652
- Design notes:
docs/fixes/2026-06-22-dedupe-nested-component-auth.md
Summary by CodeRabbit
- Bug Fixes
- Fixed per-identity authentication deduplication for nested Terraform state/output references, reducing redundant authentication manager creation.
- Improved caching behavior so successful results are reused, failures aren’t memoized, and cache reset clears nested authentication state alongside the Terraform state cache.
- Documentation
- Added a documentation note explaining the nested authentication caching behavior and reset semantics.
- Tests
- Added unit tests covering deterministic cache-key generation, deduplication/differentiation by identity, non-caching of errors, handling of non-fingerprintable auth sections, and cache invalidation on reset.
fix(describe): respect metadata.enabled when evaluating component functions Brian Ojeda (@sgtoj) (#2655)
## what- Respect
metadata.enabledwhen the shared describe pipeline (describe affected,describe stacks,list) evaluates a component's functions:!terraform.state/!terraform.outputare skipped for components disabled viametadata.enabled: false— the raw function string is left unresolved (no backend read).atmos.Component(...)returns empty sections (including an emptyoutputs) when the enclosing component is disabled — no describe, no state read, and template-safe (.outputs.x/.vars.xevaluate to nil instead of erroring).
- The gate keys strictly on
metadata.enabled(via the existingisComponentEnabled), independent ofvars.enabled.
why
describe affecteddescribes the current and base stacks and evaluates every component's templates/YAML functions with nometadata.enabledgate. A component disabled withmetadata.enabled: falsethat references an unprovisioned component's state therefore failed hard withterraform state not provisioned— even though disabled components are (correctly) excluded from the final affected list. The enabled-aware filters (shouldSkipComponent,FilterAbstractComponents) only run when assembling that list, after the describe phase has already failed.
Fixes #2654.
references
- Fixes #2654
- Design notes:
docs/fixes/2026-06-22-describe-respect-metadata-enabled.md
test plan
- Unit tests:
disabledComponentTerraformSkip(adds the terraform funcs, clones the base skip),enclosingComponentDisabled(nil/absent metadata ⇒ enabled;vars.enabled:falsealone ⇒ enabled;metadata.enabled:false⇒ disabled),componentFuncreturns empty sections for a disabled enclosing component, and an end-to-endprocessComponentEntrytest (disabled ⇒!terraform.statenot resolved; enabled /vars.enabled:false-only ⇒ resolved). go build ./...,go vet ./internal/exec/..., andcustom-gcl run --new-from-rev=main(0 issues).
Note: the
TestDescribeAffected*integration tests are environment-sensitive and fail identically on a cleanmaincheckout locally (macOS); they are unrelated to this change. CI (Linux) is authoritative.
Summary by CodeRabbit
- Bug Fixes
atmos describe affectedand related stack inspection commands now correctly honormetadata.enabled: false, avoiding Terraform state/output inspection and template evaluation for disabled components.
- Documentation
- Added a documentation entry describing the metadata handling fix and the scenarios it resolves.
- Tests
- Added end-to-end and unit-style coverage to ensure disabled/enabled behavior works consistently for YAML-function and
atmos.Component(...)template handling.
- Added end-to-end and unit-style coverage to ensure disabled/enabled behavior works consistently for YAML-function and