docs(ci): document github/artifacts planfile runtime-token requirement + E2E test Erik Osterman (Cloud Posse) (@osterman) (#2649)

## what

Planfile storage works end-to-end in CI. The github/artifacts store talks to the GitHub Artifacts runtime API for both upload and download, so a planfile uploaded by a plan job can be consumed by a separate deploy job in the same run.
Automatic, configurable drift verification on deploy. When planfile storage is configured and atmos terraform deploy runs under CI, Atmos downloads the stored plan, re-plans, compares them with a semantic JSON plan-diff, and applies the verified plan — failing on drift by default. Configurable via components.terraform.planfiles.verify (fail | warn | off) and --verify-plan / --no-verify-plan (CLI > config > CI default).
Generalized the in-repo github-runtime action to advertise planfiles, documented the runtime-token requirement, and added the automatic-flow planfile-verify-e2e workflow (kept the manual planfile-artifacts-e2e).

why

The same-run plan→deploy handoff (the core CI use case) was broken: GitHub's REST API won't serve an in-progress run's artifact, and verification was opt-in and undocumented.
A planfile legitimately varies between review and apply (values known-after-apply, computed fields, hashes, ordering). A naive diff rejects a still-valid plan as "drifted"; the semantic comparison tolerates benign variation while catching real drift — which is what makes plan-then-deploy practical.
Verification belongs on deploy (which re-runs plan, so a fresh plan exists to diff against), not apply (which never re-plans).

references

Docs: Planfile Storage, Planfile drift verification, atmos terraform deploy
Changelog: website/blog/2026-06-22-native-ci-planfile-verification.mdx

Summary by CodeRabbit

Release Notes

New Features
- Configurable Terraform planfile drift verification during deploy with three modes: fail (strict), warn (proceed), or off (skip).
- GitHub Actions Artifacts support for same-run planfile downloads via runtime API.
- CLI flags (--verify-plan / --verify-plan=false) to override configuration at runtime.
Tests
- Added GitHub Actions E2E workflows for planfile artifacts and verification scenarios.
- Comprehensive unit and integration test coverage for verification modes and storage operations.
Documentation
- Updated planfile storage guides with drift verification behavior and GitHub Actions setup instructions.
- Blog post explaining plan-then-deploy workflow and configuration options.
Chores
- GitHub Actions runtime action now exposes credentials for planfile artifact storage.

feat: native container steps for workflows and custom commands Erik Osterman (Cloud Posse) (@osterman) (#2626)

## what

Add a native type: container step (build / push / run) to the shared step library used by both workflows and custom commands, built on the existing pkg/container Docker/Podman runtime (new ephemeral one-shot runner plus image build/tag/push/inspect helpers; ImageInspect added to the Runtime interface, mocks regenerated).
Formalize step outputs: every named step exposes value/values/metadata/outputs/skipped/error (command-like steps add stdout/stderr/exit_code), so a build step can publish an image reference consumed by later push/run steps via {{ .steps.<name>.outputs.<key> }}.
Support per-step identity for registry auth and Docker Buildx + Buildx Bake builds; Podman uses the native podman build path.
Add the examples/container-step example and a hermetic GitHub Actions job ([container-step]) that exercises build → push → run against a registry:2 service on localhost:5000, including failure-propagation.
Document the step type (website/docs/workflows), add a changelog blog post, and update the roadmap (container steps + step outputs marked shipped).
Land the design PRDs for the follow-on primitives — container-components.md, compose-components.md, and a rewritten membership-based compositions.md — and trim container-actions-and-step-outputs.md to cover only the procedural step. Remove the earlier targets:-based composition scaffolding (pkg/composition, cmd/composition, the composition step, and schema.Composition*) in favor of those PRDs.
Split the container-step handler into focused files and reduce complexity to satisfy the lint gate.

why

Atmos workflows and custom commands increasingly resemble CI pipelines; running containers natively (build images, push to registries, run tools) removes the need for one-off shell scripts and keeps the same automation usable locally and in CI.
A first-class step-outputs contract lets build → push → run/deploy pipelines pass structured values without shell parsing or temporary env files.
The procedural container step is the shippable foundation; the component kinds (container, compose) and compositions are specified as PRDs so the broader system can be designed and reviewed before implementation, without blocking this PR.

references

PRDs: docs/prd/container-actions-and-step-outputs.md, docs/prd/container-components.md, docs/prd/compose-components.md, docs/prd/compositions.md
Changelog: website/blog/2026-06-17-native-container-steps.mdx
Roadmap initiative: "Container Composition & Local Development" in website/src/data/roadmap.js

Summary by CodeRabbit

New Features
- Added native type: container workflow step actions: build, push, run, and inspect, with image metadata and step exit/status details.
- Added workflow-level container sandbox support for shared execution context across shell steps.
- Added step outputs support with {{ .steps.<name>.outputs.<key> }} propagation for later steps.
- Improved Podman runtime auto-start/recovery and identity-based registry authentication in container flows.
Bug Fixes
- Improved environment-variable casing handling for nested env: declarations.
Documentation
- Added extensive docs and examples for container steps, workflow container config, and step outputs.
Tests / CI
- Expanded automated tests and added CI coverage for local container build→push→run flows.

🚀 Enhancements

fix(secret): skip remote-state reads in credential-free secret list Brian Ojeda (@sgtoj) (#2657)

## what

Make credential-free atmos secret list skip the YAML functions that perform authenticated backend reads (!terraform.state, !terraform.output, !store, !store.get) while it enumerates secret declarations.
Add a credentialFreeSkip() helper and use it in the two credential-free paths: secret list -s <stack> enumeration and the single-scope secret list -s <stack> -c <component> path without --verify.
Authenticated paths (get / set / exec / shell, and secret list --verify) are unchanged — they keep skipping only !secret.
Adds TestCredentialFreeSkip pinning the skip set and a docs/fixes write-up.

why

Secret listing is intentionally credential-free: it disables authentication so a large stack doesn't run one full auth cycle per component. But it still evaluated !terraform.state / !terraform.output / !store in component vars / settings. With auth disabled, the S3 backend assumes its configured role with no base credentials, the AWS SDK falls back to the default credential chain, and ultimately dials the EC2 IMDS endpoint — unreachable on a workstation — so the command aborts with a confusing assume-role/credentials error even immediately after a successful atmos auth login.
Listing only needs the static secrets.vars declarations (secrets.ExtractDeclarations), which never require a resolved value. Evaluating these functions was unnecessary and failure-prone. A skipped function leaves its raw string in place, which the declaration extractor ignores, so discovery is unchanged.
This is a regression: before credential-free enumeration was introduced, secret list authenticated per component, so these reads had credentials (slow, but working). Disabling auth removed the credentials without removing the reads.

references

Related to #2639 (originally reported against atmos secret list).
Follow-up to #2646, which made secret-list enumeration credential-free but left the credentialed function evaluation in place.
Write-up: docs/fixes/2026-06-23-secret-list-credential-free-skip.md
Verified with go test ./cmd/secret/... and the repo's custom-gcl lint (both green), and end-to-end against a multi-account repo whose components reference cross-account !terraform.state: secret list -s <stack> aborted before, completes after (no state reads, no credential-resolution fallbacks).

Summary by CodeRabbit

Bug Fixes
- Fixed atmos secret list failing with credential errors after atmos auth login by preventing credential-free operations from attempting authenticated backend reads.

fix(auth): retry transient auth on freshly-brokered STS git clones Erik Osterman (Cloud Posse) (@osterman) (#2653)

## what

Retry transient git authentication failures within a bounded window (default 30s, exponential backoff + jitter) only when Atmos brokered a fresh GitHub STS token this process — wired through a new broker.HasBrokeredCredentials() signal and a CustomGitGetter.RetryAuthErrors flag (existing per-source retry: config still takes precedence).
Keep auth failures terminal (fail fast) for non-brokered/static-credential clones, so a genuinely wrong or expired token is never masked by retries.
Surface previously-swallowed credential-broker failures at Warn (was Debug, invisible at the default Warning log level) and log an actionable Error when the brokered-auth retry window is exhausted.
Add tests: brokered retry succeeds, non-brokered fails fast, bounded-budget exhaustion, and a -race concurrency guard proving EnsureCredentials provisions exactly once with a happens-before barrier.

why

Under Atmos Pro cross-repo STS, atmos vendor pull intermittently failed with fatal: Authentication failed even though the same run logged a successful token mint and OIDC auth — a subset of clones failed and a rerun was clean.
Root cause is GitHub's brief post-creation 401 window: a just-minted installation token is not yet valid across all of GitHub's git frontends. The atmos-pro server already self-heals its own API calls on this 401 (Sentry APP-CLOUDPOSSE-COM-AM2), but the CLI git path did not — isRetryableGitError treated auth as terminal and vendor sources have no retry by default, so the earliest clones failed hard.
This gives the CLI the same tolerance the server has, scoped narrowly to brokered tokens so static credentials still fail fast, and removes the observability gap that made the failure hard to diagnose.

references

Token TTL is GitHub's standard ~60 min (confirmed in atmos-pro mint.ts / token-provider), ruling out mid-run expiry; the post-mint propagation window is the cause.
Follow-up (out of scope): revoke_on_exit cross-process token race on the shared, unlocked github/sts state.json.

Summary by CodeRabbit

New Features
- Added automatic, bounded retries for transient Git authentication failures when using brokered GitHub App credentials, including a sensible default when no explicit retry window is configured.
- Introduced a RetryAuthErrors setting on the custom Git getter to enable this behavior.
Improvements
- Credential brokering now runs only for remote sources, and auth-retry logic activates only after brokered credentials are successfully exported.
- Enhanced warning logs when credential broker provisioning or environment export fails.
Tests
- Expanded coverage for brokered-credential detection, export-failure handling, and concurrent provisioning, plus brokered vs non-brokered retry behavior.
Chores/CI
- Pinned the gofumpt version in pre-commit.

perf(stacks): dedupe per-identity auth in nested terraform.state resolution Brian Ojeda (@sgtoj) (#2656)

## what

Extends the per-component AuthManager memoization introduced in #2652 from the top-level describe stacks pass into the nested resolution path that runs while templates and YAML functions are evaluated (!terraform.state, !terraform.output, atmos.Component(...)).
Adds a process-scoped nestedAuthManagerCache, consulted by resolveAuthManagerForNestedComponent, keyed by the parent auth chain + a deterministic JSON fingerprint of the component's auth section.
Extracts the key logic into a shared buildComponentAuthCacheKey used by both the processor cache (#2652) and this nested path, so the two keying strategies cannot drift.
Caches only successful, non-nil resolutions; ResetStateCache() also clears the new cache (kept consistent with the terraformStateCache it pairs with). Neither is reset in production.

why

#2652 deduped per-component auth at the top level, but a component that references another component via !terraform.state still ran a full auth cycle (credential writes, file locks, keyring rebuilds) once per distinct target — even when every target resolves to the same identity. terraformStateCache only short-circuits a repeat read of the same target, not distinct targets that share an identity.
The result was the same N-auth blowup #2652 removed, just relocated into template/YAML resolution. Memoizing by identity removes it.

Measured — atmos describe stacks -s <stack> on a large real-world stack (credentials provided via auth exec, 45s cap; only the binary under test varies):

build	wall time	per-component auth cycles
latest release	DNF (>45s)	—
`main` (includes #2652)	~17–19s	44
this PR (#2652 + nested dedup)	~10–11s	5

Output was verified equivalent to main: the remaining run-to-run differences are pre-existing auto_provision_workdir_for_outputs / tofu output provisioning nondeterminism present on both builds (same identity resolved throughout, no new errors). A matched-output pair differed by fewer lines than the main-vs-main noise floor.

The nested path is shared by describe affected, list, and terraform --all/--query. On a large multi-component change, the full per-identity auth cycles during describe affected likewise drop from scaling with the number of resolved components to roughly one-per-identity, with the rest served from the cache.

test plan

go build ./... && go test ./internal/exec/... — new unit tests cover key behavior, dedupe-by-identity, parent-chain isolation, errors-not-cached, unserializable-section-not-cached, and the ResetStateCache coupling.
custom-gcl run --new-from-rev=main → 0 issues.
Real-repo benchmark above.

references

Related to #2639
Builds on #2652
Design notes: docs/fixes/2026-06-22-dedupe-nested-component-auth.md

Summary by CodeRabbit

Bug Fixes
- Fixed per-identity authentication deduplication for nested Terraform state/output references, reducing redundant authentication manager creation.
- Improved caching behavior so successful results are reused, failures aren’t memoized, and cache reset clears nested authentication state alongside the Terraform state cache.
Documentation
- Added a documentation note explaining the nested authentication caching behavior and reset semantics.
Tests
- Added unit tests covering deterministic cache-key generation, deduplication/differentiation by identity, non-caching of errors, handling of non-fingerprintable auth sections, and cache invalidation on reset.

fix(describe): respect metadata.enabled when evaluating component functions Brian Ojeda (@sgtoj) (#2655)

## what

Respect metadata.enabled when the shared describe pipeline (describe affected, describe stacks, list) evaluates a component's functions:
- !terraform.state / !terraform.output are skipped for components disabled via metadata.enabled: false — the raw function string is left unresolved (no backend read).
- atmos.Component(...) returns empty sections (including an empty outputs) when the enclosing component is disabled — no describe, no state read, and template-safe (.outputs.x / .vars.x evaluate to nil instead of erroring).
The gate keys strictly on metadata.enabled (via the existing isComponentEnabled), independent of vars.enabled.

why

describe affected describes the current and base stacks and evaluates every component's templates/YAML functions with no metadata.enabled gate. A component disabled with metadata.enabled: false that references an unprovisioned component's state therefore failed hard with terraform state not provisioned — even though disabled components are (correctly) excluded from the final affected list. The enabled-aware filters (shouldSkipComponent, FilterAbstractComponents) only run when assembling that list, after the describe phase has already failed.

Fixes #2654.

references

Fixes #2654
Design notes: docs/fixes/2026-06-22-describe-respect-metadata-enabled.md

test plan

Unit tests: disabledComponentTerraformSkip (adds the terraform funcs, clones the base skip), enclosingComponentDisabled (nil/absent metadata ⇒ enabled; vars.enabled:false alone ⇒ enabled; metadata.enabled:false ⇒ disabled), componentFunc returns empty sections for a disabled enclosing component, and an end-to-end processComponentEntry test (disabled ⇒ !terraform.state not resolved; enabled / vars.enabled:false-only ⇒ resolved).
go build ./..., go vet ./internal/exec/..., and custom-gcl run --new-from-rev=main (0 issues).

Note: the TestDescribeAffected* integration tests are environment-sensitive and fail identically on a clean main checkout locally (macOS); they are unrelated to this change. CI (Linux) is authoritative.

Summary by CodeRabbit

Bug Fixes
- atmos describe affected and related stack inspection commands now correctly honor metadata.enabled: false, avoiding Terraform state/output inspection and template evaluation for disabled components.
Documentation
- Added a documentation entry describing the metadata handling fix and the scenarios it resolves.
Tests
- Added end-to-end and unit-style coverage to ensure disabled/enabled behavior works consistently for YAML-function and atmos.Component(...) template handling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.222.0-rc.8

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

why

references

Summary by CodeRabbit

Release Notes

why

references

Summary by CodeRabbit

🚀 Enhancements

why

references

Summary by CodeRabbit

why

references

Summary by CodeRabbit

why

test plan

references

Summary by CodeRabbit

why

references

test plan

Summary by CodeRabbit

Contributors

Uh oh!