Release v1.222.0 · cloudposse/atmos

feat: native Helm components (experimental) Erik Osterman (Cloud Posse) (@osterman) (#2667)

what

Adds the native Helm component type to Atmos (stacked on the native Kubernetes work), so components.helm.<name> is a first-class component with the same stack semantics as Terraform/Kubernetes — rendered, diffed, applied, and deleted through the Helm Go SDK (no helm/helmfile binary required).

This branch contributes, on top of osterman/kubernetes-native-component:

Native Helm component + atmos helm commands — template, diff, plan, apply, deploy, delete. Charts can be local, repository, OCI, or vendored from a source: (parity with terraform/kubernetes JIT provisioning). values: is the chart's values; secrets flow via !secret and are masked.
Marked experimental — atmos helm renders the [EXPERIMENTAL] badge and honors settings.experimental / ATMOS_EXPERIMENTAL.
Real diff — atmos helm diff/plan now produces a true unified diff via the helm-diff library (used as a Go library, not the CLI plugin; v3.15.10 pins the same helm.sh/helm/v4 v4.2.1 Atmos uses). Secret values are redacted. Three baselines:
- deployed release (default; action.NewGet, cluster only for the baseline)
- --from-manifest=<path> — local baseline file (offline)
- --against=target[:<name>] — current manifests in a git deployment-repo provision target (offline; the GitOps producer-side diff)
GitOps producer side — apply/deploy can publish rendered manifests to a git deployment repository (provision targets). A new optional Fetcher capability on the target registry lets diff read that target's current state.
CI .Diff job summary at parity with the native Kubernetes component (collapsible block, Secrets omitted).

why

Helm has no native cross-release dependency ordering, no first-class secrets, and no in-process rendering — the ecosystem stitches together helm, helmfile, helm-secrets, and helm-diff. Atmos provides these directly through the stack model and the Helm Go SDK, including a real diff with no plugin to install, plus an offline GitOps-repo diff for producer-side workflows.

references

Docs: atmos helm, helm diff, Helm components
Changelog: website/blog/2026-06-15-native-helm-components.mdx
Example: examples/helm
Helmfile parity request: #2069

notes

Experimental feature; ships behind the experimental gate.
Pre-existing helm-feature lint debt (5 issues in executor.go/provision.go, e.g. os.Getenv→viper, funlen, arg-limit) is tracked for a follow-up cleanup; the diff work itself is lint-clean against the base.

feat: local Terraform tests against cloud emulators Erik Osterman (Cloud Posse) (@osterman) (#2663)

what

Run atmos terraform test (Terraform's native *.tftest.hcl framework) against a local cloud emulator instead of a real cloud account, via a new examples/terraform-tests example.
Add before.terraform.test / after.terraform.test lifecycle events and wire cmd/terraform/test.go to capture output and fire them — which drives both component hooks: and the native-CI plugin from one place.
New emulator workflow step type that drives emulator up/down/reset, so declarative kind: step hooks can bring a sandbox up before tests and tear it down after (when: always), with no manual atmos emulator up/down.
Native-CI job step summary for terraform test: pass/fail/skip badges and a per-run results table, alongside the existing plan/apply summaries.
Bug fix: under the Podman runtime, parsePodmanContainer dropped the container Ports array, so the emulator endpoint resolved empty and Terraform silently hit real AWS (403 InvalidAccessKeyId). Podman's structured Ports are now parsed into Info.Ports.
Docs (emulator step type, hook events, job summaries, hooks guide), a changelog blog post, a roadmap milestone, and a docs/fixes/ write-up for the Podman fix.

why

terraform test's apply run blocks create real infrastructure, so they need a cloud account and spend and rarely run locally — pointing them at an emulator makes them free, hermetic, and identical local↔CI.
A single hook-events seam keeps the emulator lifecycle declarative (in the component) rather than a hand-written custom command, and reuses the existing kind: step machinery.
The Podman fix is required for any emulator-backed Terraform to reach the sandbox at all on Podman (it also fixes the existing emulator-aws example), and is documented in docs/fixes/ rather than the changelog because it restores already-intended behavior.

references

Builds on the emulators feature (#2647).
Podman fix rationale: docs/fixes/2026-06-27-podman-port-readback-emulator-endpoint.md.
Changelog: website/blog/2026-06-27-local-terraform-tests-with-emulators.mdx.

[codex] Fix mobile gutters and name runtime CSS Erik Osterman (Cloud Posse) (@osterman) (#2673)

what

Renamed the homepage runtime stylesheet from landing-redesign.css to landing-runtime.css.
Updated the homepage import to use the new runtime stylesheet name.
Tightened mobile and tablet hero CSS so the homepage content keeps consistent left/right gutters and CTA elements stay within the content column.
Added a more compact phone hero by reducing vertical spacing, scaling mobile type, hiding the heavier demo/runs band on small screens, centering the overall mobile content column, placing cloud logos in the whitespace to the right of the value props, and centering the CTA row.
Optimized the mobile AI section by hiding the decorative badge, reducing text scale/line-height, tightening spacing, and using left-aligned copy on phones.

why

Makes the stylesheet name describe the current homepage theme instead of a past redesign event.
Fixes the mobile homepage hero feeling clipped or overly left-aligned on narrow viewports without making the lower action area look disconnected.
Helps the primary mobile hero and AI section content fit better above the fold on common devices.
Protects the runtime hero from legacy broad landing-page header rules at responsive breakpoints.

references

Validation: pre-commit hooks passed during commit.
Validation: Docusaurus dev server compiled successfully with src/css/landing-runtime.css and AISection/styles.css.
Validation: postcss.parse passed for the updated CSS files.

feat: native Kubernetes components with GitOps deployment-repo delivery Erik Osterman (Cloud Posse) (@osterman) (#2607)

what

Native kubernetes component type. Define Kubernetes objects in stacks and run atmos kubernetes render|diff|plan|apply|deploy|delete <component> -s <stack> through the Kubernetes Go SDKs (server-side apply) — no kubectl or kustomize binary required.
Inputs can be inline manifests, files/directories (paths), and Kustomize overlays; full stack semantics (vars/env/auth/metadata/inheritance/overrides), --all/--affected DAG ordering, Atmos Auth (e.g. EKS) integration, and dotted lifecycle hooks (before/after.kubernetes.*).
GitOps delivery via provision.targets. apply/deploy deliver to a target selected by kind: kubernetes applies to the cluster (default), git renders the manifests and commits them to a managed Git deployment repository (Argo CD/Flux source-of-truth) instead. Selected with --target (precedence: --target → provision.default → implicit cluster), so existing components are unaffected.
New reusable, component-agnostic target-provisioner registry (pkg/provisioner/target, registry pattern) + a ProvisionArtifact model. The git target composes the pkg/git service: clone-reconcile a git.repositories.<name>, replace the managed templated path with the rendered files, path-scoped commit with provenance trailers, and push-with-retry. Credentials come from Atmos Auth (GitHub STS); pull_request publishing is deferred.
Schema, LSP, docs, examples, changelog. Typed kubernetes component and provision.targets in Go schema and both JSON schemas; LSP; command/config/stack docs; examples/kubernetes and examples/kustomize; a changelog blog post and a roadmap milestone.

why

Kubernetes should be orchestrated by the same stack-based engine as Terraform/Helmfile/Ansible — one set of inheritance, auth, and affected-detection — rather than shelling out to kubectl/kustomize from glue scripts.
GitOps pipelines have always needed ad hoc glue to render manifests into a deployment repo, commit, survive push races, and wire credentials. Atmos already owns rendering, lifecycle events, and authentication, so provision.targets adds the delivery step with centralized safety rules — the same component config can apply to a cluster in dev and publish to a GitOps repo in prod with one flag.

references

Builds on the Atmos Git foundational capability (#2597), now merged into main, which provides the reusable pkg/git service and git.repositories configuration consumed by the git target.
Docs: Kubernetes component, atmos kubernetes.

Add emulator workflows, skill catalog, and website refresh Erik Osterman (Cloud Posse) (@osterman) (#2665)

what

Added emulator workflow improvements, including emulator listing, Kubernetes readiness handling, Podman port parsing, and emulator-aware Terraform backend reads for AWS, GCP, and Azure.
Added offline bundled AI skill catalog support, including available-vs-installed skill listing and install-by-name behavior.
Added component dependency listing support plus updated examples, docs, landing-page demo assets, and website sidebar/landing refresh work.

why

Makes local emulator workflows more reliable by keeping in-process backend reads pointed at emulator endpoints instead of real cloud services.
Lets users discover and install bundled Atmos AI skills without requiring network or Git access.
Improves dependency visibility and updates the docs/website experience around the new emulator and skill workflows.

references

None.

Add parallel and matrix workflow control steps Erik Osterman (Cloud Posse) (@osterman) (#2635)

what

Add parallel and matrix workflow control steps with sibling needs DAG scheduling.
Add configurable failure behavior, parent-owned grouped/prefixed/none output, summary rendering through UI helpers, and child result metadata capture.
Keep the internal exec integration thin while placing the scheduler, matrix expansion, command child executor, output handling, and tests in pkg/workflow.
Add workflow/schema validation, registered pkg/runner/step/parallel and pkg/runner/step/matrix handlers, JSON schema updates, and examples/parallel-steps.

why

Enables non-interactive workflow steps to run concurrently without moving orchestration policy into internal/exec.
Provides deterministic dependency, failure, output, and matrix semantics before allowing broader step types inside concurrent groups.
Documents the new behavior with runnable examples and keeps pkg/workflow coverage above 80%.

references

pkg/workflow coverage: 82.9%
Validation run: go test ./pkg/schema ./pkg/runner/step ./pkg/scheduler ./pkg/workflow ./internal/exec
Validation run: go test ./cmd ./tests -run 'Workflow|workflow|Schema|schema'
Validation run: ./custom-gcl run --new-from-rev=origin/main --config=.golangci.yml

feat(emulator): local cloud emulators + emulator-based advanced quick-start & docs example drawer Erik Osterman (Cloud Posse) (@osterman) (#2647)

what

Emulator feature — run cloud-service emulators locally as first-class Atmos components:

emulator component kind + driver registry (pkg/emulator): the EmulatorDriver interface, ResolveDriver/Drivers, Endpoint/Profile types, the built-in AWS Floci driver, and the AWS target-profile builder (dummy creds, AWS_ENDPOINT_URL, and the Terraform provider behavior flags env can't set).
atmos emulator CLI (cmd/emulator): lifecycle verbs (up/down/reset/list/…), flags, and shell completions.
Auth/identity binding so in-process AWS and Terraform both reach the emulator (pkg/auth, pkg/component, internal/exec); generic provider-config contribution (pkg/generator).
Design captured as three PRDs: docs/prd/emulators.md, docs/prd/kubernetes-identity.md, docs/prd/provider-config-contributor.md.
Examples + E2E: examples/emulator-aws, examples/demo-floci, and the floci/acceptance jobs.
Changelog: website/blog/2026-06-22-emulator-persistence.mdx.

Emulator-based advanced quick-start — rewrote the advanced tutorial to deploy a real event-driven AWS backend (KMS key, encrypted S3 bucket, DynamoDB table, SNS topic, SQS queue, SSM Parameter Store config) entirely on your laptop, with no AWS account and no credentials, via the emulator. New backing example examples/quick-start-advanced (replaces the old VPC-based one).

Docs UI — a right-side [Example] drawer that follows each quickstart page and shows the page's backing example files (QuickStartExampleDrawer, wired through theme/DocItem/Content + theme/Root, reading the file-browser plugin's global data). Plus restyled File, Terminal, KeyPoints ("You will learn"), KeyTakeaways, EmbedExample, and ActionCard components, a CodeBlock line-numbers toggle, and supporting theme/CSS overrides.

why

Emulators let contributors and CI run the full Atmos workflow — auth, secrets, vendoring, toolchain, Terraform apply — against local cloud emulators, the same on a laptop and in CI, with no cloud credentials. That makes the advanced tutorial runnable by anyone and gives fast, hermetic local iteration.
The example drawer and component restyle let each tutorial page show its backing example inline, so readers can follow the docs and the code side by side.

references

Stacked on osterman/container-component-type (reuses its persistent container lifecycle via ComponentType: "emulator").
See docs/prd/emulators.md for the full design and per-step implementation sequence.

Fix bare command docs links Erik Osterman (Cloud Posse) (@osterman) (#2660)

what

Adds explicit redirects from bare command overview routes for auth, ai, and toolchain to their canonical /usage pages.
Updates announcement and feature-card links to point directly at the canonical command overview URLs.

why

Prevents users from hitting 404s when following bare command docs links like /cli/commands/auth.
Keeps existing /usage command overview URLs canonical without changing valid bare command routes such as workflow, devcontainer, and ci.

references

Reported from https://atmos.tools/cli/commands/auth returning 404.
Validated with cd website && npm run build.

feat(hooks): run custom step types as lifecycle hooks (kind: step) Erik Osterman (Cloud Posse) (@osterman) (#2658)

what

Add a new kind: step component-lifecycle hook that delegates to the workflow/custom-command step registry, making every registered step type (container, http, toast, log, markdown, …) runnable on terraform lifecycle events — name a step type: and pass its parameters under with:.
Plumb the operation outcome to hooks: user hooks now fire on the failure path (not just success), a new when: success|failure|always selector (default success) controls outcome-based firing, and {{ .status }}/{{ .exit_code }}/{{ .error }} template context plus ATMOS_HOOK_* env vars (alongside component/stack) let a hook announce exactly what happened.
Tighten the hooks JSON schema into a structured per-hook envelope (kind enum incl. step, events, on_failure, when, type, with, retry) across all three schema copies, kept non-breaking (additionalProperties: true).
Add docs (hooks reference + new sections), a PRD, a changelog blog post, and a roadmap milestone; unit tests cover routing, nested with: decode, when filtering, outcome template/env exposure, retry, and on_failure.

why

The hook system previously hard-coded a small kind list (store, command, infracost, checkov, kics, trivy, git); every new capability meant a new kind. Reusing the existing, well-tested step registry lets the whole step library work as hooks without forking the abstraction.
A key use case — "the VPC component in the foobar stack failed" — was impossible: after-* hooks fired only on success (cobra skips PostRunE on error) and the outcome reached only CI hooks, never user hooks. Firing user hooks on failure with when + outcome context closes that gap while defaulting to success-only so existing hooks (e.g. store) keep their behavior.

references

PRD: docs/prd/hooks-step-types.md
Docs: /stacks/hooks#kind-step-run-a-step-type and #reacting-to-success-or-failure
The http step type used in the Slack example lands in a separate PR; the bridge works today with every registered step type.

Skip fork autofix and refresh setup-go pins Erik Osterman (Cloud Posse) (@osterman) (#2659)

what

Skip the atmos.ci autofix job when a pull request comes from a fork.
Keep the existing atmos-pro[bot] loop guard and same-repo PR autofix behavior.
Refresh eight actions/setup-go v6 SHA pins to match the current upstream v6 tag.

why

Fork PRs do not receive OIDC, repo variables, or writable credentials, so atmos pro commit cannot authenticate or push fixes.
Skipping the job avoids guaranteed red checks for external contributors while preserving formatting automation for internal PRs.
The verify workflow checks that SHA-pinned actions match their tag comments; the previous setup-go pins pointed at v6.4.0 while labeled as v6.

references

Validated with workflow YAML parsing, upstream tag checks for actions/setup-go, and commit hook check yaml.

feat(workflows): http step type (webhook alias) with retries Erik Osterman (Cloud Posse) (@osterman) (#2641)

what

Add a native http workflow/custom-command step (type: http) that performs an HTTP request — any verb (GET/POST/PUT/PATCH/DELETE/HEAD/OPTIONS), query string parameters, headers, and a request body via body (raw) or form (urlencoded, or JSON when Content-Type is JSON).
Keep webhook as a first-class alias for http (type: webhook behaves identically) for the fire-a-notification use case. This adds alias support to the step registry: NewBaseHandler is variadic for aliases, Get() resolves aliases, and List/Count report only the canonical entry (no duplicate step type).
Per-attempt timeout and retry that composes with the existing retry: policy; retry is HTTP-aware (transport errors, 5xx, and 429 retry by default, other 4xx fail fast, and retry.conditions regexes force additional cases).
Configurable success criteria via expect.status (codes) and expect.response (regexes); the response body and status are captured as the step's value/metadata for downstream steps.
Schema fields on WorkflowStep and Task (so it works in both workflows and custom commands) plus the HTTPExpect struct, ErrHTTPStep* sentinels, JSON manifest updates, docs, an examples/http-webhooks example, a changelog blog post, and a roadmap milestone.

why

Calling external endpoints (notify a service, trigger a CI job, hit a deployment webhook, poll a health check) previously required shelling out to curl, which isn't portable (Windows), is awkward to template, and gets no first-class timeout/retry handling.
The step is a general-purpose, verb-agnostic outbound HTTP client, so http is the accurate name (an inbound callback receiver is what "webhook" conventionally means); webhook is retained as an alias so the evocative name still works.
Extended/registry step types are not wrapped by the legacy retry.Do path that shell/atmos use, so the handler applies retry itself via retry.WithPredicate — which is what enables status-code-aware retry decisions a generic wrapper can't make.

references

Docs: workflow step types and custom commands
Changelog: website/blog/2026-06-20-http-step-type.mdx

feat(workflows): add `say` step for audible TTS notifications Erik Osterman (Cloud Posse) (@osterman) (#2640)

what

Add a new say workflow step type that speaks its content aloud using text-to-speech, and gracefully degrades to printing the message as a Markdown blockquote when no speech engine is available or when running in CI/headless.
Introduce a reusable cross-platform pkg/say package (mirroring pkg/browser) that detects macOS say, Linux spd-say/espeak/espeak-ng, and Windows PowerShell System.Speech, behind a Speaker interface with a CommandRunner DI seam and functional options.
Support a CSS font-family-style voice list (first installed voice on the host wins), a rate field (slow/normal/fast), and a print policy (fallback/always/never); add the matching Voice/Rate/Print fields to WorkflowStep and sentinel errors ErrSayNotFound/ErrVoiceListUnsupported.
Add an examples/say-something/ reference example, workflow step-type docs, an announcement blog post, and a roadmap milestone under the Workflows Overhaul initiative.

why

Long-running workflows often outlast your attention; say gives an audible cue when a workflow finishes or needs input, going beyond the bell-only alert step by announcing what happened.
Shelling out to say only works on macOS — this makes audible notifications portable across macOS/Linux/Windows and safe in CI, so the same workflow runs unchanged everywhere and never fails on a missing engine.

feat(hooks): CI annotations and SARIF upload for scanner findings Erik Osterman (Cloud Posse) (@osterman) (#2631)

what

Surface scanner-hook findings (Checkov, Trivy, KICS) in CI beyond the job summary:
- ci.annotations (default on) — inline GitHub ::error/::warning annotations anchored at each finding's file and line on the PR diff. The non-Code-Scanning path: needs no GitHub Advanced Security.
- ci.results (default off) — upload the raw SARIF to GitHub Code Scanning (Security tab) natively, with no github/codeql-action step. Analysis category is auto-derived from the scan target so per-component uploads don't overwrite each other.
Implemented as native CI provider capabilities (Annotator, SARIFReporter) — siblings of the existing check-run/comment/summary capabilities — not as hooks. All three reporting outputs (ci.summary/ci.annotations/ci.results) are gated by ci.enabled.
Custom hooks opt in by adding format: sarif to a kind: command hook — any SARIF-emitting tool (tfsec, semgrep, gitleaks, …) gets findings, annotations, and upload with no Go code.
Docs (incl. required GitHub Actions permissions), a changelog blog post, and a roadmap milestone.

why

The CI job summary (#2617) gave a readable report, but the two richest GitHub surfaces — inline PR annotations and tracked Code Scanning alerts — were missing even though the data was already in the parsed SARIF.
Modeling this as provider capabilities (rather than reviving the deprecated ci.* hook kinds) keeps CI reporting where it belongs and lets every SARIF-emitting hook, built-in or custom, participate through one shared path.

references

Builds on #2617 (scanner findings → CI job summary) and completes the CI-reporting direction from #2614.
Note: base is main; #2617 is not yet merged, so this PR's diff currently includes #2617's commits — they drop out once #2617 merges and this branch is rebased.

feat(container): persistent container component kind + compositions Erik Osterman (Cloud Posse) (@osterman) (#2645)

what

Adds a stack-scoped, Atmos-native container component kind — one component is one persistent service — plus first-class compositions membership.

Lifecycle (atmos container <verb> <component> -s <stack>): build, push, pull, run, up, ps, logs, exec, restart, stop, rm, down. Long-running containers are discovered by labels derived from the canonical instance address <stack>/container/<component> (name atmos-<stack>-container-<component>), not local state files. up/run build the image first when vars.build-style build: is set and the image is missing.
First-class config — image, build, run are top-level component sections (reusing the workflow container-step structs ContainerBuildStep/ContainerRunStep), not nested under vars. Container app env comes from the component env: section.
atmos container list shows per-instance running state (running/stopped/unknown), discovered by label. The generic atmos list components lists containers as a component type without container-specific status — consistent with terraform/ansible (there is no atmos terraform list/atmos ansible list).
Compositions — a first-class composition: membership field + a top-level compositions: section (closed for membership, open for fulfillment). Operating a component with undeclared membership is a hard error; atmos composition validate <name> -s <stack> reports fulfilled vs. not-provided services.
Deep merge — the custom-component fallback now runs metadata.inherits inheritance + generic deep-merge of all top-level keys, so container honors catalog/abstract defaults like built-in kinds. Abstract components are rejected for execution and filtered from listings.
Extends the describe-component auto-detect and the describe/list type whitelist for container (and fixes the pre-existing ansible gap in list components).

why

Containers should be first-class, addressable component instances like terraform/helmfile/packer/ansible, and atmos list components should show whether each is running. A set of container components grouped by a composition is "our own Compose" with no compose.yaml. Implements docs/prd/container-components.md.

references

PRD: docs/prd/container-components.md, docs/prd/compositions.md
Examples: examples/container-component/, examples/compositions/
Docs: website/docs/cli/commands/container/, website/docs/components/container.mdx
Contributor skill: .claude/skills/atmos-core-component-development/

[!NOTE]
Stacked on osterman/container-step-prd (the container step), not main. Changelog/roadmap are not required for this base (the gate is main-only); they'll go on the PR that brings the container feature to main.

docs(ci): document github/artifacts planfile runtime-token requirement + E2E test Erik Osterman (Cloud Posse) (@osterman) (#2649)

what

Planfile storage works end-to-end in CI. The github/artifacts store talks to the GitHub Artifacts runtime API for both upload and download, so a planfile uploaded by a plan job can be consumed by a separate deploy job in the same run.
Automatic, configurable drift verification on deploy. When planfile storage is configured and atmos terraform deploy runs under CI, Atmos downloads the stored plan, re-plans, compares them with a semantic JSON plan-diff, and applies the verified plan — failing on drift by default. Configurable via components.terraform.planfiles.verify (fail | warn | off) and --verify-plan / --no-verify-plan (CLI > config > CI default).
Generalized the in-repo github-runtime action to advertise planfiles, documented the runtime-token requirement, and added the automatic-flow planfile-verify-e2e workflow (kept the manual planfile-artifacts-e2e).

why

The same-run plan→deploy handoff (the core CI use case) was broken: GitHub's REST API won't serve an in-progress run's artifact, and verification was opt-in and undocumented.
A planfile legitimately varies between review and apply (values known-after-apply, computed fields, hashes, ordering). A naive diff rejects a still-valid plan as "drifted"; the semantic comparison tolerates benign variation while catching real drift — which is what makes plan-then-deploy practical.
Verification belongs on deploy (which re-runs plan, so a fresh plan exists to diff against), not apply (which never re-plans).

references

Docs: Planfile Storage, Planfile drift verification, atmos terraform deploy
Changelog: website/blog/2026-06-22-native-ci-planfile-verification.mdx

feat: native container steps for workflows and custom commands Erik Osterman (Cloud Posse) (@osterman) (#2626)

what

Add a native type: container step (build / push / run) to the shared step library used by both workflows and custom commands, built on the existing pkg/container Docker/Podman runtime (new ephemeral one-shot runner plus image build/tag/push/inspect helpers; ImageInspect added to the Runtime interface, mocks regenerated).
Formalize step outputs: every named step exposes value/values/metadata/outputs/skipped/error (command-like steps add stdout/stderr/exit_code), so a build step can publish an image reference consumed by later push/run steps via {{ .steps.<name>.outputs.<key> }}.
Support per-step identity for registry auth and Docker Buildx + Buildx Bake builds; Podman uses the native podman build path.
Add the examples/container-step example and a hermetic GitHub Actions job ([container-step]) that exercises build → push → run against a registry:2 service on localhost:5000, including failure-propagation.
Document the step type (website/docs/workflows), add a changelog blog post, and update the roadmap (container steps + step outputs marked shipped).
Land the design PRDs for the follow-on primitives — container-components.md, compose-components.md, and a rewritten membership-based compositions.md — and trim container-actions-and-step-outputs.md to cover only the procedural step. Remove the earlier targets:-based composition scaffolding (pkg/composition, cmd/composition, the composition step, and schema.Composition*) in favor of those PRDs.
Split the container-step handler into focused files and reduce complexity to satisfy the lint gate.

why

Atmos workflows and custom commands increasingly resemble CI pipelines; running containers natively (build images, push to registries, run tools) removes the need for one-off shell scripts and keeps the same automation usable locally and in CI.
A first-class step-outputs contract lets build → push → run/deploy pipelines pass structured values without shell parsing or temporary env files.
The procedural container step is the shippable foundation; the component kinds (container, compose) and compositions are specified as PRDs so the broader system can be designed and reviewed before implementation, without blocking this PR.

references

PRDs: docs/prd/container-actions-and-step-outputs.md, docs/prd/container-components.md, docs/prd/compose-components.md, docs/prd/compositions.md
Changelog: website/blog/2026-06-17-native-container-steps.mdx
Roadmap initiative: "Container Composition & Local Development" in website/src/data/roadmap.js

docs: add Custom to the Component Library Erik Osterman (Cloud Posse) (@osterman) (#2638)

what

Add a Custom entry to the Component Library so command-backed custom component types are discoverable alongside Terraform/OpenTofu, Helmfile, Packer, and Ansible.
New page website/docs/components/custom.mdx explaining custom component types (with a minimal Script Runner example and a native-vs-custom comparison), linking to the existing reference and changelog rather than duplicating them.
Wire the new page into the Component Library sidebar (website/sidebars.js) after Ansible.
Surface custom types in the Component Library overview (components-overview.mdx) — a pointer under the Component Types table and a Next Steps bullet.

why

Custom component types already shipped and are fully documented under cli/configuration/commands#custom-component-types, but a user browsing the Component Library never saw them as a first-class option — the nav didn't match the actual capability.
This is a docs-only change (no-release): no behavior changes, and the feature already has its own changelog post.

references

Reference: /cli/configuration/commands#custom-component-types
Changelog: /changelog/custom-component-types

feat: support description in component metadata Erik Osterman (Cloud Posse) (@osterman) (#2634)

what

Add an optional description field to component metadata.
Update the embedded, test-fixture, and published website JSON schemas to allow metadata.description.
Document the field in the component metadata reference and quick-start guides, and demo it in the quick-start example.
Add a schema validation test (pkg/datafetcher/schema_metadata_validation_test.go) verifying both the embedded and website schemas accept metadata.description.
Add a changelog blog post and a shipped roadmap milestone.

why

Lets users document the purpose of a component inline, right next to its configuration — especially useful when many components share the same Terraform root module with different configs.
The field is purely informational: it does not change how a component is processed, planned, or applied, so the change is safe and additive (component metadata is already a free-form map at runtime, so no Go changes were required).
Schema support gives editors auto-completion and validation for the new field.

references

Component metadata docs: /stacks/components/component-metadata

feat: terminal steps - tty/interactive fields and exec step type Erik Osterman (Cloud Posse) (@osterman) (#2602)

what

Terminal steps for custom commands and workflows — three related capabilities:

interactive: true — attach host stdin and let the step own Ctrl-C. Atmos suspends its SIGINT-exit handler while the step runs (new pkg/signals suspension registry consulted by the main.go signal handler).
tty: true — allocate a pseudo-terminal (reusing pkg/terminal/pty, same engine as atmos devcontainer attach). The command sees a real TTY; secret masking is applied to PTY output. With interactive: true, the host terminal switches to raw mode so Ctrl-C flows through the PTY to the child.
type: exec — replace the Atmos process entirely (shell exec semantics): execve of the system shell on Unix (env, working directory, and terminal inherited natively; ATMOS_SHLVL unchanged), spawn-and-propagate-exit-code emulation on Windows. Validated to be the final step; tty/interactive/retry/timeout/output are rejected on exec steps.

Architecture: all logic lives in narrow packages — pkg/process (RunShellStep routing, RunShellSession, ReplaceShellSession), pkg/schema (validation), pkg/signals (interrupt suspension). cmd/ and internal/exec contain only inline switch-case call sites; pkg/runner and the step handler share the same routing.

Also fixes in pkg/terminal/pty found along the way:

stdin copier no longer blocks completion (it's detached, docker-CLI pattern)
session teardown is bounded: when grandchildren (e.g. aws ssm's session-manager-plugin) keep the PTY slave open after the child exits, output drains on a 1s deadline instead of hanging with the terminal in raw mode
DisableStdinForward for -t-without--i semantics

why

Custom commands had no way to hand the terminal to an interactive process:

commands:
  - name: ssh
    steps:
      - type: shell
        command: "exec aws ssm start-session --target {{ .Arguments.instance_id }}"

ran the SSM session as a piped, masked subprocess: full-screen rendering broke, and Ctrl-C inside the session killed Atmos itself (global SIGINT handler exits 130), killing the orphaned session with SIGPIPE.

With this change:

commands:
  - name: ssh
    steps:
      - type: shell
        tty: true
        interactive: true
        command: "aws ssm start-session --target {{ .Arguments.instance_id }}"

behaves like docker run -it (supervised: masking preserved, more steps can follow), and:

      - type: exec
        command: "aws ssm start-session --target {{ .Arguments.instance_id }}"

hands the process over entirely (launcher: native job control, zero proxy overhead, must be the last step).

references

Reported in SweetOps Slack (SSM session via custom command gets a mangled terminal and dies with SIGPIPE on Ctrl-C); teardown hang + raw-terminal-after-exit reproduced live on this PR and fixed
Docs: Interactive and TTY Steps

feat(secrets): declarative secrets management with !secret, CRUD CLI, and masking Erik Osterman (Cloud Posse) (@osterman) (#1911)

what

Implements the Secrets Management PRD end to end — a GitOps-friendly, multi-cloud secrets workflow built on top of the existing store registry (not a parallel backend). Secrets are declared in stack config (committed to git) and their values live in a cloud secret backend or a SOPS-encrypted file, managed with a Vercel-like CLI and resolved at runtime with a new !secret YAML function.

Stores (`pkg/store`)

StoreConfig gains secret: true (subsystem membership) and kind (cloud/thing) with legacy type mapping; !store against a secret: true store is now an error ("use !secret").
New DeletableStore / StatusStore / SecretAwareStore interfaces; AWS SSM writes SecureString when used as a secret backend and gains Delete/Has.
New store backends: AWS Secrets Manager and HashiCorp Vault (KV v2). Registry refactored to a table-driven builder map; kind↔type compatibility.

Secrets core (`pkg/secrets`)

service, declaration registry, resolver, validator, kinds, and a leaf pkg/secrets/providers/ subpackage with a store-adapter (track 1) and a native SOPS provider (track 2: age/aws-kms/gcp-kms/gpg).
SOPS providers can be defined in atmos.yaml, globally in a stack (secrets: top-level merges into every component), or under a single component.

`!secret` + masking (the headline)

!secret NAME [| path ...] [| default ...] wired into the live YAML pipeline, with automatic masker registration.
Mask-without-retrieval: inspection commands (describe, list) resolve !secret to <MASKED> without contacting the backend when masking is on (the default) — so you can inspect a stack with no cloud credentials. Value-producing commands (secret get, terraform plan/apply) always retrieve; --mask/ATMOS_MASK only controls redaction of display output.
Sensitive Terraform outputs (sensitive = true) auto-register with the masker as they flow through !terraform.output / atmos.Component() / describe.

CLI (`cmd/secret`)

init, set (alias add), get, delete (alias rm), list, pull, push, import, validate.

Stack processing

secrets is now a first-class inheritable component section, plus a global stack-level secrets: block that merges into every component.

Docs + example

Full Docusaurus docs: atmos secret overview + all 9 subcommands, secrets configuration page, !secret function page; blog post (with an embedded example) and a roadmap milestone.
examples/sops-secrets/ — fully self-contained, age-encrypted, no cloud credentials. Bundled atmos test custom command (.atmos.d/test.yaml) proves the full lifecycle end to end (set → encrypted-at-rest → get → list → validate → masked-without-credentials inspection → reveal-needs-key).

why

There was no unified way to manage human-provisioned secrets in Atmos — stores were designed for machine-written Terraform outputs, and the historical workaround (Chamber) was AWS-only. This adds explicit, declarative secret registration so a secret must be declared before it can be set, read, or resolved, and makes "inspect a stack" decoupled from "authenticate to the secret backend."

references

PRD: docs/prd/secrets-management.md and docs/prd/secrets-masking/
Example: examples/sops-secrets/ (run atmos test)

notes / follow-ups

Fixed a pre-existing init-ordering bug where the global --mask=false flag did not disable the early-initialized I/O masker (only ATMOS_MASK=false did). io.ReconcileMasking() now reconciles the masker after flags are parsed, so --mask=false and ATMOS_MASK=false behave identically.
pkg/store backend implementations could be moved into a pkg/store/providers/ subpackage (mirroring pkg/secrets/providers/) — deferred to a dedicated follow-up PR since it touches ~30 external call-sites.
Base-component (metadata.component) inheritance of the secrets section is not wired yet (component-level + import: + global-stack inheritance all work).

feat(terraform): registry cache, RC management, and multi-platform mirror Erik Osterman (Cloud Posse) (@osterman) (#2582)

what

Add a transparent Terraform/OpenTofu registry cache: an ephemeral local HTTPS network-mirror proxy (pkg/http/proxy, pkg/terraform/{cache,registry}) that caches providers and modules in the canonical filesystem_mirror layout, enabled with components.terraform.cache.enabled: true.
Add the atmos terraform cache command group — list, stats, prune, delete, plus mirror (alias warm) for eager multi-platform pre-seeding and trust/untrust for the proxy certificate.
Add declarative Terraform CLI-config (.terraformrc) management via components.terraform.rc, exposed to the subprocess through TF_CLI_CONFIG_FILE/TOFU_CLI_CONFIG_FILE.
Add a first-class components.terraform.platforms setting (target <os>_<arch> list) that drives both eager atmos terraform cache mirror pre-seeding (--all/--components/--query/-s, package-manager-style TUI, --format json|yaml) and automatic completion of .terraform.lock.hcl.
Keep .terraform.lock.hcl complete across platforms: a built-in after.terraform.init provisioner runs terraform/tofu providers lock -platform=… for the declared platforms whenever a customized provider installation method (the default plugin cache, or the registry cache) is active. Because it runs after init, it sees the fully JIT-vendored and code-generated working directory, so the generated provider set (including stack-config provider versions) is what gets locked — and committed lock files install cleanly on every platform in a fleet.
Generate and cache a self-signed loopback certificate so the proxy can serve HTTPS (required by Terraform/OpenTofu network mirrors); trusted automatically via SSL_CERT_FILE on Linux/CI and via a one-time atmos terraform cache trust on macOS/Windows.
Add examples/caching (auto-installs OpenTofu via the toolchain), PRDs, command + configuration docs, blog posts, and a roadmap update.

why

Repeated and CI runs re-download the same providers and modules; the cache eliminates that, keeps runs working through registry outages, and preserves the exact versions a deployment used.
Atmos enables a provider plugin cache (TF_PLUGIN_CACHE_DIR) by default, and network mirrors behave the same way: Terraform can no longer record the registry's signed cross-platform checksums, so init writes a .terraform.lock.hcl with hashes for only the current platform and prints the "Incomplete lock file information for providers" warning. Declaring components.terraform.platforms lets Atmos complete the lock automatically for every target platform.
The lazy proxy only caches the host platform, so mixed CI/developer fleets and air-gapped reproducible builds need declarative multi-platform pre-seeding — components.terraform.platforms + cache mirror provide it.
Declarative rc lets teams manage provider mirrors, credentials, and other CLI-config directives from atmos.yaml instead of per-machine .terraformrc files.

references

Closes #2150
docs/prd/terraform-registry-cache.md, docs/prd/terraform-rc-management.md, docs/prd/terraform-registry-cache-tls.md

feat: Atmos Git — foundational capability for GitOps enablement Erik Osterman (Cloud Posse) (@osterman) (#2597)

what

Atmos Git: Git becomes a foundational platform capability, on par with Toolchain, Auth, and Hooks — the enablement layer for GitOps workflows where Atmos commits generated artifacts to a source-of-truth repository. PRD: docs/prd/git-ops.md.

Top-level git config — git.repositories.<name> declares managed repositories (uri, branch, remote, clone depth/filter/single-branch/submodules, auth.identity, commit.signing/commit.author, push.retries), git.hooks declares local Git hooks, git.list configures list output. Workdirs default to automatic XDG cache locations ($XDG_CACHE_HOME/atmos/git/repositories/<name>) so the native CI cache captures and restores managed clones.
pkg/git service — provider registry (registry pattern) with the cli provider in v1 (chosen because GitHub STS materializes credentials as GIT_CONFIG_* env vars, which subprocess git honors and go-git ignores). Clone is defined as reconcile (clone-if-absent, else fetch + checkout + ff-only) so stale CI-cache restores are just faster clones. Safety rules: ff-only pull, no force push ever, push retry-with-rebase on non-fast-forward rejection, path-scoped commits that fail on unrelated dirty files, worktree path-traversal validation, per-invocation commit author injection (CI runners need no user.name), provenance trailers (Atmos-Stack, Atmos-Component, Atmos-Source-SHA).
atmos git command group — clone, pull, status, diff, commit, push, list, clean, plus git hooks install|uninstall|run, registered under the Git help group. --all bulk operations (bounded concurrency, attempt-all with errors.Join). Clone accepts configured names, plain URLs, and go-getter git::...?ref=&depth= URIs. No-arg clone in native CI (ci.enabled: true) infers the current repository from CI metadata and clones into the workspace — an actions/checkout replacement. atmos list git-repositories alias registered.
git hook kind — publishes generated artifacts on lifecycle events (after.terraform.apply, ...) to the current repository by default or a named managed repository, with templated commit messages, trailers, clean no-ops, and push-after-commit with retry. Inherits --skip-hooks and on_failure.
Local Git hook shims — atmos git hooks install writes worktree-aware .git/hooks/* shims (marker-protected, --force to overwrite, warns when core.hooksPath is set); run dispatches configured commands with stdin forwarding and exit-code propagation.
Error handling — new sentinels (ErrGitRepositoryNotFound, ErrGitAuthFailed, ErrGitPushRejected, ErrGitDirtyUnmanagedFiles, ErrGitPathEscapesWorktree, ErrGitHookNotConfigured, ErrGitRepositoryRequired, ErrGitProviderNotFound) with error-builder hints and exit-code mapping. Git stderr streams to the masked writer and is never embedded in error chains.
Docs & example — command pages under website/docs/cli/commands/git/, git configuration reference, hook kind docs, changelog blog post (atmos-gitops), roadmap milestone (CI/CD Simplification initiative), and a GitOps publishing demo at examples/gitops (reconcile → review → publish against a managed deployment repo via custom commands).

What this is — and isn't

Atmos owns the publishing side of GitOps: render → diff → commit → push, with centralized safety rules. Reconciliation stays with the consumer — Argo CD or Flux pulls from the repository, or CI applies on merge. There are no agents and no drift-correction loop in Atmos itself (explicit non-goal in the PRD); Atmos is the producer feeding the reconciler. This also isn't a replacement for the existing GitHub Actions plan/apply integration — it's the Git plumbing those pipelines use.

why

GitOps workflows have always needed glue: ad hoc scripts to render manifests into deployment repos, commit them, survive push races, and wire credentials. Atmos already owns rendering, lifecycle events, toolchain, and credentials (GitHub STS) — this PR gives it the Git operations between them, with centralized safety rules instead of per-pipeline shell scripts. It is the foundation for Kubernetes deployment-repository provisioning (Argo CD / Flux rendered-manifest publishing, on the kubernetes component branch) and a future github provider for pull-request-based publishing to protected branches.

references

PRD: docs/prd/git-ops.md (in this PR)
Coverage: pkg/git 86%, pkg/git/providers/cli 88%, pkg/hooks/kinds/git 94%, cmd/git 81%
Related: native CI cache (XDG-root archiving) and the Kubernetes component branch (consumes provision.git next)

feat: support dotenv files in !include Erik Osterman (Cloud Posse) (@osterman) (#1930)

Summary

Adds explicit dotenv file support to the existing !include YAML function. Dotenv files now resolve to maps, so they can be used directly in CLI and stack env sections and with YAML merge keys.

env:
  <<: !include .env
  AWS_REGION: us-east-2

Dotenv files can also be layered with YAML merge sequences. This uses YAML's << merge-key syntax, the same YAML mechanism commonly used with anchors and aliases:

env:
  <<:
    - !include .env.local
    - !include .env
  AWS_REGION: us-east-2

YAML merge sequence precedence is earlier item wins, and inline keys under env override all merged values.

What Changed

Parse .env, .env.*, and exact *.env filenames as dotenv files when used with !include
Support env: !include .env and env: { <<: !include .env } / block merge forms in stack config
Support dotenv !include in atmos.yaml env, including merge sequences for layered dotenv files
Preserve !include.raw behavior for raw file contents
Keep .envrc and foo.env.local unsupported/raw; Atmos does not auto-load or execute dotenv files
Preserve YAML custom tags during schema validation so env: !include .env satisfies stack manifest schema rules
Update the stack manifest JSON schema description for env to document the !include string form
Document dotenv includes in both CLI env and stack env docs, including YAML merge-key behavior, include path resolution, and layered files
Add a short blog post for explicit dotenv inclusion
Add a roadmap milestone entry for the shipped dotenv !include support
Add coverage-focused tests for dotenv merge-key retry handling, include path helpers, case-preservation helpers, and YAML custom-tag conversion
Harden the LocalStack demo provider config to use the local edge endpoint directly, path-style S3, and skip AWS account-ID discovery so Terraform does not hang before reaching LocalStack in CI

Tests

cd examples/demo-localstack && ATMOS_IDENTITY=false go run ../.. describe component demo -s dev --format json --logs-level Off | jq '.providers.aws'
cd examples/demo-localstack && ATMOS_IDENTITY=false go run ../.. validate stacks --logs-level Off
go test ./pkg/config ./pkg/validator ./pkg/filetype
go test ./internal/exec -run 'TestGenerateProviderOverrides|TestGenerateProviderOverridesForAliases|TestProcessStackConfigProviderSection'
go test ./pkg/config ./pkg/validator -coverprofile=.context/dotenv-include-coverage.out
go test ./pkg/utils -run 'TestInclude(Dotenv|ExtensionBased|RawFunction|WithNoExtension)'
node -e "import('./website/src/data/roadmap.js').then(() => console.log('roadmap import ok'))"
git diff --check
Real stack manifest schema regression: env: !include .env validates against tests/fixtures/schemas/atmos/atmos-manifest/1.0/atmos-manifest.json
Commit hooks passed: go-fumpt, Go build, go mod tidy, golangci-lint, whitespace/EOF/large-file checks

feat(ci): GitHub Actions build cache (atmos ci cache) Erik Osterman (Cloud Posse) (@osterman) (#2579)

what

Add a CI build cache that restores the well-known Atmos cache root (~/.cache/atmos — toolchain binaries, vendored components, remote import clones, provider/plugin caches) at startup and saves it at exit, using the same store actions/cache uses (GitHub Actions Cache Service v2).
New atmos ci cache subcommands: restore, save, list, delete — so the lifecycle can run in one invocation or be spread across CI steps.
New ci.cache configuration block (enabled, auto: off|restore|save|both, root, paths, key, restore_keys, compression) with ATMOS_CI_CACHE_* env overrides.
Model it as a CI-provider capability (provider.CacheProvider + ci.DetectCache()) with a backend registry (pkg/ci/cache) and a GitHub Actions implementation (pkg/ci/cache/github), mirroring the existing artifact subsystem; outside a runner it's a clean no-op.
Consolidate the default toolchain install path under the XDG cache root (~/.cache/atmos/toolchain) so a single cache captures it; add a PRD, command/config docs, blog post, and roadmap entry.

why

In CI, every job re-downloads the toolchain, providers, and modules from upstream — wasting time/bandwidth and exposing runs to transient and rate-limit failures. Persisting the cache root across jobs makes executions faster, more reliable, and reduces supply-chain exposure.
Teams otherwise hand-wire an actions/cache step and own the key/path logic themselves; Atmos already knows its cache root and can derive a stable key from toolchain.lock.yaml + OS/arch, so it's two settings to enable.
Cache entries are write-once; a per-run state marker makes automatic and manual usage idempotent (an exact-key hit on restore skips the save), so the same operations work whether triggered automatically or via the subcommands.

references

PRD: docs/prd/native-ci/framework/ci-cache.md
Docs: /cli/commands/ci/cache and /cli/configuration/ci/cache
GitHub Actions Cache Service v2 (the store actions/cache uses)

🚀 Enhancements

Warn on explicit version constraint overrides Erik Osterman (Cloud Posse) (@osterman) (#2670)

what

Downgrade version constraint failures to structured log.Warn messages when an explicit version override is present.
Detect overrides from --use-version, ATMOS_VERSION_USE, ATMOS_USE_VERSION, and ATMOS_VERSION, while keeping config-only version.use enforcement unchanged.
Preserve fatal errors for invalid constraint syntax and add coverage for non-semver override binaries like test.

why

--use-version ref:* can re-exec into unreleased binaries that report version.Version == "test", which previously failed constraint validation before the requested command could run.
Explicit overrides are intentional, so Atmos should continue with a warning that explains the bypass instead of enforcing the configured constraint.

references

Closes #2668

fix(validate): dogfood `atmos validate stacks` for example YAML; fix nil-map crash Erik Osterman (Cloud Posse) (@osterman) (#2666)

what

Replace the deprecated third-party InoUno/yaml-ls-check GitHub Action in the [validate] CI matrix with Atmos itself, running atmos validate stacks --schemas-atmos-manifest <in-repo schema> against each example.
Expand the matrix to also validate three previously-excluded function-using examples (custom-components, sops-secrets, onepassword-secrets), which now pass because Atmos understands its own YAML tags natively.
Fix a crash in atmos validate stacks --schemas-atmos-manifest: it panicked with assignment to entry in nil map when the target atmos.yaml had no schemas: section. Added a lazy-initializing SetSchemaRegistry setter and a regression test.

why

The third-party action targets Node 20 and is force-run on Node 24, emitting deprecation warnings across every [validate] job; it also can't parse Atmos YAML tags, which forced many examples to be excluded from validation.
atmos validate stacks is a strict superset of the old static check (YAML syntax + manifest JSON Schema + import resolution + duplicate-component detection) and parses Atmos tags natively — better coverage with no external dependency. Pointing --schemas-atmos-manifest at the in-repo schema lets a PR add a schema field and use it in an example in the same change.
Dogfooding immediately surfaced and fixed a real user-visible crash in the validate command.

references

quick-start-advanced and native-terraform are intentionally left out of the matrix (documented inline): the former's stacks/workflows/*.yaml uses newer workflow step types the manifest schema doesn't describe yet, and the latter intentionally configures no stacks.name_pattern.

fix(hooks): store-output hooks inherit the run's default identity Andriy Knysh (@aknysh) (#2662)

what

Make the terraform after-apply store-outputs hook path inherit the run's auto-detected identity for
stores that don't declare their own identity, matching the main terraform path.
Add a new internal/exec.HookStoreDefaultIdentity helper (auto-detect the active identity from the
auth manager's chain, normalize empty/select/disabled to ""); cmd/terraform's
injectHookStoreAuthResolver now calls SetAuthContextResolverWithDefaultIdentity instead of the
resolver-only variant.
Fix an adjacent bug: pkg/store.defaultIdentityForStore was missing *SecretsManagerStore
(aws/asm), so AWS Secrets Manager stores never inherited a default identity on any path. Added
the case so aws/asm behaves like aws/ssm.
Tests: internal/exec.TestHookStoreDefaultIdentity (new), cmd/terraform
TestInjectHookStoreAuthResolver_InheritsDefaultIdentity (replaces …_ResolverOnly), updated
pkg/store default-identity test so identity-less aws/asm asserts inheritance, and Floci E2E
TestAWSStoreHooks_InheritedIdentity_FlociE2E with fixture aws-store-hooks-floci-inherit.
Fix doc: docs/fixes/2026-06-27-store-hook-inherit-default-identity.md.

why

Hook fix. Under Atmos auth, atmos terraform apply on a component with a store-outputs hook
applied successfully but then failed in the hook when the target store had no identity:
```
INFO  Running hooks event=after.terraform.apply status=success
✓ Fetching <output> from <component> in <stack>
Error: failed to assume write role: … get identity: get credentials:
failed to refresh cached credentials, no EC2 IMDS role found, … ec2imds: GetMetadata …
```
Hooks run in a freshly-loaded config, so the apply-phase store registry (and its injected default
identity) is gone. The hook re-injected the resolver but no default identity, so identity-less
stores fell back to the default AWS SDK credential chain — empty under Atmos auth (credentials live
in the keyring, not the environment) — and dropped to EC2 IMDS. The main terraform path and !store
reads already inherit the run's identity; this removes a surprising asymmetry and completes the
follow-up explicitly deferred in #2625 ("Component-identity inheritance for identity-less stores is
intentionally left for a follow-up design decision").
ASM fix. defaultIdentityForStore handled *SSMStore, *AzureKeyVaultStore, and *GSMStore
but not *SecretsManagerStore, so aws/asm stores without an explicit identity could never
inherit one. This was latent before (and was even codified by the old test); the hook fix's E2E
surfaced it.
Backward compatible. HookStoreDefaultIdentity returns "" whenever no identity is resolved
(no auth manager, or empty/select/disabled), and SetAuthContextResolverWithDefaultIdentity("")
is a no-op for the default — so runs without Atmos auth keep their prior ambient/default-SDK
credential behavior, and stores with an explicit identity are never overridden.

references

Follow-up to #2625 (AWS stores/secrets auth; deferred identity-less inheritance in the hook path).
Related fix docs: docs/fixes/2026-06-17-aws-stores-secrets-auth-and-gists.md,
docs/fixes/2026-05-25-store-hook-missing-backend-role-assumption.md.

fix(secret): skip remote-state reads in credential-free secret list Brian Ojeda (@sgtoj) (#2657)

what

Make credential-free atmos secret list skip the YAML functions that perform authenticated backend reads (!terraform.state, !terraform.output, !store, !store.get) while it enumerates secret declarations.
Add a credentialFreeSkip() helper and use it in the two credential-free paths: secret list -s <stack> enumeration and the single-scope secret list -s <stack> -c <component> path without --verify.
Authenticated paths (get / set / exec / shell, and secret list --verify) are unchanged — they keep skipping only !secret.
Adds TestCredentialFreeSkip pinning the skip set and a docs/fixes write-up.

why

Secret listing is intentionally credential-free: it disables authentication so a large stack doesn't run one full auth cycle per component. But it still evaluated !terraform.state / !terraform.output / !store in component vars / settings. With auth disabled, the S3 backend assumes its configured role with no base credentials, the AWS SDK falls back to the default credential chain, and ultimately dials the EC2 IMDS endpoint — unreachable on a workstation — so the command aborts with a confusing assume-role/credentials error even immediately after a successful atmos auth login.
Listing only needs the static secrets.vars declarations (secrets.ExtractDeclarations), which never require a resolved value. Evaluating these functions was unnecessary and failure-prone. A skipped function leaves its raw string in place, which the declaration extractor ignores, so discovery is unchanged.
This is a regression: before credential-free enumeration was introduced, secret list authenticated per component, so these reads had credentials (slow, but working). Disabling auth removed the credentials without removing the reads.

references

Related to #2639 (originally reported against atmos secret list).
Follow-up to #2646, which made secret-list enumeration credential-free but left the credentialed function evaluation in place.
Write-up: docs/fixes/2026-06-23-secret-list-credential-free-skip.md
Verified with go test ./cmd/secret/... and the repo's custom-gcl lint (both green), and end-to-end against a multi-account repo whose components reference cross-account !terraform.state: secret list -s <stack> aborted before, completes after (no state reads, no credential-resolution fallbacks).

fix(auth): retry transient auth on freshly-brokered STS git clones Erik Osterman (Cloud Posse) (@osterman) (#2653)

what

Retry transient git authentication failures within a bounded window (default 30s, exponential backoff + jitter) only when Atmos brokered a fresh GitHub STS token this process — wired through a new broker.HasBrokeredCredentials() signal and a CustomGitGetter.RetryAuthErrors flag (existing per-source retry: config still takes precedence).
Keep auth failures terminal (fail fast) for non-brokered/static-credential clones, so a genuinely wrong or expired token is never masked by retries.
Surface previously-swallowed credential-broker failures at Warn (was Debug, invisible at the default Warning log level) and log an actionable Error when the brokered-auth retry window is exhausted.
Add tests: brokered retry succeeds, non-brokered fails fast, bounded-budget exhaustion, and a -race concurrency guard proving EnsureCredentials provisions exactly once with a happens-before barrier.

why

Under Atmos Pro cross-repo STS, atmos vendor pull intermittently failed with fatal: Authentication failed even though the same run logged a successful token mint and OIDC auth — a subset of clones failed and a rerun was clean.
Root cause is GitHub's brief post-creation 401 window: a just-minted installation token is not yet valid across all of GitHub's git frontends. The atmos-pro server already self-heals its own API calls on this 401 (Sentry APP-CLOUDPOSSE-COM-AM2), but the CLI git path did not — isRetryableGitError treated auth as terminal and vendor sources have no retry by default, so the earliest clones failed hard.
This gives the CLI the same tolerance the server has, scoped narrowly to brokered tokens so static credentials still fail fast, and removes the observability gap that made the failure hard to diagnose.

references

Token TTL is GitHub's standard ~60 min (confirmed in atmos-pro mint.ts / token-provider), ruling out mid-run expiry; the post-mint propagation window is the cause.
Follow-up (out of scope): revoke_on_exit cross-process token race on the shared, unlocked github/sts state.json.

perf(stacks): dedupe per-identity auth in nested terraform.state resolution Brian Ojeda (@sgtoj) (#2656)

what

Extends the per-component AuthManager memoization introduced in #2652 from the top-level describe stacks pass into the nested resolution path that runs while templates and YAML functions are evaluated (!terraform.state, !terraform.output, atmos.Component(...)).
Adds a process-scoped nestedAuthManagerCache, consulted by resolveAuthManagerForNestedComponent, keyed by the parent auth chain + a deterministic JSON fingerprint of the component's auth section.
Extracts the key logic into a shared buildComponentAuthCacheKey used by both the processor cache (#2652) and this nested path, so the two keying strategies cannot drift.
Caches only successful, non-nil resolutions; ResetStateCache() also clears the new cache (kept consistent with the terraformStateCache it pairs with). Neither is reset in production.

why

#2652 deduped per-component auth at the top level, but a component that references another component via !terraform.state still ran a full auth cycle (credential writes, file locks, keyring rebuilds) once per distinct target — even when every target resolves to the same identity. terraformStateCache only short-circuits a repeat read of the same target, not distinct targets that share an identity.
The result was the same N-auth blowup #2652 removed, just relocated into template/YAML resolution. Memoizing by identity removes it.

Measured — atmos describe stacks -s <stack> on a large real-world stack (credentials provided via auth exec, 45s cap; only the binary under test varies):

build	wall time	per-component auth cycles
latest release	DNF (>45s)	—
`main` (includes #2652)	~17–19s	44
this PR (#2652 + nested dedup)	~10–11s	5

Output was verified equivalent to main: the remaining run-to-run differences are pre-existing auto_provision_workdir_for_outputs / tofu output provisioning nondeterminism present on both builds (same identity resolved throughout, no new errors). A matched-output pair differed by fewer lines than the main-vs-main noise floor.

The nested path is shared by describe affected, list, and terraform --all/--query. On a large multi-component change, the full per-identity auth cycles during describe affected likewise drop from scaling with the number of resolved components to roughly one-per-identity, with the rest served from the cache.

test plan

go build ./... && go test ./internal/exec/... — new unit tests cover key behavior, dedupe-by-identity, parent-chain isolation, errors-not-cached, unserializable-section-not-cached, and the ResetStateCache coupling.
custom-gcl run --new-from-rev=main → 0 issues.
Real-repo benchmark above.

references

Related to #2639
Builds on #2652
Design notes: docs/fixes/2026-06-22-dedupe-nested-component-auth.md

fix(describe): respect metadata.enabled when evaluating component functions Brian Ojeda (@sgtoj) (#2655)

what

Respect metadata.enabled when the shared describe pipeline (describe affected, describe stacks, list) evaluates a component's functions:
- !terraform.state / !terraform.output are skipped for components disabled via metadata.enabled: false — the raw function string is left unresolved (no backend read).
- atmos.Component(...) returns empty sections (including an empty outputs) when the enclosing component is disabled — no describe, no state read, and template-safe (.outputs.x / .vars.x evaluate to nil instead of erroring).
The gate keys strictly on metadata.enabled (via the existing isComponentEnabled), independent of vars.enabled.

why

describe affected describes the current and base stacks and evaluates every component's templates/YAML functions with no metadata.enabled gate. A component disabled with metadata.enabled: false that references an unprovisioned component's state therefore failed hard with terraform state not provisioned — even though disabled components are (correctly) excluded from the final affected list. The enabled-aware filters (shouldSkipComponent, FilterAbstractComponents) only run when assembling that list, after the describe phase has already failed.

Fixes #2654.

references

Fixes #2654
Design notes: docs/fixes/2026-06-22-describe-respect-metadata-enabled.md

test plan

Unit tests: disabledComponentTerraformSkip (adds the terraform funcs, clones the base skip), enclosingComponentDisabled (nil/absent metadata ⇒ enabled; vars.enabled:false alone ⇒ enabled; metadata.enabled:false ⇒ disabled), componentFunc returns empty sections for a disabled enclosing component, and an end-to-end processComponentEntry test (disabled ⇒ !terraform.state not resolved; enabled / vars.enabled:false-only ⇒ resolved).
go build ./..., go vet ./internal/exec/..., and custom-gcl run --new-from-rev=main (0 issues).

Note: the TestDescribeAffected* integration tests are environment-sensitive and fail identically on a clean main checkout locally (macOS); they are unrelated to this change. CI (Linux) is authoritative.

fix(stacks): scope and cache per-component auth in describe stacks Brian Ojeda (@sgtoj) (#2652)

what

Move the stack and component filters above resolveComponentAuthManager in processComponentEntry so only in-scope components authenticate (auth still precedes BuildTerraformWorkspace, template, and YAML-function processing).
Add a pass-scoped auth cache keyed by the parent chain + a deterministic JSON fingerprint of the component auth section, so components that share an auth section reuse one authenticated manager.
Regression tests: out-of-scope skip + cache reuse.

why

Any auth-enabled ExecuteDescribeStacks caller — atmos describe stacks, atmos list values/instances, atmos terraform --all/--query — resolves per-component auth before the stack/component filters and never reuses it. On a multi-stack repo where components declare their own default: true identity, atmos describe stacks -s <stack> authenticates components in other stacks before discarding them, and re-authenticates each same-identity component from scratch — so the command effectively hangs.

Per-component auth exists only to populate info.AuthContext for that component's later template (atmos.Component(...)) and YAML-function (!terraform.state, !terraform.output) processing, which is skipped for filtered-out components — so authenticating them is wasted work.

#2646 fixed atmos secret list by disabling per-component auth for that command; it did not touch the shared processor, so every other caller still hits this.

Measured with the identical command atmos describe stacks -s <stack> --logs-level Debug under a 45s budget, only the atmos binary varying:

binary	result
latest release (v1.221.1)	did not complete within 45s (authenticating mostly out-of-scope stacks)
current `main` (`aa68d85be`)	did not complete within 45s
this PR	completed in ~18s

With the fix, in-scope processor-path authentications drop to 2 and out-of-scope ones to zero (the ~42 remaining auths are legitimate nested !terraform.output / atmos.Component reads).

references

Related to #2639; supersedes #2642 and #2644.
Fix write-up: docs/fixes/2026-06-22-describe-stacks-scope-and-cache-per-component-auth.md

fix(secrets): fast, credential-free atmos secret list Erik Osterman (Cloud Posse) (@osterman) (#2646)

what

Make atmos secret list require no authenticated identity and never decrypt — it only reports whether each secret is set. On a 72-component stack, listing drops from ~21–34s (it previously authenticated every component and decrypted every secret) to effectively instant.
Disable per-component authentication during secret-list stack enumeration.
Resolve SOPS initialization status from the file's cleartext key names — no age key, no decryption.
Rewrite every store's existence check (Has) to a non-decrypting metadata API: SSM GetParameter with WithDecryption=false, Secrets Manager DescribeSecret, GCP GetSecretVersion, Azure Key Vault properties pager, Vault KV metadata read, and a no-reveal 1Password check.
Add a tri-state STATUS (initialized / missing / unknown) plus a new --verify flag: remote-store status shows unknown by default; --verify contacts backends with a read/describe identity (never a decrypt identity) on a fully-scoped target.

why

Listing is introspection — it shows which secrets are declared and whether they exist, and never needs a plaintext value, so it should not force authentication or decryption (or require kms:Decrypt-style permissions).
The old path authenticated per component and fetched+decrypted every secret just to populate the status column, making secret list slow (and prone to hanging) on real-world stacks and unusable without cloud credentials.
Existence on a remote store still needs a read credential, so those rows now default to unknown (credential-free) and opt into a real check via --verify, while local backends (SOPS) always show accurate status for free.

references

Supersedes the per-component auth-cache approach in #2644 (its atmos secret list workload is fully addressed here); follows #2642.
Docs: website/docs/cli/commands/secret/list.mdx; changelog: fast-credential-free-secret-list.

fix(secrets): SOPS cloud-KMS secrets authenticate via Atmos identity Erik Osterman (Cloud Posse) (@osterman) (#2643)

what

atmos secret and !secret (during terraform plan) against a SOPS cloud-KMS backend now authenticate using the Atmos identity — --identity / ATMOS_IDENTITY, the per-provider secrets.providers.<name>.identity, or the stack/component effective identity — instead of requiring ambient cloud credentials.
The cloud is inferred from the SOPS file's master-key type at runtime (AWS KMS / GCP KMS / Azure Key Vault); there is no per-cloud kind. Credentials are injected into the in-process getsops encrypt/decrypt via ApplyToMasterKey (no process-environment mutation).
Refactors SOPS into its own package pkg/secrets/providers/sops/ with a registry of per-cloud key handlers (aws.go / gcp.go / azure.go); the cloud-SDK credential building lives in the depguard-exempt pkg/store/sopsauth/ bridge so the SOPS package imports no cloud SDK directly.
Threads the auth resolver + effective identity to the provider via a transient AtmosConfiguration.SecretsAuth seam, populated in both the atmos secret and terraform code paths.
Fixes the SOPS decrypt error to emit identity/permission hints for cloud-KMS files (derived from the file's actual key types) instead of the misleading age-key hint.

why

Resolves #2637: the documented secrets.providers.<name>.identity field and --identity were silently ignored for the SOPS cloud-KMS track, forcing every command to be wrapped in atmos auth exec even though Track-1 stores (SSM/ASM/Key Vault/Secret Manager) already authenticated via the identity.
The fix is additive and backward compatible: with no resolvable identity, the SOPS provider falls back to the ambient credential chain exactly as before. kind remains only for the legitimate age-vs-KMS keygen distinction.
Covered by a Floci KMS end-to-end test (ambient AWS creds cleared, identity-only secret set/get — the exact #2637 scenario, RED before this change) plus unit tests for key-service selection, per-cloud registry dispatch, identity precedence, and kind-aware error hints.

references

Closes #2637
docs/fixes/2026-06-20-sops-cloud-kms-identity.md (root cause, fix, and full backend audit)

Add Atmos media kit and CI branding Erik Osterman (Cloud Posse) (@osterman) (#2636)

what

Add an Atmos media kit page, blog announcement, brandkit redirect, and generated ZIP download workflow.
Add logo, wordmark, animated gradient, Atmos CI, and Atmos AI SVG variants for light and dark surfaces.
Update native Terraform CI summaries and fixtures to use the Atmos CI lockup linking to https://atmos.tools/ci.

why

Provide a canonical source for Atmos brand assets and usage guidance.
Align CI summary branding with Atmos instead of the Cloud Posse logo.
Keep animated treatment assets downloadable and consistent across docs, media kit, and CI output.

references

Validation: go test ./pkg/ci/plugins/terraform
Validation: pnpm exec docusaurus build

Fix AWS store auth and add Floci E2E coverage Erik Osterman (Cloud Posse) (@osterman) (#2625)

what

Fix AWS SSM/Secrets Manager store auth during hooks, describe, and secret commands, including inherited identities and secret-store access enforcement.
Make slash kind notation canonical, add AWS store/secrets gists, document the fix, and add custom endpoint support for AWS, GCP Secret Manager, and Azure Key Vault.
Add opt-in Floci E2E tests and CI coverage for AWS, GCP, and Azure store/secrets workflows.

why

The reported SSM hook workflow could fall back to ambient AWS credentials or fail with a missing auth resolver even when the Terraform identity was valid.
The feature needed full-circle examples plus emulator-backed regression coverage so AWS stores and declared secrets stay working across providers.

references

No issue linked.

Fix use-version before command resolution Erik Osterman (Cloud Posse) (@osterman) (#2629)

what

Run explicit --use-version / ATMOS_USE_VERSION re-exec before Cobra resolves subcommands.
Add regression coverage for env var, --use-version=..., and --use-version ... forms with commands unknown to the current binary.
We also took the liberty of adding a few unrelated, test-only coverage improvements to satisfy Codecov; these do not change production behavior.

why

Cobra rejected newly added commands before PersistentPreRun could switch Atmos versions.
This restores the workflow for testing new commands from ref:, sha:, and PR Atmos builds.

references

Closes #2624
Tested with go test ./cmd -run 'UseVersion|UnknownSubcommand|ParseUseVersion' and go test ./pkg/version -run 'CheckAndReexec|UseVersion|RefVersion'.

fix(toolchain): harden cosign verifier bootstrap Erik Osterman (Cloud Posse) (@osterman) (#2627)

what

Keep verifier bootstrap version resolution latest-first, using the existing authenticated GitHub/Aqua lookup path.
Add a sigstore/cosign@v3.0.6 fallback only when latest-version lookup fails.
Add Renovate regex-manager coverage for the fallback cosign version so the safety pin is updateable.
Update installer tests to prove latest wins when available, cosign falls back when lookup fails, and non-pinned verifier lookup errors still surface.

why

Prevent OpenTofu toolchain installs from failing when cosign auto-install hits a slow or unavailable GitHub releases API.
Avoid making the fallback version the default forever; normal installs still use the latest resolved cosign release when GitHub lookup succeeds.
Preserve existing escape hatches: existing cosign on PATH still wins, and verifier_install: path_only still disables auto-install.

references

Failing run: https://github.com/cloudposse/atmos/actions/runs/27661641040/job/81808473011
Fallback cosign release: https://github.com/sigstore/cosign/releases/tag/v3.0.6

test: stabilize Terraform cache coverage Erik Osterman (Cloud Posse) (@osterman) (#2620)

what

Add environment overrides for components.terraform.cache.enabled and components.terraform.cache.location, plus docs in the Terraform config and environment variable references.
Add focused registry-cache coverage, including Windows-safe trust command unit tests and a non-golden acceptance test with an isolated cache location.
Stabilize acceptance CI provider reuse with a process-level TF_PLUGIN_CACHE_DIR under the Atmos XDG cache root, and bump the CI cache key so actions/cache saves a fresh provider-plugin cache.

why

The native registry cache should be testable on Windows only after its loopback certificate is trusted, but it should not be enabled globally where cold/warm cache state can flip snapshots or screenshots.
Windows timeout mitigation should use Terraform’s provider plugin cache, which avoids the native cache proxy TLS trust problem.
The new environment overrides make targeted cache dogfooding possible without editing shared fixture atmos.yaml files.

references

Related context from #2607.
Validated with go test ./pkg/config -run 'TestViperBindEnv_.*Cache', go test ./pkg/terraform/cache -run 'Test.*Trust|Test.*Windows', go test ./tests -run TestTerraformRegistryCache -timeout 10m, and git diff --check.

refactor(utils): drop dead helpers and hand-rolled SliceContainsString Erik Osterman (Cloud Posse) (@osterman) (#2608)

what

Replace the hand-rolled SliceContainsString / SliceContainsStringHasPrefix / SliceContainsStringStartsWith helpers with stdlib slices.Contains / slices.ContainsFunc across ~39 call sites, and remove the helpers from pkg/utils.
Delete nine dead exported functions that had zero callers anywhere: ExtractAtmosConfig, GetGitHubRepoReleases, GetGitHubReleaseByTag, GetGitHubLatestRelease, PrintAsHcl, NewHighlightWriter (plus the now-orphaned HighlightWriter type/method), GetAtmosConfigJSON, PrintAsJSONToFileDescriptor, and PrintAsYAMLWithConfig — including the now-empty config_utils.go and cascaded-unused imports/aliases.
Convert two depends_on dynamic errors in stack_utils.go to wrapped static errors (ErrDependencyResolution); their messages now carry a dependency resolution failed: prefix.

why

First step in dismantling pkg/utils, one of the repo's historical "dumping grounds" — CLAUDE.md already forbids adding to it, so this begins draining it.
slices.Contains is the identical O(n) scan as the deleted helper (the hot path in yaml_utils.go already uses an O(1) map), so there is no behavior or performance change from the swap; it also drops a per-call perf.Track defer.
The static-error conversion satisfies the err113 lint gate after a flagged if/else chain was restructured into early returns, and aligns with the mandatory static-error policy.

references

Internal cleanup; no issue. Follow-up PRs will relocate the remaining pkg/utils files into purpose-built packages (pkg/yaml, pkg/filesystem, pkg/data, etc.).

fix(terraform): restore init + workspace in terraform shell, add --skip-init Erik Osterman (Cloud Posse) (@osterman) (#2616)

what

Restore terraform init and Terraform workspace select/new to atmos terraform shell so the interactive shell again starts in an initialized component and the correct workspace (not default).
Extract a provisioner-free executeTerraformInitCommand from executeTerraformInitPhase so the shell can run init without re-firing the before.terraform.init provisioners it already runs (no double execution). Main ExecuteTerraform pipeline behavior is unchanged.
The shell now resolves the terraform/tofu binary (and toolchain), generates backend/provider-override files, and assembles the full component environment before launching — matching the shared pipeline.
Add a --skip-init opt-out to atmos terraform shell (reuses the existing terraform flag; no new flag definition). Workspace selection stays governed by workspaces_enabled.
Add regression tests for the init → workspace → shell ordering, the --skip-init decoupling, and the shell's shouldRunTerraformInit/shouldSkipWorkspaceSetup contract; document --skip-init in the command docs.

why

This was an accidental regression introduced in v1.202.0 by #1813, which migrated terraform shell to a standalone ExecuteTerraformShell and silently dropped the init + workspace steps that the shared ExecuteTerraform pipeline used to run.
The result contradicted the published docs (which promise the command generates a backend file and creates the component's workspace) and forced users to pin old versions.
--with-secrets behavior is preserved: secrets are still kept out of the on-disk varfile and withheld from the shell unless explicitly requested.

references

Regression introduced in #1813 (first released in v1.202.0).

fix(templates): honor ignore_missing_template_values for stack name_template (#2345) Andriy Knysh (@aknysh) (#2619)

what

Route the global templates.settings.ignore_missing_template_values flag into every stack name_template rendering site. Previously all 11 name-template ProcessTmpl(...) call sites passed a hardcoded false, so the flag was silently ignored for name-template rendering.
Sites updated: atlantis stack name, EKS cluster name, spacelift admin/stack name (describe affected), describe locals name, spacelift utils, terraform workspace, terraform generate backends/varfiles, the shared name-template util, and validate stacks.
Add TestBuildTerraformWorkspace_IgnoreMissingTemplateValues asserting both directions (flag off → error; flag on → <no value>).
Incidental cleanup: gofumpt reformatting two adjacent pre-existing fmt.Errorf calls in stack_utils.go - err113 debt under golangci-lint --new-from-rev=origin/main. Converted them to the mandated static-wrapped-error pattern (new sentinels ErrInvalidDependsOn / ErrInvalidSettingsDependsOn in errors/errors.go) with tests covering both resolution branches and the errors.Is behavior.

why

When a user sets templates.settings.ignore_missing_template_values: true, they still hit hard errors like map has no entry for key "..." whenever the error originated from rendering the stack name_template — because the name-template ProcessTmpl sites bypassed the flag.
The fix is behavior-preserving: the flag defaults to false, so existing configurations render exactly as before; behavior only changes for users who explicitly opt in.
The err113 conversion follows the repository's mandated static-error pattern and keeps the pre-commit/CI lint green; messages are unchanged.

references

Closes #2345
Fix doc: docs/fixes/2026-06-15-name-template-ignore-missing-template-values.md

Summary by CodeRabbit

Bug Fixes
- Updated template rendering to consistently honor ignore_missing_template_values across stack- and dependency-related name derivations (including Terraform workspace and generated stack naming).
- Added clearer error handling for invalid depends_on inputs via dedicated sentinel errors.
Tests
- Added regression tests covering enabled/disabled ignore_missing_template_values behavior and dependency resolution success/failure.
Documentation
- Added a documentation page explaining the corrected ignore_missing_template_values behavior for stack name template rendering.

fix(flags): scope --skip-hooks to the terraform command subtree Erik Osterman (Cloud Posse) (@osterman) (#2578)

what

Scope --skip-hooks to the terraform command subtree. The flag (and ATMOS_SKIP_HOOKS) moved off the global flag set onto atmos terraform and its subcommands, so it no longer appears in the help of unrelated commands (auth, helmfile, atlantis, toolchain, about, secret, …). Lifecycle hooks only ever run on terraform plan/apply/deploy.
Stop tracking native-ci CI scratch output. tests/fixtures/scenarios/native-ci/{github-output,github-step-summary}.txt are runtime artifacts; gitignored and untracked (matching the newer native-ci-gha-plan scenario).
Standardize the CLI test suite on OpenTofu. The suite forces ATMOS_COMPONENTS_TERRAFORM_COMMAND=tofu via a single test-harness default, gates every binary-invoking test on a precondition so a missing binary skips cleanly (instead of baking "executable file not found" into goldens), and sanitizes the harness-injected env var out of debug snapshots. A small parity set (terraform -help/-version passthrough) opts back into terraform.
Provision test tooling via the Atmos toolchain (dogfooding). TestMain installs any missing pinned binary (terraform/tofu/packer/helmfile/helm) through the Atmos toolchain itself and prepends it to PATH — "install as necessary", so CI (which supplies them via setup-* actions) downloads nothing while local runs become self-contained. No host binaries (brew, etc.) required.

why

--skip-hooks on every command was misleading — hooks only run on terraform. Mirrors the existing --github-token/toolchain scoping precedent.
The native-ci scratch files were tracked, so every local run without terraform dirtied them. They're CI artifacts, not fixtures.
Test runs depended on whatever terraform/tofu binary was on the host; a missing binary silently corrupted golden snapshots and tracked fixtures. Standardizing on a single, license-clean (MPL) OpenTofu — with explicit preconditions — makes the suite deterministic and host-independent. The product runtime default stays terraform; only tests change.
Provisioning tools through the toolchain dogfoods the feature and removes the dependency on host-installed binaries, so the suite runs the same way everywhere.

references

Follows the --github-token/toolchain flag-scoping precedent in pkg/flags/global_builder.go.

fix(toolchain): retry cosign verification on transport-level network errors Erik Osterman (Cloud Posse) (@osterman) (#2604)

what

Add a transportFlakeMarkers allowlist to the cosign retry classifier (pkg/toolchain/verification/signature_rekor.go) so transport-level network errors are retried like other transient Sigstore Rekor flakes:
- stream error: stream ID (Go net/http2 stream errors — covers all HTTP/2 error codes and both send/recv variants)
- connection reset by peer
- TLS handshake timeout
- i/o timeout
- unexpected EOF
Extend TestClassifyCosignError with the exact error observed in CI plus one case per new marker, and add TestRunCosignWithRetry_RecoversFromTransportFlake covering end-to-end retry recovery.

why

CI failed on TestToolchainCustomCommands_InstallAllTools/Install_tofu while toolchain install opentofu/opentofu@1.9.0 was verifying the download signature. Cosign's query to the Sigstore Rekor transparency log died with:

searching log query: stream error: stream ID 1; INTERNAL_ERROR; received from peer

Atmos already retries cosign flakes (runCosignWithRetry, 5 attempts with exponential backoff), but the retryable classification is a deliberate allowlist that only recognized Rekor HTTP response flakes (searchLogQueryBadRequest, the IEEE_P1363 decode error, and 5xx scoped to the tlog retrieve endpoint). An HTTP/2 transport error matched none of the markers, so it surfaced on the first attempt with no retry.

Broadening to transport-level failures is safe within the allowlist's design rule: the allowlist exists so a real signature verdict (tampering, identity mismatch, expired cert) is never silently retried away. A transport failure means the request never completed and no verdict was rendered, so retrying it categorically cannot mask tampering. Existing negative tests (tampered artifact, identity mismatch, generic failure) continue to assert those still fail on the first attempt.

references

Observed failure: Acceptance Tests (linux), TestToolchainCustomCommands_InstallAllTools/Install_tofu

Uh oh!

Uh oh!

v1.222.0

what

why

references

notes

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

why

references

what

Stores (pkg/store)

Secrets core (pkg/secrets)

!secret + masking (the headline)

CLI (cmd/secret)

Stack processing

Docs + example

why

references

notes / follow-ups

what

why

references

what

What this is — and isn't

why

references

Summary

What Changed

Tests

Stores (`pkg/store`)

Secrets core (`pkg/secrets`)

`!secret` + masking (the headline)

CLI (`cmd/secret`)