Skip to content

[codex] consolidate terraform bulk execution on scheduler#2462

Closed
Mikhail Shirkov (shirkevich) wants to merge 15 commits into
cloudposse:codex/dag-scheduler-corefrom
shirkevich:codex/dag-terraform-graph-bulk-path
Closed

[codex] consolidate terraform bulk execution on scheduler#2462
Mikhail Shirkov (shirkevich) wants to merge 15 commits into
cloudposse:codex/dag-scheduler-corefrom
shirkevich:codex/dag-terraform-graph-bulk-path

Conversation

@shirkevich
Copy link
Copy Markdown
Collaborator

@shirkevich Mikhail Shirkov (shirkevich) commented May 21, 2026

Summary

  • route Terraform --all, --components, and --query through the scheduler-backed Terraform adapter
  • build Terraform dependency graphs from dependencies.components first, with settings.depends_on fallback
  • preserve query-path auth manager setup, store resolver bridging, YAML function processing, and per-component CI hook capture
  • includes fix(terraform): preserve explicit identity and auth context for local runs #2348 identity/auth fixes in this stack so local --identity terraform testing works
  • include the credential-store concurrency-safety prerequisite discovered by the concurrency-4 timing experiment
  • serialize Terraform nodes that share the same physical component path so logical aliases of one Terraform component do not race on local .terraform metadata and generated files

Stacking

Branch history is stacked on #2461 and includes #2348. GitHub cannot use #2461 as the visible base because #2461's head branch currently lives in the fork, and pushing mirror branches to cloudposse/atmos is not permitted from this account.

Draft note

This is intentionally draft current state. The branch currently includes a temporary ATMOS_EXPERIMENTAL_DAG_MAX_CONCURRENCY override used for local timing experiments; the PR3-ready version should force effective concurrency back to 1 before review.

This branch now includes two concurrency-safety prerequisites discovered by timing experiments:

  • credential-store initialization no longer mutates global Viper env bindings per component and preserves ATMOS_KEYRING_TYPE precedence
  • Terraform adapter dispatch now locks by physical component path, preserving parallelism across distinct folders while serializing logical aliases that share generated files and .terraform metadata

Validation

  • go test ./pkg/scheduler ./pkg/scheduler/adapters ./internal/exec -run TestExecuteTerraformQuery|TestExecuteTerraformQueryNoMatches|TestBuildTerraformDependencyGraph|TestExecuteTerraformAllUsesGraphBackedSequentialOrder|TestExecuteTerraformComponentsUsesGraphBackedSequentialOrder|TestExecuteTerraformQueryUsesGraphBackedSequentialOrder|TestExecuteTerraformSerializesSharedPhysicalComponentPath|TestExecuteTerraformAllowsParallelDifferentPhysicalComponentPaths|TestBuildTerraformGraph
  • go test ./pkg/auth/credentials
  • go test -race ./pkg/auth/credentials -run TestNewCredentialStoreWithConfig_ConcurrentInitialization
  • go test ./pkg/auth ./internal/exec -run TestCreateAndAuthenticateManagerWithAtmosConfig|TestSetupTerraformAuth|TestProcessComponentConfig_PropagatesAuthManager|TestProcessComponentConfig_AuthManagerGuardBranches
  • built build/atmos and live-tested against a downstream stack with terraform plan --all and an explicit identity

Local timing findings

  • concurrency 1: success, real 202.33s
  • concurrency 2: success, real 136.56s
  • concurrency 4 before credential-store fix: failed; captured rerun panicked with fatal error: concurrent map writes in viper.BindEnv during auth credential store setup
  • concurrency 4 after credential-store fix: success, real 110.59s; no fatal error, no concurrent map writes, and no viper.(*Viper).BindEnv stack in the captured log
  • concurrency 8 before physical-path locking: failed, real 96.36s, with local OpenTofu state lock contention at .terraform/terraform.tfstate while multiple logical components shared one Terraform component directory
  • concurrency 8 after physical-path locking: success, real 93.47s; no state-lock error and no auth panic

Follow-up discussion

The short-term fix keeps current Atmos/Terraform behavior debuggable: commands still run in the normal component directory, and aliases sharing a physical folder are serialized. The longer-term way to unlock true parallelism for aliases sharing a folder would be per-node isolated workdirs plus isolated TF_DATA_DIR and generated files. That needs repo-owner discussion because it changes the operator debugging model: Atmos would need to decide whether and how to retain those per-node copies for inspection, how atmos terraform shell maps to them, and how cleanup/debug artifacts are managed.

Output remains heavily interleaved when concurrency is greater than 1, so user-visible concurrency still needs output orchestration before shipping.

…ight noise

Prevent terraform execution from falling back to default CI identity when users pass --identity/-i, skip CI hook setup on non-CI local runs unless forced, and propagate AuthManager through component processing so nested terraform state/output resolution reuses the authenticated identity.

Made-with: Cursor
- Added tests to verify the behavior of the `-i` flag as an optional-value flag, ensuring that trailing native flags are preserved during pass-through stripping.
- Implemented a test to confirm that explicitly passing an empty identity flag (`--identity=`) is treated as an interactive-selection sentinel, preventing it from being overridden by environment variables.
- Updated the `TestProcessComponentConfig_PropagatesAuthManager` to ensure the correct propagation of the AuthContext with AWS credentials.

These changes improve the robustness of identity handling in the Atmos CLI and ensure accurate propagation of authentication contexts.
…ity-ci-preflight-noise

# Conflicts:
#	cmd/auth.go
#	cmd/auth_exec.go
#	cmd/auth_shell.go
#	cmd/terraform/flags.go
#	internal/exec/cli_utils_identity_test.go
#	internal/exec/cli_utils_test.go
#	pkg/component/ansible/executor.go
#	pkg/flags/global_registry.go
#	pkg/hooks/hooks_test.go
…ght-noise' into codex/dag-terraform-graph-bulk-path
@atmos-pro
Copy link
Copy Markdown
Contributor

atmos-pro Bot commented May 21, 2026

Tip

Atmos Pro  

No affected stacks workflow was detected for this pull request.
If this is expected, no action is needed.
Learn More.

@shirkevich Mikhail Shirkov (shirkevich) force-pushed the codex/dag-terraform-graph-bulk-path branch from ea62d6f to 9e83b56 Compare May 21, 2026 10:23
@shirkevich Mikhail Shirkov (shirkevich) changed the base branch from main to codex/dag-scheduler-core May 21, 2026 18:13
@mergify mergify Bot added the stacked Stacked label May 21, 2026
@shirkevich
Copy link
Copy Markdown
Collaborator Author

Superseded by same-repo draft PR #2466, which uses cloudposse/atmos as the head repository and is based on codex/dag-scheduler-core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/l Large size PR stacked Stacked

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant