Skip to content

feat: add Ludus and Proxmox provider support with unified infra/provision orchestration#141

Merged
l50 merged 40 commits intomainfrom
feat/provider-agnostic-provisioning
Apr 29, 2026
Merged

feat: add Ludus and Proxmox provider support with unified infra/provision orchestration#141
l50 merged 40 commits intomainfrom
feat/provider-agnostic-provisioning

Conversation

@l50
Copy link
Copy Markdown
Contributor

@l50 l50 commented Apr 23, 2026

Key Changes:

  • Introduced first-class Ludus and Proxmox provider support (SSH/WinRM orchestration, inventory bootstrap, VM management)
  • Refactored CLI to use provider abstraction for all infrastructure and command execution operations
  • Added new init, up, and lab reset commands for streamlined setup and full end-to-end lab orchestration
  • Enhanced validator with embedded PowerShell scripts for robust, provider-agnostic vulnerability checks

Added:

  • Provider abstraction layer (internal/provider): Unified interface and factory for AWS, Ludus, and Proxmox; supports VM discovery, command execution, and lifecycle operations
  • Ludus provider implementation: SSH-based CLI orchestration, WinRM tunnel via Go SSH, SOCKS5 proxy, auto CLI install, API key/env support, and range/VM management
  • Proxmox provider implementation: Direct REST API integration, QEMU agent execution, and lab orchestration via Terraform templates
  • Inventory bootstrapping from provider templates for Ludus/Proxmox; dynamic IP range resolution
  • Embedded PowerShell script runner for validator: Safe text/template rendering, output JSON bracketing, and strongly typed result parsing
  • New CLI commands:
    • init: Interactive setup wizard with provider probing and config generation
    • up: End-to-end pipeline (doctor → infra → provision → health-check) with resume support
    • lab reset and lab purge-unmanaged: Automated AD object cleanup and AD-state restoration

Changed:

  • CLI command handling refactored to dispatch on provider, supporting AWS, Ludus, and Proxmox seamlessly for all infra, lab, and provisioning flows
  • Health check and validate now execute via provider abstraction, supporting WinRM/SSH in addition to SSM
  • Inventory sync and mapping commands generalized for non-SSM providers (Ludus/Proxmox IPs)
  • Ansible retry and SSM session cleanup logic updated for provider-agnostic session management
  • All validator checks now use embedded, parameterized PowerShell scripts with safe interpolation and robust error handling
  • Docs updated: CLI reference, Ludus provider guide, and architecture diagram

Removed:

  • AWS-only assumptions from CLI, provisioning, and validator flows
  • Legacy shell-out-to-ansible-over-ssh for Ludus validation (replaced with direct WinRM via Go SSH dialer)
  • LUDUS_VERSION env requirement (now auto-detected)
  • Manual inventory copy step (now auto-bootstrapped per provider)

@l50 l50 changed the title ``` feat: add provider-agnostic inventory and session management logic Apr 23, 2026
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added lab/GOAD Changes made to GOAD lab area/ad-labs Changes made to AD lab definitions area/packer Changes made to Packer configurations area/docs Changes made to documentation area/roles Changes made to Ansible roles labels Apr 25, 2026
l50 added 9 commits April 26, 2026 19:24
… integration

**Added:**

- Provider abstraction layer with dynamic registration for AWS, Proxmox, and Ludus
- Ludus provider implementation using Ludus CLI for VM/range management and command execution
- Proxmox provider implementation using Proxmox REST API for VM lifecycle and guest agent commands
- Generic infrastructure command handling for infra, provisioning, inventory, and validation, supporting all providers
- Terraform wrapper for direct apply/plan/init/output for template-based providers (Proxmox)
- Template rendering engine for Proxmox Terraform files with variable substitution
- Ludus-specific infrastructure validation and configuration in CLI
- Provider-specific configuration in dreadgoad.yaml with defaults and docs
- Ludus documentation detailing template requirements, deployment workflow, and troubleshooting

**Changed:**

- Refactored all CLI commands (infra, provision, lab, inventory, health-check, validate, ssm) to use the provider interface instead of hardcoded AWS logic
- Inventory bootstrap now supports dynamic provider templates and Ludus IP range detection
- Health checks, validation, and trust verification now work across AWS, Proxmox, and Ludus
- Ansible retry logic now uses provider session management and SSM recovery interfaces where available
- Improved infra subcommand UX for each provider, including Ludus and Proxmox-specific flows
- Updated documentation and configuration comments for provider usage and setup
- Updated Packer and Terraform templates with new Proxmox defaults and variable handling

**Removed:**

- Direct dependencies on internal AWS client from CLI command implementations
- Hardcoded AWS/SSM logic in inventory, provision, and validation flows
… providers

**Added:**

- Added PrivateIP field to instanceInfo struct for capturing instance private IPs
- Implemented extractHostRole function to consistently extract Ansible inventory hostnames from VM names across AWS, Ludus, and Proxmox

**Changed:**

- Updated instance discovery to populate PrivateIP in inventory sync for non-SSM providers
- Modified inventory update logic to set ansible_host to PrivateIP when available, falling back to InstanceID
- Enhanced output messages during inventory updates for clarity
- Refactored Proxmox hostname resolution to extract role from the last hyphen-separated segment of the VM name, improving robustness and supporting additional naming patterns

**Removed:**

- Removed fragile logic matching known roles by substring in Proxmox resolveHostname, replacing with pattern-based extraction
**Added:**

- Added `inventory_test.go` with comprehensive tests for `extractHostRole` and `applyInstanceUpdates` covering AWS, Ludus, mixed providers, and no-op scenarios

**Changed:**

- Improved error handling in proxmox client by ensuring response bodies are always closed and errors from closing are propagated in `authenticate`, `get`, `post`, `DestroyVM`, and `QEMUAgentExec` methods
- Updated some proxmox client method signatures to return named error values for better error handling
- Minor refactor to use named return values for clarity in proxmox client methods

**Removed:**

- Removed redundant setting of `$Ansible.Changed = $false` from PowerShell logic in `vulns_adcs_esc7` Ansible role to streamline module behavior
…us client

**Changed:**

- Refactored command execution to capture stdout and stderr separately, preventing Ludus v2 informational logs from contaminating JSON output
- Updated VerifyConnection to use "version --json" for structured output and improved parsing of the Ludus CLI response
…support

**Added:**

- Implemented detailed validation checks for ADCS vulnerabilities ESC6, ESC7, ESC10,
  ESC11, ESC13, and ESC15 in the validator (cli/internal/validate/checks.go)
- Added new validator checks for LDAP signing/channel binding, LSASS RunAsPPL,
  CertEnroll share presence, and SIDHistory on trusts
- Extended labmap with HostsWithESC7 method and ESC7Fact struct to support ESC7
  vulnerability checks (cli/internal/labmap/labmap.go)

**Changed:**

- Updated dev, staging, and test overlay JSONs to include new ADCS vulns
  (adcs_esc6, adcs_esc7, adcs_esc10_case1, adcs_esc10_case2, adcs_esc11,
  adcs_esc13, adcs_esc15) for relevant hosts
- Improved MSSQL validator to check SeImpersonatePrivilege and TRUSTWORTHY
  databases, and refactored logic into a helper
- Updated password policy validator to check and report lockout threshold
- Enhanced service checks to report on WebClient (WebDAV) service status


- Improved ADCS ESC7 Ansible role to ensure NuGet provider installation,
  switched PSPKI installation to win_powershell, and made rights assignment more
  robust (ansible/roles/vulns_adcs_esc7/tasks/main.yml, README.md)
- Updated SSM bucket names in dev and staging inventory examples for better
  alignment with actual infrastructure naming conventions

**Removed:**

- Eliminated redundant logic and unnecessary checks for Machine Account Quota and
  MSSQL validator sections
…roperties

**Changed:**

- Updated PowerShell script to reference `.Access.IdentityReference` instead of
  `.Identity` and to search rights with correct property for 'ManageCa' check
- Removed unnecessary string replacement for CA manager variable, passing it
  directly to improve accuracy
**Added:**

- Implement health check retry mechanism using configurable max retries and delay in `health_check.go`
- Provide per-check retry feedback and summarize recovered checks in health check output
- Add comprehensive regression and parsing tests for Ludus client JSON handling in `client_test.go`

**Changed:**

- Refactor health check execution to use `RunCommandWithRetry` for improved reliability and transient failure recovery
- Enhance health check result output to indicate checks that passed after retries and include retry count in summary
**Added:**

- Implement RunCommandWithRetry to handle transient failures and retry commands in cli/internal/provider/retry.go
- Add IsTransientFailure to detect connection-level failures worth retrying
- Define RetryCommandOptions to configure retry behavior
- Provide a comprehensive test suite for retry logic in cli/internal/provider/retry_test.go, covering success, transient failures, retry exhaustion, non-transient failures, zero retries, and context cancellation
**Changed:**

- Refined PowerShell command to explicitly detect missing scheduled tasks, returning a sentinel value when not found
- Updated state handling logic to differentiate between tasks not found and cases where WinRM returns empty output, issuing a WARN for unreadable states and a FAIL only when the task is confirmed missing
- Clarified comments for better understanding of PASS, WARN, and FAIL conditions in scheduled task checks
@l50 l50 force-pushed the feat/provider-agnostic-provisioning branch from edd56db to ff7c54e Compare April 27, 2026 01:40
l50 added 3 commits April 26, 2026 19:53
**Changed:**

- Updated switch statement in scheduled task validation to use direct state
  matching, ensuring correct detection of not found and empty state cases in
  PowerShell output. This fixes improper handling where the default case could
  be triggered unexpectedly.
**Added:**

- Added `terragrunt_fmt` pre-commit hook for infra directory to enforce consistent
  Terragrunt formatting

**Changed:**

- Enhanced PowerShell command execution in validator to retry up to 4 times and
  handle transient WinRM issues, increasing timeout and robustness
- Updated comments and spacing in environment configuration for clarity in
  `env.hcl`
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added the area/pre-commit Changes made to pre-commit hooks label Apr 27, 2026
l50 added 2 commits April 26, 2026 21:42
…ble inventory

**Added:**

- Support for configuring remote Ludus host execution via SSH, including new SSH-related fields (`ssh_host`, `ssh_user`, `ssh_key_path`, `ssh_password`, `ssh_port`) in Ludus config, CLI defaults, provider options, and `dreadgoad.yaml`
- SSHConfig struct and logic to manage SSH connection parameters in Ludus client, with `IsConfigured` helper method
- SSH command execution support in Ludus client, including argument quoting, sshpass integration for password auth, and building SSH command arguments
- Methods to run arbitrary commands on the remote Ludus host via SSH (`RunSSHCommand`)
- Unit tests for SSHConfig, shell quoting, SSH argument construction, and ansible output parsing in Ludus client tests
- SSH mode reporting in provider credential verification for clarity

**Changed:**

- Refactored Ludus provider to route command execution via SSH when configured, constructing inline ansible inventories for WinRM without requiring inventory files on the remote host
- Increased Ludus status cache TTL from 2 minutes to 30 minutes for efficiency
- Improved handling of Windows registry queries in validator checks to emit a warning when keys are missing instead of failing, with more robust PowerShell scripting for registry reads
- Enhanced ACL validation logic to better match well-known accounts and perform robust identity reference lookups in PowerShell scripts
- Improved password policy check in validator to handle errors gracefully and provide clearer reporting

**Removed:**

- Elimination of direct ansible inventory file requirement for SSH-based Ludus command execution; all inventory is now passed inline when running via SSH
**Changed:**

- Refactored switch statements in checkAnonymousSMB and checkLLMNR to use direct value matching instead of boolean expressions, improving readability and aligning with Go best practices in checks.go
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added the area/github Changes made to github actions label Apr 27, 2026
**Added:**

- Added a step to install Terragrunt v0.69.1 using curl and install command
  in the pre-commit GitHub Actions workflow to support Terragrunt-related
  checks and commands.
@l50 l50 force-pushed the feat/provider-agnostic-provisioning branch from 6b2a712 to 8a3644f Compare April 27, 2026 05:05
**Added:**

- Introduced new playbook `load_network_mappings.yml` to load AWS instance/IP
  mappings for AD-state-only runs, ensuring hostvars are populated for downstream
  playbooks
- Implemented `lab purge-unmanaged` CLI command to delete AD users, computers,
  and groups not defined in the lab config, with dry-run and selective class/host
  support
- Added `Allowlist` loader in `cli/internal/labconfig` to aggregate expected
  users, computers, groups, and trusts from lab config JSON for use in AD purges
- Created comprehensive unit tests for allowlist logic, unmanaged-object purge,
  result parsing, and CSV splitting in new test files

**Changed:**

- Refactored `lab reset` workflow to supersede the legacy WIN-* ghost computer
  purge with the new unmanaged-object purge, covering users, computers, and
  groups using allowlist from lab config
- Updated help text, CLI flags, and internal logic to support the new AD object
  purge with options for apply, class/host filtering, and admin-creator safety
  belt
- Improved DC instance discovery to support case-insensitive hostname filtering
  and robustly handle missing/invalid inventory entries
- Enhanced PowerShell script for object purge to support allowlist-based checks,
  class filtering, and trust account recognition
- Updated documentation diagrams in `docs/architecture.mmd` and
  `docs/architecture.svg` to reflect the new playbook and playbook count
- Improved error handling and summary reporting for AD object purges

**Removed:**

- Removed legacy `purge-ghosts` command and WIN-* regex-based computer account
  purge logic, consolidating all unmanaged AD object cleanup under the new
  allowlist-driven mechanism
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added the area/playbooks Changes made to playbooks directory label Apr 27, 2026
l50 added 3 commits April 27, 2026 10:26
**Changed:**

- Updated `ansible_aws_ssm_bucket_name` to use placeholder example values in `dev-inventory.example` and `staging-inventory.example` for clarity and to prevent accidental use of real bucket names
**Added:**

- Introduced SSH-mode support for the ludus provider, enabling remote operation
  by connecting to the Ludus server over SSH
- Added new configuration options for Ludus SSH (host, user, port, key, password)
- Implemented Ludus-specific doctor checks to validate SSH connectivity, required
  binaries (`ssh`, `sshpass`), and API key configuration
- Expanded documentation to cover SSH-mode configuration and usage details

**Changed:**

- Refactored doctor pre-flight checks to dispatch based on provider, supporting
  both local and SSH modes for Ludus
- Updated help text and usage instructions for the `doctor` command to clarify
  provider-specific checks and operation modes
- Revised Ludus provider documentation to reflect both local and SSH operation,
  including configuration examples, prerequisites, and behavioral notes

**Removed:**

- Deprecated the restriction that DreadGOAD must be run only on the Ludus server;
  now supports remote workstation clients via SSH mode
**Added:**

- Introduced `host` field to Ludus config allowing use of ssh_config Host aliases
- Implemented `SSHTarget()` method to prefer `host` over `ssh_host` for SSH target resolution
- Added `ResolveAlias` flag to Ludus doctor checks to trigger alias resolution
- Created `sshconfig` package to resolve ssh_config aliases via `ssh -G`
- Added tests for ssh_config alias resolution in `sshconfig/resolve_test.go`

**Changed:**

- Updated Ludus SSH doctor checks to resolve Host aliases using `sshconfig.Resolve`
- Modified TCP reachability probe to use resolved hostname/port when alias is set
- Refactored SSH argument builder to honor explicit override fields only when set;
  otherwise, pass through the Host alias verbatim and rely on user's ssh_config
- Improved documentation for Ludus provider to clarify preferred use of `host` and
  explain override behavior for CI/automation contexts

**Removed:**

- Deprecated `ssh_host` as the primary SSH-mode toggle; now prefer `host` field
l50 added 2 commits April 27, 2026 11:45
…docs

**Added:**

- Introduced an interactive setup wizard (`dreadgoad init`) for generating a
  ready-to-use `dreadgoad.yaml` config, supporting Ludus and AWS providers
- Wizard includes provider selection, SSH connectivity probing for Ludus,
  prompts for API key, and writes configuration with guidance for next steps

**Changed:**

- Updated Ludus provider documentation to recommend the new interactive wizard
  (`dreadgoad init`) as the fastest setup path
- Clarified documentation to reflect the wizard's workflow, including probing
  for Ludus host, handling API key, and providing example config and commands
- Added explanatory notes on using environment variables for API keys instead
  of writing them to the config file
**Added:**

- Introduced `up` command for end-to-end lab deployment, orchestrating doctor,
  infra apply, provision, and health-check steps with flags for granular control
  (`cli/cmd/up.go`)
- Expanded CLI reference with detailed `init` and `up` sections, including usage
  examples and flag descriptions (`docs/mkdocs/docs/cli-reference.md`)
- Added quickstart documentation highlighting the new orchestrator workflow and
  resume mechanism (`docs/mkdocs/docs/usage/index.md`)
- Documented orchestrator workflow for Ludus, with usage and resume examples
  (`docs/mkdocs/docs/providers/ludus.md`)
@l50 l50 force-pushed the feat/provider-agnostic-provisioning branch from d965158 to 1ac9578 Compare April 27, 2026 17:46
l50 added 2 commits April 27, 2026 11:46
…tion

**Changed:**

- Updated `ludus.ssh_user` and `ludus.ssh_port` defaults to empty string and 0
  respectively, allowing alias-mode guard logic to detect when users have not
  customized these values; fallback values ("root" and 22) are now provided at
  runtime by SSH logic rather than configuration defaults
- Added explanatory comments clarifying the reasoning behind leaving these
  defaults unset for improved maintainability and clarity
…docs

**Changed:**

- Replaced PowerShell-based trust creation with netdom CLI commands in trust tasks
  to avoid issues with .NET API under WinRM and simplify credential handling
- Separated trust existence check into its own shell task using netdom /verify
- Updated conditional logic to add trusts only if not present, based on trust
  check result
- Modified playbook variable usage: removed unused `domain_username`, updated
  trust task logic to align with new netdom approach
- Updated documentation in trust role README to reflect use of netdom via shell
  instead of PowerShell, and clarified step descriptions

**Removed:**

- Removed PowerShell script and related become/runas logic for trust creation,
  eliminating need for domain_username and .NET context handling
@l50 l50 force-pushed the feat/provider-agnostic-provisioning branch from 4588c08 to 3591dbd Compare April 27, 2026 19:04
l50 added 10 commits April 27, 2026 13:47
**Added:**

- SOCKS5 tunnel support for Ludus SSH mode, enabling WinRM provisioning via a dynamic SSH proxy (`cli/internal/ludus/socks.go`)
- Automatic Ludus CLI download and install if not found in PATH, with checksum verification and platform detection (`cli/internal/ludus/install.go`)
- CLI installation tests (`cli/internal/ludus/install_test.go`)
- Playbook task for building/distributing `dc_hostname_to_ip` mapping (`ansible/roles/network_discovery/tasks/dc_hostname_mapping.yml`)
- Conditional inventory and Ansible collection checks for Ludus in SSH mode, including pypsrp/requests[socks] detection in pre-flight (`cli/internal/doctor/checks.go`)
- Documentation note explaining Ludus CLI auto-install in provider docs (`docs/mkdocs/docs/providers/ludus.md`)

**Changed:**

- Ludus provider now supports SSH mode for remote Ludus servers; provision command starts and manages a SOCKS5 tunnel, auto-injecting Ansible connection vars for psrp/WinRM over the tunnel (`cli/cmd/provision.go`)
- Playbooks refactored to use new network_discovery role tasks for building mappings, improving reuse and simplifying logic (`ansible/playbooks/load_network_mappings.yml`, `ansible/playbooks/network_setup.yml`)
- Refactored retry logic to preserve connection-level vars (like SOCKS proxy) when retrying Ansible playbooks, preventing retry strategies from overwriting existing vars (`cli/internal/ansible/retry.go`)
- Improved Ludus CLI version detection; now defaults to v2 if version check fails, with a warning (`cli/internal/ludus/client.go`)
- Ludus provider hostname matching is now exact on role suffix, reducing ambiguity and improving robustness (`cli/internal/ludus/provider.go`)
- Pre-flight checks now return failure count, display warnings for missing config, and improve Ludus/Ansible collection logic (`cli/internal/doctor/checks.go`, `cli/cmd/doctor.go`, `cli/cmd/up.go`)
- Ludus API key and SSH config detection improved for SSH mode (`cli/internal/ludus/provider.go`)
- Role `trusts` now uses explicit DirectoryContext with credentials for forest trust creation, improving reliability and avoiding netdom password escaping issues (`ansible/roles/trusts/tasks/main.yml`)
- Role `vulns_adcs_esc7` avoids restarting CA when ManageCa rights already granted (`ansible/roles/vulns_adcs_esc7/tasks/main.yml`)
- Ludus provider VM destroy now invalidates internal cache to avoid stale instance info (`cli/internal/ludus/provider.go`)
- Ludus deployment progress now shows elapsed time and live powered-on VM count (`cli/internal/ludus/client.go`)
- Ludus provider interface extended for SSH awareness (`cli/internal/ludus/provider.go`)
- Minor: improved config missing detection, warning if no config found (`cli/internal/config/config.go`)

**Removed:**

- Inlined dc_hostname_to_ip mapping logic from playbooks, replaced with reusable role include (`ansible/playbooks/load_network_mappings.yml`, `ansible/playbooks/network_setup.yml`)
**Changed:**

- Add handling for EPIPE errors when writing to stdout in Validator, exiting early if detected and logging other write errors to stderr for better robustness
…idator

**Added:**

- Added documentation for the `dc_hostname_mapping.yml` task in network_discovery README
- Introduced detection of users with passwords equal to their name, first name, or surname
- Added comprehensive credential discovery checks (autologon, credential manager, SYSVOL, shares, permissions, administrator folder)
- Implemented checks for network protocol vulnerabilities (SMBv1, CredSSP server/client, WebDAV-Redirector)
- Added group and user existence validation against configured lab domain/group settings
- Added checks for local Administrators group membership consistency per host
- Implemented checks for ADCS template flags (ESC1, ESC2, ESC3, ESC4, ESC9, ESC13) and issuance policy linkage
- Added validation for DNS conditional forwarders, directory service audit, LDAP diagnostic logging, Defender ASR rules
- Implemented checks for IIS upload permissions, CVE patch status (ZeroLogon, PrintNightmare, noPac, Certifried), and default admin shares

**Changed:**

- Updated trusts README to clarify and rename PowerShell trust configuration step
- Refactored validator's RunAllChecks to include new validation functions in a logical, grouped order
- Expanded labmap HostConfig to include LocalGroups for local admin checks
- Improved sorting, deduplication, and reporting logic for users and groups validation
- Minor whitespace and formatting adjustments for consistency across files

**Removed:**

- Removed redundant lines and trailing whitespace in provider.go and config.go for clarity
…roxy

**Added:**

- Implemented native Go SSH client in `native_ssh.go` with support for ssh_config parsing, agent/identity authentication, ProxyJump, and SOCKS5 proxying
- Added unit and integration tests for native SSH handling in `native_ssh_test.go` and `native_ssh_integration_test.go`

**Changed:**

- Replaced all SSH and SOCKS5 subprocess calls with pure-Go implementations, including connection reuse and concurrent session limiting - `client.go`, `socks.go`
- Updated `provider.go` to patch missing VM IPs using the authoritative `/opt/ludus/ranges/<rangeID>/etc-hosts` via the new SSH client when needed
- Improved error handling and logging for SSH and ansible command execution in `provider.go`
- Increased parallelism for validator checks to 16 in `validator.go`
- Updated configuration comments and docstrings to reflect native SSH auth changes in `config.go`, `defaults.go`, and `factory.go`
- Updated dependencies: added `github.com/armon/go-socks5`, `golang.org/x/crypto`, and `golang.org/x/net` in `go.mod` and `go.sum`

**Removed:**

- Removed legacy OpenSSH subprocess argument builders and tests in `client.go` and `client_test.go`
- Removed all use of `sshpass` and related password handling via external processes
- Eliminated subprocess-based SOCKS5 proxy management in `socks.go`
**Changed:**

- Replaced custom PowerShell scripts and async handling with the native
  `microsoft.ad.group` Ansible module for universal, global, and domainlocal
  group creation in `groups.yml`. This simplifies task logic, improves
  readability, and leverages built-in idempotency and error handling.
- Updated documentation in `README.md` to reflect use of `microsoft.ad.group`
  instead of custom scripts and removed references to async wait tasks.

**Removed:**

- Eliminated async status polling and retries for group creation tasks, as the
  `microsoft.ad.group` module handles these concerns internally.
- Removed custom PowerShell scripts for group creation, reducing complexity and
  reliance on hand-written error handling.
**Changed:**

- Consolidated SSH client logic by renaming nativeClient to sshClient and native_ssh.go to ssh.go, updating all usages for clarity and consistency
- Removed nativeSOCKS abstraction in favor of simpler SOCKSTunnel with direct listener management, moving SOCKS5 listener logic into socks.go and ssh.go
- Updated client.go to use new sshClient abstraction and dialSSH function, replacing references to nativeClient and dialNative
- Refactored integration and unit tests to use new sshClient and SOCKSTunnel interfaces, updating all references and helper function names for clarity
- Added ansible_win_async_startup_timeout: 30 to relevant Ansible tasks in ou.yml and users.yml to improve async reliability
- Updated go.mod to move golang.org/x/net from indirect to direct dependency

**Removed:**

- Removed nativeSOCKS type and related methods from ssh.go, consolidating SOCKS5 server functionality
fix: improve reliability of ManageCa rights check in ADCS ESC7 role
**Changed:**

- Updated the logic for detecting existing ManageCa rights to use the Rights
  enum's string representation, avoiding reliance on the
  [PKI.CertificateServices.CertificationAuthorityRights] type literal. This
  addresses issues where Import-Module PSPKI does not consistently expose the
  type in runas sessions, making the grant check more robust in
  ansible/roles/vulns_adcs_esc7/tasks/main.yml.
```
**Added:**

- Introduced dead host tracking with mutex to avoid repeated failed attempts per host

**Changed:**

- Replaced retry loop in runPS with a single attempt and dead host caching
- Reduced RunCommand timeout from 180 seconds to 15 seconds for faster failure detection

**Removed:**

- Removed retry logic and exponential backoff from runPS to improve efficiency
**Added:**

- Introduced failureDetail helper to provide detailed error messages when ansible output is empty, including fallback to the underlying run error - cli/internal/ludus/provider.go

**Changed:**

- Updated error reporting in runCommandLocal and runCommandSSH to use failureDetail, ensuring clear error reasons even when output is empty - cli/internal/ludus/provider.go
- Replaced map with sync.Map for dead host tracking to ensure concurrency safety and single warning per host in Validator - cli/internal/validate/validator.go
- Refactored dead host marking logic to use sync.Map.LoadOrStore, reducing lock contention and improving logging clarity - cli/internal/validate/validator.go

**Removed:**

- Removed deadMu and dead map fields, along with associated locking logic, in favor of sync.Map for dead host management - cli/internal/validate/validator.go
… validation

**Added:**

- Introduced per-host failure counters using `sync.Map` and `atomic.Int64` to
  track consecutive runPS failures and only mark hosts dead after a configurable
  threshold
- Added constants for runPS timeout (90s), retry attempts (3), and dead host
  threshold (3 consecutive failures) to allow robust handling of transient SSM/
  WinRM errors

**Changed:**

- Refactored runPS to retry up to 3 times per call on transient errors, with
  exponential backoff, resetting failure counters on success
- Updated logic so that only hosts exceeding the deadThreshold are marked dead,
  reducing false positives from transient issues
- Enhanced warning logs to include failure counts and error details for better
  observability and debugging
@l50 l50 force-pushed the feat/provider-agnostic-provisioning branch from 3b574d2 to 45685af Compare April 28, 2026 18:10
l50 added 5 commits April 28, 2026 12:18
… provider

**Added:**

- Implement direct-from-Go WinRM client (`winrm.go`) to run PowerShell over SSH-tunneled TCP connections, replacing the previous ansible-based execution path for Windows VMs
- Introduce lazy initialization and reuse of the WinRM runner in `LudusProvider` for more efficient command execution
- Add credential and per-host client caching logic to optimize repeated WinRM connections

**Changed:**

- Replace ansible-based SSH command execution with direct WinRM calls in `runCommandSSH`, reducing overhead and improving reliability under concurrent validator fan-out
- Update documentation and comments to reflect the new WinRM-based workflow and rationale for the architectural change
**Added:**

- Added github.com/masterzen/winrm as a direct dependency in go.mod
- Added checksums for github.com/gorilla/securecookie and github.com/gorilla/sessions in go.sum

**Changed:**

- Promoted github.com/masterzen/winrm from indirect to direct dependency in go.mod

**Removed:**

- Removed github.com/masterzen/winrm as an indirect dependency in go.mod
…alidation

**Added:**

- Implemented `script_runner.go` providing helpers to render, execute, and parse
  templated PowerShell scripts with safe interpolation and JSON payload envelopes
- Added `runScriptText` and `runScriptJSON` utilities for reliable remote script
  execution with typed output and error handling
- Created `script_runner_test.go` covering script template rendering and JSON
  extraction edge cases
- Embedded a suite of PowerShell scripts under `cli/internal/validate/scripts/`
  for various validation checks (e.g. registry, features, ACLs, scheduled tasks)
- Defined typed result structs for each new script to enable structured output
- Integrated new scripts for:
  - ADCS features and template queries
  - ESC6/ESC7/ESC10/ESC11/ESC15/ESC4 ADCS checks
  - Password policy, registry DWORD, certutil flag, and SMBv1/ASR/WebDAV checks
  - Scheduled task, admin profile ACL, share ACL, and IIS upload ACL probes
  - CVE patch presence, sysvol/share plaintext file count, and more

**Changed:**

- Refactored validation checks in `checks.go` to use `runScriptJSON` and
  `runScriptText` for host-side queries instead of inline PowerShell/`runPS`
- Enhanced all validation logic for ADCS, SMB, MSSQL, scheduled tasks,
  password policies, registry checks, LAPS/gMSA, network features, and file
  system ACLs to leverage structured script output and improved error reporting
- Updated all relevant checks to pass template variables safely using the new
  helpers, ensuring proper escaping and resilience to input edge cases
- Improved result reporting granularity and accuracy by handling script errors,
  partial data, and missing keys explicitly
- Ensured all host/role-specific probes now use strongly typed Go structs for
  output parsing and reporting, improving maintainability and correctness

**Removed:**

- Eliminated all ad-hoc and string-matching PowerShell output scraping from
  validation checks in favor of robust JSON parsing
- Removed fragile inline PowerShell one-liners in favor of reusable, tested,
  parameterized scripts
- Deprecated legacy result parsing in favor of structured per-check logic
…lidation

**Added:**

- Add verification step for NTDS diagnostic registry values after configuration to ensure correct LDAP diagnostic logging levels - ldap_diagnostic_logging role
- Populate LabMap.AdminUser from inventory admin_user variable to enable domain-principal checks that match Ansible role behavior - cli/cmd/infra.go, cli/internal/labmap/labmap.go

**Changed:**

- Refactor lab loading logic into a dedicated loadLab function that sets AdminUser based on inventory, defaulting to "administrator" if unset - cli/cmd/infra.go
- Expand LabMap.HostsWithESC7 to include CA-serving host, domain NetBIOS, password, and admin_user for accurate vulnerability context - cli/internal/labmap/labmap.go
- Update ADCS ESC7 validator to run ManageCA check on the actual CA host and elevate to domain admin using inventory's admin_user for consistent context - cli/internal/validate/checks.go, cli/internal/validate/scripts/adcs_esc7.ps1
- Clarify comments and logic for MSSQL service validation, ensuring robust detection by probing each possible service name and always returning exit 0 - cli/internal/validate/checks.go
- Remove unnecessary conditional on MSSQL service startup in main.yml and README for idempotency - ansible/roles/mssql

**Removed:**

- Remove redundant conditional from MSSQL service startup task to ensure the service is always started if not already running - ansible/roles/mssql/tasks/main.yml, ansible/roles/mssql/README.md
**Added:**

- Added `runScriptTextErr` function to propagate PowerShell transport errors and return partial output for better diagnostics in script execution - `script_runner.go`
- Introduced `runPSErr` function to provide diagnostic errors from PowerShell execution, distinguishing between different host and transport failures - `validator.go`

**Changed:**

- Updated DNS conditional forwarder check to use `runScriptTextErr`, improving error reporting by surfacing transport errors and unexpected output - `checks.go`
- Refactored error handling in DNS probe to provide more actionable WARN results, including details about unexpected probe output and transport failures - `checks.go`
- Modified `runPS` to delegate to `runPSErr` and always return a string, ensuring backward compatibility while centralizing error logic - `validator.go`
@l50 l50 changed the title feat: add provider-agnostic inventory and session management logic feat: add Ludus and Proxmox provider support with unified infra/provision orchestration Apr 28, 2026
l50 added 2 commits April 28, 2026 18:22
…n cli and ludus

**Changed:**

- Refactored user prompt functions to use standard output directly, removing
  redundant writer parameters and simplifying prompt logic in `cli/cmd/init.go`
- Improved error handling for closing HTTP responses, temporary files, and file
  descriptors throughout `cli/internal/ludus/install.go`
- Split Ludus CLI installation logic into smaller functions for better clarity
  and maintainability, with explicit checksum verification and cleanup
- Enhanced robustness of temp file and session cleanup with error handling
  during file close and removal, including new helper functions for cleanup
- Updated file copy, SHA256, and download logic to surface cleanup errors only
  if no primary error occurred, increasing reliability
- Improved SSH session cleanup in `cli/internal/ludus/ssh.go` to only override
  errors with close errors if the main command succeeded; now tolerates EOF on
  close
- Added comments and clarified best-effort logic for legacy Ludus releases with
  missing checksums
- Minor formatting and variable name improvements for clarity and Go idioms
…c-provisioning

# Conflicts:
#	cli/go.mod
#	cli/go.sum
@l50 l50 merged commit 633607c into main Apr 29, 2026
8 checks passed
@l50 l50 deleted the feat/provider-agnostic-provisioning branch April 29, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ad-labs Changes made to AD lab definitions area/docs Changes made to documentation area/github Changes made to github actions area/packer Changes made to Packer configurations area/playbooks Changes made to playbooks directory area/pre-commit Changes made to pre-commit hooks area/roles Changes made to Ansible roles lab/GOAD Changes made to GOAD lab

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant