Skip to content

chore: update infrastructure automation and agent deployment for reliability and GPU support#275

Merged
l50 merged 3 commits into
mainfrom
chore/golden-image-and-task-infra
May 10, 2026
Merged

chore: update infrastructure automation and agent deployment for reliability and GPU support#275
l50 merged 3 commits into
mainfrom
chore/golden-image-and-task-infra

Conversation

@l50
Copy link
Copy Markdown
Contributor

@l50 l50 commented May 10, 2026

Key Changes:

  • Improved EC2 deployment and verification with SHA256 checks for binary integrity
  • Added robust systemd resource isolation and memory limits for orchestrator and workers
  • Enabled optional NVIDIA GPU driver and CUDA toolkit installation for cracking tools on AMIs
  • Enhanced Ansible roles for better error logging and dependency handling, and removed legacy/incorrect Python packages

Added:

  • Systemd system-ares.slice with memory caps and OOM settings to isolate orchestrator and workers (setup.sh)
  • Automated creation and activation of a 4GB swap file, plus OOM and swappiness sysctl tuning for better stability on memory pressure (setup.sh)
  • Conditional installation of NVIDIA kernel-mode drivers and CUDA toolkit in the cracking_tools role, with full DKMS build logs and OpenCL/CUDA verification steps
  • SHA256 build and deploy verification for Rust binaries on EC2 and S3 stages in .taskfiles/ec2/Taskfile.yaml
  • Support for direct target IP specification in red team multi-agent ops (IPS variable) in .taskfiles/red/Taskfile.yaml
  • Pass-through and environment variable injection for Loki endpoint (LOKI_URL) in EC2 and orchestrator launch scripts

Changed:

  • EC2 deployment and polling logic to allow up to 30 minutes for SSM commands and builds (increased from 10 minutes)
  • All S3 deployment and EC2-side install steps now check and verify SHA256 sums before/after deployment to prevent silent failures or mismatches
  • Orchestrator launch script now uses transient systemd units in a dedicated slice, ensuring resource limits and separation from SSM agent cgroups
  • Python package installation in Ansible base role: switched to shell+tee for pip to capture full logs and added error tail output on failure; removed incorrect inclusion of asyncio package (now only using stdlib asyncio)
  • Ansible cracking_tools role: added steps to use NVIDIA's Debian repo for driver installation (Kali archive drivers are too old for current kernels); improved error handling and debugging for DKMS/NVIDIA install failures
  • Ansible lateral_movement_tools role: Ruby gem updates are now looped per package, only on Ubuntu (Kali gets CVEs via apt), and failures are non-fatal to avoid AMI builder SIGKILL issues
  • Default Loki URL is now configurable and passed through all relevant scripts and configs

Removed:

  • Legacy .claude/agents/python-ares-expert.md agent definition file
  • Explicit BIN_DIR and RUST_TARGET top-level variables in EC2 Taskfile (now set per-task as needed)
  • Redundant or legacy disables/cleanups for old worker systemd units (now handled in setup.sh)
  • Unnecessary Python asyncio package from Ansible dependencies (was breaking imports on Python 3.13+)

l50 added 3 commits May 10, 2026 09:12
…ilder reliability

**Added:**

- Install NVIDIA GPU kernel driver and CUDA toolkit in cracking_tools role for EC2 AMI/host builds; enables hashcat GPU acceleration and CUDA backend support
- Systemd slice (system-ares.slice) with global memory cap and swap OOM cushion for orchestrator/worker isolation on EC2 setup script
- Loki URL support for log aggregation across Taskfiles, launch scripts, and environment propagation
- SHA256 checks and verification for binary deploys to EC2 and S3 to prevent silent deploy errors or mismatches
- Option to pass custom OPERATION_ID in ec2:run task for reproducible operation IDs
- Strategy parameter in red:multi operation launch

**Changed:**

- Cracking_tools: add robust NVIDIA driver installation using NVIDIA CUDA repo (for kernel 6.19+ compatibility), improved logging, error reporting, and verification steps
- Cracking_tools: install CUDA toolkit optionally, and verify nvidia-smi/clinfo status even if no GPU present
- Cracking_tools: restructure tasks to separate kernel headers/DKMS install from driver install to avoid DKMS race conditions
- Base role: remove asyncio from pip dependencies (now stdlib; PyPI version breaks imports in Python 3.13+)
- Base role: switch pip install to shell+tee for full log capture and better error visibility; show tail on failure
- Lateral movement tools: update Ruby gem patching task to only run on non-Kali distributions, skip on Kali (uses apt for CVEs), and handle gem update failures gracefully
- EC2 Taskfile: improve open file limit (ulimit) logic for Zig builds, always setting concrete limit to avoid RLIM_INFINITY bug
- EC2 Taskfile: extend SSM build/deploy timeout and polling interval to 30 minutes for large builds
- EC2 Taskfile: add SHA verification for both local and remote binary deploy, with error and mismatch detection
- EC2 Taskfile: propagate Loki URL to orchestrator and .env
- Launch orchestrator on EC2 in its own systemd transient unit in system-ares.slice for proper cgroup/memory isolation and logging
- Red Taskfile: support passing explicit IPs for target selection (IPS var) and conditional AWS lookup
- Remote Taskfile: auto-detect K8s node architecture for RUST_TARGET to support arm64/amd64 clusters
- Remote orchestrator-wrapper: improve operation claim parsing and skip malformed op requests
- Goad attack box Ansible playbook: bake GPU drivers and CUDA toolkit into AMI by default for cracking_tools role; make AWS roles conditional on cloud provider

**Removed:**

- Deleted python-ares-expert Claude agent definition (no longer needed or relocated)
- Removed asyncio from base pip dependencies as it is now always included in Python 3.4+
**Changed:**

- Removed the default value for LOKI_URL, now requiring explicit configuration in Taskfile.yaml
@l50 l50 merged commit 67bcb16 into main May 10, 2026
11 checks passed
@l50 l50 deleted the chore/golden-image-and-task-infra branch May 10, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant