chore: update infrastructure automation and agent deployment for reliability and GPU support#275
Merged
Merged
Conversation
…ilder reliability **Added:** - Install NVIDIA GPU kernel driver and CUDA toolkit in cracking_tools role for EC2 AMI/host builds; enables hashcat GPU acceleration and CUDA backend support - Systemd slice (system-ares.slice) with global memory cap and swap OOM cushion for orchestrator/worker isolation on EC2 setup script - Loki URL support for log aggregation across Taskfiles, launch scripts, and environment propagation - SHA256 checks and verification for binary deploys to EC2 and S3 to prevent silent deploy errors or mismatches - Option to pass custom OPERATION_ID in ec2:run task for reproducible operation IDs - Strategy parameter in red:multi operation launch **Changed:** - Cracking_tools: add robust NVIDIA driver installation using NVIDIA CUDA repo (for kernel 6.19+ compatibility), improved logging, error reporting, and verification steps - Cracking_tools: install CUDA toolkit optionally, and verify nvidia-smi/clinfo status even if no GPU present - Cracking_tools: restructure tasks to separate kernel headers/DKMS install from driver install to avoid DKMS race conditions - Base role: remove asyncio from pip dependencies (now stdlib; PyPI version breaks imports in Python 3.13+) - Base role: switch pip install to shell+tee for full log capture and better error visibility; show tail on failure - Lateral movement tools: update Ruby gem patching task to only run on non-Kali distributions, skip on Kali (uses apt for CVEs), and handle gem update failures gracefully - EC2 Taskfile: improve open file limit (ulimit) logic for Zig builds, always setting concrete limit to avoid RLIM_INFINITY bug - EC2 Taskfile: extend SSM build/deploy timeout and polling interval to 30 minutes for large builds - EC2 Taskfile: add SHA verification for both local and remote binary deploy, with error and mismatch detection - EC2 Taskfile: propagate Loki URL to orchestrator and .env - Launch orchestrator on EC2 in its own systemd transient unit in system-ares.slice for proper cgroup/memory isolation and logging - Red Taskfile: support passing explicit IPs for target selection (IPS var) and conditional AWS lookup - Remote Taskfile: auto-detect K8s node architecture for RUST_TARGET to support arm64/amd64 clusters - Remote orchestrator-wrapper: improve operation claim parsing and skip malformed op requests - Goad attack box Ansible playbook: bake GPU drivers and CUDA toolkit into AMI by default for cracking_tools role; make AWS roles conditional on cloud provider **Removed:** - Deleted python-ares-expert Claude agent definition (no longer needed or relocated) - Removed asyncio from base pip dependencies as it is now always included in Python 3.4+
**Changed:** - Removed the default value for LOKI_URL, now requiring explicit configuration in Taskfile.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key Changes:
Added:
system-ares.slicewith memory caps and OOM settings to isolate orchestrator and workers (setup.sh)setup.sh)cracking_toolsrole, with full DKMS build logs and OpenCL/CUDA verification steps.taskfiles/ec2/Taskfile.yamlIPSvariable) in.taskfiles/red/Taskfile.yamlLOKI_URL) in EC2 and orchestrator launch scriptsChanged:
baserole: switched to shell+tee for pip to capture full logs and added error tail output on failure; removed incorrect inclusion ofasynciopackage (now only using stdlib asyncio)cracking_toolsrole: added steps to use NVIDIA's Debian repo for driver installation (Kali archive drivers are too old for current kernels); improved error handling and debugging for DKMS/NVIDIA install failureslateral_movement_toolsrole: Ruby gem updates are now looped per package, only on Ubuntu (Kali gets CVEs via apt), and failures are non-fatal to avoid AMI builder SIGKILL issuesRemoved:
.claude/agents/python-ares-expert.mdagent definition fileBIN_DIRandRUST_TARGETtop-level variables in EC2 Taskfile (now set per-task as needed)setup.sh)asynciopackage from Ansible dependencies (was breaking imports on Python 3.13+)