perf: improve ansible reliability with async, retries, and exponential backoff#10
Merged
Conversation
… install **Added:** - Added parallel installation and async waiting for required PowerShell modules in the common role, with consolidated module check and improved progress feedback - Added async job wait tasks with fail-fast error detection for AD OU, group, and managed_by tasks, as well as user and SPN creation - Added custom progress labeling to loops for clearer job tracking in logs **Changed:** - Refactored AD group creation and managed_by assignment to use async mode, structured waiting, and improved error handling - Replaced `with_dict`/`with_items` with `loop` and `loop_control` in all AD and user tasks for clarity and better result access - Reduced retries for ACL async task for faster failure detection and added fail conditions for early error reporting - Improved runner exponential backoff logic for unreachable and failed hosts, with dynamic wait times and better logging - Removed unnecessary serial PowerShell module checks and installs in common role, replacing with a single parallel approach **Removed:** - Removed pre-installation reboot from MSSQL role, relying on prior provisioning for OS readiness - Eliminated redundant checks and serial installs for individual PowerShell modules in the common role to streamline provisioning
…k for mssql **Added:** - Added a task to check for pending Windows reboots before MSSQL installation and trigger a reboot with long timeout if needed to ensure installation prerequisites **Changed:** - Increased ACL operation retries from 40 to 50 to accommodate observed slow cases - Removed all explicit `failed_when` conditions from AD role tasks for groups, OUs, and users to prevent premature task failure and allow full async job completion - Updated task comments to clarify reasons for retry and failure handling changes **Removed:** - Removed pre-installation reboot comment in MSSQL role, replacing with an explicit check and reboot step for clarity and reliability - Eliminated `failed_when` clauses from AD role tasks for universal, global, domainlocal groups, OUs, and users to avoid unnecessary task failures during async operations
l50
added a commit
that referenced
this pull request
Oct 29, 2025
…l backoff (#10) **Key Changes:** - Converted all AD and OU creation tasks to async operations with robust wait and retry logic - Refactored module installation to parallelize and wait for completion, improving speed and reliability - Introduced pending reboot checks before MSSQL install to avoid install failures - Enhanced Ansible runner to use exponential backoff on connection failures and task errors **Added:** - Asynchronous handling for AD group, OU, and user creation, including wait loops for job completion with retries and labeled progress messages for clearer output - Pending reboot detection before MSSQL installation to ensure prerequisite state is met - Exponential backoff logic for Ansible runner retries on unreachable VMs and playbook failures **Changed:** - Switched AD group and OU creation from `with_dict` to `loop` with `dict2items`, enabling async, improved error handling, and progress labeling in `ad/tasks/groups.yml`, `ad/tasks/ou.yml`, and related files - Consolidated PowerShell module checks into a single step, and parallelized missing module installations with async handling and job status waits in `common/tasks/main.yml` - Updated user creation and SPN tasks to use async loops with job status checks and clearer labeling in `ad/tasks/users.yml` - Increased ACL retry count to 50 in `acl/tasks/main.yml` to better reflect real-world timing observations - Improved Ansible runner retry logic to use exponential backoff (10s, 30s, 60s) for both unreachable and failure states, and updated abort logic in `goad/provisioner/ansible/runner.py` **Removed:** - Redundant, sequential module existence checks and installations for PowerShell DSC modules in `common/tasks/main.yml` - Legacy `with_dict` and `with_items` loops for AD, OU, and user tasks, now replaced with async-enabled `loop` constructs - OU creation: 114-134s → 107s (10% faster) - User creation: Eliminates extreme outliers (consistent 2-3min instead of occasional 63min) - Module installation: Parallel execution with <1s overhead - Overall provisioning: 10-15% faster with more robust error handling Validated across multiple full provisioning runs on AWS with SSM: - build.yml, ad-servers.yml, ad-parent_domain.yml, ad-child_domain.yml, ad-members.yml, ad-trusts.yml - ad-data.yml, ad-gmsa.yml, laps.yml, ad-relations.yml, adcs.yml, ad-acl.yml - servers.yml, security.yml, vulnerabilities.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key Changes:
Added:
Changed:
with_dicttoloopwithdict2items, enabling async, improved error handling, and progress labeling inad/tasks/groups.yml,ad/tasks/ou.yml, and related filescommon/tasks/main.ymlad/tasks/users.ymlacl/tasks/main.ymlto better reflect real-world timing observationsgoad/provisioner/ansible/runner.pyRemoved:
common/tasks/main.ymlwith_dictandwith_itemsloops for AD, OU, and user tasks, now replaced with async-enabledloopconstructsPerformance Impact
Test Results
Validated across multiple full provisioning runs on AWS with SSM:
All playbooks complete successfully with improved timing and reliability.