Skip to content

perf: improve ansible reliability with async, retries, and exponential backoff#10

Merged
l50 merged 2 commits into
mainfrom
perf/ansible-playbook-optimizations
Oct 29, 2025
Merged

perf: improve ansible reliability with async, retries, and exponential backoff#10
l50 merged 2 commits into
mainfrom
perf/ansible-playbook-optimizations

Conversation

@l50

@l50 l50 commented Oct 29, 2025

Copy link
Copy Markdown
Contributor

Key Changes:

  • Converted all AD and OU creation tasks to async operations with robust wait and retry logic
  • Refactored module installation to parallelize and wait for completion, improving speed and reliability
  • Introduced pending reboot checks before MSSQL install to avoid install failures
  • Enhanced Ansible runner to use exponential backoff on connection failures and task errors

Added:

  • Asynchronous handling for AD group, OU, and user creation, including wait loops for job completion with retries and labeled progress messages for clearer output
  • Pending reboot detection before MSSQL installation to ensure prerequisite state is met
  • Exponential backoff logic for Ansible runner retries on unreachable VMs and playbook failures

Changed:

  • Switched AD group and OU creation from with_dict to loop with dict2items, enabling async, improved error handling, and progress labeling in ad/tasks/groups.yml, ad/tasks/ou.yml, and related files
  • Consolidated PowerShell module checks into a single step, and parallelized missing module installations with async handling and job status waits in common/tasks/main.yml
  • Updated user creation and SPN tasks to use async loops with job status checks and clearer labeling in ad/tasks/users.yml
  • Increased ACL retry count to 50 in acl/tasks/main.yml to better reflect real-world timing observations
  • Improved Ansible runner retry logic to use exponential backoff (10s, 30s, 60s) for both unreachable and failure states, and updated abort logic in goad/provisioner/ansible/runner.py

Removed:

  • Redundant, sequential module existence checks and installations for PowerShell DSC modules in common/tasks/main.yml
  • Legacy with_dict and with_items loops for AD, OU, and user tasks, now replaced with async-enabled loop constructs

Performance Impact

  • OU creation: 114-134s → 107s (10% faster)
  • User creation: Eliminates extreme outliers (consistent 2-3min instead of occasional 63min)
  • Module installation: Parallel execution with <1s overhead
  • Overall provisioning: 10-15% faster with more robust error handling

Test Results

Validated across multiple full provisioning runs on AWS with SSM:

  • build.yml, ad-servers.yml, ad-parent_domain.yml, ad-child_domain.yml, ad-members.yml, ad-trusts.yml
  • ad-data.yml, ad-gmsa.yml, laps.yml, ad-relations.yml, adcs.yml, ad-acl.yml
  • servers.yml, security.yml, vulnerabilities.yml

All playbooks complete successfully with improved timing and reliability.

l50 added 2 commits October 29, 2025 11:55
… install

**Added:**

- Added parallel installation and async waiting for required PowerShell modules in
  the common role, with consolidated module check and improved progress feedback
- Added async job wait tasks with fail-fast error detection for AD OU, group, and
  managed_by tasks, as well as user and SPN creation
- Added custom progress labeling to loops for clearer job tracking in logs

**Changed:**

- Refactored AD group creation and managed_by assignment to use async mode,
  structured waiting, and improved error handling
- Replaced `with_dict`/`with_items` with `loop` and `loop_control` in all AD and
  user tasks for clarity and better result access
- Reduced retries for ACL async task for faster failure detection and added fail
  conditions for early error reporting
- Improved runner exponential backoff logic for unreachable and failed hosts, with
  dynamic wait times and better logging
- Removed unnecessary serial PowerShell module checks and installs in common role,
  replacing with a single parallel approach

**Removed:**

- Removed pre-installation reboot from MSSQL role, relying on prior provisioning
  for OS readiness
- Eliminated redundant checks and serial installs for individual PowerShell
  modules in the common role to streamline provisioning
…k for mssql

**Added:**

- Added a task to check for pending Windows reboots before MSSQL installation and
  trigger a reboot with long timeout if needed to ensure installation prerequisites

**Changed:**

- Increased ACL operation retries from 40 to 50 to accommodate observed slow cases
- Removed all explicit `failed_when` conditions from AD role tasks for groups, OUs,
  and users to prevent premature task failure and allow full async job completion
- Updated task comments to clarify reasons for retry and failure handling changes

**Removed:**

- Removed pre-installation reboot comment in MSSQL role, replacing with an explicit
  check and reboot step for clarity and reliability
- Eliminated `failed_when` clauses from AD role tasks for universal, global,
  domainlocal groups, OUs, and users to avoid unnecessary task failures during
  async operations
@l50 l50 merged commit d1483c1 into main Oct 29, 2025
@l50 l50 deleted the perf/ansible-playbook-optimizations branch October 29, 2025 21:34
l50 added a commit that referenced this pull request Oct 29, 2025
…l backoff (#10)

**Key Changes:**

- Converted all AD and OU creation tasks to async operations with robust wait and retry logic
- Refactored module installation to parallelize and wait for completion, improving speed and reliability
- Introduced pending reboot checks before MSSQL install to avoid install failures
- Enhanced Ansible runner to use exponential backoff on connection failures and task errors

**Added:**

- Asynchronous handling for AD group, OU, and user creation, including wait loops for job completion with retries and labeled progress messages for clearer output
- Pending reboot detection before MSSQL installation to ensure prerequisite state is met
- Exponential backoff logic for Ansible runner retries on unreachable VMs and playbook failures

**Changed:**

- Switched AD group and OU creation from `with_dict` to `loop` with `dict2items`, enabling async, improved error handling, and progress labeling in `ad/tasks/groups.yml`, `ad/tasks/ou.yml`, and related files
- Consolidated PowerShell module checks into a single step, and parallelized missing module installations with async handling and job status waits in `common/tasks/main.yml`
- Updated user creation and SPN tasks to use async loops with job status checks and clearer labeling in `ad/tasks/users.yml`
- Increased ACL retry count to 50 in `acl/tasks/main.yml` to better reflect real-world timing observations
- Improved Ansible runner retry logic to use exponential backoff (10s, 30s, 60s) for both unreachable and failure states, and updated abort logic in `goad/provisioner/ansible/runner.py`

**Removed:**

- Redundant, sequential module existence checks and installations for PowerShell DSC modules in `common/tasks/main.yml`
- Legacy `with_dict` and `with_items` loops for AD, OU, and user tasks, now replaced with async-enabled `loop` constructs

- OU creation: 114-134s → 107s (10% faster)
- User creation: Eliminates extreme outliers (consistent 2-3min instead of occasional 63min)
- Module installation: Parallel execution with <1s overhead
- Overall provisioning: 10-15% faster with more robust error handling

Validated across multiple full provisioning runs on AWS with SSM:
- build.yml, ad-servers.yml, ad-parent_domain.yml, ad-child_domain.yml, ad-members.yml, ad-trusts.yml
- ad-data.yml, ad-gmsa.yml, laps.yml, ad-relations.yml, adcs.yml, ad-acl.yml
- servers.yml, security.yml, vulnerabilities.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant