Prior to 3.1, the Ansible Tower HA solution was not a true high-availability system. This system has been entirely rewritten in 3.1 with a focus towards a proper highly-available clustered system. This has been extended further in 3.2 to allow grouping of clustered instances into different pools/queues.
-
Each instance should be able to act as an entry point for UI and API Access. This should enable AWX administrators to use load balancers in front of as many instances as they wish and maintain good data visibility.
-
Each instance should be able to join the AWX cluster and expand its ability to execute jobs.
-
Provisioning new instance should be as simple as updating the
inventory
file and re-running the setup playbook. -
Instances can be de-provisioned with a simple management command.
-
Instances can be grouped into one or more Instance Groups to share resources for topical purposes.
-
These instance groups should be assignable to certain resources:
- Organizations
- Inventories
- Job Templates
...such that execution of jobs under those resources will favor particular queues.
It's important to point out a few existing things:
- PostgreSQL is still a standalone instance and is not clustered. Replica configuration will not be managed. If the user configures standby replicas, database failover will also not be managed.
- All instances should be reachable from all other instances and they should be able to reach the database. It's also important for the hosts to have a stable address and/or hostname (depending on how you configure the AWX host).
- Existing old-style HA deployments will be transitioned automatically to the new HA system during the upgrade process to 3.1.
- Manual projects will need to be synced to all instances by the customer.
Ansible Tower 3.3 adds support for container-based clusters using Openshift or Kubernetes.
- There is no concept of primary/secondary in the new AWX system. All systems are primary.
- The
inventory
file for AWX deployments should be saved/persisted. If new instances are to be provisioned, the passwords and configuration options as well as host names will need to be available to the installer.
The current standalone instance configuration doesn't change for a 3.1+ deployment. The inventory file does change in some important ways:
- Since there is no primary/secondary configuration, those inventory groups go away and are replaced with a single inventory group
tower
. The customer may optionally define other groups and group instances in those groups. These groups should be prefixed withinstance_group_
. One instance must be present in thetower
group. Technicallytower
is a group like any otherinstance_group_
group, but it must always be present and if a specific group is not associated with a specific resource, then job execution will always fall back to thetower
group:
[tower]
hostA
hostB
hostC
[instance_group_east]
hostB
hostC
[instance_group_west]
hostC
hostD
The database
group remains in order to specify an external Postgres. If the database host is provisioned separately, this group should be empty.
[tower]
hostA
hostB
hostC
[database]
hostDB
Recommendations and constraints:
- Do not create a group named
instance_group_tower
. - Do not name any instance the same as a group name.
-
Provisioning - Provisioning instances after installation is supported by updating the
inventory
file and re-running the setup playbook. It's important that this file contains all passwords and related information used when installing the cluster; if this is not the case, other instances may be reconfigured (this can be done intentionally). -
Deprovisioning - AWX does not automatically deprovision instances since it cannot distinguish between an instance that was taken offline intentionally or due to failure.
Starting with AWX version 19.3.0, deprovisioning an instance results in one or more Receptor configurations needing to be updated across one or more nodes, which therefore cannot be done via a manual process; the Automation Mesh Installer needs to deprovision the nodes.
Adding to and removing from the mesh does not require that every node is listed in the inventory file; in other words, the absence of a node from the inventory file does not indicate that a node should be removed. Instead, a
hostvar
ofnode_state: deprovision
conveys to the mesh installer that the node should be deprovisioned. -
Removing/Deprovisioning Instance Groups - AWX does not automatically de-provision or remove instance groups, even though re-provisioning will often cause these to be unused. They may still show up in API endpoints and stats monitoring. These groups can be removed with the following command:
$ awx-manage unregister_queue --queuename=<name>
Instance Groups can be created by posting to /api/v2/instance_groups
as a System Administrator.
Once created, Instances
can be associated with an Instance Group with:
HTTP POST /api/v2/instance_groups/x/instances/ {'id': y}`
An Instance
that is added to an InstanceGroup
will automatically reconfigure itself to listen on the group's work queue. See the following section Instance Group Policies
for more details.
AWX Instances
can be configured to automatically join Instance Groups
when they come online by defining a policy. These policies are evaluated for
every new Instance that comes online.
Instance Group Policies are controlled by three optional fields on an Instance Group
:
policy_instance_percentage
: This is a number between 0 - 100. It guarantees that this percentage of active AWX instances will be added to thisInstance Group
. As new instances come online, if the number of Instances in this group relative to the total number of instances is fewer than the given percentage, then new ones will be added until the percentage condition is satisfied.policy_instance_minimum
: This policy attempts to keep at least this manyInstances
in theInstance Group
. If the number of available instances is lower than this minimum, then allInstances
will be placed in thisInstance Group
.policy_instance_list
: This is a fixed list ofInstance
names to always include in thisInstance Group
.
NOTES
-
Instances
that are assigned directly toInstance Groups
by posting to/api/v2/instance_groups/x/instances
or/api/v2/instances/x/instance_groups
are automatically added to thepolicy_instance_list
. This means they are subject to the normal caveats forpolicy_instance_list
and must be manually managed. -
policy_instance_percentage
andpolicy_instance_minimum
work together. For example, if you have apolicy_instance_percentage
of 50% and apolicy_instance_minimum
of 2 and you start 6Instances
, 3 of them would be assigned to theInstance Group
. If you reduce the number ofInstances
to 2, then both of them would be assigned to theInstance Group
to satisfypolicy_instance_minimum
. In this way, you can set a lower bound on the amount of available resources. -
Policies don't actively prevent
Instances
from being associated with multipleInstance Groups
but this can effectively be achieved by making the percentages sum to 100. If you have 4Instance Groups
, assign each a percentage value of 25 and theInstances
will be distributed among them with no overlap.
If you have a special Instance
which needs to be exclusively assigned to a specific Instance Group
but don't want it to automatically join other groups via "percentage" or "minimum" policies:
- Add the
Instance
to one or moreInstance Group
s'policy_instance_list
. - Update the
Instance
'smanaged_by_policy
property to beFalse
.
This will prevent the Instance
from being automatically added to other groups based on percentage and minimum policy; it will only belong to the groups you've manually assigned it to:
HTTP PATCH /api/v2/instance_groups/N/
{
"policy_instance_list": ["special-instance"]
}
HTTP PATCH /api/v2/instances/X/
{
"managed_by_policy": False
}
AWX itself reports as much status as it can via the API at /api/v2/ping
in order to provide validation of the health of the Cluster. This includes:
- The instance servicing the HTTP request.
- The last heartbeat time of all other instances in the cluster.
- Instance Groups and Instance membership in those groups.
A more detailed view of Instances and Instance Groups, including running jobs and membership
information can be seen at /api/v2/instances/
and /api/v2/instance_groups
.
Each AWX instance is made up of several different services working collaboratively:
- HTTP Services - This includes the AWX application itself as well as external web services.
- Callback Receiver - Receives job events that result from running Ansible jobs.
- Celery - The worker queue that processes and runs all jobs.
- Redis - this is used as a queue for AWX to process ansible playbook callback events.
AWX is configured in such a way that if any of these services or their components fail, then all services are restarted. If these fail sufficiently (often in a short span of time), then the entire instance will be placed offline in an automated fashion in order to allow remediation without causing unexpected behavior.
Ideally a regular user of AWX should not notice any semantic difference to the way jobs are run and reported. Behind the scenes it is worth pointing out the differences in how the system behaves.
When a job is submitted from the API interface, it gets pushed into the dispatcher queue via postgres notify/listen (https://www.postgresql.org/docs/10/sql-notify.html), and the task is handled by the dispatcher process running on that specific AWX node. If an instance fails while executing jobs, then the work is marked as permanently failed.
If a cluster is divided into separate Instance Groups, then the behavior is similar to the cluster as a whole. If two instances are assigned to a group then either one is just as likely to receive a job as any other in the same group.
As AWX instances are brought online, it effectively expands the work capacity of the AWX system. If those instances are also placed into Instance Groups, then they also expand that group's capacity. If an instance is performing work and it is a member of multiple groups, then capacity will be reduced from all groups for which it is a member. De-provisioning an instance will remove capacity from the cluster wherever that instance was assigned.
It's important to note that not all instances are required to be provisioned with an equal capacity.
If an Instance Group is configured but all instances in that group are offline or unavailable, any jobs that are launched targeting only that group will be stuck in a waiting state until instances become available. Fallback or backup resources should be provisioned to handle any work that might encounter this scenario.
It is important that project updates run on the instance which prepares the ansible-runner private data directory. This is accomplished by a project sync which is done by the dispatcher control / launch process. The sync will update the source tree to the correct version on the instance immediately prior to transmitting the job. If the needed revision is already locally checked out and Galaxy or Collections updates are not needed, then a sync may not be performed.
When the sync happens, it is recorded in the database as a project update with a launch_type
of "sync" and a job_type
of "run". Project syncs will not change the status or version of the project; instead, they will update the source tree only on the instance where they run. The only exception to this behavior is when the project is in the "never updated" state (meaning that no project updates of any type have been run), in which case a sync should fill in the project's initial revision and status, and subsequent syncs should not make such changes.
All project updates run with container isolation (like jobs) and volume mount to the persistent projects folder.
By default, a job will be submitted to the default queue (formerly the tower
queue).
To see the name of the queue, view the setting DEFAULT_EXECUTION_QUEUE_NAME
.
Administrative actions, like project updates, will run in the control plane queue.
The name of the control plane queue is surfaced in the setting DEFAULT_CONTROL_PLANE_QUEUE_NAME
.
If the Job Template, Inventory, or Organization have instance groups associated with them, a job run from that Job Template will not be eligible for the default behavior. This means that if all of the instance associated with these three resources are out of capacity, the job will remain in the pending
state until capacity frees up.
The order of preference in determining which instance group the job gets submitted to is as follows:
- Job Template
- Inventory
- Organization (by way of Inventory)
To expand further: If instance groups are associated with the Job Template and all of them are at capacity, then the job will be submitted to instance groups specified on Inventory, and then Organization.
The global tower
group can still be associated with a resource, just like any of the custom instance groups defined in the playbook. This can be used to specify a preferred instance group on the job template or inventory, but still allow the job to be submitted to any instance if those are out of capacity.
In order to support temporarily taking an Instance
offline, there is a boolean property enabled
defined on each instance.
When this property is disabled, no jobs will be assigned to that Instance
. Existing jobs will finish but no new work will be assigned.
When verifying acceptance, we should ensure that the following statements are true:
- AWX should install as a standalone Instance
- AWX should install in a Clustered fashion
- Instances should, optionally, be able to be grouped arbitrarily into different Instance Groups
- Capacity should be tracked at the group level and capacity impact should make sense relative to what instance a job is running on and what groups that instance is a member of
- Provisioning should be supported via the setup playbook
- De-provisioning should be supported via a management command
- All jobs, inventory updates, and project updates should run successfully
- Jobs should be able to run on hosts for which they are targeted; if assigned implicitly or directly to groups, then they should only run on instances in those Instance Groups
- Project updates should manifest their data on the host that will run the job immediately prior to the job running
- AWX should be able to reasonably survive the removal of all instances in the cluster
- AWX should behave in a predictable fashion during network partitioning
- Basic testing should be able to demonstrate parity with a standalone instance for all integration testing.
- Basic playbook testing to verify routing differences, including:
- Basic FQDN
- Short-name name resolution
- IP addresses
/etc/hosts
static routing information
- We should test behavior of large and small clusters; small clusters usually consist of 2 - 3 instances and large clusters have 10 - 15 instances.
- Failure testing should involve killing single instances and killing multiple instances while the cluster is performing work. Job failures during the time period should be predictable and not catastrophic.
- Instance downtime testing should also include recoverability testing (killing single services and ensuring the system can return itself to a working state).
- Persistent failure should be tested by killing single services in such a way that the cluster instance cannot be recovered and ensuring that the instance is properly taken offline.
- Network partitioning failures will also be important. In order to test this:
- Disallow a single instance from communicating with the other instances but allow it to communicate with the database
- Break the link between instances such that it forms two or more groups where Group A and Group B can't communicate but all instances can communicate with the database.
- Crucially, when network partitioning is resolved, all instances should recover into a consistent state.
- Upgrade Testing - verify behavior before and after are the same for the end user.
- Project Updates should be thoroughly tested for all SCM types (
git
,svn
,archive
) and for manual projects. - Setting up instance groups in two scenarios: a) instances are shared between groups b) instances are isolated to particular groups Organizations, Inventories, and Job Templates should be variously assigned to one or many groups and jobs should execute in those groups in preferential order as resources are available.
Performance testing should be twofold:
- A large volume of simultaneous jobs
- Jobs that generate a large amount of output
These should also be benchmarked against the same playbooks using the 3.0.X Tower release and a stable Ansible version. For a large volume playbook (e.g., against 100+ hosts), something like the following is recommended:
https://gist.github.com/michelleperz/fe3a0eb4eda888221229730e34b28b89