Skip to content
This repository has been archived by the owner on Jul 11, 2020. It is now read-only.

Allow switching between running open source AWX and Ansible Tower #5

Closed
geerlingguy opened this issue Nov 8, 2019 · 29 comments
Closed

Comments

@geerlingguy
Copy link
Owner

Right now I'm building out everything using open source AWX, just for convenience's sake. But I'm working on building the operator in a way where users could choose between AWX and Tower (if they want support and a license, and all that).

See:

Docs for setup:

@geerlingguy
Copy link
Owner Author

Follow-up to #1.

@geerlingguy
Copy link
Owner Author

Just switching the tags like so:

tower_task_image: registry.access.redhat.com/ansible-tower-35/ansible-tower:3.5.3
tower_web_image: registry.access.redhat.com/ansible-tower-35/ansible-tower:3.5.3

Results in:

Screen Shot 2019-11-11 at 1 16 31 PM

@geerlingguy
Copy link
Owner Author

And in the logs:

2019-11-11 19:17:17,444 WARNING  awx.conf.settings Database settings are not available, using defaults, error:
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/handlers/wsgi.py", line 157, in __call__
    response = self.get_response(request)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/handlers/base.py", line 131, in get_response
    response = middleware_method(request, response)
  File "/middleware.py", line 54, in process_response
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1306, in info
    self._log(INFO, msg, args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1442, in _log
    self.handle(record)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1452, in handle
    self.callHandlers(record)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 1514, in callHandlers
    hdlr.handle(record)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 859, in handle
    rv = self.filter(record)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/logging/__init__.py", line 718, in filter
    result = f.filter(record)
  File "/filters.py", line 91, in filter
  File "/filters.py", line 38, in __get__
  File "/settings.py", line 543, in __getattr_without_cache__
  File "/settings.py", line 447, in __getattr__
  File "/settings.py", line 390, in _get_local
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 567, in first
    objects = list((self if self.ordered else self.order_by('pk'))[:1])
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 250, in __iter__
    self._fetch_all()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 1121, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 53, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
    raise original_exception
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
    cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
ProgrammingError: relation "conf_setting" does not exist
LINE 1: ...f_setting"."value", "conf_setting"."user_id" FROM "conf_sett...
                                                             ^

So it seems something is different in the config between Tower 3.5 and AWX 9.x?

@geerlingguy
Copy link
Owner Author

It looks like the database initialization is not done automatically for Tower, only for AWX. So I had to:

$ kubectl exec -it -n example-tower example-tower-tower-6858559bcd-crc75 bash
bash$ awx-manage migrate --noinput

I'll have to add something to the operator that checks if this is a fresh install, and runs the migration if it needs to be run.

@geerlingguy
Copy link
Owner Author

After that, it looks like the tower_admin_user and tower_admin_password weren't consumed as they are when installing AWX... so I need to figure out what they are. The install guide kinda hints at them being admin and password, but that didn't work either.

@geerlingguy
Copy link
Owner Author

RE: the above two comments; for Tower, the OpenShift setup playbook contains the following tasks (all these things seem to be done by AWX automatically when first setting it up as long as your env vars and config are correct; so not sure why it's not the same for Tower):

- name: Migrate database
  shell: |
    {{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
      bash -c "awx-manage migrate --noinput"

- name: Check for Tower Super users
  shell: |
    {{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
      bash -c "echo 'from django.contrib.auth.models import User; nsu = User.objects.filter(is_superuser=True).count(); exit(0 if nsu > 0 else 1)' | awx-manage shell"
  register: super_check
  ignore_errors: yes
  changed_when: super_check.rc > 0

- name: create django super user if it does not exist
  shell: |
    {{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
      bash -c "echo \"from django.contrib.auth.models import User; User.objects.create_superuser('{{ admin_user }}', '{{ admin_email }}', '{{ admin_password }}')\" | awx-manage shell"
  no_log: yes
  when: super_check.rc > 0

- name: update django super user password
  shell: |
    {{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
      bash -c "awx-manage update_password --username='{{ admin_user }}' --password='{{ admin_password }}'"
  no_log: yes
  register: result
  changed_when: "'Password updated' in result.stdout"

- name: Create the default organization if it is needed.
  shell: |
    {{ kubectl_or_oc }} -n {{ kubernetes_namespace }} exec ansible-tower-management -- \
      bash -c "awx-manage create_preload_data"
  register: cdo
  changed_when: "'added' in cdo.stdout"
  when: create_preload_data | bool

@geerlingguy
Copy link
Owner Author

So asking more about this from some Ansible devs, I found out that the automatic stuff that's done is part of the AWX Docker image installation convenience script:

https://github.com/ansible/awx/blob/d3b413c1258af5e6b0f7ef417831985a0bb4e348/installer/roles/image_build/files/launch_awx_task.sh#L15-L22

For OpenShift/Kubernetes installs, it looks like this is the command used for the task container (/usr/bin/launch_awx_task.sh):

https://github.com/ansible/awx/blob/devel/installer/roles/kubernetes/templates/deployment.yml.j2#L238

And the default Dockerfile CMD is also set to it (CMD /usr/bin/launch_awx_task.sh):

https://github.com/ansible/awx/blob/17798edbc4577a0601d4b0e867ce198e06837794/installer/roles/image_build/templates/Dockerfile.task.j2#L6

So... I guess I'll just have to detect if we're installing Tower or AWX, and from that decide whether to do the extra steps.

@geerlingguy
Copy link
Owner Author

Looks like there is no user account (used psql to connect inside the Tower container):

awx=# select * from auth_user
awx-# ;
 id | password | last_login | is_superuser | username | first_name | last_name | email | is_staff | is_active | date_joined 
----+----------+------------+--------------+----------+------------+-----------+-------+----------+-----------+-------------
(0 rows)

So I ran:

echo "from django.contrib.auth.models import User; User.objects.create_superuser('test', 'test@example.com', 'changeme')" | awx-manage shell

And now I'm on the license page, logged in. Nice!

Screen Shot 2019-11-11 at 4 42 32 PM

@geerlingguy
Copy link
Owner Author

To achieve everything automatically, I'm going to need the k8s_exec module that's in this PR: ansible/ansible#55029

I'll probably toss it into the tower role's library directory and call it a day for now... just wish it could've been merged into Ansible sooner :P

@geerlingguy
Copy link
Owner Author

That module is giving me:

  File \"/usr/lib/python2.7/site-packages/kubernetes/stream/ws_client.py\", line 255, in websocket_call
    raise ApiException(status=0, reason=str(e))
kubernetes.client.rest.ApiException: (0)
Reason: Handshake status 403 Forbidden

@geerlingguy
Copy link
Owner Author

In the above commit, I split the example CRs, with one for AWX and one for Tower. That way I can continue using the AWX one in the CI tests (at least for now... eventually I'll want to test both AWX and local...).

@geerlingguy
Copy link
Owner Author

At this point I'm getting:

TASK [tower : Migrate the database if the K8s resources were updated.] *********
task path: /opt/ansible/roles/tower/tasks/main.yml:32
fatal: [localhost]: FAILED! => {"changed": false, "module_stderr": "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py:496: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config_dict=yaml.load(f),
Traceback (most recent call last):
  File \"/opt/ansible/.ansible/tmp/ansible-tmp-1573847294.36-128275569375865/AnsiballZ_k8s_exec.py\", line 114, in <module>
    _ansiballz_main()
  File \"/opt/ansible/.ansible/tmp/ansible-tmp-1573847294.36-128275569375865/AnsiballZ_k8s_exec.py\", line 106, in _ansiballz_main
    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
  File \"/opt/ansible/.ansible/tmp/ansible-tmp-1573847294.36-128275569375865/AnsiballZ_k8s_exec.py\", line 49, in invoke_module
    imp.load_module('__main__', mod, module, MOD_DESC)
  File \"/tmp/ansible_k8s_exec_payload_bavjVr/__main__.py\", line 136, in <module>
  File \"/tmp/ansible_k8s_exec_payload_bavjVr/__main__.py\", line 123, in main
  File \"/usr/lib/python2.7/site-packages/kubernetes/stream/stream.py\", line 32, in stream
    return func(*args, **kwargs)
  File \"/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py\", line 835, in connect_get_namespaced_pod_exec
    (data) = self.connect_get_namespaced_pod_exec_with_http_info(name, namespace, **kwargs)
  File \"/usr/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py\", line 935, in connect_get_namespaced_pod_exec_with_http_info
    collection_formats=collection_formats)
  File \"/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py\", line 321, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File \"/usr/lib/python2.7/site-packages/kubernetes/client/api_client.py\", line 155, in __call_api
    _request_timeout=_request_timeout)
  File \"/usr/lib/python2.7/site-packages/kubernetes/stream/stream.py\", line 27, in _intercept_request_call
    return ws_client.websocket_call(config, *args, **kwargs)
  File \"/usr/lib/python2.7/site-packages/kubernetes/stream/ws_client.py\", line 255, in websocket_call
    raise ApiException(status=0, reason=str(e))
kubernetes.client.rest.ApiException: (0)
Reason: Handshake status 200 OK

", "module_stdout": "", "msg": "MODULE FAILURE
See stdout/stderr for the exact error", "rc": 1}

@geerlingguy
Copy link
Owner Author

Some testing—on the command line, I can run:

$ kubectl exec -n example-tower example-tower-tower-6858559bcd-pbghh date
Fri Nov 15 20:10:51 UTC 2019

Testing in the operator playbook:

- name: Test a simple command.
  k8s_exec:
    namespace: '{{ meta.namespace }}'
    pod: '{{ tower_pod_name }}'
    command: date
  register: date_result
- debug: var=date_result

It results in:

raise ApiException(status=0, reason=str(e))
kubernetes.client.rest.ApiException: (0)
Reason: Handshake status 200 OK

Digging a little bit, it seems that can happen if you're hitting an endpoint that's not actually a websocket; see https://stackoverflow.com/a/40110656/100134

So maybe the module's not finding the right URL to hit when it's running inside the Operator? Could it be an Ansible 2.8 issue (I believe I'm running 2.9 externally)? Going to do some more digging...

@geerlingguy
Copy link
Owner Author

Running the same task on my host against Minikube with ansible===2.9.1, I had no problem.

@geerlingguy
Copy link
Owner Author

I ran pip3 uninstall ansible and pip3 install ansible===2.8.0 to match the version inside the Operator image, and the command still worked fine. So it's definitely something with running it from inside the cluster vs. running it from outside :/

@geerlingguy
Copy link
Owner Author

Inside the container I was hitting:

bash-4.2$ ansible-playbook test.yml 

PLAY [localhost] *******************************************************************************************************

TASK [Get the Tower web pod information.] ******************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'getpwuid(): uid not found: 1001'
fatal: [localhost]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

It looks like the Ansible Operator Dockerfile adds environment information for an ansible-operator user (see: https://github.com/operator-framework/operator-sdk/blob/master/ci/dockerfiles/ansible.Dockerfile#L16-L19), but since OpenShift assigns a random UID on container start, that user is not added to /etc/passwd. I added the following line:

ansible-operator:x:1001:1001:ansible-operator user:/opt/ansible:/sbin/nologin

And the Ansible/Python getpwuid errors went away.

@geerlingguy
Copy link
Owner Author

So this is fun. If I create the following playbook inside the running ansible container of the tower-operator Pod:

- hosts: localhost
  connection: local
  gather_facts: false

  tasks:
    - name: Get the Tower web pod information.
      # TODO: Change to k8s_info after Ansible 2.9.0 is available in Operator image.
      k8s_facts:
        kind: Pod
        namespace: example-tower
        label_selectors:
          - app=tower
      register: tower_pods

    - name: Set the tower pod name as a variable.
      set_fact:
        tower_pod_name: "{{ tower_pods['resources'][0]['metadata']['name'] }}"

    - name: Verify tower_pod_name is populated.
      assert:
        that: tower_pod_name != ''
        fail_msg: "Could not find the tower pod's name."

    - name: Test a simple command.
      k8s_exec:
        namespace: example-tower
        pod: '{{ tower_pod_name }}'
        command: date
      register: date_result
    - debug: var=date_result

Then I get the result:

TASK [Test a simple command.] ******************************************************************************************
changed: [localhost]

TASK [debug] ***********************************************************************************************************
ok: [localhost] => {
    "date_result": {
        "changed": true, 
        "failed": false, 
        "stderr": "", 
        "stderr_lines": [], 
        "stdout": "Fri Nov 15 20:52:14 UTC 2019\n", 
        "stdout_lines": [
            "Fri Nov 15 20:52:14 UTC 2019"
        ]
    }
}

PLAY RECAP *************************************************************************************************************
localhost                  : ok=5    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

So it seems that something is different when it runs through ansible-runner? This is extremely puzzling.

@geerlingguy
Copy link
Owner Author

@fabianvf and I were discussing this in the CoreOS slack, and it could be that the proxy set up inside the Ansible Operator between K8s and Ansible runs might be intercepting the websockets request and not proxying the connection cleanly... was glancing through https://github.com/operator-framework/operator-sdk/tree/424a61d56000e6e3d91d352faa1bd4f7c814661f/internal/scaffold/ansible and will have to dig a little deeper.

One other possibility: Install kubectl inside the operator image, and use command: kubectl exec [stuff].

@geerlingguy
Copy link
Owner Author

Opened an upstream issue operator-framework/operator-sdk#2204, as it does seem related to the ansible operator's proxy.

@geerlingguy
Copy link
Owner Author

I have everything working—I think—to get Tower automatically installed and operating now, but using kubectl instead of k8s_exec. I'm going to work on finishing this issue up, and move the work of getting k8s_exec working into #8

@geerlingguy
Copy link
Owner Author

Now when I run jobs they're never starting, and the logs on the task Pod instance seem to indicate there could be some issues:

celery.beat Removing corrupted schedule file '/var/lib/awx/beat.db': error(11, 'Resource temporarily unavailable')
...
psycopg2.errors.UndefinedColumn: column main_instancegroup.credential_id does not exist
... [much later] ...
2019-11-18 21:18:13,932 DEBUG    awx.main.scheduler Running Tower task manager.
2019-11-18 21:18:13,940 DEBUG    awx.main.scheduler Starting Scheduler
2019-11-18 21:18:14,016 DEBUG    awx.main.scheduler project_update 2 (pending) couldn't be scheduled on graph, waiting for next cycle
2019-11-18 21:18:14,066 DEBUG    awx.main.scheduler Dependent project_update 2 (pending) couldn't be scheduled on graph, waiting for next cycle
2019-11-18 21:18:14,078 DEBUG    awx.main.scheduler job 1 (pending) is blocked from running
2019-11-18 21:18:14,147 DEBUG    awx.main.scheduler Dependent project_update 2 (pending) couldn't be scheduled on graph, waiting for next cycle
2019-11-18 21:18:14,159 DEBUG    awx.main.scheduler job 3 (pending) is blocked from running
2019-11-18 21:18:14,165 DEBUG    awx.main.dispatch task 743791cf-dac7-49db-870a-a44b482b4530 is finished

And the last messages repeat over and over as it seems to be trying to kick off jobs but is not successful.

@geerlingguy
Copy link
Owner Author

(For the first item, see #3).

@geerlingguy
Copy link
Owner Author

It looks like in the AWX/Tower OpenShift installer, it uses a sidecar pod to provide celery... or something strange like that. It's running the command /usr/bin/launch_awx_task.sh and has the privileged context (which is a little odd... but maybe it needs it?).

So I added the privileged context, and started it up again, and now am getting:

Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1255, in run self.pre_run_hook(self.instance, private_data_dir) File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1761, in pre_run_hook raise RuntimeError(msg) RuntimeError: The project revision for this job template is unknown due to a failed update.

And in the backend:

2019-11-18 21:30:39,791 ERROR    awx.main.tasks job 1 (running) Exception occurred while running task
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1255, in run
    self.pre_run_hook(self.instance, private_data_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/tasks.py", line 1761, in pre_run_hook
    raise RuntimeError(msg)
RuntimeError: The project revision for this job template is unknown due to a failed update.

That seems to be related to the initial SCM sync job, which errored out with the following after I restarted the tower task container:

Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed.

@geerlingguy
Copy link
Owner Author

And now everything seems to be working, after manually re-running the SCM sync job for the Demo Project...

@geerlingguy
Copy link
Owner Author

Screen Shot 2019-11-18 at 3 34 09 PM

Looking good. Next up: time to delete everything and build from scratch to verify it works OOTB.

@geerlingguy
Copy link
Owner Author

It takes about 10m for everything to come up on first run, but the task container still runs into the following when I run the first job on it:

2019-11-18 22:19:01,692 DEBUG    awx.main.scheduler project_update 1 (pending) couldn't be scheduled on graph, waiting for next cycle

If I delete the task pod, then wait for its replacement, then monitor it, it seems to at least bump jobs from 'Pending' to 'Waiting'... and then it takes some time for new jobs to be processed. Maybe just a weird first-time setup thing. But I'll probably take a deeper look at it later. Don't want to have to be restarting the task container all the time...

Side note—one other error that occurs on startup every time:

Using /etc/ansible/ansible.cfg as config file
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ImportError: No module named psycopg2
127.0.0.1 | FAILED! => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    }, 
    "changed": false, 
    "msg": "Failed to import the required Python library (psycopg2) on example-tower-tower-task-dcbf4bdcb-k8hjg's Python /usr/bin/python. Please read module documentation and install in the appropriate location"
}

@geerlingguy
Copy link
Owner Author

If this next test passes, I'm going to test that AWX still works the same, and if so, close out this issue as complete.

@geerlingguy
Copy link
Owner Author

Yay, test passed! Just need to test that AWX works similarly to Tower, then I'll close the issue. Day is wrapping up so it'll have to be later or tomorrow.

@geerlingguy
Copy link
Owner Author

AWX worked just fine, but also needed the task Pod to be deleted/restarted before it would start running Jobs. Strange, but whatever for now...

CI tests are now passing, too, so I'm going to go ahead and merge to master and close out this issue. Yay!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant