Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get output from an ansible-runner traceback when ran as Kubernetes work #736

Closed
AlanCoding opened this issue Feb 9, 2023 · 3 comments
Labels
type:bug Something isn't working

Comments

@AlanCoding
Copy link
Member

I have created an image that you should be able to pull at ghcr.io/alancoding/bad-ee:traceback, for building steps you can see:

https://github.com/AlanCoding/bad-execution-environments

You can verify its behavior easily:

(env) [alancoding@alan-red-hat bad-execution-environments]$ docker run --rm ghcr.io/alancoding/bad-ee:traceback /bin/bash -c "ansible-runner worker"
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/__main__.py", line 910, in main
    res = run(**run_options)
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/interface.py", line 216, in run
    r = init_runner(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/interface.py", line 109, in init_runner
    stream_worker = Worker(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/streaming.py", line 72, in __init__
    raise Exception("surprise3")
Exception: surprise3

The key expectation of this issue is that I should be able to run this over receptor and somehow obtain the critical debugging information of "surprise3".

In AWX, I go create an execution environment from this image and associate it with a job template that runs as a container group job. This will call receptor_ctl.submit_work(payload=sockout.makefile('rb'), **work_submit_kw) with "params" in work_submit_kw with "secret_kube_pod" as an entry in that with the pod spec, which includes that image. After submitting, of course, it fails. The details look like:

    "R7CLSe7b": {
        "Detail": "Error creating pod: container failed with exit code 1: ",
        "ExtraData": {
            "Command": "",
            "Image": "",
            "KubeConfig": "",
            "KubeNamespace": "bynb6w7iu5",
            "KubePod": "",
            "Params": "",
            "PodName": "automation-job-45-wcv7r"
        },
        "State": 3,
        "StateName": "Failed",
        "StdoutSize": 0,
        "WorkType": "kubernetes-runtime-auth"
    },

And the results:

bash-5.1$ receptorctl work results R7CLSe7b
Warning: receptorctl and receptor are different versions, they may not be compatible
ERROR: Remote unit failed: Error creating pod: container failed with exit code 1: 

This is missing the output. I can confirm that /tmp/receptor/awx_1/R7CLSe7b/ is missing a stdout file as well.

Ultimately, AWX needs to collect that output and pass it onto the user, but receptor needs to provide it before that can happen.

@AlanCoding
Copy link
Member Author

I'm adding 2 screenshots here to clarify just how bad this bug is, as presented in AWX.

Screenshot from 2023-04-03 14-24-02

Screenshot from 2023-04-03 14-23-56

The pod fails with a traceback, and we outright don't get that traceback. This is the domain of receptor, as it gets passed the secret_kube_pod, so receptor is the one that manages this.

In AWX awx_k8s_reaper we do import from awx.main.scheduler.kubernetes import PodManager which acts on client.CoreV1Api, gotten from from kubernetes import client. From the name of the job, it's able to figure out how to interact with the kube logs for its pod. As a painful short-term hack, I might consider doing the same in a branch of AWX error handling code, as this is sorely needed fix.

@AaronH88
Copy link
Contributor

AaronH88 commented May 8, 2023

I have been looking at this issue for the past few weeks, and finally have a PR to address it here: #776

Root cause: sometimes receptor will call the k8 api too fast resulting in "Pending" being returned instead of "Failed". And to top it off, the failed case in receptor exits without the log stream being attached, resulting in no trace. This PR fixes both issues.

@AlanCoding
Copy link
Member Author

I believe this has been fixed for a while by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants