Can't get output from an ansible-runner traceback when ran as Kubernetes work #736

AlanCoding · 2023-02-09T19:24:16Z

I have created an image that you should be able to pull at ghcr.io/alancoding/bad-ee:traceback, for building steps you can see:

https://github.com/AlanCoding/bad-execution-environments

You can verify its behavior easily:

(env) [alancoding@alan-red-hat bad-execution-environments]$ docker run --rm ghcr.io/alancoding/bad-ee:traceback /bin/bash -c "ansible-runner worker"
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/__main__.py", line 910, in main
    res = run(**run_options)
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/interface.py", line 216, in run
    r = init_runner(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/interface.py", line 109, in init_runner
    stream_worker = Worker(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/ansible_runner/streaming.py", line 72, in __init__
    raise Exception("surprise3")
Exception: surprise3

The key expectation of this issue is that I should be able to run this over receptor and somehow obtain the critical debugging information of "surprise3".

In AWX, I go create an execution environment from this image and associate it with a job template that runs as a container group job. This will call receptor_ctl.submit_work(payload=sockout.makefile('rb'), **work_submit_kw) with "params" in work_submit_kw with "secret_kube_pod" as an entry in that with the pod spec, which includes that image. After submitting, of course, it fails. The details look like:

    "R7CLSe7b": {
        "Detail": "Error creating pod: container failed with exit code 1: ",
        "ExtraData": {
            "Command": "",
            "Image": "",
            "KubeConfig": "",
            "KubeNamespace": "bynb6w7iu5",
            "KubePod": "",
            "Params": "",
            "PodName": "automation-job-45-wcv7r"
        },
        "State": 3,
        "StateName": "Failed",
        "StdoutSize": 0,
        "WorkType": "kubernetes-runtime-auth"
    },

And the results:

bash-5.1$ receptorctl work results R7CLSe7b
Warning: receptorctl and receptor are different versions, they may not be compatible
ERROR: Remote unit failed: Error creating pod: container failed with exit code 1:

This is missing the output. I can confirm that /tmp/receptor/awx_1/R7CLSe7b/ is missing a stdout file as well.

Ultimately, AWX needs to collect that output and pass it onto the user, but receptor needs to provide it before that can happen.

The text was updated successfully, but these errors were encountered:

AlanCoding · 2023-04-03T18:31:39Z

I'm adding 2 screenshots here to clarify just how bad this bug is, as presented in AWX.

The pod fails with a traceback, and we outright don't get that traceback. This is the domain of receptor, as it gets passed the secret_kube_pod, so receptor is the one that manages this.

In AWX awx_k8s_reaper we do import from awx.main.scheduler.kubernetes import PodManager which acts on client.CoreV1Api, gotten from from kubernetes import client. From the name of the job, it's able to figure out how to interact with the kube logs for its pod. As a painful short-term hack, I might consider doing the same in a branch of AWX error handling code, as this is sorely needed fix.

AaronH88 · 2023-05-08T15:44:15Z

I have been looking at this issue for the past few weeks, and finally have a PR to address it here: #776

Root cause: sometimes receptor will call the k8 api too fast resulting in "Pending" being returned instead of "Failed". And to top it off, the failed case in receptor exits without the log stream being attached, resulting in no trace. This PR fixes both issues.

AlanCoding · 2023-09-11T12:18:27Z

I believe this has been fixed for a while by now.

github-actions bot added the needs_triage label Feb 9, 2023

AlanCoding mentioned this issue Feb 14, 2023

Job error without log ansible/awx#12297

Open

6 tasks

djyasin added type:bug Something isn't working and removed needs_triage labels Mar 22, 2023

AlanCoding mentioned this issue Apr 3, 2023

Consume job_explanation from runner, fix error reporting error ansible/awx#13482

Merged

AlanCoding mentioned this issue Apr 3, 2023

Draft a wire protocol for ansible-runner remote jobs, with a proposed second version ansible/ansible-runner#1221

Open

4 tasks

AaronH88 mentioned this issue May 8, 2023

Fix for issue 736, refactor log streams for k8 pods #776

Merged

AlanCoding mentioned this issue Jun 9, 2023

[2.3 backport] Give more detail when we cannot process a non-JSON streamed line (#1186) ansible/ansible-runner#1258

Merged

AlanCoding closed this as completed Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get output from an ansible-runner traceback when ran as Kubernetes work #736

Can't get output from an ansible-runner traceback when ran as Kubernetes work #736

AlanCoding commented Feb 9, 2023

AlanCoding commented Apr 3, 2023

AaronH88 commented May 8, 2023

AlanCoding commented Sep 11, 2023

Can't get output from an ansible-runner traceback when ran as Kubernetes work #736

Can't get output from an ansible-runner traceback when ran as Kubernetes work #736

Comments

AlanCoding commented Feb 9, 2023

AlanCoding commented Apr 3, 2023

AaronH88 commented May 8, 2023

AlanCoding commented Sep 11, 2023