Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job details and Job view not working #1861

Closed
Borrelworst opened this issue May 9, 2018 · 82 comments
Closed

Job details and Job view not working #1861

Borrelworst opened this issue May 9, 2018 · 82 comments

Comments

@Borrelworst
Copy link

ISSUE TYPE
  • Bug Report
COMPONENT NAME
  • UI
SUMMARY

Job details and Job view not working properly

ENVIRONMENT
  • AWX version: 1.0.6.5
  • AWX install method: docker on linux
  • Ansible version: 2.5.2
  • Operating System: RedHat 7.4
  • Web Browser: Firefox/Chrome
STEPS TO REPRODUCE

Run any playbook, failed and succeeded jobs are present but not showing any details.

EXPECTED RESULTS

Details from jobs

ACTUAL RESULTS

Nothing is showing, no errors, no timeouts, just nothing

ADDITIONAL INFORMATION

For example I have a failed job. When clicking on details, I can see the URL changing to:
https://awx-url/#/jobz/project/
However nothing happens. When using right mouse button and opening in new tab/page I will only get the navigation pane and a blank page.
Same happens when I click on the job it self.

Additionaly, adding inventory sources works fine, however when navigating to 'Schedule inventory sync' I can see the the gear-wheel spinning but also nothing happens.
I did a fresh installation today (9th May)

@matthew-hickok
Copy link

I am experiencing the same issue.

@anasypany
Copy link

anasypany commented May 9, 2018

What are you using for a proxy in front of AWX? Do you have your awx_web container bound to 0.0.0.0:port or 127.0.0.1:port? I was experiencing the same issue while accessing AWX behind a nginx proxy running on the Linux host and noticed that when the proxy was disabled the Job detail pages would display properly. After I set the awx_web container to listen on 127.0.0.1, I was longer experiencing the issue. To set the awx_web container to 127.0.0.1, you can specify host_port=127.0.0.1:port (instead of host_port=port) in the installer inventory file.

@cstuart1
Copy link

cstuart1 commented May 10, 2018

I'm having the same issue where the job details will not display (also running with a proxy in front of awx). Adjusting the awx_web container to listen on 127.0.0.1 did not resolve the issue. Prior to upgrading to 1.0.6.5 this was working properly.

ENVIRONMENT
AWX version: 1.0.6.5
AWX install method: docker on linux
Ansible version: 2.5.2
Operating System: Ubuntu 16.04
Web Browser: Firefox/Chrome

In developer tools I'm seeing this error:
WebSocket connection to 'wss://<>/websocket/' failed: WebSocket is closed before the connection is established.

where the <> is the correct uri to my instance.

"/#/jobs?job_search=page_size:20;order_by:-finished;not__launch_type:sync:1 /#/jobz/inventory/33:1". I am also usning Nginx as a front end proxy (port 443).

@Borrelworst
Copy link
Author

Thanks for the tip @anasypany and for trying this solution @cstuart1. I indeed also use nginx as front-end proxy as I need SSL and port 443. What I haven't tried yet is via a ssh-tunnel directly connecting to the awx_web container. If the issue then still persist it is in the application itself. However I will not be able to test this today, but it will be the first thing I will do tomorrow morning.

@anasypany
Copy link

anasypany commented May 10, 2018

@cstuart1 Can you paste your ngxinx proxy config? (with censored environment details of course)

@cstuart1
Copy link

cstuart1 commented May 10, 2018

@Borrelworst
The solution here is to add in a block for the websocket in your Nginx config

location /websocket {
proxy_pass http://x.x.x.x:80;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
}

@anasypany this is probably what you were going to suggest/inquire about?

@anasypany
Copy link

anasypany commented May 10, 2018

@cstuart1 I was able to get the job details pages working again with this simple nginx proxy config once awx_web was bound to 127.0.0.1:

location / {
proxy_pass http://127.0.0.1:xxxx; (xxxx = 80 in your case)
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}

If you try this config make sure to add HTTP_X_FORWARDED_FOR in your Remote Host Headers on AWX as well. Let me know if you have any luck!

@cstuart1
Copy link

Yes, that resolved the issue for me.
I had already added HTTP_X_FORWARDED_FOR to AWX as I'm using SAML for auth.

For someone else reading this thread and trying to setup SAML.
I also had to alter /etc/tower/settings.py (task and web) to have the following:
USE_X_FORWARDED_PORT = True
USE_X_FORWARDED_HOST = True

and restart tower after making the setting change.
This is mentioned in the tower documents but I thought I would post this in-case someone else read this thread.

@Borrelworst
Copy link
Author

@cstuart1: That indeed solved the issue. I have not set the awx_web to bound explicitly to 127.0.0.1 and apparently that is not needed. The only issue I still see is that when I go to my custom inventory scripts and click on schudule inventory syncs, I will just see the cog wheel, but nothing happens. This is also described in #1850.

@nmpacheco
Copy link

I am also experiencing problems with job details. I deployed a stack with postgres, rabbitmq, memcache, awx_web and awx_task in a swarm (ansible role to check variables, create dirs, instantiating a docker-compose template, deploy and so on). I am using vfarcic docker-flow to provide access to all the services in the swarm and to automatically detect changes in the configuration and reflect those changes in the proxy configuration. Within this stack, only awx_web is provided access outside the swarm with the docker-flow stack.
All works well except that the websocket of the job listing and details works only during rare intervals, usually, when repeated killing daphne and nginx inside awx_web container.
Debugging in the browser, I can see a bunch of websocket upgrades being tried and all of them failing with "502 Bad Gateway" after 5/6 seconds. At the same time, for each of the failing websockets attempts, a message like the one bellow appears in the awx_web log:

2018/05/16 23:36:18 [error] 31#0: *543 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <internal proxy ip>, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "<my specific virtual host>"

Occasionally, the following messages are also printed in the same log:

127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECTING /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:54] "WSCONNECT /websocket/" - -
127.0.0.1:59526 - - [16/May/2018:19:22:55] "WSDISCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECTING /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:55] "WSCONNECT /websocket/" - -
127.0.0.1:59536 - - [16/May/2018:19:22:56] "WSDISCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECTING /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:06] "WSCONNECT /websocket/" - -
127.0.0.1:59976 - - [16/May/2018:19:23:21] "WSDISCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECTING /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:23:27] "WSCONNECT /websocket/" - -
127.0.0.1:60994 - - [16/May/2018:19:25:05] "WSDISCONNECT /websocket/" - -
127.0.0.1:34510 - - [16/May/2018:22:42:34] "WSDISCONNECT /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:43] "WSCONNECTING /websocket/" - -
127.0.0.1:34710 - - [16/May/2018:22:42:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:42:57] "WSCONNECTING /websocket/" - -
127.0.0.1:34794 - - [16/May/2018:22:43:02] "WSDISCONNECT /websocket/" - -
(...)
127.0.0.1:35964 - - [16/May/2018:23:35:48] "WSDISCONNECT /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:52] "WSCONNECTING /websocket/" - -
127.0.0.1:37312 - - [16/May/2018:23:35:52] "WSDISCONNECT /websocket/" - -
127.0.0.1:37412 - - [16/May/2018:23:35:57] "WSCONNECTING /websocket/" - -
127.0.0.1:37394 - - [16/May/2018:23:35:57] "WSDISCONNECT /websocket/" - -

The haproxy config generated by docker-flow for this service (awx_web) is:

frontend services
(...)
    acl url_awx-stack_awxweb8052_0 path_beg /
    acl domain_awx-stack_awxweb8052_0 hdr_beg(host) -i <my specific virtual host>
    use_backend awx-stack_awxweb-be8052_0 if url_awx-stack_awxweb8052_0 domain_awx-stack_awxweb8052_0
(...)
backend awx-stack_awxweb-be8052_0
    mode http
    http-request add-header X-Forwarded-Proto https if { ssl_fc }
    http-request add-header X-Forwarded-For %[src]
    http-request add-header X-Client-IP %[src]
    http-request add-header Upgrade "websocket"
    http-request add-header Connection "upgrade"
    server awx-stack_awxweb awx-stack_awxweb:8052

It is very similar to a bunch of other services in the swarm.
As far as I can understand, the upstream referenced in the message above refers to daphne inside the awx_web container, that daphne instance is listening on the http://127.0.0.1:8051 and is "called" by the proxy configuration of the nginx, also running inside the same container. I am currently investigating how can one troubleshoot daphne.
I would appreciate if anyone can help me with some ideas or guidelines to proceed with the investigations.
Thanks!

@leweafan
Copy link

I'm experiencing the same issue

ENVIRONMENT
AWX version: 1.0.6.8
AWX install method: docker on linux
Ansible version: 2.5.2
Operating System: Debian 9
Web Browser: Firefox/Chrome

@mkoshevoi
Copy link

I have the same issue either

@Rpera
Copy link

Rpera commented May 29, 2018

Hi, I had the same issue and i was able to get the jobs output running this command to fix the permissions:

  • chmod 744 -R /opt/awx/embedded

@matthew-hickok
Copy link

Since most of these comments are related to proxy configurations, I should probably mention that I have the same issue but I do not have a proxy in front of mine.

@SatiricFX
Copy link

I'm experiencing the same issue as well. Initially will work fine. I noticed restarting the containers/docker resolves the issue. Will monitor it to determine if issue occurs again, which I assume it will.

@cavamagie
Copy link

same error
i use nginx with configuration similar to @anasypany

location / {
    proxy_pass http://127.0.0.1:8052;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
}

but i'm unable to see the job

@bogdansharuk
Copy link

bogdansharuk commented Jun 13, 2018

@cavamagie

ENVIRONMENT

  • AWX version: 1.0.6.15
  • AWX install method: docker on linux
  • Ansible version: 2.5.4
  • Operating System: Debian 9
  • Web Browser: Firefox/Chrome

cat awx/installer/inventory

host_port=127.0.0.1:9999

location / {
    proxy_pass http://127.0.0.1:9999/;
    proxy_http_version 1.1;
    proxy_set_header Host               $host;
    proxy_set_header X-Real-IP          $remote_addr;
    proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto  $scheme;
    proxy_set_header Upgrade            $http_upgrade;
    proxy_set_header Connection         "upgrade";
} 

It works for me

@sudomateo
Copy link

sudomateo commented Jun 15, 2018

@cstuart1 Do you think we can chat out of band regarding SAML setup with AWX? I've been at this for hours with no success.

Edit: I commented on #1016 with details on how to configure AWX for use with SAML auth.

@piroux
Copy link
Contributor

piroux commented Jun 18, 2018

Same issues
@SatiricFX I have noticed the same thing: restarting the docker containers usually helps.
Moreover, I am not using any proxy nor https access.

@SatiricFX
Copy link

@piroux That does resolve it for us as well temporarily. Haven't found a permanent fix for it. Maybe a bug.

@strawgate
Copy link

strawgate commented Jun 27, 2018

It appears you can swap the supervisor.conf and add verbose output to daphne:

[program:daphne]
command = /var/lib/awx/venv/awx/bin/daphne -b 127.0.0.1 -p 8051 awx.asgi:channel_layer -v 2

With this I am seeing the following behavior related to websockets from Daphne/nginx:

2018-06-27 03:18:59,295 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!BfsxXxiUPF to WebSocket daphne.response.XbupPxYRcS!ReBXomhGtg
RESULT 2
OKREADY
10.255.0.2 - - [27/Jun/2018:03:19:02 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:03,491 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!ReBXomhGtg
2018-06-27 03:19:21,372 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!aPmLgJGDZd to WebSocket daphne.response.XbupPxYRcS!hTzJudfDoM
10.255.0.2 - - [27/Jun/2018:03:19:24 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:25,571 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!hTzJudfDoM
2018-06-27 03:19:50,862 DEBUG    Upgraded connection daphne.response.XbupPxYRcS!lnvEJzPynj to WebSocket daphne.response.XbupPxYRcS!XCyaFNijYM
10.255.0.2 - - [27/Jun/2018:03:19:53 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018-06-27 03:19:53,999 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!XCyaFNijYM
RESULT 2
OKREADY

This eventually logs:

2018-06-27 03:34:03,939 WARNING  dropping connection to peer tcp4:127.0.0.1:34576 with abort=True: WebSocket opening handshake timeout (peer did not finish the opening handshake in time)
10.255.0.2 - - [27/Jun/2018:03:34:03 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17682"
2018/06/27 03:34:03 [error] 32#0: *147 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "localhost:8080"
2018-06-27 03:34:03,941 DEBUG    WebSocket closed for daphne.response.XbupPxYRcS!gbrIRtuqeq

@DBLaci
Copy link

DBLaci commented Jun 28, 2018

awx_web:1.0.6.23 here:

10.255.0.2 - - [28/Jun/2018:13:31:14 +0000] "GET /websocket/ HTTP/1.1" 502 575 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"
2018/06/28 13:31:14 [error] 25#0: *440 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 10.255.0.2, server: _, request: "GET /websocket/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/", host: "awx.prmrgt.com:80"
10.255.0.2 - - [28/Jun/2018:13:31:19 +0000] "GET /websocket/ HTTP/1.1" 499 0 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"

etc. websocket simply not working. All reverse proxy configuration was working before (1.0.3.29 for example). nginx config is fine:

      location / {
        proxy_pass http://10.20.1.100:8053/;
        proxy_http_version 1.1;
        proxy_set_header   Host               $host:$server_port;
        proxy_set_header   X-Real-IP          $remote_addr;
        proxy_set_header   X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto  $scheme;
        proxy_set_header   Upgrade            $http_upgrade;
        proxy_set_header   Connection         "upgrade";
      }

I appended these lines to /etc/tower/settings.py:

USE_X_FORWARDED_PORT = True
USE_X_FORWARDED_HOST = True

I found ansible/awx_web:1.0.6.11 is the latest image working fine for me (this means the websocket reverse proxy settings are fine outside the awx_web!). I hope this helps.

Please not the settings.py changes are not needed for 1.0.6.11 to work. I don't see any impact it I set those or not.

@josemgom
Copy link

I am also facing the same issue.

ENVIRONMENT

  • AWX version: 1.0.6.11
  • AWX install method: docker on linux
  • Ansible version: 2.5.7
  • Operating System: CentOS 7
  • Web Browser: Firefox/Chrome

They only workaround that is currently working for me is stopping everything and starting again the containers.

@strawgate
Copy link

strawgate commented Jul 2, 2018

This issue does not appear to occur for a little while after redeploying AWX.

I did however notice that none of the job details from while this issue is occuring are available even after you restart. It appears as though the "stdout" response on the API is populated via the task container posting data to a websocket for that job.

I also noticed that when the issue is occurring that the task container fails with the following errors:

[2018-07-02 19:03:47,717: DEBUG/Worker-4] using channel_id: 2
2018-07-02 19:03:47,718 ERROR    awx.main.models.unified_jobs job 15 (running) failed to emit channel msg about status change
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 1169, in _websocket_emit_status
    emit_channel_notification('jobs-status_changed', status_data)
  File "/usr/lib/python2.7/site-packages/awx/main/consumers.py", line 70, in emit_channel_notification
    Group(group).send({"text": json.dumps(payload, cls=DjangoJSONEncoder)})
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/channels/channel.py", line 88, in send
    self.channel_layer.send_group(self.name, content)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 190, in send_group
    self.send(channel, message)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 95, in send
    self.recover()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/asgi_amqp/core.py", line 77, in recover
    self.tdata.consumer.revive(self.tdata.connection.channel())
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/connection.py", line 255, in channel
    chan = self.transport.create_channel(self.connection)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 92, in create_channel
    return connection.channel()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/connection.py", line 282, in channel
    return self.Channel(self, channel_id)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 101, in __init__
    self._x_open()
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/channel.py", line 427, in _x_open
    self._send_method((20, 10), args)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 104] Connection reset by peer

This would explain why the job details from jobs that ran while the websockets are not working arent even visible after restarting the web/task container and why they arent available when hitting the stdout resource on the job endpoint

@stmarier
Copy link

stmarier commented Jul 3, 2018

I ran into this issue as well and resolved it by stopping both the web and task containers and rerunning the installer playbook to start them again.

@boris-42
Copy link

@ryanpetrello

  • We are using official image 1.0.7.2
  • Web browser is not the problem (we tried on different, on different OS)

Some observation:

  • if we curl this "api/v2/jobs/<JOB_ID>/stdout/" it's empty
  • After restart of awx web and awx task it gets populated
  • In logs of awx task we see " File "/usr/lib/python2.7/site-packages/awx/main/models/unified_jobs.py", line 1169, in _websocket_emit_status" the same as in one of aboves comments
  • After restarting it works for ~15 minutes
  • Seems like problem between awx-task and rabbitmq...

@ryanpetrello
Copy link
Contributor

It sounds to me like job events aren't being saved into the database. This can be caused by a number of things. Do you see anything when you visit /api/v2/jobs/N/event/?

@boris-42
Copy link

boris-42 commented Aug 31, 2018

@ryanpetrello I suspect you meant jobs_events.

it returns

{
  "count": 0, 
  "next": null, 
  "previous": null, 
  "results": []
}

If I restart awx-task and awx-web this information gets populated. And it continues working until we see in awx-task that log message related to rabbitmq

@ryanpetrello
Copy link
Contributor

Yep, that's exactly what I meant, thanks :)

In your awx task container, can you run:

supervisorctl -c /supervisor_task.conf status

@boris-42
Copy link

boris-42 commented Aug 31, 2018

@ryanpetrello

bash-4.2$ supervisorctl -c /supervisor_task.conf status
awx-config-watcher                  RUNNING   pid 195, uptime 12:38:18
tower-processes:callback-receiver   RUNNING   pid 199, uptime 12:38:18
tower-processes:celery              RUNNING   pid 196, uptime 12:38:18
tower-processes:celery-watcher      RUNNING   pid 198, uptime 12:38:18
tower-processes:channels-worker     RUNNING   pid 197, uptime 12:38:18

@boris-42
Copy link

boris-42 commented Sep 1, 2018

@ryanpetrello

Some more information:

  • If I create schedule and run jobs every 3-5 minutes it works perfectly
  • If I create schedule and run jobs with gap of 20 minutes it stops working

@boris-42
Copy link

boris-42 commented Sep 2, 2018

@ryanpetrello Some more details. Bug is reproduced on many version of AWX.

If i run /usr/bin/awx-manage run_callback_receiver in task container

All results get send to database...

More interesting thing is this piece of code:
https://github.com/ansible/awx/blob/devel/awx/main/management/commands/run_callback_receiver.py#L233-L238

If something happens to rabbitmq and we got broken connection it's not recrated, from other side we have large try/except in code that uses connection, which doesn't let run_callback_reciever crash so supervisor will be bring it back...

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 5, 2018

@boris-42 the example you linked is catching KeyboardInterrupt - I'd expect the callback receiver to gracefully handle and recover from AMQP unavailability in the way you described (testing this a bit myself).

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 5, 2018

I'm having a hard time reproducing this by stopping RabbitMQ - the callback receiver recovers for me after stopping and starting the message broker:

image

It also seems resilient to me screwing w/ TCP via tcpkill:

image

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 5, 2018

@boris-42 do you see any logs in the task container for the callback receiver that might provide some hints?

@josemgom
Copy link

IMHO, I don't know why this issue is closed when is still happening, even with the recent versions.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Sep 11, 2018

@josemgom the reason it's closed is that the original reporter described their issue and found a solution to it here: #1861 (comment)

(also, see: #1861 (comment))

The number of people chiming in on this one has generated a lot of noise; it's likely people are encountering a number of issues across a variety of configurations that are being conflated:

  • some people are using older awx versions with resolved bugs
  • some are deploying behind a proxy and needed additional X-Forwarded-For configuration
  • some have reported that things work better with a newer version of Chrome

If you're still encountering an issue with the job details page, and you're using the most recent version of awx, and none of the suggestions in this comment thread have addressed it for you, then please open a new issue with as much detail as possible about the problem you're encountering: https://github.com/ansible/awx/issues/new?template=bug_report.md

In the meantime, I and other awx maintainers are happy to help as much as possible here (see my and others' various interactions with people above) and in our IRC room on freenode (#awx-devel).

@boris-42
Copy link

@ryanpetrello you are back ! =)

Steps to reproduce:

  • My production deployment is running on top of k8s and looks like, this:
    -- awx-rabbitmq is statefulset with 3 replicas
    -- memcahced and postgres are 2 deployments
    -- awx-web is coupled with awx-task in the same pod as part of one deployment (there is some bug that we are still debugging that is blocking us from decoupling)
  • After deploying everything, don't touch anything for 15+ minutes
  • Run any job template (demo one for example)
  • You won't see the logs in output
  • If you restart callback receiver logs are populated
  • (if you don't run anything for next 15 minutes issue is going to be reproduced)

@ryanpetrello
Copy link
Contributor

Hey @boris-42,

Do you see any logs in the task container for the callback receiver that might provide some hints? Errors/exceptions/tracebacks?

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Oct 9, 2018

@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer errors: we think we might have an idea of what's causing this issue. If any of you are feeling like experimenting, could you give this PR a try in your environments to see if it improves things?

#2391

Alternatively, you could try running something like this (in all of your containers) and then restarting awx services to get the latest version:

~ /var/lib/awx/venv/awx/bin/pip uninstall asgi-amqp
~ /var/lib/awx/venv/awx/bin/pip install "asgi-amqp==1.1.2"

@boris-42
Copy link

boris-42 commented Oct 9, 2018

@ryanpetrello Thanks, I'll try to patch container this weekend!

@josemgom
Copy link

Thanks @ryanpetrello

I just upgraded the package in my development and production envs. I let you know if the users still facing this issue.

@taspotts
Copy link

Running:
/var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2 brought in a newer version of kombu 4.2.1 which starts breaking daphne/celery badly.

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/bin/daphne", line 11, in <module>
    sys.exit(CommandLineInterface.entrypoint())
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 144, in entrypoint
    cls().run(sys.argv[1:])
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/daphne/cli.py", line 174, in run
    channel_layer = importlib.import_module(module_path)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/lib/python2.7/site-packages/awx/asgi.py", line 9, in <module>
    prepare_env() # NOQA
  File "/usr/lib/python2.7/site-packages/awx/__init__.py", line 55, in prepare_env
    if not settings.DEBUG: # pragma: no cover
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 56, in __getattr__
    self._setup(name)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 41, in _setup
    self._wrapped = Settings(settings_module)
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/django/conf/__init__.py", line 110, in __init__
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/usr/lib64/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/lib/python2.7/site-packages/awx/settings/production.py", line 17, in <module>
    from defaults import *  # NOQA
  File "/usr/lib/python2.7/site-packages/awx/settings/defaults.py", line 7, in <module>
    import djcelery
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/djcelery/__init__.py", line 34, in <module>
    from celery import current_app as celery  # noqa
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/five.py", line 312, in __getattr__
    module = __import__(self._object_origins[name], None, None, [name])
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/_state.py", line 20, in <module>
    from celery.utils.threads import LocalStack
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/__init__.py", line 405, in <module>
    from .functional import chunks, noop                    # noqa
  File "/var/lib/awx/venv/awx/lib/python2.7/site-packages/celery/utils/functional.py", line 19, in <module>
    from kombu.utils.compat import OrderedDict
ImportError: cannot import name OrderedDict

Running:
/var/lib/awx/venv/awx/bin/pip install -U asgi-amqp==1.1.2 kombu==3.0.37 and holding back kombu appears to have worked. No more Connection reset by peer errors and the job details load!

ENVIRONMENT

AWX version: 2.0.0
AWX install method: docker on linux
Ansible version: 2.6.5
Operating System: Ubuntu 18.04
Web Browser: Firefox/Chrome

@ryanpetrello
Copy link
Contributor

@taspotts thanks for the feedback. We've merged the asgi_amqp update and are planning to release it in a new version of awx in the near future.

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Oct 11, 2018

@boris-42 @strawgate @DBLaci @nmpacheco and others who have encountered the Connection reset by peer errors: we've released a new version of awx, 2.0.1, which we believe should resolve this issue. Please give it a shot and let us know if you continue to encounter issues!

@nightvisi0n
Copy link

I also had this error and verified that it was fixed in the latest released docker-image.
Thanks for addressing this issue!

@wenottingham
Copy link
Contributor

Closing this, please reopen if it persists.

@boris-42
Copy link

@ryanpetrello thanks for fixing this, I checked it finally yesterday, everything works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests