Skip to content

Intermittent GHA Listener Failures  #3682

@jb-2020

Description

@jb-2020

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Deploy helm charts
Wait for the listener to restart

Describe the bug

Intermittently some of our listener pods will become unresponsive for 15-20 minutes. This is surfaced as long queue times for Workflows. This occurs ~4 times a day and usually correlated with load on the GHES server. It seems to happen in 'waves' impacting roughly ~90% of our listeners.

Observed behavior:

  1. The listener will throw a context deadline exceeded (Client.Timeout exceeded while awaiting headers) - this error is repeated 3 times with 5 minute pauses between the events.
  2. The listener throws: read tcp <REDACTED>:41054-><REDACTED>:443: read: connection timed out
  3. One of the following occurs:
    • The controller restarts the listener pod and it comes back as healthy
    • No error message is thrown and the listener continues on as expected
    • The listener throws: Message queue token is expired during GetNextMessage, refreshing... and it continues on as expected.

During step 1 the listener is not functional and causes 15-20 minute down times.

Should this timeout be set to 1 minute? Is 5 minutes too long?

Note: We do not observe any other connectivity issues with our instance of GHES. We are investigating issues with our connectivity to GHES and the resiliency of the server and compatibility with HTTP long polls. With that said, I think there may be an opportunity here to make the listeners more resilient to networking blips.

Describe the expected behavior

The listener is not restarted by the controller and doesn't become unresponsive for 15-20 minutes.

Additional Context

GHES Version: 3.9

Controller Logs

Listener logs: https://gist.github.com/jb-2020/13f246a361f039a54733f90f270eeafa

Controller logs: https://gist.github.com/jb-2020/18c6f276fd351e4f09ac894e545258e6

Runner Pod Logs

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions