Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@coqbot could report failures from GitLab CI and restart spurious ones #3

Closed
Zimmi48 opened this issue May 20, 2018 · 8 comments
Closed
Labels
enhancement New feature or request
Milestone

Comments

@Zimmi48
Copy link
Member

Zimmi48 commented May 20, 2018

When a GitLab CI pipeline completes with failures, first check if the pipeline is up-to-date with respect to the PR.

If yes, check if some of the failures are spurious:

  • "runner system failure" detected by GitLab, example error message: ERROR: Job failed (system failure): Cannot connect to the Docker daemon at tcp://10.142.0.123:2376. Is the docker daemon running?
  • connection trouble:
    error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500 Internal Server Error
    fatal: The remote end hung up unexpectedly
    ERROR: Job failed: exit code 1
    
  • uploading artifacts failed:
    Uploading artifacts to coordinator... ok            id=71082452 responseStatus=201 Created token=G7Azf-fN
    ERROR: Job failed: exit code 1
    

If yes, restart the corresponding jobs.
(For the spurious failures we have control upon, we should fix them instead.)

Otherwise, post a message in the PR thread with the last few lines of the failing job logs and direct links to these logs.

@Zimmi48 Zimmi48 added the enhancement New feature or request label May 20, 2018
@Zimmi48
Copy link
Member Author

Zimmi48 commented Jun 12, 2018

This is going to be way easier than I thought thanks to the job webhook, the build trace API endpoint and the build retry API endpoint (cf. bf62f4b and https://docs.gitlab.com/ee/api/jobs.html#retry-a-job).

@Zimmi48
Copy link
Member Author

Zimmi48 commented Jun 13, 2018

Actually something is missing from the webhook load or the trace to be able to tell whether the failure is due to a failing runner. Cf. https://gitlab.com/gitlab-org/gitlab-ee/issues/6408

@Zimmi48
Copy link
Member Author

Zimmi48 commented Jun 13, 2018

Another unrelated problem to put this in practice would be to stop relying on Heroku's free dynos as GitLab job webhook generates way too many requests to let the bot have the 7 hours of statutory sleep.

@ejgallego
Copy link
Member

Please disable the report functionality until the "stale build problem" is fixed as detailed in coq/coq#7871 (comment)

@Zimmi48
Copy link
Member Author

Zimmi48 commented Jul 12, 2018

OK, this is fixed now.

@ejgallego
Copy link
Member

Thanks!!!

@Zimmi48
Copy link
Member Author

Zimmi48 commented Jul 16, 2018

This is basically implemented now and further enhancements can be treated in separate issues.

@Zimmi48 Zimmi48 closed this as completed Jul 16, 2018
@ejgallego
Copy link
Member

Thanks for this great work.

@Zimmi48 Zimmi48 added this to the 0.1.0 milestone Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants