Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch slack errors and add a command to remove jobs #388

Merged
merged 2 commits into from
Jan 8, 2024

Conversation

rebkwok
Copy link
Contributor

@rebkwok rebkwok commented Jan 8, 2024

On Friday Rich tried to run an op command, but nothing happened
https://bennettoxford.slack.com/archives/CGF9TKZLG/p1704475634495869

@BennettBot status suggested that this request to run a measures preview from 2nd Jan was still running, so the new request was just scheduled. Something apparently happened that meant the job didn't report that it either failed or succeeded, but I can't see far enough back in the old logs to see if anything was logged. In any case, the "running" job blocked any attempts to retry it, so this PR adds a command to remove a job by id in case this happens again.

Although I can't see if it was the case for the 2nd Jan job, there was an instance of an uncaught error from slack

ERROR:slack_bolt.App:on_error invoked (session id: xxxxxx, error: BlockingIOError, message: [Errno 11] Resource temporarily unavailable)

This seems to be an occasional transient thing, but if we happen to encounter it when notifying slack of starting or completing a job, we don't want it to prevent the job from being marked as complete, so now it retries up to 3 times and then logs the error.

@rebkwok rebkwok force-pushed the remove-job-and-catch-slack-errors branch from a27ffde to 5082977 Compare January 8, 2024 14:19
)
return resp.data
except Exception as err:
attempts += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is worth a small sleep() here? Might not be worth it, just wondered if you'd considered it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider it; I guess for a maximum 3 sec delay it's probably worth adding

@rebkwok rebkwok force-pushed the remove-job-and-catch-slack-errors branch from 5082977 to abff127 Compare January 8, 2024 16:44
If some unexpected error is encountered during the running of
a job, the job may be updated in the database to indicate that
it's running, but never complete or fail. This prevents the job
being retried, and any further attempts to run the job only update
the job scheduled to run after the current one is done. This adds a
command to remove specific jobs by id (identified by calling status
first).
If the slack client errors in some unexpected way, it can
cause a scheduled job to error without returning any success/fail
status, which means the job gets stuck in the "running" state, even
though it's only been reserved and isn't actually running. Now we
retry a max of 3 times and log the error if we can't do the
notification. This could mean that jobs run and never report
back, if the slack client is more than transiently broken, but
then the chances are that simple commands like help are also
broken and at least the output of jobs should be accessible in
the job logs.
@rebkwok rebkwok force-pushed the remove-job-and-catch-slack-errors branch from abff127 to 8b69286 Compare January 8, 2024 17:04
@rebkwok rebkwok merged commit 11e16e6 into main Jan 8, 2024
6 checks passed
@rebkwok rebkwok deleted the remove-job-and-catch-slack-errors branch January 8, 2024 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants