-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GH-1593] Notify Slack watchers on failed TDR snapshot jobs #590
Conversation
And converted with-redefs-fn to with-redefs in test files I touched for better readability
source/tdr-job-failed-slack-msg mock-tdr-job-failed-slack-msg | ||
slack/notify-watchers (constantly nil)] | ||
(let [metadata {:job_status "running"}] | ||
(with-redefs [datarepo/job-metadata (constantly metadata)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you brought my attention here, I do find with-redefs
easier to work with and read than with-redefs-fn
, especially in cases where wanting to nest redefinitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to hear it.
source/tdr-job-failed-slack-msg mock-tdr-job-failed-slack-msg | ||
slack/notify-watchers (constantly nil)] | ||
(let [metadata {:job_status "running"}] | ||
(with-redefs [datarepo/job-metadata (constantly metadata)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to hear it.
(with-redefs [datarepo/job-metadata (constantly metadata)] | ||
(is (= metadata | ||
(#'source/check-tdr-job-and-notify-on-failure workload job-id)) | ||
"Should return metadata for job with unknown status"))))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh … OK.
|
||
![](assets/staged-workload/workflow-finished-notifications.png) | ||
|
||
In the future, WFL may allow for these two notification streams | ||
to be configured separately. | ||
High-volume use cases (ex. 100s of workflows/day) may find | ||
state change notifications too noisy. | ||
some state change notifications too noisy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or … may not? (-;
Purpose
Staged workloads with
TerraDataRepoSource
snapshot new rows as they land in TDR. If/when the snapshot job succeeds, the associated snapshot is eligible for a consumption by a downstream stage (ex. imported to a Terra workspace as a snapshot reference and submitted).We've seen a few flavors of snapshot creation job failure over the past few months of running staged workloads:
workflow-launcher
Firecloud group must be granted appropriate resource permissions for snapshotting to be picked up. (As an aside, maybe these permissions are something we could check for when validating theTerraDataRepoSource
request, but that's outside the scope of this ticket.)Prompt discovery of these issues (especially 2 and 3) make it more likely that they'll be corrected within the 2 hour window of WFL's automatic retry. To get this information, I've previously created GCP Stackdriver alerts and looked at WFL logs.
In this PR I call the Slacker directly when we've registered that an active TDR snapshot creation job has failed or reached some unknown state. The messages emitted reflect feedback from key stakeholders and Jade team re: what info would help them, and include:
Changes
with-redefs-fn
->with-redefs
for better readabilityManual Verification
Here's some scratch code I ran in
wfl.source
to generate Slack notifications for a past failed snapshot creation job:And the result:
https://broadinstitute.slack.com/archives/C026PTM4XPA/p1646931430498359
System Tests
Pass:
Review Instructions
Check out above manual testing and added unit test
wfl.unit.source-test/test-check-tdr-job-and-notify-on-failure
.I didn't find it necessary to add an integration test which explicitly checks that a Slack notification has been sent, as we have that covered in the
slack
automated tests.