Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GH-1593] Notify Slack watchers on failed TDR snapshot jobs #590

Merged
merged 5 commits into from
Mar 14, 2022

Conversation

okotsopoulos
Copy link
Contributor

@okotsopoulos okotsopoulos commented Mar 10, 2022

Purpose

Staged workloads with TerraDataRepoSource snapshot new rows as they land in TDR. If/when the snapshot job succeeds, the associated snapshot is eligible for a consumption by a downstream stage (ex. imported to a Terra workspace as a snapshot reference and submitted).

We've seen a few flavors of snapshot creation job failure over the past few months of running staged workloads:

  1. Transient issues: rows which failed to snapshot are picked up on subsequent snapshot attempts, capped at 2 hours from row discovery. No manual intervention needed.
  2. Dataset and/or billing profile permissions issues: most common for new workloads and projects. Appropriate workflow-launcher Firecloud group must be granted appropriate resource permissions for snapshotting to be picked up. (As an aside, maybe these permissions are something we could check for when validating the TerraDataRepoSource request, but that's outside the scope of this ticket.)
  3. Persistent issue TDR-side: as an example, a past TDR permission change broke snapshotting and we needed them to remediate (Slack).

Prompt discovery of these issues (especially 2 and 3) make it more likely that they'll be corrected within the 2 hour window of WFL's automatic retry. To get this information, I've previously created GCP Stackdriver alerts and looked at WFL logs.

In this PR I call the Slacker directly when we've registered that an active TDR snapshot creation job has failed or reached some unknown state. The messages emitted reflect feedback from key stakeholders and Jade team re: what info would help them, and include:

  • Snapshot job ID
  • The snapshot job result's error code and message

Changes

  • Call the Slack notifier when we register that an active TDR snapshot creation job has failed
  • Added unit test
  • Update public-facing docs
  • In test files that I touched, converted with-redefs-fn -> with-redefs for better readability

Manual Verification

Here's some scratch code I ran in wfl.source to generate Slack notifications for a past failed snapshot creation job:

  (let [workload {:watchers [["slack" "C026PTM4XPA" "#hornet-slack-app-testing"]]
                  :uuid     "12c92a7c-79ac-4b0d-9617-b7cc48277459"
                  :project  "test-notify-on-failed-snapshot-job"}
        job-id   "hzycMuapS2GW2eUpt_uH-w"]
    (slack/start-notification-loop)
    (check-tdr-job-and-notify-on-failure workload job-id)) 

And the result:

Screen Shot 2022-03-10 at 12 37 43 PM

https://broadinstitute.slack.com/archives/C026PTM4XPA/p1646931430498359

System Tests

Pass:

wm111-e35:wfl okotsopo$ make TARGET=system
export CPCACHE=/Users/okotsopo/wfl/api/.cpcache;            \
	export WFL_WFL_URL=http://localhost:3000; \
	clojure  -M:parallel-test wfl.system.v1-endpoint-test | \
	tee /Users/okotsopo/wfl/derived/api/system.log
WARNING: Specified path is external to project: ../derived/api/src
WARNING: Specified path is external to project: ../derived/api/resources

Ran 33 tests containing 374 assertions.
0 failures, 0 errors.
api system finished on Thu Mar 10 12:18:15 EST 2022
docs system finished on Thu Mar 10 12:18:15 EST 2022
functions/aou system finished on Thu Mar 10 12:18:15 EST 2022
functions/sg system finished on Thu Mar 10 12:18:15 EST 2022
helm system finished on Thu Mar 10 12:18:15 EST 2022
ui system finished on Thu Mar 10 12:18:15 EST 2022

Review Instructions

Check out above manual testing and added unit test wfl.unit.source-test/test-check-tdr-job-and-notify-on-failure.

I didn't find it necessary to add an integration test which explicitly checks that a Slack notification has been sent, as we have that covered in the slack automated tests.

source/tdr-job-failed-slack-msg mock-tdr-job-failed-slack-msg
slack/notify-watchers (constantly nil)]
(let [metadata {:job_status "running"}]
(with-redefs [datarepo/job-metadata (constantly metadata)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you brought my attention here, I do find with-redefs easier to work with and read than with-redefs-fn, especially in cases where wanting to nest redefinitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to hear it.

@okotsopoulos okotsopoulos marked this pull request as ready for review March 10, 2022 18:37
api/src/wfl/source.clj Show resolved Hide resolved
api/src/wfl/source.clj Outdated Show resolved Hide resolved
source/tdr-job-failed-slack-msg mock-tdr-job-failed-slack-msg
slack/notify-watchers (constantly nil)]
(let [metadata {:job_status "running"}]
(with-redefs [datarepo/job-metadata (constantly metadata)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to hear it.

(with-redefs [datarepo/job-metadata (constantly metadata)]
(is (= metadata
(#'source/check-tdr-job-and-notify-on-failure workload job-id))
"Should return metadata for job with unknown status")))))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh … OK.


![](assets/staged-workload/workflow-finished-notifications.png)

In the future, WFL may allow for these two notification streams
to be configured separately.
High-volume use cases (ex. 100s of workflows/day) may find
state change notifications too noisy.
some state change notifications too noisy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or … may not? (-;

@okotsopoulos okotsopoulos merged commit 58bfbba into develop Mar 14, 2022
@okotsopoulos okotsopoulos deleted the okotsopo/GH-1593-slack-snapshot-fail branch March 14, 2022 15:26
okotsopoulos added a commit that referenced this pull request Mar 21, 2022
GH-1592 GH-1635 Notify Slack watchers on Terra submission creation (#591)
GH-1593 Notify Slack watchers on failed TDR snapshot jobs (#590)
GH-1618 Add workload info to executor logs (#589)
GH-1633 Remove WFL_SLACK_ENABLED feature switch (#587)
okotsopoulos added a commit that referenced this pull request Mar 22, 2022
GH-1592 GH-1635 Notify Slack watchers on Terra submission creation (#591)
GH-1593 Notify Slack watchers on failed TDR snapshot jobs (#590)
GH-1618 Add workload info to executor logs (#589)
GH-1633 Remove WFL_SLACK_ENABLED feature switch (#587)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants