feat: add alert for publishing @next version #2297

0618 · 2022-07-14T22:39:34Z

Description of changes

Issue #, if available

https://app.asana.com/0/1201736086077862/1201914591116096/f

Description of how you validated changes

This PR add metrics for success and failures of the after merge next tag publishing events in Prod account.

TODO:

manually create an alarm to trigger Sev3 tickets
Ideally, we need to migrate the metrics and alarms to CDK and deploy them to the canary accounts

Checklist

PR description included
yarn test passes
Tests are updated
No side effects or sideEffects field updated
Relevant documentation is changed or added (and PR referenced)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

changeset-bot · 2022-07-14T22:39:36Z

⚠️ No Changeset found

Latest commit: acdce5d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

ErikCH · 2022-07-15T17:10:29Z

.github/workflows/run-and-test-builds.yml

          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
-      - run: aws cloudwatch put-metric-data --metric-name RunTimeTestsFailure --namespace GithubCanaryApps --value 0
+      - run: aws cloudwatch put-metric-data --metric-name RunTimeTestsSuccess --namespace GithubCanaryApps --value 0


We don't have a RunTimeTestsSuccess metric setup with any alarms right now. Is the idea that we'll use this in the future to alarm if it's missing for a certain period of time?

From my understanding, this line will create a RunTimeTestsSuccess metric and we'll need to manually set up the alarm from the metric. Is that right?

Looking at this further this might be an issue. The way the alarm works is that if RunTimeTestsFailure > 0 for 2 data points within 40 minutes the alarm is triggered. When a success occurs the RunTimeTestsFailure is sent a 0. If a 0 is never sent won't the RunTimeTestsFailure be above 0 forever?

Perhaps you have to change the additional configuration to have missing data treated as being good.

Or change this back to RunTimeTestsFailure

With @ErikCH on this one, is this change required? This would break the existing metric because we've been relying on the value 0 and 1 to determine the alarm state.

Awesome point! Thanks for catching it. That means I need to add the success action back to the publish workflow as well.

Just pushed a new commit 😁

ErikCH · 2022-07-15T17:10:57Z

.github/workflows/test-deploy-main.yml

+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - run: aws cloudwatch put-metric-data --metric-name publishSuccess --namespace GithubCanaryApps --value 0


Same here with these two alarms. Will the success alarm on missing data?

Same. I think this line is only for metrics

We configured canary to trigger alarm on missing data.

I don't think we can do that here though, because publish to @next does not happen on a regular schedule. IMO should have @next trigger alarm only on failure data and ignore missing data.

I thought this PR would add a canary for the failures? Not sure about the missing data, but can we take a look once it's added?

Right, my thought here is that we shouldn't even need to add success metrics here. We can safely delete trigger-success-alarm and only have trigger-failure-alarm job.

Good point! I just removed the trigger-success-alarm 😁

ErikCH · 2022-07-15T17:11:52Z

.github/workflows/test-deploy-main.yml

+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - run: aws cloudwatch put-metric-data --metric-name publishFailure --namespace GithubCanaryApps --value 1


I'm assuming publishFailure will be setup to alarm right away?

I remember we manually set up the run-and-test-build failure. Still remember we were on a call and @wlee221 set it up...I hope my memory is not lying to me

wlee221

LGTM, let's make sure cloudwatch alarms are set up before we make this happen.

reesscot · 2022-07-15T17:28:34Z

.github/workflows/run-and-test-builds.yml

      - run: aws cloudwatch put-metric-data --metric-name RunTimeTestsFailure --namespace GithubCanaryApps --value 1

-  trigger-sucess-alarm:
+  trigger-success-alarm:


Shouldn't this be called put-success-metric? It doesn't actually trigger an alarm

reesscot · 2022-07-15T17:31:53Z

.github/workflows/test-deploy-main.yml

        working-directory: ./canary
+
+  trigger-failure-alarm:
+    # Triggers an alarm if any of builds failed.


this comment isn't actually correct. Can we change this language to saw that it publishes a metric, which we separately alarm on?

.github/workflows/test-deploy-main.yml

wlee221

Added a comment on run-and-test-builds changes 🙏

Co-authored-by: Scott Rees <6165315+reesscot@users.noreply.github.com>

feat: add alert for publishing @next version

4e1be56

0618 requested a review from a team as a code owner July 14, 2022 22:39

0618 temporarily deployed to ci July 14, 2022 22:52 Inactive

ErikCH reviewed Jul 15, 2022

View reviewed changes

fix: remove trigger-success-alarm

5a6cedb

0618 temporarily deployed to ci July 18, 2022 23:40 Inactive

wlee221 previously approved these changes Jul 18, 2022

View reviewed changes

reesscot reviewed Jul 19, 2022

View reviewed changes

wlee221 self-requested a review July 19, 2022 00:12

wlee221 reviewed Jul 19, 2022

View reviewed changes

Update .github/workflows/test-deploy-main.yml

a4f9ee6

Co-authored-by: Scott Rees <6165315+reesscot@users.noreply.github.com>

0618 dismissed wlee221’s stale review via a4f9ee6 July 19, 2022 00:18

0618 temporarily deployed to ci July 19, 2022 00:30 Inactive

fix: address review comments

acdce5d

0618 temporarily deployed to ci July 19, 2022 01:01 Inactive

ErikCH approved these changes Jul 19, 2022

View reviewed changes

wlee221 approved these changes Jul 19, 2022

View reviewed changes

0618 merged commit c455bce into main Jul 19, 2022

0618 deleted the next-publish-alert branch July 19, 2022 18:37

0618 mentioned this pull request Jul 20, 2022

feat: add CloudWatch metric when publishLatest fails #2311

Merged

5 tasks

feat: add alert for publishing @next version #2297

feat: add alert for publishing @next version #2297

Uh oh!

Conversation

0618 commented Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Issue #, if available

Description of how you validated changes

Checklist

Uh oh!

changeset-bot bot commented Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0618 Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ErikCH Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wlee221 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wlee221 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

0618 commented Jul 14, 2022 •

edited

Loading

changeset-bot bot commented Jul 14, 2022 •

edited

Loading

0618 Jul 15, 2022 •

edited

Loading

ErikCH Jul 15, 2022 •

edited

Loading