Skip to content

Conversation

@0618
Copy link
Contributor

@0618 0618 commented Jul 14, 2022

Description of changes

Issue #, if available

https://app.asana.com/0/1201736086077862/1201914591116096/f

Description of how you validated changes

This PR add metrics for success and failures of the after merge next tag publishing events in Prod account.

TODO:

  • manually create an alarm to trigger Sev3 tickets
  • Ideally, we need to migrate the metrics and alarms to CDK and deploy them to the canary accounts

Checklist

  • PR description included
  • yarn test passes
  • Tests are updated
  • No side effects or sideEffects field updated
  • Relevant documentation is changed or added (and PR referenced)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@0618 0618 requested a review from a team as a code owner July 14, 2022 22:39
@changeset-bot
Copy link

changeset-bot bot commented Jul 14, 2022

⚠️ No Changeset found

Latest commit: acdce5d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@0618 0618 temporarily deployed to ci July 14, 2022 22:52 Inactive
@0618 0618 temporarily deployed to ci July 14, 2022 22:52 Inactive
@0618 0618 temporarily deployed to ci July 14, 2022 22:52 Inactive
@0618 0618 temporarily deployed to ci July 14, 2022 22:52 Inactive
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- run: aws cloudwatch put-metric-data --metric-name RunTimeTestsFailure --namespace GithubCanaryApps --value 0
- run: aws cloudwatch put-metric-data --metric-name RunTimeTestsSuccess --namespace GithubCanaryApps --value 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a RunTimeTestsSuccess metric setup with any alarms right now. Is the idea that we'll use this in the future to alarm if it's missing for a certain period of time?

Copy link
Contributor Author

@0618 0618 Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, this line will create a RunTimeTestsSuccess metric and we'll need to manually set up the alarm from the metric. Is that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this further this might be an issue. The way the alarm works is that if RunTimeTestsFailure > 0 for 2 data points within 40 minutes the alarm is triggered. When a success occurs the RunTimeTestsFailure is sent a 0. If a 0 is never sent won't the RunTimeTestsFailure be above 0 forever?

Perhaps you have to change the additional configuration to have missing data treated as being good.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or change this back to RunTimeTestsFailure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With @ErikCH on this one, is this change required? This would break the existing metric because we've been relying on the value 0 and 1 to determine the alarm state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome point! Thanks for catching it. That means I need to add the success action back to the publish workflow as well.

Just pushed a new commit 😁

aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- run: aws cloudwatch put-metric-data --metric-name publishSuccess --namespace GithubCanaryApps --value 0
Copy link
Contributor

@ErikCH ErikCH Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with these two alarms. Will the success alarm on missing data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. I think this line is only for metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We configured canary to trigger alarm on missing data.

I don't think we can do that here though, because publish to @next does not happen on a regular schedule. IMO should have @next trigger alarm only on failure data and ignore missing data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this PR would add a canary for the failures? Not sure about the missing data, but can we take a look once it's added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my thought here is that we shouldn't even need to add success metrics here. We can safely delete trigger-success-alarm and only have trigger-failure-alarm job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I just removed the trigger-success-alarm 😁

aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- run: aws cloudwatch put-metric-data --metric-name publishFailure --namespace GithubCanaryApps --value 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming publishFailure will be setup to alarm right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we manually set up the run-and-test-build failure. Still remember we were on a call and @wlee221 set it up...I hope my memory is not lying to me

@0618 0618 temporarily deployed to ci July 18, 2022 23:40 Inactive
@0618 0618 temporarily deployed to ci July 18, 2022 23:40 Inactive
@0618 0618 temporarily deployed to ci July 18, 2022 23:40 Inactive
@0618 0618 temporarily deployed to ci July 18, 2022 23:40 Inactive
wlee221
wlee221 previously approved these changes Jul 18, 2022
Copy link
Contributor

@wlee221 wlee221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's make sure cloudwatch alarms are set up before we make this happen.

- run: aws cloudwatch put-metric-data --metric-name RunTimeTestsFailure --namespace GithubCanaryApps --value 1

trigger-sucess-alarm:
trigger-success-alarm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be called put-success-metric? It doesn't actually trigger an alarm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

working-directory: ./canary

trigger-failure-alarm:
# Triggers an alarm if any of builds failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment isn't actually correct. Can we change this language to saw that it publishes a metric, which we separately alarm on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@wlee221 wlee221 self-requested a review July 19, 2022 00:12
Copy link
Contributor

@wlee221 wlee221 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on run-and-test-builds changes 🙏

Co-authored-by: Scott Rees <6165315+reesscot@users.noreply.github.com>
@0618 0618 temporarily deployed to ci July 19, 2022 00:30 Inactive
@0618 0618 temporarily deployed to ci July 19, 2022 00:30 Inactive
@0618 0618 temporarily deployed to ci July 19, 2022 00:30 Inactive
@0618 0618 temporarily deployed to ci July 19, 2022 00:30 Inactive
@0618 0618 temporarily deployed to ci July 19, 2022 01:01 Inactive
@0618 0618 temporarily deployed to ci July 19, 2022 01:01 Inactive
@0618 0618 temporarily deployed to ci July 19, 2022 01:01 Inactive
@0618 0618 temporarily deployed to ci July 19, 2022 01:01 Inactive
@0618 0618 merged commit c455bce into main Jul 19, 2022
@0618 0618 deleted the next-publish-alert branch July 19, 2022 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants