Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add callback failure warning email #2190

Closed
wants to merge 12 commits into from

Conversation

whabanks
Copy link
Contributor

@whabanks whabanks commented Jun 4, 2024

Summary | Résumé

This PR adds the groundwork for sending a warning email to service owners when their callback service is not working correctly.

The actual email sending based on threshold is not yet implemented. It is non-trivial to ensure that we don't spam users with emails every time a retried callback reaches the max retries of 5. For the time being we will simply monitor how frequently this occurs while the remaining work is completed.

Related Issues | Cartes liées

Test instructions | Instructions pour tester la modification

TODO: Fill in test instructions for the reviewer.

Release Instructions | Instructions pour le déploiement

None.

Reviewer checklist | Liste de vérification du réviseur

  • This PR does not break existing functionality.
  • This PR does not violate GCNotify's privacy policies.
  • This PR does not raise new security concerns. Refer to our GC Notify Risk Register document on our Google drive.
  • This PR does not significantly alter performance.
  • Additional required documentation resulting of these changes is covered (such as the README, setup instructions, a related ADR or the technical documentation).

⚠ If boxes cannot be checked off before merging the PR, they should be moved to the "Release Instructions" section with appropriate steps required to verify before release. For example, changes to celery code may require tests on staging to verify that performance has not been affected.

- Alphabeticalized the list of Notify's templates in config.py, because readability is nice.
@@ -86,5 +86,5 @@ def _send_data_to_service_callback_api(self, data, service_callback_url, token,
self.retry(queue=QueueNames.CALLBACKS_RETRY)
except self.MaxRetriesExceededError:
current_app.logger.warning(
"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}"
f"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we have not been properly capturing callback failure logs for a while now.

- Added CALLBACK_FAILURE_THRESHOLD_PERCENTAGE env var
- Updated the callback email migration script with the latest changes from content
- Added a method to send the callback failure email to service owners
- Stubbed a method to query cloudwatch so we can determine if callbacks for the service have failed at least 5 times in a 30 minute time period before we send the email to the service owner
ALREADY_REGISTERED_EMAIL_TEMPLATE_ID = "0880fbb1-a0c6-46f0-9a8e-36c986381ceb"
APIKEY_REVOKE_TEMPLATE_ID = "a0a4e7b8-8a6a-4eaa-9f4e-9c3a5b2dbcf3"
BRANDING_REQUEST_TEMPLATE_ID = "7d423d9e-e94e-4118-879d-d52f383206ae"
CALLBACK_FAILURE_TEMPLATE_ID = "d8d580f4-86b3-4ba4-9d7c-263a630af354"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amongst the rearranging, this env var is the one I added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to read, thank you 🔤

- fix circular dependency
- formatting
@whabanks whabanks marked this pull request as ready for review June 6, 2024 21:02
@whabanks whabanks requested review from andrewleith and jzbahrai and removed request for andrewleith June 11, 2024 14:41
Copy link
Member

@andrewleith andrewleith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. If our threshold is simply whenever MaxRetriesExceededError is thrown, I think we can use a redis key to ensure we only send one email per day. See comment below.

# Retry if the response status code is server-side or 429 (too many requests).
if not isinstance(e, HTTPError) or e.response.status_code >= 500 or e.response.status_code == 429:
try:
self.retry(queue=QueueNames.CALLBACKS_RETRY)
except self.MaxRetriesExceededError:
current_app.logger.warning(
"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id}"
f"Retry: {function_name} has retried the max num of times for callback url {service_callback_url} and notification_id: {notification_id} for service: {current_app.service.id}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the limits feature, we use an expiring redis key to ensure only one email is sent within a particular time period. For example, if they reach the threshold, we send the email then set this key in redis to true. Then if they hit the threshold again within the time period the redis key will still be true and we will check it before sending another email.

ALREADY_REGISTERED_EMAIL_TEMPLATE_ID = "0880fbb1-a0c6-46f0-9a8e-36c986381ceb"
APIKEY_REVOKE_TEMPLATE_ID = "a0a4e7b8-8a6a-4eaa-9f4e-9c3a5b2dbcf3"
BRANDING_REQUEST_TEMPLATE_ID = "7d423d9e-e94e-4118-879d-d52f383206ae"
CALLBACK_FAILURE_TEMPLATE_ID = "d8d580f4-86b3-4ba4-9d7c-263a630af354"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to read, thank you 🔤

HEARTBEAT_TEMPLATE_EMAIL_LOW = "73079cb9-c169-44ea-8cf4-8d397711cc9d"
HEARTBEAT_TEMPLATE_EMAIL_MEDIUM = "c75c4539-3014-4c4c-96b5-94d326758a74"
HEARTBEAT_TEMPLATE_EMAIL_HIGH = "276da251-3103-49f3-9054-cbf6b5d74411"
HEARTBEAT_TEMPLATE_SMS_HIGH = "4969a9e9-ddfd-476e-8b93-6231e6f1be4a"

Check warning

Code scanning / CodeQL

Variable defined multiple times Warning

This assignment to 'HEARTBEAT_TEMPLATE_SMS_HIGH' is unnecessary as it is
redefined
before this value is used.
@jzbahrai
Copy link
Collaborator

We are not doing this,this is already in the PRD GC notify service

@jzbahrai jzbahrai closed this Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants