New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Elasticsearch cluster health monitor DAGs #3748
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/3748 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
from airflow.models.connection import Connection | ||
from airflow.models.xcom_arg import XComArg | ||
|
||
from common.constants import Environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I am working on a PR where I move this task into a new file called common/elasticsearch.py
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha! Should I update this PR to do that as well or do you have a different approach you would prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like mine will take a minute to get over the line -- I'm totally fine to keep it here for this PR and I can rebase/update in mine.
I am struggling to get the DAG docs generation to run locally, due to an issue with the version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, I'm glad it was easy enough to knock out! My only blocking request is that we hold off on yellow alerts during a data refresh, see the explanation given in the in-line comment. Tested locally and it works as expected too 😄
@AetherUnbound I've made the requested changes. I made it so that the notification of yellow status is skipped if the data refresh is running and environment is production, but otherwise the other two notification types still send. I haven't done anything to address the same issue in staging, though, even though there are a few staging DAGs that would cause a yellow status. I don't know how to create a comprehensive list of them without manually doing it, which would easily lead to drift. Do you have any advice? It will be easy to add later as well. I don't think we currently use those DAGs very much if at all. |
I think catching the case where the data refresh in production is perfect for now! If we find it alerts too much in other scenarios, we can figure out how to add those as well. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making those changes, looks great!
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- the checks for running data refreshes are fantastic, and I agree that it's fine to consider the staging alerts in the future. 🚢
from airflow.models.connection import Connection | ||
from airflow.models.xcom_arg import XComArg | ||
|
||
from common.constants import Environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like mine will take a minute to get over the line -- I'm totally fine to keep it here for this PR and I can rebase/update in mine.
Fixes
Fixes #3747 by @sarayourfriend
Description
Adds two new dags to monitor the staging and production Elasticsearch clusters. Each dag runs on a 15 minute schedule.
Testing Instructions
Both DAGs should run and respond with the appropriate notification or alert based on the cluster health response.
Run these locally by searching for the
monitoring
tag and enabling the twoelasticsearch_cluster_healthcheck
. Both should pass. If production passes but staging does not, make sure you have a new copy ofenv.template
atcatalog/.env
because staging ES connection string was only recently added.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin