Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic monitoring #60

Merged
merged 7 commits into from
Nov 6, 2023
Merged

Add basic monitoring #60

merged 7 commits into from
Nov 6, 2023

Conversation

pankajastro
Copy link
Collaborator

@pankajastro pankajastro commented Oct 27, 2023

closes: #39

an example post success and fail service status
Screenshot 2023-11-01 at 11 27 58 AM

@cloudflare-pages
Copy link

cloudflare-pages bot commented Oct 27, 2023

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6762d20
Status: ✅  Deploy successful!
Preview URL: https://9cdff5ab.ask-astro.pages.dev
Branch Preview URL: https://monitoring-dag.ask-astro.pages.dev

View logs

Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the production DAG post the status of every run?



@task(trigger_rule=TriggerRule.ALL_DONE)
def check_weaviate_status():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we trying to check here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. weaviate class exist if not then the task fail
  2. print the number of record in weaviate class

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just add what we are monitoring in weaviate as doctoring.

@pankajastro pankajastro marked this pull request as ready for review October 30, 2023 16:25
Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajastro we need the following changes in the DAGs:

  • On a regular basis we can post the status on slack.
  • We should run the DAGs like very 10 minutes or so and check if anything is down if it is we should post a status on Slack (and maybe email folks).
    Something like this incident can happen any time.

@pankajastro
Copy link
Collaborator Author

pankajastro commented Oct 31, 2023

  • On a regular basis we can post the status on slack.
  • We should run the DAGs like very 10 minutes or so and check if anything is down if it is we should post a status on Slack (and maybe email folks).

As per the current state of PR, it would post on Slack and we can add an env variable for schedule intervals for example */10 * * * *

@sunank200
Copy link
Collaborator

sunank200 commented Oct 31, 2023

  • On a regular basis we can post the status on slack.
  • We should run the DAGs like very 10 minutes or so and check if anything is down if it is we should post a status on Slack (and maybe email folks).

As per the current state of PR, it would post on Slack and we can add an env variable for schedule intervals for example */10 * * * *

As discussed in call:

  • This schedule should be fine:
  • We should post on Slack once a day regardless.
  • We should only post during the day immediately if anything breaks or fails.

@pankajastro
Copy link
Collaborator Author

@sunank200 I have tested this you can check the sample message in the PR description

@pankajastro
Copy link
Collaborator Author

@jedcunningham requesting your feedback on this!

Copy link
Collaborator

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. @pankajastro added few comments

airflow/dags/monitor/monitor.py Show resolved Hide resolved


@task(trigger_rule=TriggerRule.ALL_DONE)
def check_weaviate_status():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just add what we are monitoring in weaviate as doctoring.

airflow/dags/monitor/monitor.py Show resolved Hide resolved
@pankajastro pankajastro merged commit 82ea78a into main Nov 6, 2023
7 checks passed
@pankajastro pankajastro deleted the monitoring_dag branch November 6, 2023 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement some observability/monitoring part-1
2 participants