Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add description on the ways how users should approach DB monitoring #36483

Merged
merged 3 commits into from Dec 30, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Dec 29, 2023

Often our users are not aware that they are responsible for setting up and monitoring the database they chose as the metaa-data backend.

While details of the tables and database structure of the metadata DB used by Airflow is internal detail, the monitoring, tracking the usage, fine-tuning and optimisation of the database configuration and detecting some cases where database becomes a bottle neck is generally a task that Deployment Manager should be aware of and it should be approached in a generic way - specific to the database chosen by the Deployment Manager and it also depends a lot on the choice of managed database if managed database is chosen by the Deployment Manager.

This chapter makes it explicit and gives enough leads to the Deployment Manager to be able to follow after they chose the database, it also explain the specific parameters tha the Deployment Manager should pay attention to when setting up such monitoring.

We also add an explanation of how Deployment Manager can setup client-side logging of SQL queries generated by Airflow in case database access is suspected for performance issues with Airflow, as a poor-man's version of complete, server-side monitoring and explains caveats of such client side configuraiton.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member Author

potiuk commented Dec 29, 2023

Inspired by discussion in #36454 - whenever user/issue is likely related to excessive DB usage, we should be able to just link the discussion/issue to this documentation, and possibly it will even inspire Deployment Managers to set-up monitoring of their DB as they might not be aware this falls under their responsibility when they choose the DB backend and that they have to learn how to monitor the DB.

@potiuk potiuk force-pushed the add-documentation-on-db-monitoring branch from fe0b4bc to 3bac97e Compare December 29, 2023 11:57
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First block has very long sentences, besides this well explained.
I did not know about hte SQLAlchemy logging yet, could not find this. Woul dbe cool to have WARNings generated if queries are above an certain threshold like if running >500ms - but seems this would be rather a contribution for SQLAlchemy :-D

docs/apache-airflow/howto/set-up-database.rst Outdated Show resolved Hide resolved
Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

https://www.datadoghq.com/dg/monitor/databases/ is great for this kind of monitoring.

@BasPH
Copy link
Contributor

BasPH commented Dec 29, 2023

I agree with explaining monitoring of Airflow's meta-database in the docs. However, this chapter contains a lot of (IMO) unnecessary details which makes it difficult to extract the key information.

How about restructuring it as:

  • 1 sentence explaining "everything you see and do in Airflow is registered in the database, therefore it's important to monitor"
  • 1 sentence with disclaimer that monitoring the database is up to yourself, not part of Airflow
  • 1 paragraph stating "Here are a few key considerations when monitoring your Airflow database" with the bulletpoints from your text
  • 1 paragraph explaining how to enable query logging together with the "why" and a warning that it can have a performance impact

Copy link
Collaborator

@dirrao dirrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initiative. I have ideas along similar lines.
I have observed in our use case where db grows (1GB per day) indefinitely unless manually run the db clean cli command. So, it will eventually degrade the performance of the database over time. If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.

@hussein-awala
Copy link
Member

If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.

IMHO, the db clean command available in the Airflow CLI and the Python methods provided in the module db_cleanup are sufficient for the users because they can be easily used in a simple dag with a bash operator for the CLI or Python operator for the Python API. (that's what most of Airflow users do)

@potiuk
Copy link
Member Author

potiuk commented Dec 29, 2023

Great initiative. I have ideas along similar lines. I have observed in our use case where db grows (1GB per day) indefinitely unless manually run the db clean cli command. So, it will eventually degrade the performance of the database over time.

Good points . I will add the description about those commands

@potiuk
Copy link
Member Author

potiuk commented Dec 29, 2023

If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.

IMHO, the db clean command available in the Airflow CLI and the Python methods provided in the module db_cleanup are sufficient for the users because they can be easily used in a simple dag with a bash operator for the CLI or Python operator for the Python API. (that's what most of Airflow users do)

Agree here with @hussein-awala . Those are tools we give the Deployment Managers but they should adjust the usage to their needs and run those tools as and when needed with the parameters they decide are best for their deployment.

Generally speaking I think we should stress (and this is partially my motivation) that Airflow (components) and Airflow maintainers will NOT solve the problems of DB retention, configuration and optimization. Also commenting to @BasPH - I think the part where we explicilty say Hey Deployment Manager - it's Your job to make sure to keep your database healthy and to analyse if you see some performance issues is extremely important - and also FAIR,

I think (and in parts that is my motivation behind this change) is not only to help our users to see what they should do but also make it crystal clear they are responsible for taking care about all those things.

It's extremely complex and rather brittle to develop a generic and applicable to all cases ways to keep the database healthy and optimized. There is a good reason why databases need maintenance and someone to look after that - otherwise we would have it already implemented in Postgres and MySQL so that it's completely worry-free and handle all usage cases. Airflow and the way it can be used - from small single node installation to 100s of nodes and 1000s of DAGs and tasks does not narrow down the DB maintenance problem to much smaller and simpler case - not enough that we provide a "zero maintenance" solution out-of the box. Even the @dirrao 's request is going in that direction. But I think it's far too much to expect this kind of "zero maintenance" from a non-managed, free, open-source software.

But some of our users expect this will happen. And with that documentation chapter I want to make it crystal clear and set expectations

Also mentioning "managed database" in this context is important - those databases might be closer to "zero-maintenance". And then if you go to "managed Airflow" - that's even more "zero-maintenance" when you pay someone to manage Airflow, then absolutely - yes, you should expect you do not have to worry about database maintenance and the one who provides "managed Airflow" should take care about it - this is precisely why you pay (among other things) - you pay for the maintanance that you do not have to do.

@BasPH - yep I hear you. I will try to remove some duplications and restructure it a bit (I just bought Chat GPT 4.5 subscription just to see if it can help with that). But the points about "Deployment Manager responsibity" especially and settting clear expectations what they have to do is crucial part of this change and one that is the main reason I am doing it in the first place.

@potiuk
Copy link
Member Author

potiuk commented Dec 29, 2023

https://www.datadoghq.com/dg/monitor/databases/ is great for this kind of monitoring.

I am not sure if we should endorse specific services in our docs, but I will leave enough clues for the users to search for services that offer it.

@potiuk
Copy link
Member Author

potiuk commented Dec 29, 2023

Here it is - after employing Chat GPT (after few iterations and discussion with it) and applying the comments above. @BasPH - does it look better ?

@potiuk potiuk force-pushed the add-documentation-on-db-monitoring branch from d3575a9 to bfe7aea Compare December 29, 2023 20:56
potiuk and others added 3 commits December 29, 2023 21:57
Often our users are not aware that they are responsible for setting
up and monitoring the database they chose as the metaa-data backend.

While details of the tables and database structure of the metadata
DB used by Airflow is internal detail, the monitoring, tracking
the usage, fine-tuning and optimisation of the database configuration
and detecting some cases where database becomes a bottle neck is
generally a task that Deployment Manager should be aware of and
it should be approached in a generic way - specific to the database
chosen by the Deployment Manager and it also depends a lot on the
choice of managed database if managed database is chosen by the
Deployment Manager.

This chapter makes it explicit and gives enough leads to the
Deployment Manager to be able to follow after they chose the
database, it also explain the specific parameters tha the
Deployment Manager should pay attention to when setting up
such monitoring.

We also add an explanation of how Deployment Manager can setup
client-side logging of SQL queries generated by Airflow in case
database access is suspected for performance issues with Airflow,
as a poor-man's version of complete, server-side monitoring and
explains caveats of such client side configuraiton.
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
@potiuk potiuk force-pushed the add-documentation-on-db-monitoring branch from bfe7aea to ab62d9f Compare December 29, 2023 20:57
@@ -244,7 +244,7 @@ Zombie/Undead Tasks
No system runs perfectly, and task instances are expected to die once in a while. Airflow detects two kinds of task/process mismatch:

* *Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their process didn't send a recent heartbeat as it got killed, or the machine died). Airflow will find these
(e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failed when I run --clean-build locally so i fixed it :)

@@ -273,7 +273,7 @@ The explanation of the criteria used in the above snippet to detect zombie tasks

3. **Job Type**

The job associated with the task must be of type "LocalTaskJob."
The job associated with the task must be of type ``LocalTaskJob``.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@potiuk
Copy link
Member Author

potiuk commented Dec 30, 2023

I will merge it for now - we can always correct it later. I believe the structured way is much cleaner and responds to the "too many details" concern.

@potiuk potiuk merged commit dea715d into apache:main Dec 30, 2023
51 checks passed
@potiuk potiuk deleted the add-documentation-on-db-monitoring branch December 30, 2023 08:31
ephraimbuddy pushed a commit that referenced this pull request Jan 11, 2024
…36483)

* Add description on the ways how users should approach DB monitoring

Often our users are not aware that they are responsible for setting
up and monitoring the database they chose as the metaa-data backend.

While details of the tables and database structure of the metadata
DB used by Airflow is internal detail, the monitoring, tracking
the usage, fine-tuning and optimisation of the database configuration
and detecting some cases where database becomes a bottle neck is
generally a task that Deployment Manager should be aware of and
it should be approached in a generic way - specific to the database
chosen by the Deployment Manager and it also depends a lot on the
choice of managed database if managed database is chosen by the
Deployment Manager.

This chapter makes it explicit and gives enough leads to the
Deployment Manager to be able to follow after they chose the
database, it also explain the specific parameters tha the
Deployment Manager should pay attention to when setting up
such monitoring.

We also add an explanation of how Deployment Manager can setup
client-side logging of SQL queries generated by Airflow in case
database access is suspected for performance issues with Airflow,
as a poor-man's version of complete, server-side monitoring and
explains caveats of such client side configuraiton.

* Update docs/apache-airflow/howto/set-up-database.rst

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

* fixup! Update docs/apache-airflow/howto/set-up-database.rst

---------

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
(cherry picked from commit dea715d)
@ephraimbuddy ephraimbuddy added the type:doc-only Changelog: Doc Only label Jan 15, 2024
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
…pache#36483)

* Add description on the ways how users should approach DB monitoring

Often our users are not aware that they are responsible for setting
up and monitoring the database they chose as the metaa-data backend.

While details of the tables and database structure of the metadata
DB used by Airflow is internal detail, the monitoring, tracking
the usage, fine-tuning and optimisation of the database configuration
and detecting some cases where database becomes a bottle neck is
generally a task that Deployment Manager should be aware of and
it should be approached in a generic way - specific to the database
chosen by the Deployment Manager and it also depends a lot on the
choice of managed database if managed database is chosen by the
Deployment Manager.

This chapter makes it explicit and gives enough leads to the
Deployment Manager to be able to follow after they chose the
database, it also explain the specific parameters tha the
Deployment Manager should pay attention to when setting up
such monitoring.

We also add an explanation of how Deployment Manager can setup
client-side logging of SQL queries generated by Airflow in case
database access is suspected for performance issues with Airflow,
as a poor-man's version of complete, server-side monitoring and
explains caveats of such client side configuraiton.

* Update docs/apache-airflow/howto/set-up-database.rst

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

* fixup! Update docs/apache-airflow/howto/set-up-database.rst

---------

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants