Add description on the ways how users should approach DB monitoring #36483

potiuk · 2023-12-29T11:41:32Z

Often our users are not aware that they are responsible for setting up and monitoring the database they chose as the metaa-data backend.

While details of the tables and database structure of the metadata DB used by Airflow is internal detail, the monitoring, tracking the usage, fine-tuning and optimisation of the database configuration and detecting some cases where database becomes a bottle neck is generally a task that Deployment Manager should be aware of and it should be approached in a generic way - specific to the database chosen by the Deployment Manager and it also depends a lot on the choice of managed database if managed database is chosen by the Deployment Manager.

This chapter makes it explicit and gives enough leads to the Deployment Manager to be able to follow after they chose the database, it also explain the specific parameters tha the Deployment Manager should pay attention to when setting up such monitoring.

We also add an explanation of how Deployment Manager can setup client-side logging of SQL queries generated by Airflow in case database access is suspected for performance issues with Airflow, as a poor-man's version of complete, server-side monitoring and explains caveats of such client side configuraiton.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk · 2023-12-29T11:45:35Z

Inspired by discussion in #36454 - whenever user/issue is likely related to excessive DB usage, we should be able to just link the discussion/issue to this documentation, and possibly it will even inspire Deployment Managers to set-up monitoring of their DB as they might not be aware this falls under their responsibility when they choose the DB backend and that they have to learn how to monitor the DB.

jscheffl

First block has very long sentences, besides this well explained.
I did not know about hte SQLAlchemy logging yet, could not find this. Woul dbe cool to have WARNings generated if queries are above an certain threshold like if running >500ms - but seems this would be rather a contribution for SQLAlchemy :-D

docs/apache-airflow/howto/set-up-database.rst

hussein-awala

👍

https://www.datadoghq.com/dg/monitor/databases/ is great for this kind of monitoring.

BasPH · 2023-12-29T13:15:12Z

I agree with explaining monitoring of Airflow's meta-database in the docs. However, this chapter contains a lot of (IMO) unnecessary details which makes it difficult to extract the key information.

How about restructuring it as:

1 sentence explaining "everything you see and do in Airflow is registered in the database, therefore it's important to monitor"
1 sentence with disclaimer that monitoring the database is up to yourself, not part of Airflow
1 paragraph stating "Here are a few key considerations when monitoring your Airflow database" with the bulletpoints from your text
1 paragraph explaining how to enable query logging together with the "why" and a warning that it can have a performance impact

dirrao

Great initiative. I have ideas along similar lines.
I have observed in our use case where db grows (1GB per day) indefinitely unless manually run the db clean cli command. So, it will eventually degrade the performance of the database over time. If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.

hussein-awala · 2023-12-29T16:02:19Z

If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.

IMHO, the db clean command available in the Airflow CLI and the Python methods provided in the module db_cleanup are sufficient for the users because they can be easily used in a simple dag with a bash operator for the CLI or Python operator for the Python API. (that's what most of Airflow users do)

potiuk · 2023-12-29T16:55:00Z

Great initiative. I have ideas along similar lines. I have observed in our use case where db grows (1GB per day) indefinitely unless manually run the db clean cli command. So, it will eventually degrade the performance of the database over time.

Good points . I will add the description about those commands

potiuk · 2023-12-29T17:11:39Z

If we can add a feature to automatically clean the data based on the retention parameter that we set in the airflow configuration.

IMHO, the db clean command available in the Airflow CLI and the Python methods provided in the module db_cleanup are sufficient for the users because they can be easily used in a simple dag with a bash operator for the CLI or Python operator for the Python API. (that's what most of Airflow users do)

Agree here with @hussein-awala . Those are tools we give the Deployment Managers but they should adjust the usage to their needs and run those tools as and when needed with the parameters they decide are best for their deployment.

Generally speaking I think we should stress (and this is partially my motivation) that Airflow (components) and Airflow maintainers will NOT solve the problems of DB retention, configuration and optimization. Also commenting to @BasPH - I think the part where we explicilty say Hey Deployment Manager - it's Your job to make sure to keep your database healthy and to analyse if you see some performance issues is extremely important - and also FAIR,

I think (and in parts that is my motivation behind this change) is not only to help our users to see what they should do but also make it crystal clear they are responsible for taking care about all those things.

It's extremely complex and rather brittle to develop a generic and applicable to all cases ways to keep the database healthy and optimized. There is a good reason why databases need maintenance and someone to look after that - otherwise we would have it already implemented in Postgres and MySQL so that it's completely worry-free and handle all usage cases. Airflow and the way it can be used - from small single node installation to 100s of nodes and 1000s of DAGs and tasks does not narrow down the DB maintenance problem to much smaller and simpler case - not enough that we provide a "zero maintenance" solution out-of the box. Even the @dirrao 's request is going in that direction. But I think it's far too much to expect this kind of "zero maintenance" from a non-managed, free, open-source software.

But some of our users expect this will happen. And with that documentation chapter I want to make it crystal clear and set expectations

Also mentioning "managed database" in this context is important - those databases might be closer to "zero-maintenance". And then if you go to "managed Airflow" - that's even more "zero-maintenance" when you pay someone to manage Airflow, then absolutely - yes, you should expect you do not have to worry about database maintenance and the one who provides "managed Airflow" should take care about it - this is precisely why you pay (among other things) - you pay for the maintanance that you do not have to do.

@BasPH - yep I hear you. I will try to remove some duplications and restructure it a bit (I just bought Chat GPT 4.5 subscription just to see if it can help with that). But the points about "Deployment Manager responsibity" especially and settting clear expectations what they have to do is crucial part of this change and one that is the main reason I am doing it in the first place.

potiuk · 2023-12-29T17:17:15Z

https://www.datadoghq.com/dg/monitor/databases/ is great for this kind of monitoring.

I am not sure if we should endorse specific services in our docs, but I will leave enough clues for the users to search for services that offer it.

potiuk · 2023-12-29T17:44:40Z

Here it is - after employing Chat GPT (after few iterations and discussion with it) and applying the comments above. @BasPH - does it look better ?

Often our users are not aware that they are responsible for setting up and monitoring the database they chose as the metaa-data backend. While details of the tables and database structure of the metadata DB used by Airflow is internal detail, the monitoring, tracking the usage, fine-tuning and optimisation of the database configuration and detecting some cases where database becomes a bottle neck is generally a task that Deployment Manager should be aware of and it should be approached in a generic way - specific to the database chosen by the Deployment Manager and it also depends a lot on the choice of managed database if managed database is chosen by the Deployment Manager. This chapter makes it explicit and gives enough leads to the Deployment Manager to be able to follow after they chose the database, it also explain the specific parameters tha the Deployment Manager should pay attention to when setting up such monitoring. We also add an explanation of how Deployment Manager can setup client-side logging of SQL queries generated by Airflow in case database access is suspected for performance issues with Airflow, as a poor-man's version of complete, server-side monitoring and explains caveats of such client side configuraiton.

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

potiuk · 2023-12-29T21:21:19Z

docs/apache-airflow/core-concepts/tasks.rst

@@ -244,7 +244,7 @@ Zombie/Undead Tasks
 No system runs perfectly, and task instances are expected to die once in a while. Airflow detects two kinds of task/process mismatch:

 * *Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
-  (e.g. their process didn't send a recent heartbeat as it got killed, or the machine died). Airflow will find these
+  (e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these


This failed when I run --clean-build locally so i fixed it :)

potiuk · 2023-12-29T21:21:31Z

docs/apache-airflow/core-concepts/tasks.rst

@@ -273,7 +273,7 @@ The explanation of the criteria used in the above snippet to detect zombie tasks

 3. **Job Type**

-    The job associated with the task must be of type "LocalTaskJob."
+    The job associated with the task must be of type ``LocalTaskJob``.


potiuk · 2023-12-30T08:31:29Z

I will merge it for now - we can always correct it later. I believe the structured way is much cleaner and responds to the "too many details" concern.

…36483) * Add description on the ways how users should approach DB monitoring Often our users are not aware that they are responsible for setting up and monitoring the database they chose as the metaa-data backend. While details of the tables and database structure of the metadata DB used by Airflow is internal detail, the monitoring, tracking the usage, fine-tuning and optimisation of the database configuration and detecting some cases where database becomes a bottle neck is generally a task that Deployment Manager should be aware of and it should be approached in a generic way - specific to the database chosen by the Deployment Manager and it also depends a lot on the choice of managed database if managed database is chosen by the Deployment Manager. This chapter makes it explicit and gives enough leads to the Deployment Manager to be able to follow after they chose the database, it also explain the specific parameters tha the Deployment Manager should pay attention to when setting up such monitoring. We also add an explanation of how Deployment Manager can setup client-side logging of SQL queries generated by Airflow in case database access is suspected for performance issues with Airflow, as a poor-man's version of complete, server-side monitoring and explains caveats of such client side configuraiton. * Update docs/apache-airflow/howto/set-up-database.rst Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com> * fixup! Update docs/apache-airflow/howto/set-up-database.rst --------- Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com> (cherry picked from commit dea715d)

…pache#36483) * Add description on the ways how users should approach DB monitoring Often our users are not aware that they are responsible for setting up and monitoring the database they chose as the metaa-data backend. While details of the tables and database structure of the metadata DB used by Airflow is internal detail, the monitoring, tracking the usage, fine-tuning and optimisation of the database configuration and detecting some cases where database becomes a bottle neck is generally a task that Deployment Manager should be aware of and it should be approached in a generic way - specific to the database chosen by the Deployment Manager and it also depends a lot on the choice of managed database if managed database is chosen by the Deployment Manager. This chapter makes it explicit and gives enough leads to the Deployment Manager to be able to follow after they chose the database, it also explain the specific parameters tha the Deployment Manager should pay attention to when setting up such monitoring. We also add an explanation of how Deployment Manager can setup client-side logging of SQL queries generated by Airflow in case database access is suspected for performance issues with Airflow, as a poor-man's version of complete, server-side monitoring and explains caveats of such client side configuraiton. * Update docs/apache-airflow/howto/set-up-database.rst Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com> * fixup! Update docs/apache-airflow/howto/set-up-database.rst --------- Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

boring-cyborg bot added the kind:documentation label Dec 29, 2023

potiuk requested review from Taragolis, BasPH, dstandish, hussein-awala and jscheffl December 29, 2023 11:42

potiuk added this to the Airflow 2.8.1 milestone Dec 29, 2023

potiuk force-pushed the add-documentation-on-db-monitoring branch from fe0b4bc to 3bac97e Compare December 29, 2023 11:57

jscheffl approved these changes Dec 29, 2023

View reviewed changes

docs/apache-airflow/howto/set-up-database.rst Outdated Show resolved Hide resolved

hussein-awala approved these changes Dec 29, 2023

View reviewed changes

dirrao approved these changes Dec 29, 2023

View reviewed changes

potiuk force-pushed the add-documentation-on-db-monitoring branch from d3575a9 to bfe7aea Compare December 29, 2023 20:56

potiuk and others added 3 commits December 29, 2023 21:57

Update docs/apache-airflow/howto/set-up-database.rst

21c48fd

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

fixup! Update docs/apache-airflow/howto/set-up-database.rst

ab62d9f

potiuk force-pushed the add-documentation-on-db-monitoring branch from bfe7aea to ab62d9f Compare December 29, 2023 20:57

potiuk commented Dec 29, 2023

View reviewed changes

potiuk merged commit dea715d into apache:main Dec 30, 2023
51 checks passed

potiuk deleted the add-documentation-on-db-monitoring branch December 30, 2023 08:31

ephraimbuddy added the type:doc-only Changelog: Doc Only label Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add description on the ways how users should approach DB monitoring #36483

Add description on the ways how users should approach DB monitoring #36483

potiuk commented Dec 29, 2023

potiuk commented Dec 29, 2023

jscheffl left a comment

hussein-awala left a comment

BasPH commented Dec 29, 2023

dirrao left a comment

hussein-awala commented Dec 29, 2023

potiuk commented Dec 29, 2023 •

edited

potiuk commented Dec 29, 2023

potiuk commented Dec 29, 2023

potiuk commented Dec 29, 2023

potiuk Dec 29, 2023

potiuk Dec 29, 2023

potiuk commented Dec 30, 2023

Add description on the ways how users should approach DB monitoring #36483

Add description on the ways how users should approach DB monitoring #36483

Conversation

potiuk commented Dec 29, 2023

potiuk commented Dec 29, 2023

jscheffl left a comment

Choose a reason for hiding this comment

hussein-awala left a comment

Choose a reason for hiding this comment

BasPH commented Dec 29, 2023

dirrao left a comment

Choose a reason for hiding this comment

hussein-awala commented Dec 29, 2023

potiuk commented Dec 29, 2023 • edited

potiuk commented Dec 29, 2023

potiuk commented Dec 29, 2023

potiuk commented Dec 29, 2023

potiuk Dec 29, 2023

Choose a reason for hiding this comment

potiuk Dec 29, 2023

Choose a reason for hiding this comment

potiuk commented Dec 30, 2023

potiuk commented Dec 29, 2023 •

edited