Skip to content

Commit

Permalink
fixup! Update docs/apache-airflow/howto/set-up-database.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
potiuk committed Dec 29, 2023
1 parent 79d1449 commit bfe7aea
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 64 deletions.
4 changes: 2 additions & 2 deletions docs/apache-airflow/core-concepts/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ Zombie/Undead Tasks
No system runs perfectly, and task instances are expected to die once in a while. Airflow detects two kinds of task/process mismatch:

* *Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their process didn't send a recent heartbeat as it got killed, or the machine died). Airflow will find these
(e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
periodically, clean them up, and either fail or retry the task depending on its settings.

* *Undead tasks* are tasks that are *not* supposed to be running but are, often caused when you manually edit Task
Expand Down Expand Up @@ -273,7 +273,7 @@ The explanation of the criteria used in the above snippet to detect zombie tasks

3. **Job Type**

The job associated with the task must be of type "LocalTaskJob."
The job associated with the task must be of type ``LocalTaskJob``.

4. **Queued by Job ID**

Expand Down
155 changes: 93 additions & 62 deletions docs/apache-airflow/howto/set-up-database.rst
Original file line number Diff line number Diff line change
Expand Up @@ -383,70 +383,101 @@ After configuring the database and connecting to it in Airflow configuration, yo
airflow db migrate
Monitoring your database and logging queries
--------------------------------------------

Airflow uses the relational metadata DB a lot. When scheduling and executing tasks, database is the central
and crucial part of all the calculations and synchronization. It is important to monitor your database
and make sure it is configured properly. Excessive and long-running queries will most probably impact
Airflow performance. Such long or excessive queries might occur due to specific cases in your workflow,
missing optimizations or even due to bugs in the code (which maintainers try to take care for but glitches
can appear). There is also a possibility that database optimization engine makes wrong decisions based
on (outdated) statistics of data - for example when the set of data in your database changes and database
statistics gets outdated.

This is a responsibility of the Deployment Manager to setup monitoring and configuring the database properly.

Those kind of issues are not specific to Airflow and they can happen in any application that uses database
as a backend. There are a number of ways you can do it and we do not provide an opinionated answer on how
you should monitor your database, we leave it to discretion of the Deployment Manager who should take care
about the database configuration and monitoring, but typical parameters that should be monitored regularly are:

* CPU usage of your DB
* I/O usage of your DB
* Memory usage of your DB
* Number and frequency of queries handled by your DB
* Detecting and logging slow/long running queries in your DB
* Detecting cases where execution plan of queries lead to full table scans for huge tables
* Using disk swap vs. memory by the DB and frequent swapping out of the cache

It's something that can only be done by configuring and monitoring your DB - Airflow does not provide any
tooling for that - this is usually specific to your database and you can enable server-side monitoring or
logging in your DB. It can also give you some extra metrics and you can selectively enable (usually)
tracking of longest running queries by some thresholds you define that might indicate excessive resource usage.

It is strongly recommended for the Deployment Manager to configure such monitoring and tracking in the
DB they chose and and learn how to make use of it, it should be used to monitor regular DB performance and issues
such as misuse of indexes. Often database also have tools that allow to fix some of those issues
by regularly running house-keeping that cannot be done by Airflow on its own. The Databases supported as
meta-data DB have dedicated ways of doing it (most often a variant of ``ANALYZE`` SQL command that you
can run periodically on your database to update statistics to let the optimization engine make better
decisions about execution plans of queries).

When configured well and monitored properly, you can set such monitoring and logging enabled even in your
production system and it will not impact the performance of your database significantly.

Consult documentation of the database you chose as meta-data backend for details on how this should
be setup for your database. Many of managed databases already run the maintenance tasks automatically but
it very much depends on the choice of both the database and the provider of the managed database.

If you suspect that excessive or long running database queries are the reason for slow performance of Airflow,
there is also an option to enable SQLAlchemy client to log all queries send to the database,
however this one is less selective and it will impact Airflow on the client side much more than properly
configured server side monitoring. It is also likely going to interfere with Airflow stuff -
it will drastically slow down the performance and might induce some race conditions and possibly
even deadlocks, so it should be used carefully, possibly on a staging system of yours where you
replicate your production system and the usage patterns you want to check.

This can be done with setting ``echo=True`` as sqlalchemy engine configuration as explained in the
Database Monitoring and Maintenance in Airflow
----------------------------------------------

Airflow extensively utilizes a relational metadata database for task scheduling and execution.
Monitoring and proper configuration of this database are crucial for optimal Airflow performance.

Key Concerns
............
1. **Performance Impact**: Long or excessive queries can significantly affect Airflow's functionality.
These may arise due to workflow specifics, lack of optimizations, or code bugs.
2. **Database Statistics**: Incorrect optimization decisions by the database engine,
often due to outdated data statistics, can degrade performance.

Responsibilities
................

The responsibilities for database monitoring and maintenance in Airflow environments vary depending on
whether you're using self-managed databases and Airflow instances or opting for managed services.

**Self-Managed Environments**:

In the setups where both the database and Airflow are self-managed, the Deployment Manager
is responsible for setting up, configuring, and maintaining the database. This includes monitoring
its performance, managing backups, periodic cleanups and ensuring its optimal operation with Airflow.

**Managed Services**:

- Managed Database Services: When using managed DB services, many maintenance tasks (like backups,
patching, and basic monitoring) are handled by the provider. However, the Deployment Manager still
needs to oversee the configuration of Airflow and optimize performance settings specific to their
workflows, manages periodic cleanups and monitor their DB to ensure optimal operations with Airflow.

- Managed Airflow Services: With managed Airflow services, those service provider take responsibility
for the configuration and maintenance of Airflow and its database. However, the Deployment Manager
needs to collaborate with the service configuration to ensure that the sizing and workflow requirements
are matching the sizing and configuration of the managed service.

Monitoring Aspects
..................

Regular monitoring should include:

- CPU, I/O, and memory usage.
- Query frequency and number.
- Identification and logging of slow or long-running queries.
- Detection of inefficient query execution plans.
- Analysis of disk swap versus memory usage and cache swapping frequency.

Tools and Strategies
....................

- Airflow doesn't provide direct tooling for database monitoring.
- Use server-side monitoring and logging to obtain metrics.
- Enable tracking of long-running queries based on defined thresholds.
- Regularly run house-keeping tasks (like ``ANALYZE`` SQL command) for maintenance.

Database Cleaning Tools
.......................

- **Airflow DB Clean Command**: Utilize the ``airflow db clean`` command to help manage and clean
up your database.
- **Python Methods in ``airflow.utils.db_cleanup``**: This module provides additional Python methods for
database cleanup and maintenance, offering more fine-grained control and customization for specific needs.

Recommendations
...............

- **Proactive Monitoring**: Implement monitoring and logging in production without significantly
impacting performance.
- **Database-Specific Guidance**: Consult the chosen database's documentation for specific monitoring
setup instructions.
- **Managed Database Services**: Check if automatic maintenance tasks are available with your
database provider.

SQLAlchemy Logging
..................

For detailed query analysis, enable SQLAlchemy client logging (``echo=True`` in SQLAlchemy
engine configuration).

- This method is more intrusive and can affect Airflow's client-side performance.
- It generates a lot of logs, especially in a busy Airflow environment.
- Suitable for non-production environments like staging systems.

You can do it with ``echo=True`` as sqlalchemy engine configuration as explained in the
`SQLAlchemy logging documentation <https://docs.sqlalchemy.org/en/14/core/engines.html#configuring-logging>`_.

In case of Airflow, it can be set via :ref:`config:database__sql_alchemy_engine_args` configuration parameter.
However, again - this one will impact airflow processing heavily and introduce a lot of I/O contention
to write to log files and extra CPU needed to format and print the log messages, and you need a lot of
space to store the logs, so it should be used carefully on production systems. Consider that a
"poor-man`s" version of proper server-side monitoring of your DB as it provides a very limited
and production-interfering setup.
Use :ref:`config:database__sql_alchemy_engine_args` configuration parameter to set echo arg to True.

Caution
.......

- Be mindful of the impact on Airflow's performance and system resources when enabling extensive logging.
- Prefer server-side monitoring over client-side logging for production environments to minimize
performance interference.

What's next?
------------
Expand Down

0 comments on commit bfe7aea

Please sign in to comment.