Airflow database CPU usage #23

zstoth · 2018-11-06T15:45:10Z

After adding the plugin to our installation the Airflow postgres DB reported ~100% CPU usage until the plugin was removed. This made it impossible for the scheduler to schedule new tasks, Airflow basically stopped functioning.
Our airflow contains around 100 dags, some with a few thousand dag runs - I think reloading this was too much for the DB. Do you know any workaround for this issue?

seelmann · 2018-11-12T19:10:56Z

We experienced the same problem today.

The problem is that it is also activated on the workers whenever a task instance is executed. So for each started task at plugin registration the collect() function is called.

In pg_stat_activity we saw especially the query from get_task_state_info() is heavy and takes multiple seconds to execute when the database is at 100% CPU. Postgres does a full table scan for that groupby+count subquery. We have currently around 500k rows in the task_instance table and run a db.m4.large AWS RDS instance.

A simple solution may be to implement the describe() function which is called instead of collect() at registration time (see https://github.com/prometheus/client_python#custom-collectors). Otherwise the plugin should not be enabled on the worker nodes but only on the webserver.

elephantum · 2018-11-21T13:11:39Z

@zstoth @seelmann can you please check if v0.4.3 fixes this issue?

zstoth · 2018-11-23T12:45:58Z

It helps, yes. I applied the same solution locally, it prevents the scheduler calling the plugin. The DB operations are still heavy, but setting the Prometheus scrape interval to 5m causes only ~50% CPU usage spikes instead of 100%, so it's possible to live with that.
Maybe a note about that in the README would be helpful. Thanks for the fix!

elephantum · 2018-11-23T13:51:07Z

Cool!

@zstoth If you think clarifications in README are necessary - feel free to make a PR.

Also, I'll add another bug for heavy CPU usage, maybe some extra indexes here and there will help.

elephantum assigned cleverCat Nov 19, 2018

elephantum mentioned this issue Nov 21, 2018

fix cpu problem #31

Merged

elephantum added the bug Something isn't working label Nov 21, 2018

elephantum closed this as completed Nov 23, 2018

elephantum mentioned this issue Nov 23, 2018

Heavy CPU usage on large number of DAGs #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airflow database CPU usage #23

Airflow database CPU usage #23

zstoth commented Nov 6, 2018

seelmann commented Nov 12, 2018

elephantum commented Nov 21, 2018

zstoth commented Nov 23, 2018

elephantum commented Nov 23, 2018

Airflow database CPU usage #23

Airflow database CPU usage #23

Comments

zstoth commented Nov 6, 2018

seelmann commented Nov 12, 2018

elephantum commented Nov 21, 2018

zstoth commented Nov 23, 2018

elephantum commented Nov 23, 2018