Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow database CPU usage #23

Closed
zstoth opened this issue Nov 6, 2018 · 4 comments
Closed

Airflow database CPU usage #23

zstoth opened this issue Nov 6, 2018 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@zstoth
Copy link

zstoth commented Nov 6, 2018

After adding the plugin to our installation the Airflow postgres DB reported ~100% CPU usage until the plugin was removed. This made it impossible for the scheduler to schedule new tasks, Airflow basically stopped functioning.
Our airflow contains around 100 dags, some with a few thousand dag runs - I think reloading this was too much for the DB. Do you know any workaround for this issue?

@seelmann
Copy link

We experienced the same problem today.

The problem is that it is also activated on the workers whenever a task instance is executed. So for each started task at plugin registration the collect() function is called.

In pg_stat_activity we saw especially the query from get_task_state_info() is heavy and takes multiple seconds to execute when the database is at 100% CPU. Postgres does a full table scan for that groupby+count subquery. We have currently around 500k rows in the task_instance table and run a db.m4.large AWS RDS instance.

A simple solution may be to implement the describe() function which is called instead of collect() at registration time (see https://github.com/prometheus/client_python#custom-collectors). Otherwise the plugin should not be enabled on the worker nodes but only on the webserver.

@elephantum
Copy link
Contributor

@zstoth @seelmann can you please check if v0.4.3 fixes this issue?

@elephantum elephantum added the bug Something isn't working label Nov 21, 2018
@zstoth
Copy link
Author

zstoth commented Nov 23, 2018

It helps, yes. I applied the same solution locally, it prevents the scheduler calling the plugin. The DB operations are still heavy, but setting the Prometheus scrape interval to 5m causes only ~50% CPU usage spikes instead of 100%, so it's possible to live with that.
Maybe a note about that in the README would be helpful. Thanks for the fix!

@elephantum
Copy link
Contributor

Cool!

@zstoth If you think clarifications in README are necessary - feel free to make a PR.

Also, I'll add another bug for heavy CPU usage, maybe some extra indexes here and there will help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants