Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New grid view in Airflow 2.3.0 has very slow performance on large DAGs relative to tree view in 2.2.5 #23772

Closed
1 of 2 tasks
joshzana opened this issue May 18, 2022 · 11 comments · Fixed by #23947
Closed
1 of 2 tasks
Assignees
Labels
area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues kind:bug This is a clearly a bug

Comments

@joshzana
Copy link

Apache Airflow version

2.3.0 (latest released)

What happened

I upgraded a local dev deployment of Airflow from 2.2.5 to 2.3.0, then loaded the new /dags/<dag_id>/grid page for a few dag ids.

On a big DAG, I’m seeing 30+ second latency on the /grid API, followed by a 10+ second delay each time I click a green rectangle. For a smaller DAG I tried, the page was pretty snappy.

I went back to 2.2.5 and loaded the tree view for comparison, and saw that the /tree/ endpoint on the large DAG had 9 seconds of latency, and clicking a green rectangle had instant responsiveness.

This is slow enough that it would be a blocker for my team to upgrade.

What you think should happen instead

The grid view should be equally performant to the tree view it replaces

How to reproduce

Generate a large DAG. Mine looks like the following:

  • 900 tasks
  • 150 task groups
  • 25 historical runs

Compare against a small DAG, in my case:

  • 200 tasks
  • 36 task groups
  • 25 historical runs

The large DAG is unusable, the small DAG is usable.

Operating System

Ubuntu 20.04.3 LTS (Focal Fossa)

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

Docker-compose deployment on an EC2 instance running ubuntu.
Airflow web server is nearly stock image from apache/airflow:2.3.0-python3.9

Anything else

Screenshot of load time:
image

GIF of click latency:
2022-05-17 21 26 26

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@joshzana joshzana added area:core kind:bug This is a clearly a bug labels May 18, 2022
@bbovenzi bbovenzi added area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues and removed area:core labels May 18, 2022
@bbovenzi
Copy link
Contributor

Oh yes, 30s is way too long. I would say there are probably 2 issues here:

  1. Just getting the data together for the grid view in the webserver is taking too long.
  2. The UI is taking too long to render or rerender the page on new selections

Could you share the network latency when you click auto-refresh and get a response in the network from grid_data or tree_data? That will help determine which is the bigger bottleneck to work on first.

@bbovenzi
Copy link
Contributor

Also, a temporary work around: changing the number of runs num_runs to something less than 25 should help a bit.

@bbovenzi bbovenzi self-assigned this May 18, 2022
@joshzana
Copy link
Author

Thanks for the quick response @bbovenzi !

  • I enabled auto-refresh and am seeing 30+ seconds of latency from /tree_data
  • I just noticed that when the grid is showing, other UI is also super laggy (eg the Auto Refresh toggle, and the Update button)
  • You are correct that if I go down to num_runs=5, things are better. I see 5 seconds latency from /tree_data and click lagginess is much improved

@joshzana
Copy link
Author

@bbovenzi Thanks for the progress so far. I tried out the new 2.3.1 release with #23813 in it, but this is still an issue for us. The big DAG mentioned above is now taking >60 seconds to load and then timing out (we have an nginx reverse proxy with a 60 second timeout on it). The timeout is on the airflow/dags/<dag_id>/grid HTTP GET. Smaller DAGs are able to load fine, but medium-sized ones are taking ~20 seconds to load that.

Are you still planning more work on this issue? Can we get it into 2.3.2?

@bbovenzi
Copy link
Contributor

Yes, that change was just for dynamic tasks. I am working on more optimizations. They just didn't make it in time for 2.3.1.

@bbovenzi
Copy link
Contributor

Going to reopen as we can still do more to improve perfomance for large DAGs

@c-thiel
Copy link
Contributor

c-thiel commented Jun 9, 2022

@bbovenzi we just updated to 2.3.2 but are still running into timeouts.
Are futher optimizations planned from your side?
@joshzana does 2.3.2 do the trick for you?

@jedcunningham
Copy link
Member

@c-thiel, #24284 should also help significantly. If you can give us some info on your DAG size/structure, I'd be happy to profile it further.

@sbailliez
Copy link

I can confirm I'm running 2.3.2 and the grid view is not usable (not loading) with large number of tasks and the default of 25 runs. It was no problem in 2.2.x. It loads fine on 2.3.x as 5 runs. The biggest number of tasks I have is 812 in one dag which is essentially 6 datasources loading 7 different days of data for 1-10 tables and about 6 tasks for each so things add up quickly

@jedcunningham
Copy link
Member

@c-thiel / @sbailliez, have you tried with #24284 applied in your environment? I'm curious if that gets it working for you.

@bbovenzi
Copy link
Contributor

Don't worry that we closed this. We'll keep working on performance improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:UI Related to UI/UX. For Frontend Developers. area:webserver Webserver related Issues kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants