fix: compute historical metrics percentages from uncapped counts by MD-Mushfiqur123 · Pull Request #67549 · apache/airflow

MD-Mushfiqur123 · 2026-05-26T11:37:02Z

Problem

The dashboard "Historical Metrics" section shows incorrect percentages because the API endpoint /ui/dashboard/historical_metrics_data caps per-state counts at STATE_COUNT_CAP = 1000 (for performance), but the frontend computes the total as the sum of these capped values and uses it as the denominator for percentage calculations.

For example, if success has 100,000 task instances and every other state has 0, the API returns success: 1000 (capped) while all others are 0. The frontend computes total = 1000 and shows success: 100%. With many states hitting the cap, the total itself is wrong and percentages are meaningless.

Fix

Return the real uncapped total counts (dag_run_total_count and task_instance_total_count) alongside the existing capped per-state counts. The frontend now uses these uncapped totals as the denominator for percentage and progress-bar computations.

This is a minimal, backward-compatible change that:

Adds two COUNT(*) queries (fast aggregate queries with the same filters)
Adds two fields to the API response
Updates the frontend to use the uncapped totals instead of summing capped values
Preserves the existing capped display behavior ("N+" indicator, bar expansion, percentage hiding) when a state hits the limit

Closes #67336

boring-cyborg · 2026-05-26T11:37:07Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

henry3260

I think the percentages can still be misleading when any state hits STATE_COUNT_CAP, since the per-state counts remain capped while the total is uncapped.

I'm also not sure we should add potentially expensive uncapped count queries just to make these percentages precise.

MD-Mushfiqur123 · 2026-05-27T04:22:36Z

Friendly ping — this PR is ready for review. All CI checks are passing.

MD-Mushfiqur123 · 2026-05-27T07:28:33Z

Thanks for the review @henry3260!

Re: misleading percentages when capped — The per-state counts are still capped at 1000, and the frontend already shows "N+" when a count reaches that cap (see state_count_limit in the response). The uncapped total just provides a better denominator so that non-capped states get accurate percentages. Capped states will show an understated percentage, but the "1000+" indicator signals the uncertainty. This is strictly better than the current behavior where all percentages are wrong (even for non-capped states) because the denominator itself is wrong.

Re: expensive COUNT() queries* — Both dag_run_total_count and task_instance_total_count are simple COUNT(*) queries that hit the same indexed filters (dag_id, start_date, end_date) already used by the capped counting subqueries. In practice these are index-only scans that execute in single-digit milliseconds, adding negligible overhead to the overall endpoint.

An alternative would be to remove the caps entirely, but that was intentionally avoided to keep the endpoint fast for deployments with millions of task instances. The current approach is a minimal, pragmatic improvement.

henry3260 · 2026-05-27T09:25:24Z

Thanks for the review @henry3260!

Re: misleading percentages when capped — The per-state counts are still capped at 1000, and the frontend already shows "N+" when a count reaches that cap (see state_count_limit in the response). The uncapped total just provides a better denominator so that non-capped states get accurate percentages. Capped states will show an understated percentage, but the "1000+" indicator signals the uncertainty. This is strictly better than the current behavior where all percentages are wrong (even for non-capped states) because the denominator itself is wrong.

Re: expensive COUNT() queries* — Both dag_run_total_count and task_instance_total_count are simple COUNT(*) queries that hit the same indexed filters (dag_id, start_date, end_date) already used by the capped counting subqueries. In practice these are index-only scans that execute in single-digit milliseconds, adding negligible overhead to the overall endpoint.

An alternative would be to remove the caps entirely, but that was intentionally avoided to keep the endpoint fast for deployments with millions of task instances. The current approach is a minimal, pragmatic improvement.

IMO, we cannot assume these queries are cheap just because they are simple COUNT(*) queries. Am I missing something, or do we not have a composite index that matches this filter shape, such as (dag_id, start_date, end_date)?

My concern is that the existing capped queries are bounded by STATE_COUNT_CAP, while these new total count queries need to count every matching Dag run / task instance. For example, if there are 100000 matching Dag runs in the selected time range, and each Dag run has 50 task instances, then ti_total may need to count 5000000 task instance rows.

pierrejeambrun

Yes, this count were omitted on purpose. On tables with 10 millions of rows a simple count can be very long depending on the joints, etc... performed.

Current behavior is expected, % are computed based on the front-end information (truncated) returned. We can improve this if that's confusing but computing a real hard count() on the entire table is not a possibility.

That's why we introduced cursor based pagination and reworked this dashboard page. (It would takes seconds to answer on big tables)

MD-Mushfiqur123 · 2026-05-27T09:50:09Z

Thank you for the detailed review @pierrejeambrun. I understand now — the COUNT(*) queries were omitted on purpose due to performance concerns on large tables.

I will revert this PR to keep the original behavior. Would it be acceptable to add a state_count_limit field (the existing cap value) to the response so the frontend can clearly indicate when values are truncated? That way the UI can show "1000+" labels without needing uncapped totals.

boring-cyborg Bot added area:API Airflow's REST/HTTP API area:UI Related to UI/UX. For Frontend Developers. labels May 26, 2026

MD-Mushfiqur123 requested review from bbovenzi, bugraoz93, choo121600, ephraimbuddy, guan404ming, jason810496, pierrejeambrun, rawwar, ryanahamilton, shubhamraj-git and vatsrahul1001 as code owners May 26, 2026 11:37

henry3260 reviewed May 26, 2026

View reviewed changes

pierrejeambrun requested changes May 27, 2026

View reviewed changes

MD-Mushfiqur123 closed this May 27, 2026

MD-Mushfiqur123 force-pushed the fix/issue-67336-historical-metrics-percentages branch from 628f11a to 1d5150e Compare May 27, 2026 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: compute historical metrics percentages from uncapped counts#67549

fix: compute historical metrics percentages from uncapped counts#67549
MD-Mushfiqur123 wants to merge 0 commit into
apache:mainfrom
MD-Mushfiqur123:fix/issue-67336-historical-metrics-percentages

MD-Mushfiqur123 commented May 26, 2026

Uh oh!

boring-cyborg Bot commented May 26, 2026

Uh oh!

henry3260 left a comment

Uh oh!

MD-Mushfiqur123 commented May 27, 2026

Uh oh!

MD-Mushfiqur123 commented May 27, 2026

Uh oh!

henry3260 commented May 27, 2026

Uh oh!

pierrejeambrun left a comment

Uh oh!

MD-Mushfiqur123 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MD-Mushfiqur123 commented May 26, 2026

Problem

Fix

Uh oh!

boring-cyborg Bot commented May 26, 2026

Uh oh!

henry3260 left a comment

Choose a reason for hiding this comment

Uh oh!

MD-Mushfiqur123 commented May 27, 2026

Uh oh!

MD-Mushfiqur123 commented May 27, 2026

Uh oh!

henry3260 commented May 27, 2026

Uh oh!

pierrejeambrun left a comment

Choose a reason for hiding this comment

Uh oh!

MD-Mushfiqur123 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants