Skip to content

fix: compute historical metrics percentages from uncapped counts#67549

Closed
MD-Mushfiqur123 wants to merge 0 commit into
apache:mainfrom
MD-Mushfiqur123:fix/issue-67336-historical-metrics-percentages
Closed

fix: compute historical metrics percentages from uncapped counts#67549
MD-Mushfiqur123 wants to merge 0 commit into
apache:mainfrom
MD-Mushfiqur123:fix/issue-67336-historical-metrics-percentages

Conversation

@MD-Mushfiqur123
Copy link
Copy Markdown

Problem

The dashboard "Historical Metrics" section shows incorrect percentages because the API endpoint /ui/dashboard/historical_metrics_data caps per-state counts at STATE_COUNT_CAP = 1000 (for performance), but the frontend computes the total as the sum of these capped values and uses it as the denominator for percentage calculations.

For example, if success has 100,000 task instances and every other state has 0, the API returns success: 1000 (capped) while all others are 0. The frontend computes total = 1000 and shows success: 100%. With many states hitting the cap, the total itself is wrong and percentages are meaningless.

Fix

Return the real uncapped total counts (dag_run_total_count and task_instance_total_count) alongside the existing capped per-state counts. The frontend now uses these uncapped totals as the denominator for percentage and progress-bar computations.

This is a minimal, backward-compatible change that:

  • Adds two COUNT(*) queries (fast aggregate queries with the same filters)
  • Adds two fields to the API response
  • Updates the frontend to use the uncapped totals instead of summing capped values
  • Preserves the existing capped display behavior ("N+" indicator, bar expansion, percentage hiding) when a state hits the limit

Closes #67336

@boring-cyborg boring-cyborg Bot added area:API Airflow's REST/HTTP API area:UI Related to UI/UX. For Frontend Developers. labels May 26, 2026
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 26, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Copy Markdown
Contributor

@henry3260 henry3260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the percentages can still be misleading when any state hits STATE_COUNT_CAP, since the per-state counts remain capped while the total is uncapped.

I'm also not sure we should add potentially expensive uncapped count queries just to make these percentages precise.

@MD-Mushfiqur123
Copy link
Copy Markdown
Author

Friendly ping — this PR is ready for review. All CI checks are passing.

@MD-Mushfiqur123
Copy link
Copy Markdown
Author

Thanks for the review @henry3260!

Re: misleading percentages when capped — The per-state counts are still capped at 1000, and the frontend already shows "N+" when a count reaches that cap (see state_count_limit in the response). The uncapped total just provides a better denominator so that non-capped states get accurate percentages. Capped states will show an understated percentage, but the "1000+" indicator signals the uncertainty. This is strictly better than the current behavior where all percentages are wrong (even for non-capped states) because the denominator itself is wrong.

Re: expensive COUNT() queries* — Both dag_run_total_count and task_instance_total_count are simple COUNT(*) queries that hit the same indexed filters (dag_id, start_date, end_date) already used by the capped counting subqueries. In practice these are index-only scans that execute in single-digit milliseconds, adding negligible overhead to the overall endpoint.

An alternative would be to remove the caps entirely, but that was intentionally avoided to keep the endpoint fast for deployments with millions of task instances. The current approach is a minimal, pragmatic improvement.

@henry3260
Copy link
Copy Markdown
Contributor

Thanks for the review @henry3260!

Re: misleading percentages when capped — The per-state counts are still capped at 1000, and the frontend already shows "N+" when a count reaches that cap (see state_count_limit in the response). The uncapped total just provides a better denominator so that non-capped states get accurate percentages. Capped states will show an understated percentage, but the "1000+" indicator signals the uncertainty. This is strictly better than the current behavior where all percentages are wrong (even for non-capped states) because the denominator itself is wrong.

Re: expensive COUNT() queries* — Both dag_run_total_count and task_instance_total_count are simple COUNT(*) queries that hit the same indexed filters (dag_id, start_date, end_date) already used by the capped counting subqueries. In practice these are index-only scans that execute in single-digit milliseconds, adding negligible overhead to the overall endpoint.

An alternative would be to remove the caps entirely, but that was intentionally avoided to keep the endpoint fast for deployments with millions of task instances. The current approach is a minimal, pragmatic improvement.

IMO, we cannot assume these queries are cheap just because they are simple COUNT(*) queries. Am I missing something, or do we not have a composite index that matches this filter shape, such as (dag_id, start_date, end_date)?

My concern is that the existing capped queries are bounded by STATE_COUNT_CAP, while these new total count queries need to count every matching Dag run / task instance. For example, if there are 100000 matching Dag runs in the selected time range, and each Dag run has 50 task instances, then ti_total may need to count 5000000 task instance rows.

Copy link
Copy Markdown
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this count were omitted on purpose. On tables with 10 millions of rows a simple count can be very long depending on the joints, etc... performed.

Current behavior is expected, % are computed based on the front-end information (truncated) returned. We can improve this if that's confusing but computing a real hard count() on the entire table is not a possibility.

That's why we introduced cursor based pagination and reworked this dashboard page. (It would takes seconds to answer on big tables)

@MD-Mushfiqur123
Copy link
Copy Markdown
Author

Thank you for the detailed review @pierrejeambrun. I understand now — the COUNT(*) queries were omitted on purpose due to performance concerns on large tables.

I will revert this PR to keep the original behavior. Would it be acceptable to add a state_count_limit field (the existing cap value) to the response so the frontend can clearly indicate when values are truncated? That way the UI can show "1000+" labels without needing uncapped totals.

@MD-Mushfiqur123 MD-Mushfiqur123 force-pushed the fix/issue-67336-historical-metrics-percentages branch from 628f11a to 1d5150e Compare May 27, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:UI Related to UI/UX. For Frontend Developers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dashboard summary page shows wrong percentages when a state count exceeds the API cap (1000)

3 participants