Improve AIP progress tracker example for accuracy by vikramkoka · Pull Request #68037 · apache/airflow

vikramkoka · 2026-06-04T20:28:59Z

A key addition here is an AI validation step.

The example DAG was producing hallucinated output including fabricated completion percentages, invented blockers, and missed shipped work. Many reasons including the fact that the evidence pipeline was too thin and the prompts too permissive.

Key changes:

Add AIP registry with Confluence page IDs, GitHub search aliases, and codebase directory paths for multi-strategy evidence gathering
Fetch GitHub file tree (Git Trees API) for codebase-level evidence
Replace flat 3000-char spec truncation with section-aware parsing
Replace completion_pct/blockers Pydantic model with per-deliverable DeliverableStatus (name, status, evidence, confidence)
Add grounding rules to analysis/synthesis/validation system prompts
Add three-layer quality pipeline: AI validation (LLMOperator) identifies ungrounded claims, deterministic apply_validation task does mechanical find-and-replace, human reviews the corrected report
Add arithmetic validation that cross-checks X/Y fractions against structured analysis data (catches validator-introduced errors)
Set temperature=0 on all LLM calls for run-to-run consistency

Was generative AI tooling used to co-author this PR?

[ x] Yes (please specify the tool below)

…68042) A large diff to example DAGs (e.g. a single provider example like #68037, +667/-119) tripped the `_is_large_enough_pr` line-count gate, which set `full-tests-needed=true` and fanned out the entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider compat and special tests — for what is illustrative, non-shipped code. Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production code" definition that feeds the line-count gate) for both the airflow-core top-level `airflow/example_dags/` and the nested `providers/<name>/.../example_dags/` layout. Example DAGs are still selected for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` / `ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant core/provider tests — they just no longer force the full matrix.

…nly changes (apache#68042) A large diff to example DAGs (e.g. a single provider example like apache#68037, +667/-119) tripped the `_is_large_enough_pr` line-count gate, which set `full-tests-needed=true` and fanned out the entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider compat and special tests — for what is illustrative, non-shipped code. Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production code" definition that feeds the line-count gate) for both the airflow-core top-level `airflow/example_dags/` and the nested `providers/<name>/.../example_dags/` layout. Example DAGs are still selected for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` / `ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant core/provider tests — they just no longer force the full matrix. (cherry picked from commit 4adf4e6) Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>

…e-backed results The example DAG was producing hallucinated output -- fabricated completion percentages, invented blockers, and missed shipped work -- because the evidence pipeline was too thin and the prompts too permissive. Key changes: - Add AIP registry with Confluence page IDs, GitHub search aliases, and codebase directory paths for multi-strategy evidence gathering - Fetch GitHub file tree (Git Trees API) for codebase-level evidence - Replace flat 3000-char spec truncation with section-aware parsing - Replace completion_pct/blockers Pydantic model with per-deliverable DeliverableStatus (name, status, evidence, confidence) - Add grounding rules to analysis/synthesis/validation system prompts - Add three-layer quality pipeline: AI validation (LLMOperator) identifies ungrounded claims, deterministic apply_validation task does mechanical find-and-replace, human reviews the corrected report - Add arithmetic validation that cross-checks X/Y fractions against structured analysis data (catches validator-introduced errors) - Set temperature=0 on all LLM calls for run-to-run consistency

Same file now contains two DAGs that solve the same use case -- tracking AIP implementation progress -- with different architectures: 1. example_aip_progress_tracker (pipeline): 12-task deterministic pipeline with per-AIP LLM analysis, structured Pydantic output, AI validation, and arithmetic correction. More accurate, more auditable, fewer tokens (~66K total), but more complex. 2. example_aip_progress_tracker_skills (agent): Single AgentOperator with the aip-tracker skill loaded via AgentSkillsToolset plus custom tool functions for Confluence/GitHub APIs. Simpler DAG (2 tasks), but less control over output discipline (~82K tokens, coarser granularity). The aip-tracker SKILL.md bundle teaches the agent the same grounding rules the pipeline enforces structurally: spec-level deliverable granularity, fraction-only progress format, evidence-backed assessments, and a mandatory self-verification checklist. Also strengthens the pipeline DAG's arithmetic validation to cross-check per-AIP fractions and summary totals against structured analysis data.

Based on feedback from Kaxil, removed the duplicate import of re and resolved the redundant definition of _github_headers

Fix mypy errors in AIP tracker skills DAG for _safe_api_get return type Narrow type guard from `isinstance(data, str)` to `not isinstance(data, dict)` so mypy recognizes that `.get()` calls are valid after the check, since `_safe_api_get` returns `dict | list | str`.

vikramkoka requested review from gopidesupavan and kaxil as code owners June 4, 2026 20:29

boring-cyborg Bot added area:providers provider:common-ai labels Jun 4, 2026

shahar1 mentioned this pull request Jun 4, 2026

Don't force full test matrix for large example_dags-only changes #68042

Merged

1 task

kaxil reviewed Jun 4, 2026

View reviewed changes

Comment thread ...iders/common/ai/src/airflow/providers/common/ai/example_dags/example_aip_progress_tracker.py Outdated

Comment thread ...iders/common/ai/src/airflow/providers/common/ai/example_dags/example_aip_progress_tracker.py Outdated

kaxil approved these changes Jun 5, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jun 5, 2026

[v3-2-test] Don't force the full test matrix for large example_dags-only changes (#68042) #68052

Closed

vikramkoka added 3 commits June 5, 2026 16:00

Removed duplicate import and redundant definition

e6133bc

Based on feedback from Kaxil, removed the duplicate import of re and resolved the redundant definition of _github_headers

ashb force-pushed the aip99-example-aiptracker branch from 8fa7926 to e6133bc Compare June 5, 2026 15:00

vikramkoka merged commit db6ce84 into main Jun 5, 2026
94 checks passed

vikramkoka deleted the aip99-example-aiptracker branch June 5, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve AIP progress tracker example for accuracy#68037

Improve AIP progress tracker example for accuracy#68037
vikramkoka merged 4 commits into
mainfrom
aip99-example-aiptracker

vikramkoka commented Jun 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vikramkoka commented Jun 4, 2026

Was generative AI tooling used to co-author this PR?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants