Improve AIP progress tracker example for accuracy#68037
Merged
Conversation
1 task
kaxil
reviewed
Jun 4, 2026
kaxil
approved these changes
Jun 5, 2026
potiuk
pushed a commit
that referenced
this pull request
Jun 5, 2026
…68042) A large diff to example DAGs (e.g. a single provider example like #68037, +667/-119) tripped the `_is_large_enough_pr` line-count gate, which set `full-tests-needed=true` and fanned out the entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider compat and special tests — for what is illustrative, non-shipped code. Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production code" definition that feeds the line-count gate) for both the airflow-core top-level `airflow/example_dags/` and the nested `providers/<name>/.../example_dags/` layout. Example DAGs are still selected for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` / `ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant core/provider tests — they just no longer force the full matrix.
github-actions Bot
pushed a commit
to aws-mwaa/upstream-to-airflow
that referenced
this pull request
Jun 5, 2026
…nly changes (apache#68042) A large diff to example DAGs (e.g. a single provider example like apache#68037, +667/-119) tripped the `_is_large_enough_pr` line-count gate, which set `full-tests-needed=true` and fanned out the entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider compat and special tests — for what is illustrative, non-shipped code. Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production code" definition that feeds the line-count gate) for both the airflow-core top-level `airflow/example_dags/` and the nested `providers/<name>/.../example_dags/` layout. Example DAGs are still selected for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` / `ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant core/provider tests — they just no longer force the full matrix. (cherry picked from commit 4adf4e6) Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
aws-airflow-bot
pushed a commit
to aws-mwaa/upstream-to-airflow
that referenced
this pull request
Jun 5, 2026
…nly changes (apache#68042) A large diff to example DAGs (e.g. a single provider example like apache#68037, +667/-119) tripped the `_is_large_enough_pr` line-count gate, which set `full-tests-needed=true` and fanned out the entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider compat and special tests — for what is illustrative, non-shipped code. Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production code" definition that feeds the line-count gate) for both the airflow-core top-level `airflow/example_dags/` and the nested `providers/<name>/.../example_dags/` layout. Example DAGs are still selected for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` / `ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant core/provider tests — they just no longer force the full matrix. (cherry picked from commit 4adf4e6) Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
…e-backed results The example DAG was producing hallucinated output -- fabricated completion percentages, invented blockers, and missed shipped work -- because the evidence pipeline was too thin and the prompts too permissive. Key changes: - Add AIP registry with Confluence page IDs, GitHub search aliases, and codebase directory paths for multi-strategy evidence gathering - Fetch GitHub file tree (Git Trees API) for codebase-level evidence - Replace flat 3000-char spec truncation with section-aware parsing - Replace completion_pct/blockers Pydantic model with per-deliverable DeliverableStatus (name, status, evidence, confidence) - Add grounding rules to analysis/synthesis/validation system prompts - Add three-layer quality pipeline: AI validation (LLMOperator) identifies ungrounded claims, deterministic apply_validation task does mechanical find-and-replace, human reviews the corrected report - Add arithmetic validation that cross-checks X/Y fractions against structured analysis data (catches validator-introduced errors) - Set temperature=0 on all LLM calls for run-to-run consistency
Same file now contains two DAGs that solve the same use case -- tracking AIP implementation progress -- with different architectures: 1. example_aip_progress_tracker (pipeline): 12-task deterministic pipeline with per-AIP LLM analysis, structured Pydantic output, AI validation, and arithmetic correction. More accurate, more auditable, fewer tokens (~66K total), but more complex. 2. example_aip_progress_tracker_skills (agent): Single AgentOperator with the aip-tracker skill loaded via AgentSkillsToolset plus custom tool functions for Confluence/GitHub APIs. Simpler DAG (2 tasks), but less control over output discipline (~82K tokens, coarser granularity). The aip-tracker SKILL.md bundle teaches the agent the same grounding rules the pipeline enforces structurally: spec-level deliverable granularity, fraction-only progress format, evidence-backed assessments, and a mandatory self-verification checklist. Also strengthens the pipeline DAG's arithmetic validation to cross-check per-AIP fractions and summary totals against structured analysis data.
Based on feedback from Kaxil, removed the duplicate import of re and resolved the redundant definition of _github_headers
8fa7926 to
e6133bc
Compare
Fix mypy errors in AIP tracker skills DAG for _safe_api_get return type Narrow type guard from `isinstance(data, str)` to `not isinstance(data, dict)` so mypy recognizes that `.get()` calls are valid after the check, since `_safe_api_get` returns `dict | list | str`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A key addition here is an AI validation step.
The example DAG was producing hallucinated output including fabricated completion percentages, invented blockers, and missed shipped work. Many reasons including the fact that the evidence pipeline was too thin and the prompts too permissive.
Key changes:
Was generative AI tooling used to co-author this PR?