Skip to content

Improve AIP progress tracker example for accuracy#68037

Merged
vikramkoka merged 4 commits into
mainfrom
aip99-example-aiptracker
Jun 5, 2026
Merged

Improve AIP progress tracker example for accuracy#68037
vikramkoka merged 4 commits into
mainfrom
aip99-example-aiptracker

Conversation

@vikramkoka
Copy link
Copy Markdown
Contributor

A key addition here is an AI validation step.

The example DAG was producing hallucinated output including fabricated completion percentages, invented blockers, and missed shipped work. Many reasons including the fact that the evidence pipeline was too thin and the prompts too permissive.

Key changes:

  • Add AIP registry with Confluence page IDs, GitHub search aliases, and codebase directory paths for multi-strategy evidence gathering
  • Fetch GitHub file tree (Git Trees API) for codebase-level evidence
  • Replace flat 3000-char spec truncation with section-aware parsing
  • Replace completion_pct/blockers Pydantic model with per-deliverable DeliverableStatus (name, status, evidence, confidence)
  • Add grounding rules to analysis/synthesis/validation system prompts
  • Add three-layer quality pipeline: AI validation (LLMOperator) identifies ungrounded claims, deterministic apply_validation task does mechanical find-and-replace, human reviews the corrected report
  • Add arithmetic validation that cross-checks X/Y fractions against structured analysis data (catches validator-introduced errors)
  • Set temperature=0 on all LLM calls for run-to-run consistency

Was generative AI tooling used to co-author this PR?
  • [ x] Yes (please specify the tool below)

potiuk pushed a commit that referenced this pull request Jun 5, 2026
…68042)

A large diff to example DAGs (e.g. a single provider example like
#68037, +667/-119) tripped the `_is_large_enough_pr`
line-count gate, which set `full-tests-needed=true` and fanned out the
entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider
compat and special tests — for what is illustrative, non-shipped code.

Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production
code" definition that feeds the line-count gate) for both the airflow-core
top-level `airflow/example_dags/` and the nested
`providers/<name>/.../example_dags/` layout. Example DAGs are still selected
for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` /
`ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant
core/provider tests — they just no longer force the full matrix.
github-actions Bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Jun 5, 2026
…nly changes (apache#68042)

A large diff to example DAGs (e.g. a single provider example like
apache#68037, +667/-119) tripped the `_is_large_enough_pr`
line-count gate, which set `full-tests-needed=true` and fanned out the
entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider
compat and special tests — for what is illustrative, non-shipped code.

Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production
code" definition that feeds the line-count gate) for both the airflow-core
top-level `airflow/example_dags/` and the nested
`providers/<name>/.../example_dags/` layout. Example DAGs are still selected
for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` /
`ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant
core/provider tests — they just no longer force the full matrix.
(cherry picked from commit 4adf4e6)

Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
aws-airflow-bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Jun 5, 2026
…nly changes (apache#68042)

A large diff to example DAGs (e.g. a single provider example like
apache#68037, +667/-119) tripped the `_is_large_enough_pr`
line-count gate, which set `full-tests-needed=true` and fanned out the
entire matrix — core DB tests, Kubernetes, Helm, PROD images, all-provider
compat and special tests — for what is illustrative, non-shipped code.

Exclude `example_dags/` from `PYTHON_PRODUCTION_FILES` (the "production
code" definition that feeds the line-count gate) for both the airflow-core
top-level `airflow/example_dags/` and the nested
`providers/<name>/.../example_dags/` layout. Example DAGs are still selected
for their own tests via the broader `ALL_AIRFLOW_PYTHON_FILES` /
`ALL_PROVIDERS_PYTHON_FILES` groups, so they keep running the relevant
core/provider tests — they just no longer force the full matrix.
(cherry picked from commit 4adf4e6)

Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
…e-backed results

The example DAG was producing hallucinated output -- fabricated completion
percentages, invented blockers, and missed shipped work -- because the
evidence pipeline was too thin and the prompts too permissive.

Key changes:
- Add AIP registry with Confluence page IDs, GitHub search aliases, and
  codebase directory paths for multi-strategy evidence gathering
- Fetch GitHub file tree (Git Trees API) for codebase-level evidence
- Replace flat 3000-char spec truncation with section-aware parsing
- Replace completion_pct/blockers Pydantic model with per-deliverable
  DeliverableStatus (name, status, evidence, confidence)
- Add grounding rules to analysis/synthesis/validation system prompts
- Add three-layer quality pipeline: AI validation (LLMOperator) identifies
  ungrounded claims, deterministic apply_validation task does mechanical
  find-and-replace, human reviews the corrected report
- Add arithmetic validation that cross-checks X/Y fractions against
  structured analysis data (catches validator-introduced errors)
- Set temperature=0 on all LLM calls for run-to-run consistency
Same file now contains two DAGs that solve the same use case -- tracking
AIP implementation progress -- with different architectures:

1. example_aip_progress_tracker (pipeline): 12-task deterministic pipeline
   with per-AIP LLM analysis, structured Pydantic output, AI validation,
   and arithmetic correction. More accurate, more auditable, fewer tokens
   (~66K total), but more complex.

2. example_aip_progress_tracker_skills (agent): Single AgentOperator with
   the aip-tracker skill loaded via AgentSkillsToolset plus custom tool
   functions for Confluence/GitHub APIs. Simpler DAG (2 tasks), but less
   control over output discipline (~82K tokens, coarser granularity).

The aip-tracker SKILL.md bundle teaches the agent the same grounding
rules the pipeline enforces structurally: spec-level deliverable
granularity, fraction-only progress format, evidence-backed assessments,
and a mandatory self-verification checklist.

Also strengthens the pipeline DAG's arithmetic validation to cross-check
per-AIP fractions and summary totals against structured analysis data.
Based on feedback from Kaxil, removed the duplicate import of re and resolved the redundant definition of _github_headers
@ashb ashb force-pushed the aip99-example-aiptracker branch from 8fa7926 to e6133bc Compare June 5, 2026 15:00
Fix mypy errors in AIP tracker skills DAG for _safe_api_get return type

  Narrow type guard from `isinstance(data, str)` to `not isinstance(data, dict)`
  so mypy recognizes that `.get()` calls are valid after the check, since
  `_safe_api_get` returns `dict | list | str`.
@vikramkoka vikramkoka merged commit db6ce84 into main Jun 5, 2026
94 checks passed
@vikramkoka vikramkoka deleted the aip99-example-aiptracker branch June 5, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants