License Notice
Copyright © 2026 Aleksy Pyrz. All rights reserved.
This repository is source-available for private, internal, non-commercial evaluation only. It is not open source.
You may view, clone, and run the code for private evaluation only. You may not use it commercially, deploy it in production, redistribute it, sublicense it, resell it, offer it as a hosted service, or claim it as your own.
See LICENSE.md for full terms.
For commercial licensing or acquisition discussions: aleksypyrz@gmail.com
Control Tower is a Python-based observability service for data pipeline operations.
This repository contains:
- API service (
FastAPI) exposing Snowflake-backed operational endpoints. - Aggregation UI backend (
Flask) that merges AWS and Azure API responses. - Deployment and verification scripts for Azure Web App and AWS App Runner.
app/main.py: FastAPI service entrypoint and API routes.app/snowflake_client.py: Snowflake connection/auth/query logic.controltower_ui/app.py: Aggregation backend (AWS + Azure source merge).verify_and_deploy_API.sh: Azure API deploy + verification script.verify_and_deploy_ui.sh: Azure UI deploy + verification script.deploy/create_apprunner_controltower.sh: first-time AWS App Runner service creation.deploy/deploy_apprunner_controltower.sh: App Runner update using image identifier.deploy/deploy_apprunner_controltower_rel.sh: App Runner update by building/pushing release image.scripts/git_update.sh: safe Git update workflow for deployment branches.scripts/post_deploy_api_smoke.sh: API smoke tests.scripts/post_deploy_ui_smoke.sh: UI smoke tests.scripts/post_deploy_truth_check.sh: deployment truth verification (/health,/version, UI build banner, smoke index).scripts/ci_failure_probe.sh: intentional CI failure probe (for workflow failure-surfacing validation).scripts/ci_preflight.sh: CI config/secrets preflight classification (strict or warn mode).scripts/assisted_provider_validate.sh: assisted provider runtime validation probe (Bedrock/Azure OpenAI).scripts/release.sh: one-command release orchestrator (git update, deploy, smoke tests).
- View the source code
- Clone it for private evaluation
- Run and modify it for private, internal, non-production, non-commercial evaluation
- Commercial use
- Production use
- Redistribution
- Resale
- Sublicensing
- Public hosting / SaaS use
- Copying substantial parts into another product
- Claiming the code as your own
- API service reads Snowflake env vars and serves operational endpoints under
/api/*. - UI aggregation service proxies and merges responses from:
AWS_BASEAZURE_BASE
- Both services expose
/healthendpoints for liveness checks. - UI overview and incidents views show operator-visible warning banners when a source is degraded and when endpoint capabilities are degraded.
Install the following tools locally:
python3(3.11+ recommended)pipjqcurlzip/unzipgitaz(Azure CLI) for Azure deploymentsaws+dockerfor AWS/App Runner deployments
Required:
SF_USER- one of:
SF_ACCOUNT_IDENTIFIERorSF_ACCOUNT_URL - one auth method:
SF_PRIVATE_KEY_PEM_B64(recommended), orSF_PASSWORD/SNOWFLAKE_PASSWORD
Recommended:
SF_ROLESF_WAREHOUSESF_DATABASE(default oftenBHP_PLATFORM_LAB)SF_SCHEMA(default oftenOBS)
Required for merged mode:
AWS_BASE(example:https://api-controltower.aws.example.com)AZURE_BASE(example:https://api-controltower.azure.example.com)
Set these in API and UI environments so /health and /version show exact deployed build identity:
BUILD_IDGIT_SHARELEASE_TAG
Current smoke scripts now hard-assert these values are present and non-empty in /version.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtuvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadHealth check:
curl -fsS http://localhost:8000/health | jq .cd controltower_ui
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export AWS_BASE="http://localhost:8000"
export AZURE_BASE="http://localhost:8000"
python3 app.pyUI health:
curl -fsS http://localhost:8000/health | jq .Use the repository script to keep deployment branches updated and avoid interactive Git mistakes.
BASE_BRANCH=main TARGET_BRANCH=main ./scripts/git_update.shBASE_BRANCH=main TARGET_BRANCH=release PUSH=1 ./scripts/git_update.shWhat it does:
- fetches latest refs
- checks out target branch
- fast-forwards main or rebases non-main target on base
- optionally pushes when
PUSH=1
Export these before running deploy/release flows:
export BUILD_ID="$(date +%Y-%m-%d).1"
export GIT_SHA="$(git rev-parse --short HEAD)"
export RELEASE_TAG="controltower-v0.4.1"If these are missing, smoke tests can fail at /version assertions.
Step 1: set variables (or rely on defaults in script)
export API_DIR="/Users/aleksypyrz/Documents/GitHub/controltower_api"
export RG="rg-bhp-platformlab-dev-aue-app"
export WEBAPP_API="app-bhp-platformlab-dev-aue-controltower-api"
export API_URL="https://api-controltower.gitpushandpray.ai"Step 2: run deploy + verify
./verify_and_deploy_API.shIf BUILD_ID, GIT_SHA, and RELEASE_TAG are exported, deploy scripts apply them to App Service settings before deploy.
Optional: verify package only (no deploy)
SKIP_DEPLOY=1 ./verify_and_deploy_API.shAdditional API deploy script behavior:
- Enforces Azure startup command to
bash startup.shon each run. - Supports retry attempts with
DEPLOY_RETRIES(default:2). - Uses
az webapp deploy --track-status falseto avoid false-negative tracking issues seen with OneDeploy status polling.
./verify_and_deploy_ui.shNotes:
- Script builds a zip, compares file hashes, deploys with
az webapp deploy, then runs endpoint checks. - If release stamp vars are exported, script writes
BUILD_ID/GIT_SHA/RELEASE_TAGto UI app settings before deploy. - Script supports env overrides for
UI_DIR,RG,WEBAPP_UI,UI_URL, andDEPLOY_RETRIES.
- Export required Snowflake environment variables.
- Run:
./deploy/create_apprunner_controltower.shIMAGE_IDENTIFIER="<account>.dkr.ecr.ap-southeast-2.amazonaws.com/bhp-platformlab-controltower:rel-YYYYMMDD-HHMMSS" \
./deploy/deploy_apprunner_controltower.shSERVICE_ARN="arn:aws:apprunner:..." ./deploy/deploy_apprunner_controltower_rel.shAPI_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.shImpact endpoint contract is enforced by default (IMPACT_REQUIRED=1 default):
IMPACT_REQUIRED=1 API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.shChecks:
/health/version(hard assertion ofbuild.build_id,build.git_sha,build.release_tag)/api/summary- summary contract extensions:
stale_datasets(non-negative)top_impacted_downstream_assets(array)cost_24h_aud,cost_7d_aud,cost_7d_daily_avg_aud(non-negative)
/api/pipelines- pagination behavior on
/api/pipelines/{id}/runs(offsetmust change output) - direct pipeline SLA contract (
/api/pipelines/{id}/sla) - costs contract (
/api/costs) - first-class impact contract:
/api/pipelines/{id}/impactmust return HTTP200with required fields. - lineage strategy contract:
/api/pipelines/{id}/lineagemust return deprecated alias metadata and point to/impact.
UI_URL="https://controltower.gitpushandpray.ai" ./scripts/post_deploy_ui_smoke.shChecks:
/health/version(hard assertion ofbuild.build_id,build.git_sha,build.release_tag)- aggregate summary
- aggregate summary contract extensions:
stale_datasets(non-negative)top_impacted_downstream_assets(array)cost_24h_aud,cost_7d_aud,cost_7d_daily_avg_aud(non-negative)
- aggregate costs contract (
/api/aggregate/costs) - aggregate pipeline views (
aws,both) - aggregate incidents
- aggregate pipeline SLA contract (
/api/aggregate/pipelines/{id}/sla) - aggregate runs pagination for AWS (
offset=0must differ fromoffset=3) - pagination assertion compares
items[].run_idsets (not full payload text) to avoid metadata timestamp false-passes - semantic assertions:
- pipeline
p2exists inawsandboth aws.runs_7d(p2) > 1000aws.last_end_ts(p2)is not nullboth.runs_7d(p2) >= aws.runs_7d(p2)
- pipeline
- Update local branch safely:
BASE_BRANCH=main TARGET_BRANCH=main ./scripts/git_update.sh
- Run local syntax checks:
python3 -m py_compile app/main.py controltower_ui/app.py
- Deploy API:
./verify_and_deploy_API.sh
- Deploy UI:
./verify_and_deploy_ui.sh
- Run independent smoke tests:
API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.sh UI_URL="https://controltower.gitpushandpray.ai" ./scripts/post_deploy_ui_smoke.sh
- Commit and push deployment-related changes if needed:
git add -A git commit -m "docs: update deployment runbook and smoke test scripts" git push origin main
- Azure Web App: redeploy previous known-good zip artifact or previous commit.
- App Runner: update service back to previous immutable image tag/digest.
- Re-run smoke scripts after rollback.
- Never commit secrets or raw private keys.
- Use secret managers / CI secret stores for
SF_*credentials. - Prefer immutable container tags or digest-pinned images for App Runner updates.
- Keep
jq,az,aws, anddockerversions current.
Database dependency unavailable: check Snowflake env vars and key format.- UI returns upstream errors: confirm
AWS_BASE/AZURE_BASEare reachable and healthy. - App Runner update fails with same image: provide a new image tag or digest.
- Azure deploy succeeds but app unhealthy: inspect app logs and rerun smoke tests.
- Smoke fails at
/versionbuild assertions: ensureBUILD_ID,GIT_SHA, andRELEASE_TAGwere exported before deploy (or already present in App Service settings).
For contribution and policy docs, see:
CONTRIBUTING.mdSECURITY.mdSUPPORT.md
Use this when you want a full redeploy flow in one command.
./scripts/release.shRecommended full release command:
export BUILD_ID="$(date +%Y-%m-%d).1"
export GIT_SHA="$(git rev-parse --short HEAD)"
export RELEASE_TAG="controltower-v0.4.1"
RUN_GIT_UPDATE=0 DEPLOY_RETRIES=3 ./scripts/release.shDefault flow order:
- Git update (
scripts/git_update.sh) - API deploy + verification (
verify_and_deploy_API.sh) - UI deploy + verification (
verify_and_deploy_ui.sh) - API smoke tests (
scripts/post_deploy_api_smoke.sh) - UI smoke tests (
scripts/post_deploy_ui_smoke.sh)
Skip git update:
RUN_GIT_UPDATE=0 ./scripts/release.shDry run packaging/verification only (no API deploy):
SKIP_DEPLOY=1 RUN_DEPLOY_UI=0 RUN_SMOKE_API=0 RUN_SMOKE_UI=0 ./scripts/release.shRun only smoke tests:
RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_DEPLOY_UI=0 ./scripts/release.shRun semantic checks with custom pipeline/threshold:
SEMANTIC_PIPELINE_ID=p2 ASSERT_RUNS7D_GT=1000 RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_DEPLOY_UI=0 ./scripts/release.shRun against custom URLs or resources:
RG="rg-bhp-platformlab-dev-aue-app" \
WEBAPP_API="app-bhp-platformlab-dev-aue-controltower-api" \
WEBAPP_UI="app-bhp-platformlab-dev-aue-controltower-ui" \
API_URL="https://api-controltower.gitpushandpray.ai" \
UI_URL="https://controltower.gitpushandpray.ai" \
./scripts/release.shRUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_SMOKE_API=0 RUN_DEPLOY_UI=1 RUN_SMOKE_UI=1 ./scripts/release.shRUN_GIT_UPDATE=0 RUN_DEPLOY_UI=0 RUN_SMOKE_UI=0 DEPLOY_RETRIES=3 ./scripts/release.shGitHub Actions now includes a release gate workflow:
.github/workflows/release-smoke-controltower.yml
What it does:
- runs deploy + smoke from CI (Azure path on push, optional App Runner on manual dispatch),
- fails the workflow on smoke regressions,
- runs explicit deployment-truth checks (health/version/build fingerprints + smoke artifact index),
- runs CI preflight checks for required deployment config/secrets,
- supports optional assisted-provider validation probe on manual dispatch,
- supports optional automatic rollback attempt (manual dispatch only) when Azure release fails,
- captures
/versionfingerprints after deploy, - uploads full smoke logs (
deployment_logs/...) and build metadata as workflow artifacts.
Manual dispatch options:
run_azure_release(defaulttrue)run_apprunner_release(defaultfalse)run_apprunner_api_smoke(defaulttrue)run_failure_probe(defaultfalse) to run a controlled negative probe and verify failure surfacingrun_assisted_provider_validation(defaultfalse)expect_azure_assisted(defaultfalse) for strict Azure OpenAI provider validationfail_on_preflight_errors(defaulttrue)enable_auto_rollback(defaultfalse)- optional
rollback_refoverride (defaults to previous commit) aws_region- optional
apprunner_service_arnoverride
Scheduled regression workflow:
.github/workflows/scheduled-smoke-regression-controltower.yml- runs every 6 hours (plus manual dispatch),
- executes API/UI runtime smoke contracts and deployment-truth checks,
- runs CI preflight classification for workflow configuration,
- can emit failure alerts via
SMOKE_ALERT_WEBHOOKsecret, - uploads logs and JSON artifacts from
deployment_logs/scheduled-smoke-....
The UI now exposes two warning layers in both Overview and Incidents:
- Source health warning banner: driven by
_meta.aws_ok/_meta.azure_ok. - Capability warning banner: driven by
_meta.capabilities, with expandable details (endpoint | source | status | reason).
This makes partial-source or degraded endpoint behavior visible in-page instead of only in raw payload metadata.
UI now includes a Diagnostics view for operational trust checks:
- build fingerprint reconciliation across UI/AWS/Azure surfaces,
- assisted triage telemetry counters and recent event feed,
- assisted provider posture (selected provider/policy/retry settings),
- recent GitHub workflow outcomes (release smoke, scheduled smoke, deploy workflows),
- JSON snapshot export for handoff/audit.
Official contracts:
- Impact:
GET /api/pipelines/{pipeline_id}/impactis the first-class graph impact contract. - SLA (direct):
GET /api/pipelines/{pipeline_id}/slaremains a first-class per-pipeline contract. - Costs:
GET /api/costsis the first-class cost aggregate contract. - Cost trend:
GET /api/costs/trendprovides daily cost movement. - Cost anomalies:
GET /api/costs/anomaliessurfaces latest-day outliers vs baseline. - Triage query:
POST /api/triage/queryis the first-class deterministic triage contract. - Runbook lifecycle governance (persisted):
GET /api/runbooks/governance,POST /api/runbooks/governance/upsert,POST /api/runbooks/governance/remove. - Policy profiles (persisted):
GET /api/policies/profiles,POST /api/policies/profiles/assign,POST /api/policies/profiles/remove. - Effective policy resolution:
GET /api/policies/effective. - Policy audit history:
GET /api/policies/history. - Policy audit/export reporting:
GET /api/policies/history/export,GET /api/policies/exceptions/register,GET /api/policies/review-queue. - Team scorecards export:
GET /api/scorecards/teams/export. - Governance posture snapshot:
GET /api/governance/posture. - Governance posture trend:
GET /api/governance/posture/trend. - Governance export bundle:
GET /api/governance/export/bundle. - Governance executive report pack:
GET /api/governance/report/executive. - Assisted triage telemetry:
GET /api/triage/assisted/telemetry. - Assisted provider status:
GET /api/triage/assisted/provider/status. - DQ exception governance:
POST /api/dq/exceptions/review,GET /api/dq/history,GET /api/dq/exceptions/register.
Compatibility contracts:
- Lineage:
GET /api/pipelines/{pipeline_id}/lineageis a deprecated alias to impact. - Deprecated lineage responses include:
deprecated: truereplacement_endpoint: /api/pipelines/{pipeline_id}/impact
Aggregator support:
GET /api/aggregate/pipelines/{pipeline_id}/impactGET /api/aggregate/pipelines/{pipeline_id}/slaGET /api/aggregate/costsGET /api/aggregate/costs/trendGET /api/aggregate/costs/anomaliesGET /api/aggregate/costs/pipelinesGET /api/aggregate/pipelines/{pipeline_id}/costGET /api/aggregate/runbooksGET /api/aggregate/runbooks/governancePOST /api/aggregate/runbooks/governance/upsertPOST /api/aggregate/runbooks/governance/removeGET /api/aggregate/governance/drilldownGET /api/aggregate/policies/profilesPOST /api/aggregate/policies/profiles/assignPOST /api/aggregate/policies/profiles/removeGET /api/aggregate/policies/effectiveGET /api/aggregate/policies/historyGET /api/aggregate/policies/history/exportGET /api/aggregate/policies/exceptions/registerGET /api/aggregate/policies/review-queueGET /api/aggregate/scorecards/teams/exportGET /api/aggregate/governance/postureGET /api/aggregate/governance/posture/trendGET /api/aggregate/governance/export/bundleGET /api/aggregate/governance/report/executivePOST /api/aggregate/dq/exceptions/reviewGET /api/aggregate/dq/historyGET /api/aggregate/dq/exceptions/registerPOST /api/aggregate/triage/queryGET /api/aggregate/triage/assisted/telemetryGET /api/aggregate/triage/assisted/provider/statusGET /api/aggregate/ops/diagnostics
| Capability | API route | Aggregator route | Status |
|---|---|---|---|
| Impact (official) | /api/pipelines/{pipeline_id}/impact |
/api/aggregate/pipelines/{pipeline_id}/impact |
First-class |
| Lineage (compat) | /api/pipelines/{pipeline_id}/lineage |
n/a (use impact route) | Deprecated alias |
| Pipeline SLA (direct) | /api/pipelines/{pipeline_id}/sla |
/api/aggregate/pipelines/{pipeline_id}/sla |
First-class |
| SLA pipeline counts | /api/sla/pipelines |
/api/aggregate/sla/pipelines |
First-class |
| Costs | /api/costs |
/api/aggregate/costs |
First-class |
| Cost trend | /api/costs/trend |
/api/aggregate/costs/trend |
First-class |
| Cost anomalies | /api/costs/anomalies |
/api/aggregate/costs/anomalies |
First-class |
| Cost ranking | /api/costs/pipelines |
/api/aggregate/costs/pipelines |
First-class |
| Pipeline cost detail | /api/pipelines/{pipeline_id}/cost |
/api/aggregate/pipelines/{pipeline_id}/cost |
First-class |
| Runbook catalog | /api/runbooks |
/api/aggregate/runbooks |
First-class |
| Runbook lifecycle governance | /api/runbooks/governance + upsert/remove |
/api/aggregate/runbooks/governance + upsert/remove |
First-class |
| Triage query | /api/triage/query |
/api/aggregate/triage/query |
First-class |
| Assisted triage telemetry | /api/triage/assisted/telemetry |
/api/aggregate/triage/assisted/telemetry |
First-class |
| Assisted provider status | /api/triage/assisted/provider/status |
/api/aggregate/triage/assisted/provider/status |
First-class |
| Policy assignments | /api/policies/profiles + assign/remove |
/api/aggregate/policies/profiles + assign/remove |
First-class |
| Effective policy resolution | /api/policies/effective |
/api/aggregate/policies/effective |
First-class |
| Policy audit history | /api/policies/history |
/api/aggregate/policies/history |
First-class |
| Policy audit/export reporting | /api/policies/history/export, /api/policies/exceptions/register, /api/policies/review-queue |
/api/aggregate/policies/history/export, /api/aggregate/policies/exceptions/register, /api/aggregate/policies/review-queue |
First-class |
| Team scorecards export | /api/scorecards/teams/export |
/api/aggregate/scorecards/teams/export |
First-class |
| Governance posture snapshot | /api/governance/posture |
/api/aggregate/governance/posture |
First-class |
| Governance posture trend | /api/governance/posture/trend |
/api/aggregate/governance/posture/trend |
First-class |
| Governance export bundle | /api/governance/export/bundle |
/api/aggregate/governance/export/bundle |
First-class |
| Governance executive report | /api/governance/report/executive |
/api/aggregate/governance/report/executive |
First-class |
| Operator diagnostics | n/a | /api/aggregate/ops/diagnostics |
First-class |
| DQ exception governance | /api/dq/exceptions/review, /api/dq/history, /api/dq/exceptions/register |
/api/aggregate/dq/exceptions/review, /api/aggregate/dq/history, /api/aggregate/dq/exceptions/register |
First-class |
Incidents now include an operator action panel that provides:
- likely cause
- impact summary
- next actions
- runbook jump link
The panel is powered by same-origin aggregator routes:
GET /api/aggregate/explain/run/{run_id}GET /api/aggregate/pipelines/{pipeline_id}/impact- optional
POST /api/aggregate/triage/queryfallback when error details are present
API:
POST /api/triage/query
Aggregator:
POST /api/aggregate/triage/query
Request (example):
{
"source": "both",
"pipeline_id": "p2",
"run_id": "r_p2_20260307235500_Q5_76458",
"context_days": 7,
"mode": "deterministic",
"user_prompt": "optional operator context",
"error_message": "optional",
"error_code": "optional"
}Response contract:
summary(string)likely_cause(string)impacted_consumers[]impact_summary(nodes_count,edges_count,degraded,impact_model)next_actions[]runbook_references[]degraded(boolean)warnings[]confidenceconfidence_reasonmode(deterministic_rules_v1orassisted_bedrock_v1)requested_mode(deterministicorassisted)facts_used[](field,value,source)traceabilityobject:telemetry_facts[]runbook_facts[]impact_facts[]cost_facts[]evidence_links[]
evidence(object)generated_at
Current mode behavior:
mode=deterministic: active and fully supported.mode=assisted: provider-assisted enrichment is attempted when runtime config is present.- Assisted provider selection is controlled by
ASSISTED_TRIAGE_PROVIDER:bedrock(default)azure_openai(scaffolded parity path)auto(use Azure OpenAI when configured, else Bedrock)
- If assisted provider is unavailable/misconfigured/fails and
ASSISTED_FAILURE_POLICY=fallback(default), the API falls back to deterministic rules with:- warning metadata,
evidence.assisted.fallback_used=true,degraded=true.
- If assisted provider fails and
ASSISTED_FAILURE_POLICY=strict, the API returns HTTP503(assisted_mode_failed) and does not fall back. - Assisted telemetry/observability is exposed via:
- API:
GET /api/triage/assisted/telemetry - aggregate:
GET /api/aggregate/triage/assisted/telemetry - payload includes status counters (
success|fallback|failed), provider breakdown, error-class counts, and recent events.
- API:
Assisted mode runtime configuration:
BEDROCK_MODEL_ID(required for assisted mode; expected provider:anthropic.*)BEDROCK_REGION(optional, defaults toAWS_REGION, thenap-southeast-2)- standard AWS credentials/role permissions for
bedrock:InvokeModel ASSISTED_TRIAGE_PROVIDER(bedrock|azure_openai|auto, defaultbedrock)ASSISTED_FAILURE_POLICY(fallback|strict, defaultfallback)ASSISTED_RETRY_MAX_ATTEMPTS(default4)ASSISTED_RETRY_BASE_MS(default300)
Azure OpenAI assisted-mode scaffolding:
AZURE_OPENAI_ENDPOINTAZURE_OPENAI_API_KEYAZURE_OPENAI_DEPLOYMENTAZURE_OPENAI_API_VERSION(optional, default2024-10-21)
Runbook catalog enrichment:
GET /api/runbooksandGET /api/aggregate/runbooksnow include richer metadata:ownersla_minutesrunbook_status(configured,missing,invalid_format)tags[]summarycoverage block (configured,missing,invalid_format,coverage_ratio)
- Supported filters:
search,team,domain,criticality,owner,include_missing. - Runbook lifecycle governance (persisted):
GET /api/runbooks/governancePOST /api/runbooks/governance/upsertPOST /api/runbooks/governance/remove- aggregate equivalents under
/api/aggregate/runbooks/governance* - store path override:
RUNBOOK_GOV_STORE_PATH=/custom/path/runbook_governance_store.json - lifecycle states:
none,open,in_progress,remediated,waived
Governance surface:
- UI includes a baseline Governance view (
?view=governance) with:- runbook coverage
- critical runbook gaps
- SLA pressure count
- cost anomaly count
- Governance drilldown rows for runbook-oriented views include an operator workflow jump:
Open in runbookspre-fills Triage runbook filters (search/team/domain/criticality/status) and switches to?view=triage.
- Governance drilldown API:
GET /api/aggregate/governance/drilldown?source=both&days=7&drilldown=runbook_coverage|critical_runbook_gaps|sla_pressure|cost_anomalies|dq_mismatches&limit=200
- Governance deep-link state params:
view=governancegov_source=both|aws|azuregov_days=7|14|30gov_drilldown=runbook_coverage|critical_runbook_gaps|sla_pressure|cost_anomalies|dq_mismatches
Policy/profile surface (baseline):
- UI includes
?view=policywith:- persisted policy assignment actions by scope (
pipeline,team,domain) - SLA profile assignment/effective table with mismatch + exception flags
- exception metadata (
reason,reviewer,expiry,approved) - approved vs unmanaged exception-state visibility
- policy audit history table with filters (
action,scope,target) - baseline DQ/operational standards table with live evidence
- profile KPIs (strict/balanced/relaxed counts + standards coverage)
- row click-through actions into Overview, Governance, and Triage
- persisted policy assignment actions by scope (
- Policy deep-link state params:
view=policypolicy_source=both|aws|azurepolicy_days=7|14|30policy_profile=all|strict|balanced|relaxedpolicy_mismatch=all|mismatchpolicy_exception=all|exceptionpolicy_exception_state=all|approved|unmanaged|expired|rejectedpolicy_audit_action=all|assign|removepolicy_audit_scope=all|pipeline|team|domainpolicy_audit_target=<target_id_substring>
- Actionable DQ standards workflow (persisted):
- API:
GET /api/dq/standardsPOST /api/dq/standards/upsertPOST /api/dq/standards/removeGET /api/dq/effective
- aggregate equivalents:
GET /api/aggregate/dq/standardsPOST /api/aggregate/dq/standards/upsertPOST /api/aggregate/dq/standards/removeGET /api/aggregate/dq/effective
- supported standards:
runbook_requiredmax_breach_ratiomax_failures_7d
- precedence:
pipelineoverride >teamoverride >domainoverride > default baseline
- temporary deviation fields:
is_exceptionexception_approvedexception_reviewerexception_expires_atwaiver_template
- effective DQ response includes:
- per-pipeline
remediation_suggestions[] - per-pipeline
action_links(overview/governance/triage/incidents) - summary aggregates:
failing_by_team,failing_by_domain,failing_by_profile,failing_standards_counts
- per-pipeline
- store path override:
DQ_STORE_PATH=/custom/path/dq_standards_store.json
- API:
Persisted policy backend notes:
- default store path:
/home/site/wwwroot/policy_profiles_store.jsonwhen available- fallback local runtime path under
app/
- override with env var:
POLICY_STORE_PATH=/custom/path/policy_profiles_store.json
- effective resolution precedence:
pipelineassignment >teamassignment >domainassignment > recommended default (from criticality)
- assignment payload supports exception governance metadata:
is_exceptionexception_approvedexception_status(none|proposed|approved|rejected|expired)exception_reviewerexception_expires_atreview_due_atwaiver_template
- effective response includes:
exception_state(none|approved|unmanaged|expired|rejected)exception_status- summary
unmanaged_exception_count - summary
proposed_exception_count,rejected_exception_count,approved_exception_count,expired_exception_count
- history endpoint returns immutable change events:
action,scope,target_id,changed_by,changed_at,reason,old_value,new_value
Cost drilldown polish:
- Cost anomalies table now includes:
- team/domain context
- severity band
- anomaly filters (
allvsanomaly_only) - minimum ratio filter
- row
Focusaction to jump to pipeline detail in overview. - row
Investigateaction to build deeper incident + triage context in-page.
- Investigation endpoint:
- API:
GET /api/costs/anomalies/investigate/{pipeline_id} - aggregate:
GET /api/aggregate/costs/anomalies/investigate/{pipeline_id} - response includes:
anomalytrend.items[]top_cost_runs[]related_incidents[]baseline_delta_audbaseline_delta_ratiorecent_change_24h_audpolicy_pressuredq_pressurerecommended_next_actions[]triage_prefilldegradedwarnings[]
- API:
Assisted triage evidence UX:
- Incident/Triage action panels now surface:
- requested vs executed mode
- degraded/fallback metadata
- grouped
facts_usedby source - grouped traceability facts (
telemetry,runbook,impact,cost) - traceability evidence links
- structured evidence cards
- raw evidence payload details (expandable)
Azure OpenAI parity:
- Azure OpenAI assisted-mode parity is now scaffolded behind
ASSISTED_TRIAGE_PROVIDER=azure_openai(orautowhen configured). - Bedrock remains the default provider for current production posture.
Team scorecards:
- API:
GET /api/scorecards/teams - aggregate:
GET /api/aggregate/scorecards/teams - export:
- API:
GET /api/scorecards/teams/export - aggregate:
GET /api/aggregate/scorecards/teams/export
- API:
- governance UI now renders team scorecards with:
- composite score
- DQ failures
- policy mismatches
- unmanaged exceptions
- runbook coverage ratio
- ownership posture ratio
- anomaly count
- SLA pressure trend direction
- breach ratio
- click-through into policy workflow
Governance audit/reporting:
- Policy reporting endpoints:
- API:
GET /api/policies/history/export,GET /api/policies/exceptions/register,GET /api/policies/review-queue - aggregate:
GET /api/aggregate/policies/history/export,GET /api/aggregate/policies/exceptions/register,GET /api/aggregate/policies/review-queue
- API:
- Governance posture snapshot:
- API:
GET /api/governance/posture - aggregate:
GET /api/aggregate/governance/posture
- API:
DQ exception approval workflow:
- API:
POST /api/dq/exceptions/reviewGET /api/dq/historyGET /api/dq/exceptions/register
- aggregate:
POST /api/aggregate/dq/exceptions/reviewGET /api/aggregate/dq/historyGET /api/aggregate/dq/exceptions/register
Backward compatibility:
- The API response still includes legacy fields such as
triage_hintsfor existing clients.
Copy helpers are available in the panel:
- Copy action summary (plain text)
- Copy as Markdown (ticket/chat ready format)