Skip to content

casp24kom/controltower_api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

License Notice

Copyright © 2026 Aleksy Pyrz. All rights reserved.

This repository is source-available for private, internal, non-commercial evaluation only. It is not open source.

You may view, clone, and run the code for private evaluation only. You may not use it commercially, deploy it in production, redistribute it, sublicense it, resell it, offer it as a hosted service, or claim it as your own.

See LICENSE.md for full terms.

For commercial licensing or acquisition discussions: aleksypyrz@gmail.com

Control Tower API + Aggregation UI

Control Tower is a Python-based observability service for data pipeline operations.

This repository contains:

  • API service (FastAPI) exposing Snowflake-backed operational endpoints.
  • Aggregation UI backend (Flask) that merges AWS and Azure API responses.
  • Deployment and verification scripts for Azure Web App and AWS App Runner.

Repository Structure

  • app/main.py: FastAPI service entrypoint and API routes.
  • app/snowflake_client.py: Snowflake connection/auth/query logic.
  • controltower_ui/app.py: Aggregation backend (AWS + Azure source merge).
  • verify_and_deploy_API.sh: Azure API deploy + verification script.
  • verify_and_deploy_ui.sh: Azure UI deploy + verification script.
  • deploy/create_apprunner_controltower.sh: first-time AWS App Runner service creation.
  • deploy/deploy_apprunner_controltower.sh: App Runner update using image identifier.
  • deploy/deploy_apprunner_controltower_rel.sh: App Runner update by building/pushing release image.
  • scripts/git_update.sh: safe Git update workflow for deployment branches.
  • scripts/post_deploy_api_smoke.sh: API smoke tests.
  • scripts/post_deploy_ui_smoke.sh: UI smoke tests.
  • scripts/post_deploy_truth_check.sh: deployment truth verification (/health, /version, UI build banner, smoke index).
  • scripts/ci_failure_probe.sh: intentional CI failure probe (for workflow failure-surfacing validation).
  • scripts/ci_preflight.sh: CI config/secrets preflight classification (strict or warn mode).
  • scripts/assisted_provider_validate.sh: assisted provider runtime validation probe (Bedrock/Azure OpenAI).
  • scripts/release.sh: one-command release orchestrator (git update, deploy, smoke tests).

Usage Terms

Permitted

  • View the source code
  • Clone it for private evaluation
  • Run and modify it for private, internal, non-production, non-commercial evaluation

Not permitted

  • Commercial use
  • Production use
  • Redistribution
  • Resale
  • Sublicensing
  • Public hosting / SaaS use
  • Copying substantial parts into another product
  • Claiming the code as your own

Runtime Architecture

  1. API service reads Snowflake env vars and serves operational endpoints under /api/*.
  2. UI aggregation service proxies and merges responses from:
    • AWS_BASE
    • AZURE_BASE
  3. Both services expose /health endpoints for liveness checks.
  4. UI overview and incidents views show operator-visible warning banners when a source is degraded and when endpoint capabilities are degraded.

Prerequisites

Install the following tools locally:

  • python3 (3.11+ recommended)
  • pip
  • jq
  • curl
  • zip / unzip
  • git
  • az (Azure CLI) for Azure deployments
  • aws + docker for AWS/App Runner deployments

Environment Variables

API (Snowflake)

Required:

  • SF_USER
  • one of: SF_ACCOUNT_IDENTIFIER or SF_ACCOUNT_URL
  • one auth method:
    • SF_PRIVATE_KEY_PEM_B64 (recommended), or
    • SF_PASSWORD / SNOWFLAKE_PASSWORD

Recommended:

  • SF_ROLE
  • SF_WAREHOUSE
  • SF_DATABASE (default often BHP_PLATFORM_LAB)
  • SF_SCHEMA (default often OBS)

UI Aggregator

Required for merged mode:

  • AWS_BASE (example: https://api-controltower.aws.example.com)
  • AZURE_BASE (example: https://api-controltower.azure.example.com)

Deployment fingerprint (required for release smoke success)

Set these in API and UI environments so /health and /version show exact deployed build identity:

  • BUILD_ID
  • GIT_SHA
  • RELEASE_TAG

Current smoke scripts now hard-assert these values are present and non-empty in /version.

Local Development

1) Create virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2) Run API locally

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Health check:

curl -fsS http://localhost:8000/health | jq .

3) Run UI aggregator locally

cd controltower_ui
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

export AWS_BASE="http://localhost:8000"
export AZURE_BASE="http://localhost:8000"
python3 app.py

UI health:

curl -fsS http://localhost:8000/health | jq .

Git Update Workflow (Before Deploying)

Use the repository script to keep deployment branches updated and avoid interactive Git mistakes.

Standard update (main)

BASE_BRANCH=main TARGET_BRANCH=main ./scripts/git_update.sh

Update release branch from main and push

BASE_BRANCH=main TARGET_BRANCH=release PUSH=1 ./scripts/git_update.sh

What it does:

  • fetches latest refs
  • checks out target branch
  • fast-forwards main or rebases non-main target on base
  • optionally pushes when PUSH=1

Azure Deployment (Zip Deploy)

Required release stamp exports (before deployment)

Export these before running deploy/release flows:

export BUILD_ID="$(date +%Y-%m-%d).1"
export GIT_SHA="$(git rev-parse --short HEAD)"
export RELEASE_TAG="controltower-v0.4.1"

If these are missing, smoke tests can fail at /version assertions.

API deployment/update (recommended script)

Step 1: set variables (or rely on defaults in script)

export API_DIR="/Users/aleksypyrz/Documents/GitHub/controltower_api"
export RG="rg-bhp-platformlab-dev-aue-app"
export WEBAPP_API="app-bhp-platformlab-dev-aue-controltower-api"
export API_URL="https://api-controltower.gitpushandpray.ai"

Step 2: run deploy + verify

./verify_and_deploy_API.sh

If BUILD_ID, GIT_SHA, and RELEASE_TAG are exported, deploy scripts apply them to App Service settings before deploy.

Optional: verify package only (no deploy)

SKIP_DEPLOY=1 ./verify_and_deploy_API.sh

Additional API deploy script behavior:

  • Enforces Azure startup command to bash startup.sh on each run.
  • Supports retry attempts with DEPLOY_RETRIES (default: 2).
  • Uses az webapp deploy --track-status false to avoid false-negative tracking issues seen with OneDeploy status polling.

UI deployment/update (recommended script)

./verify_and_deploy_ui.sh

Notes:

  • Script builds a zip, compares file hashes, deploys with az webapp deploy, then runs endpoint checks.
  • If release stamp vars are exported, script writes BUILD_ID/GIT_SHA/RELEASE_TAG to UI app settings before deploy.
  • Script supports env overrides for UI_DIR, RG, WEBAPP_UI, UI_URL, and DEPLOY_RETRIES.

AWS App Runner Deployment

First-time service creation

  1. Export required Snowflake environment variables.
  2. Run:
./deploy/create_apprunner_controltower.sh

Update existing App Runner service (image already in ECR)

IMAGE_IDENTIFIER="<account>.dkr.ecr.ap-southeast-2.amazonaws.com/bhp-platformlab-controltower:rel-YYYYMMDD-HHMMSS" \
./deploy/deploy_apprunner_controltower.sh

Build + push release image + update service

SERVICE_ARN="arn:aws:apprunner:..." ./deploy/deploy_apprunner_controltower_rel.sh

Post-Deployment Smoke Tests

API smoke test script

API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.sh

Impact endpoint contract is enforced by default (IMPACT_REQUIRED=1 default):

IMPACT_REQUIRED=1 API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.sh

Checks:

  • /health
  • /version (hard assertion of build.build_id, build.git_sha, build.release_tag)
  • /api/summary
  • summary contract extensions:
    • stale_datasets (non-negative)
    • top_impacted_downstream_assets (array)
    • cost_24h_aud, cost_7d_aud, cost_7d_daily_avg_aud (non-negative)
  • /api/pipelines
  • pagination behavior on /api/pipelines/{id}/runs (offset must change output)
  • direct pipeline SLA contract (/api/pipelines/{id}/sla)
  • costs contract (/api/costs)
  • first-class impact contract: /api/pipelines/{id}/impact must return HTTP 200 with required fields.
  • lineage strategy contract: /api/pipelines/{id}/lineage must return deprecated alias metadata and point to /impact.

UI smoke test script

UI_URL="https://controltower.gitpushandpray.ai" ./scripts/post_deploy_ui_smoke.sh

Checks:

  • /health
  • /version (hard assertion of build.build_id, build.git_sha, build.release_tag)
  • aggregate summary
  • aggregate summary contract extensions:
    • stale_datasets (non-negative)
    • top_impacted_downstream_assets (array)
    • cost_24h_aud, cost_7d_aud, cost_7d_daily_avg_aud (non-negative)
  • aggregate costs contract (/api/aggregate/costs)
  • aggregate pipeline views (aws, both)
  • aggregate incidents
  • aggregate pipeline SLA contract (/api/aggregate/pipelines/{id}/sla)
  • aggregate runs pagination for AWS (offset=0 must differ from offset=3)
  • pagination assertion compares items[].run_id sets (not full payload text) to avoid metadata timestamp false-passes
  • semantic assertions:
    • pipeline p2 exists in aws and both
    • aws.runs_7d(p2) > 1000
    • aws.last_end_ts(p2) is not null
    • both.runs_7d(p2) >= aws.runs_7d(p2)

Full Step-by-Step Release Procedure (Recommended)

  1. Update local branch safely:
    BASE_BRANCH=main TARGET_BRANCH=main ./scripts/git_update.sh
  2. Run local syntax checks:
    python3 -m py_compile app/main.py controltower_ui/app.py
  3. Deploy API:
    ./verify_and_deploy_API.sh
  4. Deploy UI:
    ./verify_and_deploy_ui.sh
  5. Run independent smoke tests:
    API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.sh
    UI_URL="https://controltower.gitpushandpray.ai" ./scripts/post_deploy_ui_smoke.sh
  6. Commit and push deployment-related changes if needed:
    git add -A
    git commit -m "docs: update deployment runbook and smoke test scripts"
    git push origin main

Rollback Guidance

  • Azure Web App: redeploy previous known-good zip artifact or previous commit.
  • App Runner: update service back to previous immutable image tag/digest.
  • Re-run smoke scripts after rollback.

Security and Operations Notes

  • Never commit secrets or raw private keys.
  • Use secret managers / CI secret stores for SF_* credentials.
  • Prefer immutable container tags or digest-pinned images for App Runner updates.
  • Keep jq, az, aws, and docker versions current.

Troubleshooting

  • Database dependency unavailable: check Snowflake env vars and key format.
  • UI returns upstream errors: confirm AWS_BASE/AZURE_BASE are reachable and healthy.
  • App Runner update fails with same image: provide a new image tag or digest.
  • Azure deploy succeeds but app unhealthy: inspect app logs and rerun smoke tests.
  • Smoke fails at /version build assertions: ensure BUILD_ID, GIT_SHA, and RELEASE_TAG were exported before deploy (or already present in App Service settings).

For contribution and policy docs, see:

  • CONTRIBUTING.md
  • SECURITY.md
  • SUPPORT.md

One-Command Release Script

Use this when you want a full redeploy flow in one command.

./scripts/release.sh

Recommended full release command:

export BUILD_ID="$(date +%Y-%m-%d).1"
export GIT_SHA="$(git rev-parse --short HEAD)"
export RELEASE_TAG="controltower-v0.4.1"
RUN_GIT_UPDATE=0 DEPLOY_RETRIES=3 ./scripts/release.sh

Default flow order:

  1. Git update (scripts/git_update.sh)
  2. API deploy + verification (verify_and_deploy_API.sh)
  3. UI deploy + verification (verify_and_deploy_ui.sh)
  4. API smoke tests (scripts/post_deploy_api_smoke.sh)
  5. UI smoke tests (scripts/post_deploy_ui_smoke.sh)

Common usage options

Skip git update:

RUN_GIT_UPDATE=0 ./scripts/release.sh

Dry run packaging/verification only (no API deploy):

SKIP_DEPLOY=1 RUN_DEPLOY_UI=0 RUN_SMOKE_API=0 RUN_SMOKE_UI=0 ./scripts/release.sh

Run only smoke tests:

RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_DEPLOY_UI=0 ./scripts/release.sh

Run semantic checks with custom pipeline/threshold:

SEMANTIC_PIPELINE_ID=p2 ASSERT_RUNS7D_GT=1000 RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_DEPLOY_UI=0 ./scripts/release.sh

Run against custom URLs or resources:

RG="rg-bhp-platformlab-dev-aue-app" \
WEBAPP_API="app-bhp-platformlab-dev-aue-controltower-api" \
WEBAPP_UI="app-bhp-platformlab-dev-aue-controltower-ui" \
API_URL="https://api-controltower.gitpushandpray.ai" \
UI_URL="https://controltower.gitpushandpray.ai" \
./scripts/release.sh

Deploy only UI (when API is already healthy)

RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_SMOKE_API=0 RUN_DEPLOY_UI=1 RUN_SMOKE_UI=1 ./scripts/release.sh

Deploy only API (when testing API fixes)

RUN_GIT_UPDATE=0 RUN_DEPLOY_UI=0 RUN_SMOKE_UI=0 DEPLOY_RETRIES=3 ./scripts/release.sh

CI/CD Release + Smoke Automation

GitHub Actions now includes a release gate workflow:

  • .github/workflows/release-smoke-controltower.yml

What it does:

  • runs deploy + smoke from CI (Azure path on push, optional App Runner on manual dispatch),
  • fails the workflow on smoke regressions,
  • runs explicit deployment-truth checks (health/version/build fingerprints + smoke artifact index),
  • runs CI preflight checks for required deployment config/secrets,
  • supports optional assisted-provider validation probe on manual dispatch,
  • supports optional automatic rollback attempt (manual dispatch only) when Azure release fails,
  • captures /version fingerprints after deploy,
  • uploads full smoke logs (deployment_logs/...) and build metadata as workflow artifacts.

Manual dispatch options:

  • run_azure_release (default true)
  • run_apprunner_release (default false)
  • run_apprunner_api_smoke (default true)
  • run_failure_probe (default false) to run a controlled negative probe and verify failure surfacing
  • run_assisted_provider_validation (default false)
  • expect_azure_assisted (default false) for strict Azure OpenAI provider validation
  • fail_on_preflight_errors (default true)
  • enable_auto_rollback (default false)
  • optional rollback_ref override (defaults to previous commit)
  • aws_region
  • optional apprunner_service_arn override

Scheduled regression workflow:

  • .github/workflows/scheduled-smoke-regression-controltower.yml
  • runs every 6 hours (plus manual dispatch),
  • executes API/UI runtime smoke contracts and deployment-truth checks,
  • runs CI preflight classification for workflow configuration,
  • can emit failure alerts via SMOKE_ALERT_WEBHOOK secret,
  • uploads logs and JSON artifacts from deployment_logs/scheduled-smoke-....

UI Operator Warnings

The UI now exposes two warning layers in both Overview and Incidents:

  • Source health warning banner: driven by _meta.aws_ok / _meta.azure_ok.
  • Capability warning banner: driven by _meta.capabilities, with expandable details (endpoint | source | status | reason).

This makes partial-source or degraded endpoint behavior visible in-page instead of only in raw payload metadata.

Operator Diagnostics View

UI now includes a Diagnostics view for operational trust checks:

  • build fingerprint reconciliation across UI/AWS/Azure surfaces,
  • assisted triage telemetry counters and recent event feed,
  • assisted provider posture (selected provider/policy/retry settings),
  • recent GitHub workflow outcomes (release smoke, scheduled smoke, deploy workflows),
  • JSON snapshot export for handoff/audit.

Endpoint Strategy (Locked)

Official contracts:

  • Impact: GET /api/pipelines/{pipeline_id}/impact is the first-class graph impact contract.
  • SLA (direct): GET /api/pipelines/{pipeline_id}/sla remains a first-class per-pipeline contract.
  • Costs: GET /api/costs is the first-class cost aggregate contract.
  • Cost trend: GET /api/costs/trend provides daily cost movement.
  • Cost anomalies: GET /api/costs/anomalies surfaces latest-day outliers vs baseline.
  • Triage query: POST /api/triage/query is the first-class deterministic triage contract.
  • Runbook lifecycle governance (persisted): GET /api/runbooks/governance, POST /api/runbooks/governance/upsert, POST /api/runbooks/governance/remove.
  • Policy profiles (persisted): GET /api/policies/profiles, POST /api/policies/profiles/assign, POST /api/policies/profiles/remove.
  • Effective policy resolution: GET /api/policies/effective.
  • Policy audit history: GET /api/policies/history.
  • Policy audit/export reporting: GET /api/policies/history/export, GET /api/policies/exceptions/register, GET /api/policies/review-queue.
  • Team scorecards export: GET /api/scorecards/teams/export.
  • Governance posture snapshot: GET /api/governance/posture.
  • Governance posture trend: GET /api/governance/posture/trend.
  • Governance export bundle: GET /api/governance/export/bundle.
  • Governance executive report pack: GET /api/governance/report/executive.
  • Assisted triage telemetry: GET /api/triage/assisted/telemetry.
  • Assisted provider status: GET /api/triage/assisted/provider/status.
  • DQ exception governance: POST /api/dq/exceptions/review, GET /api/dq/history, GET /api/dq/exceptions/register.

Compatibility contracts:

  • Lineage: GET /api/pipelines/{pipeline_id}/lineage is a deprecated alias to impact.
  • Deprecated lineage responses include:
    • deprecated: true
    • replacement_endpoint: /api/pipelines/{pipeline_id}/impact

Aggregator support:

  • GET /api/aggregate/pipelines/{pipeline_id}/impact
  • GET /api/aggregate/pipelines/{pipeline_id}/sla
  • GET /api/aggregate/costs
  • GET /api/aggregate/costs/trend
  • GET /api/aggregate/costs/anomalies
  • GET /api/aggregate/costs/pipelines
  • GET /api/aggregate/pipelines/{pipeline_id}/cost
  • GET /api/aggregate/runbooks
  • GET /api/aggregate/runbooks/governance
  • POST /api/aggregate/runbooks/governance/upsert
  • POST /api/aggregate/runbooks/governance/remove
  • GET /api/aggregate/governance/drilldown
  • GET /api/aggregate/policies/profiles
  • POST /api/aggregate/policies/profiles/assign
  • POST /api/aggregate/policies/profiles/remove
  • GET /api/aggregate/policies/effective
  • GET /api/aggregate/policies/history
  • GET /api/aggregate/policies/history/export
  • GET /api/aggregate/policies/exceptions/register
  • GET /api/aggregate/policies/review-queue
  • GET /api/aggregate/scorecards/teams/export
  • GET /api/aggregate/governance/posture
  • GET /api/aggregate/governance/posture/trend
  • GET /api/aggregate/governance/export/bundle
  • GET /api/aggregate/governance/report/executive
  • POST /api/aggregate/dq/exceptions/review
  • GET /api/aggregate/dq/history
  • GET /api/aggregate/dq/exceptions/register
  • POST /api/aggregate/triage/query
  • GET /api/aggregate/triage/assisted/telemetry
  • GET /api/aggregate/triage/assisted/provider/status
  • GET /api/aggregate/ops/diagnostics

API vs Aggregator Matrix

Capability API route Aggregator route Status
Impact (official) /api/pipelines/{pipeline_id}/impact /api/aggregate/pipelines/{pipeline_id}/impact First-class
Lineage (compat) /api/pipelines/{pipeline_id}/lineage n/a (use impact route) Deprecated alias
Pipeline SLA (direct) /api/pipelines/{pipeline_id}/sla /api/aggregate/pipelines/{pipeline_id}/sla First-class
SLA pipeline counts /api/sla/pipelines /api/aggregate/sla/pipelines First-class
Costs /api/costs /api/aggregate/costs First-class
Cost trend /api/costs/trend /api/aggregate/costs/trend First-class
Cost anomalies /api/costs/anomalies /api/aggregate/costs/anomalies First-class
Cost ranking /api/costs/pipelines /api/aggregate/costs/pipelines First-class
Pipeline cost detail /api/pipelines/{pipeline_id}/cost /api/aggregate/pipelines/{pipeline_id}/cost First-class
Runbook catalog /api/runbooks /api/aggregate/runbooks First-class
Runbook lifecycle governance /api/runbooks/governance + upsert/remove /api/aggregate/runbooks/governance + upsert/remove First-class
Triage query /api/triage/query /api/aggregate/triage/query First-class
Assisted triage telemetry /api/triage/assisted/telemetry /api/aggregate/triage/assisted/telemetry First-class
Assisted provider status /api/triage/assisted/provider/status /api/aggregate/triage/assisted/provider/status First-class
Policy assignments /api/policies/profiles + assign/remove /api/aggregate/policies/profiles + assign/remove First-class
Effective policy resolution /api/policies/effective /api/aggregate/policies/effective First-class
Policy audit history /api/policies/history /api/aggregate/policies/history First-class
Policy audit/export reporting /api/policies/history/export, /api/policies/exceptions/register, /api/policies/review-queue /api/aggregate/policies/history/export, /api/aggregate/policies/exceptions/register, /api/aggregate/policies/review-queue First-class
Team scorecards export /api/scorecards/teams/export /api/aggregate/scorecards/teams/export First-class
Governance posture snapshot /api/governance/posture /api/aggregate/governance/posture First-class
Governance posture trend /api/governance/posture/trend /api/aggregate/governance/posture/trend First-class
Governance export bundle /api/governance/export/bundle /api/aggregate/governance/export/bundle First-class
Governance executive report /api/governance/report/executive /api/aggregate/governance/report/executive First-class
Operator diagnostics n/a /api/aggregate/ops/diagnostics First-class
DQ exception governance /api/dq/exceptions/review, /api/dq/history, /api/dq/exceptions/register /api/aggregate/dq/exceptions/review, /api/aggregate/dq/history, /api/aggregate/dq/exceptions/register First-class

Incident Action Workflow

Incidents now include an operator action panel that provides:

  • likely cause
  • impact summary
  • next actions
  • runbook jump link

The panel is powered by same-origin aggregator routes:

  • GET /api/aggregate/explain/run/{run_id}
  • GET /api/aggregate/pipelines/{pipeline_id}/impact
  • optional POST /api/aggregate/triage/query fallback when error details are present

Formal Triage Query Contract

API:

  • POST /api/triage/query

Aggregator:

  • POST /api/aggregate/triage/query

Request (example):

{
  "source": "both",
  "pipeline_id": "p2",
  "run_id": "r_p2_20260307235500_Q5_76458",
  "context_days": 7,
  "mode": "deterministic",
  "user_prompt": "optional operator context",
  "error_message": "optional",
  "error_code": "optional"
}

Response contract:

  • summary (string)
  • likely_cause (string)
  • impacted_consumers[]
  • impact_summary (nodes_count, edges_count, degraded, impact_model)
  • next_actions[]
  • runbook_references[]
  • degraded (boolean)
  • warnings[]
  • confidence
  • confidence_reason
  • mode (deterministic_rules_v1 or assisted_bedrock_v1)
  • requested_mode (deterministic or assisted)
  • facts_used[] (field, value, source)
  • traceability object:
    • telemetry_facts[]
    • runbook_facts[]
    • impact_facts[]
    • cost_facts[]
    • evidence_links[]
  • evidence (object)
  • generated_at

Current mode behavior:

  • mode=deterministic: active and fully supported.
  • mode=assisted: provider-assisted enrichment is attempted when runtime config is present.
  • Assisted provider selection is controlled by ASSISTED_TRIAGE_PROVIDER:
    • bedrock (default)
    • azure_openai (scaffolded parity path)
    • auto (use Azure OpenAI when configured, else Bedrock)
  • If assisted provider is unavailable/misconfigured/fails and ASSISTED_FAILURE_POLICY=fallback (default), the API falls back to deterministic rules with:
    • warning metadata,
    • evidence.assisted.fallback_used=true,
    • degraded=true.
  • If assisted provider fails and ASSISTED_FAILURE_POLICY=strict, the API returns HTTP 503 (assisted_mode_failed) and does not fall back.
  • Assisted telemetry/observability is exposed via:
    • API: GET /api/triage/assisted/telemetry
    • aggregate: GET /api/aggregate/triage/assisted/telemetry
    • payload includes status counters (success|fallback|failed), provider breakdown, error-class counts, and recent events.

Assisted mode runtime configuration:

  • BEDROCK_MODEL_ID (required for assisted mode; expected provider: anthropic.*)
  • BEDROCK_REGION (optional, defaults to AWS_REGION, then ap-southeast-2)
  • standard AWS credentials/role permissions for bedrock:InvokeModel
  • ASSISTED_TRIAGE_PROVIDER (bedrock|azure_openai|auto, default bedrock)
  • ASSISTED_FAILURE_POLICY (fallback|strict, default fallback)
  • ASSISTED_RETRY_MAX_ATTEMPTS (default 4)
  • ASSISTED_RETRY_BASE_MS (default 300)

Azure OpenAI assisted-mode scaffolding:

  • AZURE_OPENAI_ENDPOINT
  • AZURE_OPENAI_API_KEY
  • AZURE_OPENAI_DEPLOYMENT
  • AZURE_OPENAI_API_VERSION (optional, default 2024-10-21)

Runbook catalog enrichment:

  • GET /api/runbooks and GET /api/aggregate/runbooks now include richer metadata:
    • owner
    • sla_minutes
    • runbook_status (configured, missing, invalid_format)
    • tags[]
    • summary coverage block (configured, missing, invalid_format, coverage_ratio)
  • Supported filters: search, team, domain, criticality, owner, include_missing.
  • Runbook lifecycle governance (persisted):
    • GET /api/runbooks/governance
    • POST /api/runbooks/governance/upsert
    • POST /api/runbooks/governance/remove
    • aggregate equivalents under /api/aggregate/runbooks/governance*
    • store path override: RUNBOOK_GOV_STORE_PATH=/custom/path/runbook_governance_store.json
    • lifecycle states: none, open, in_progress, remediated, waived

Governance surface:

  • UI includes a baseline Governance view (?view=governance) with:
    • runbook coverage
    • critical runbook gaps
    • SLA pressure count
    • cost anomaly count
  • Governance drilldown rows for runbook-oriented views include an operator workflow jump:
    • Open in runbooks pre-fills Triage runbook filters (search/team/domain/criticality/status) and switches to ?view=triage.
  • Governance drilldown API:
    • GET /api/aggregate/governance/drilldown?source=both&days=7&drilldown=runbook_coverage|critical_runbook_gaps|sla_pressure|cost_anomalies|dq_mismatches&limit=200
  • Governance deep-link state params:
    • view=governance
    • gov_source=both|aws|azure
    • gov_days=7|14|30
    • gov_drilldown=runbook_coverage|critical_runbook_gaps|sla_pressure|cost_anomalies|dq_mismatches

Policy/profile surface (baseline):

  • UI includes ?view=policy with:
    • persisted policy assignment actions by scope (pipeline, team, domain)
    • SLA profile assignment/effective table with mismatch + exception flags
    • exception metadata (reason, reviewer, expiry, approved)
    • approved vs unmanaged exception-state visibility
    • policy audit history table with filters (action, scope, target)
    • baseline DQ/operational standards table with live evidence
    • profile KPIs (strict/balanced/relaxed counts + standards coverage)
    • row click-through actions into Overview, Governance, and Triage
  • Policy deep-link state params:
    • view=policy
    • policy_source=both|aws|azure
    • policy_days=7|14|30
    • policy_profile=all|strict|balanced|relaxed
    • policy_mismatch=all|mismatch
    • policy_exception=all|exception
    • policy_exception_state=all|approved|unmanaged|expired|rejected
    • policy_audit_action=all|assign|remove
    • policy_audit_scope=all|pipeline|team|domain
    • policy_audit_target=<target_id_substring>
  • Actionable DQ standards workflow (persisted):
    • API:
      • GET /api/dq/standards
      • POST /api/dq/standards/upsert
      • POST /api/dq/standards/remove
      • GET /api/dq/effective
    • aggregate equivalents:
      • GET /api/aggregate/dq/standards
      • POST /api/aggregate/dq/standards/upsert
      • POST /api/aggregate/dq/standards/remove
      • GET /api/aggregate/dq/effective
    • supported standards:
      • runbook_required
      • max_breach_ratio
      • max_failures_7d
    • precedence:
      • pipeline override > team override > domain override > default baseline
    • temporary deviation fields:
      • is_exception
      • exception_approved
      • exception_reviewer
      • exception_expires_at
      • waiver_template
    • effective DQ response includes:
      • per-pipeline remediation_suggestions[]
      • per-pipeline action_links (overview/governance/triage/incidents)
      • summary aggregates: failing_by_team, failing_by_domain, failing_by_profile, failing_standards_counts
    • store path override:
      • DQ_STORE_PATH=/custom/path/dq_standards_store.json

Persisted policy backend notes:

  • default store path:
    • /home/site/wwwroot/policy_profiles_store.json when available
    • fallback local runtime path under app/
  • override with env var:
    • POLICY_STORE_PATH=/custom/path/policy_profiles_store.json
  • effective resolution precedence:
    • pipeline assignment > team assignment > domain assignment > recommended default (from criticality)
  • assignment payload supports exception governance metadata:
    • is_exception
    • exception_approved
    • exception_status (none|proposed|approved|rejected|expired)
    • exception_reviewer
    • exception_expires_at
    • review_due_at
    • waiver_template
  • effective response includes:
    • exception_state (none|approved|unmanaged|expired|rejected)
    • exception_status
    • summary unmanaged_exception_count
    • summary proposed_exception_count, rejected_exception_count, approved_exception_count, expired_exception_count
  • history endpoint returns immutable change events:
    • action, scope, target_id, changed_by, changed_at, reason, old_value, new_value

Cost drilldown polish:

  • Cost anomalies table now includes:
    • team/domain context
    • severity band
    • anomaly filters (all vs anomaly_only)
    • minimum ratio filter
    • row Focus action to jump to pipeline detail in overview.
    • row Investigate action to build deeper incident + triage context in-page.
  • Investigation endpoint:
    • API: GET /api/costs/anomalies/investigate/{pipeline_id}
    • aggregate: GET /api/aggregate/costs/anomalies/investigate/{pipeline_id}
    • response includes:
      • anomaly
      • trend.items[]
      • top_cost_runs[]
      • related_incidents[]
      • baseline_delta_aud
      • baseline_delta_ratio
      • recent_change_24h_aud
      • policy_pressure
      • dq_pressure
      • recommended_next_actions[]
      • triage_prefill
      • degraded
      • warnings[]

Assisted triage evidence UX:

  • Incident/Triage action panels now surface:
    • requested vs executed mode
    • degraded/fallback metadata
    • grouped facts_used by source
    • grouped traceability facts (telemetry, runbook, impact, cost)
    • traceability evidence links
    • structured evidence cards
    • raw evidence payload details (expandable)

Azure OpenAI parity:

  • Azure OpenAI assisted-mode parity is now scaffolded behind ASSISTED_TRIAGE_PROVIDER=azure_openai (or auto when configured).
  • Bedrock remains the default provider for current production posture.

Team scorecards:

  • API: GET /api/scorecards/teams
  • aggregate: GET /api/aggregate/scorecards/teams
  • export:
    • API: GET /api/scorecards/teams/export
    • aggregate: GET /api/aggregate/scorecards/teams/export
  • governance UI now renders team scorecards with:
    • composite score
    • DQ failures
    • policy mismatches
    • unmanaged exceptions
    • runbook coverage ratio
    • ownership posture ratio
    • anomaly count
    • SLA pressure trend direction
    • breach ratio
    • click-through into policy workflow

Governance audit/reporting:

  • Policy reporting endpoints:
    • API: GET /api/policies/history/export, GET /api/policies/exceptions/register, GET /api/policies/review-queue
    • aggregate: GET /api/aggregate/policies/history/export, GET /api/aggregate/policies/exceptions/register, GET /api/aggregate/policies/review-queue
  • Governance posture snapshot:
    • API: GET /api/governance/posture
    • aggregate: GET /api/aggregate/governance/posture

DQ exception approval workflow:

  • API:
    • POST /api/dq/exceptions/review
    • GET /api/dq/history
    • GET /api/dq/exceptions/register
  • aggregate:
    • POST /api/aggregate/dq/exceptions/review
    • GET /api/aggregate/dq/history
    • GET /api/aggregate/dq/exceptions/register

Backward compatibility:

  • The API response still includes legacy fields such as triage_hints for existing clients.

Copy helpers are available in the panel:

  • Copy action summary (plain text)
  • Copy as Markdown (ticket/chat ready format)

About

Data & AI Pipeline Control Tower API — FastAPI service for pipeline health metrics from Snowflake, deployable to AWS App Runner.

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.md

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors