Control Tower API + Aggregation UI

License Notice

Copyright © 2026 Aleksy Pyrz. All rights reserved.

This repository is source-available for private, internal, non-commercial evaluation only. It is not open source.

You may view, clone, and run the code for private evaluation only. You may not use it commercially, deploy it in production, redistribute it, sublicense it, resell it, offer it as a hosted service, or claim it as your own.

See LICENSE.md for full terms.

For commercial licensing or acquisition discussions: aleksypyrz@gmail.com

Control Tower API + Aggregation UI

Control Tower is a Python-based observability service for data pipeline operations.

This repository contains:

API service (FastAPI) exposing Snowflake-backed operational endpoints.
Aggregation UI backend (Flask) that merges AWS and Azure API responses.
Deployment and verification scripts for Azure Web App and AWS App Runner.

Repository Structure

app/main.py: FastAPI service entrypoint and API routes.
app/snowflake_client.py: Snowflake connection/auth/query logic.
controltower_ui/app.py: Aggregation backend (AWS + Azure source merge).
verify_and_deploy_API.sh: Azure API deploy + verification script.
verify_and_deploy_ui.sh: Azure UI deploy + verification script.
deploy/create_apprunner_controltower.sh: first-time AWS App Runner service creation.
deploy/deploy_apprunner_controltower.sh: App Runner update using image identifier.
deploy/deploy_apprunner_controltower_rel.sh: App Runner update by building/pushing release image.
scripts/git_update.sh: safe Git update workflow for deployment branches.
scripts/post_deploy_api_smoke.sh: API smoke tests.
scripts/post_deploy_ui_smoke.sh: UI smoke tests.
scripts/post_deploy_truth_check.sh: deployment truth verification (/health, /version, UI build banner, smoke index).
scripts/ci_failure_probe.sh: intentional CI failure probe (for workflow failure-surfacing validation).
scripts/ci_preflight.sh: CI config/secrets preflight classification (strict or warn mode).
scripts/assisted_provider_validate.sh: assisted provider runtime validation probe (Bedrock/Azure OpenAI).
scripts/release.sh: one-command release orchestrator (git update, deploy, smoke tests).

Usage Terms

Permitted

View the source code
Clone it for private evaluation
Run and modify it for private, internal, non-production, non-commercial evaluation

Not permitted

Commercial use
Production use
Redistribution
Resale
Sublicensing
Public hosting / SaaS use
Copying substantial parts into another product
Claiming the code as your own

Runtime Architecture

API service reads Snowflake env vars and serves operational endpoints under /api/*.
UI aggregation service proxies and merges responses from:
- AWS_BASE
- AZURE_BASE
Both services expose /health endpoints for liveness checks.
UI overview and incidents views show operator-visible warning banners when a source is degraded and when endpoint capabilities are degraded.

Prerequisites

Install the following tools locally:

python3 (3.11+ recommended)
pip
jq
curl
zip / unzip
git
az (Azure CLI) for Azure deployments
aws + docker for AWS/App Runner deployments

Environment Variables

API (Snowflake)

Required:

SF_USER
one of: SF_ACCOUNT_IDENTIFIER or SF_ACCOUNT_URL
one auth method:
- SF_PRIVATE_KEY_PEM_B64 (recommended), or
- SF_PASSWORD / SNOWFLAKE_PASSWORD

Recommended:

SF_ROLE
SF_WAREHOUSE
SF_DATABASE (default often BHP_PLATFORM_LAB)
SF_SCHEMA (default often OBS)

UI Aggregator

Required for merged mode:

AWS_BASE (example: https://api-controltower.aws.example.com)
AZURE_BASE (example: https://api-controltower.azure.example.com)

Deployment fingerprint (required for release smoke success)

Set these in API and UI environments so /health and /version show exact deployed build identity:

BUILD_ID
GIT_SHA
RELEASE_TAG

Current smoke scripts now hard-assert these values are present and non-empty in /version.

Local Development

1) Create virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2) Run API locally

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Health check:

curl -fsS http://localhost:8000/health | jq .

3) Run UI aggregator locally

cd controltower_ui
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

export AWS_BASE="http://localhost:8000"
export AZURE_BASE="http://localhost:8000"
python3 app.py

UI health:

curl -fsS http://localhost:8000/health | jq .

Git Update Workflow (Before Deploying)

Use the repository script to keep deployment branches updated and avoid interactive Git mistakes.

Standard update (main)

BASE_BRANCH=main TARGET_BRANCH=main ./scripts/git_update.sh

Update release branch from main and push

BASE_BRANCH=main TARGET_BRANCH=release PUSH=1 ./scripts/git_update.sh

What it does:

fetches latest refs
checks out target branch
fast-forwards main or rebases non-main target on base
optionally pushes when PUSH=1

Azure Deployment (Zip Deploy)

Required release stamp exports (before deployment)

Export these before running deploy/release flows:

export BUILD_ID="$(date +%Y-%m-%d).1"
export GIT_SHA="$(git rev-parse --short HEAD)"
export RELEASE_TAG="controltower-v0.4.1"

If these are missing, smoke tests can fail at /version assertions.

API deployment/update (recommended script)

Step 1: set variables (or rely on defaults in script)

export API_DIR="/Users/aleksypyrz/Documents/GitHub/controltower_api"
export RG="rg-bhp-platformlab-dev-aue-app"
export WEBAPP_API="app-bhp-platformlab-dev-aue-controltower-api"
export API_URL="https://api-controltower.gitpushandpray.ai"

Step 2: run deploy + verify

./verify_and_deploy_API.sh

If BUILD_ID, GIT_SHA, and RELEASE_TAG are exported, deploy scripts apply them to App Service settings before deploy.

Optional: verify package only (no deploy)

SKIP_DEPLOY=1 ./verify_and_deploy_API.sh

Additional API deploy script behavior:

Enforces Azure startup command to bash startup.sh on each run.
Supports retry attempts with DEPLOY_RETRIES (default: 2).
Uses az webapp deploy --track-status false to avoid false-negative tracking issues seen with OneDeploy status polling.

UI deployment/update (recommended script)

./verify_and_deploy_ui.sh

Notes:

Script builds a zip, compares file hashes, deploys with az webapp deploy, then runs endpoint checks.
If release stamp vars are exported, script writes BUILD_ID/GIT_SHA/RELEASE_TAG to UI app settings before deploy.
Script supports env overrides for UI_DIR, RG, WEBAPP_UI, UI_URL, and DEPLOY_RETRIES.

AWS App Runner Deployment

First-time service creation

Export required Snowflake environment variables.
Run:

./deploy/create_apprunner_controltower.sh

Update existing App Runner service (image already in ECR)

IMAGE_IDENTIFIER="<account>.dkr.ecr.ap-southeast-2.amazonaws.com/bhp-platformlab-controltower:rel-YYYYMMDD-HHMMSS" \
./deploy/deploy_apprunner_controltower.sh

Build + push release image + update service

SERVICE_ARN="arn:aws:apprunner:..." ./deploy/deploy_apprunner_controltower_rel.sh

Post-Deployment Smoke Tests

API smoke test script

API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.sh

Impact endpoint contract is enforced by default (IMPACT_REQUIRED=1 default):

IMPACT_REQUIRED=1 API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.sh

Checks:

/health
/version (hard assertion of build.build_id, build.git_sha, build.release_tag)
/api/summary
summary contract extensions:
- stale_datasets (non-negative)
- top_impacted_downstream_assets (array)
- cost_24h_aud, cost_7d_aud, cost_7d_daily_avg_aud (non-negative)
/api/pipelines
pagination behavior on /api/pipelines/{id}/runs (offset must change output)
direct pipeline SLA contract (/api/pipelines/{id}/sla)
costs contract (/api/costs)
first-class impact contract: /api/pipelines/{id}/impact must return HTTP 200 with required fields.
lineage strategy contract: /api/pipelines/{id}/lineage must return deprecated alias metadata and point to /impact.

UI smoke test script

UI_URL="https://controltower.gitpushandpray.ai" ./scripts/post_deploy_ui_smoke.sh

Checks:

/health
/version (hard assertion of build.build_id, build.git_sha, build.release_tag)
aggregate summary
aggregate summary contract extensions:
- stale_datasets (non-negative)
- top_impacted_downstream_assets (array)
- cost_24h_aud, cost_7d_aud, cost_7d_daily_avg_aud (non-negative)
aggregate costs contract (/api/aggregate/costs)
aggregate pipeline views (aws, both)
aggregate incidents
aggregate pipeline SLA contract (/api/aggregate/pipelines/{id}/sla)
aggregate runs pagination for AWS (offset=0 must differ from offset=3)
pagination assertion compares items[].run_id sets (not full payload text) to avoid metadata timestamp false-passes
semantic assertions:
- pipeline p2 exists in aws and both
- aws.runs_7d(p2) > 1000
- aws.last_end_ts(p2) is not null
- both.runs_7d(p2) >= aws.runs_7d(p2)

Full Step-by-Step Release Procedure (Recommended)

Update local branch safely:

BASE_BRANCH=main TARGET_BRANCH=main ./scripts/git_update.sh

Run local syntax checks:

python3 -m py_compile app/main.py controltower_ui/app.py

Deploy API:
```
./verify_and_deploy_API.sh
```
Deploy UI:
```
./verify_and_deploy_ui.sh
```

Run independent smoke tests:

API_URL="https://api-controltower.gitpushandpray.ai" ./scripts/post_deploy_api_smoke.sh
UI_URL="https://controltower.gitpushandpray.ai" ./scripts/post_deploy_ui_smoke.sh

Commit and push deployment-related changes if needed:

git add -A
git commit -m "docs: update deployment runbook and smoke test scripts"
git push origin main

Rollback Guidance

Azure Web App: redeploy previous known-good zip artifact or previous commit.
App Runner: update service back to previous immutable image tag/digest.
Re-run smoke scripts after rollback.

Security and Operations Notes

Never commit secrets or raw private keys.
Use secret managers / CI secret stores for SF_* credentials.
Prefer immutable container tags or digest-pinned images for App Runner updates.
Keep jq, az, aws, and docker versions current.

Troubleshooting

Database dependency unavailable: check Snowflake env vars and key format.
UI returns upstream errors: confirm AWS_BASE/AZURE_BASE are reachable and healthy.
App Runner update fails with same image: provide a new image tag or digest.
Azure deploy succeeds but app unhealthy: inspect app logs and rerun smoke tests.
Smoke fails at /version build assertions: ensure BUILD_ID, GIT_SHA, and RELEASE_TAG were exported before deploy (or already present in App Service settings).

For contribution and policy docs, see:

CONTRIBUTING.md
SECURITY.md
SUPPORT.md

One-Command Release Script

Use this when you want a full redeploy flow in one command.

./scripts/release.sh

Recommended full release command:

export BUILD_ID="$(date +%Y-%m-%d).1"
export GIT_SHA="$(git rev-parse --short HEAD)"
export RELEASE_TAG="controltower-v0.4.1"
RUN_GIT_UPDATE=0 DEPLOY_RETRIES=3 ./scripts/release.sh

Default flow order:

Git update (scripts/git_update.sh)
API deploy + verification (verify_and_deploy_API.sh)
UI deploy + verification (verify_and_deploy_ui.sh)
API smoke tests (scripts/post_deploy_api_smoke.sh)
UI smoke tests (scripts/post_deploy_ui_smoke.sh)

Common usage options

Skip git update:

RUN_GIT_UPDATE=0 ./scripts/release.sh

Dry run packaging/verification only (no API deploy):

SKIP_DEPLOY=1 RUN_DEPLOY_UI=0 RUN_SMOKE_API=0 RUN_SMOKE_UI=0 ./scripts/release.sh

Run only smoke tests:

RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_DEPLOY_UI=0 ./scripts/release.sh

Run semantic checks with custom pipeline/threshold:

SEMANTIC_PIPELINE_ID=p2 ASSERT_RUNS7D_GT=1000 RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_DEPLOY_UI=0 ./scripts/release.sh

Run against custom URLs or resources:

RG="rg-bhp-platformlab-dev-aue-app" \
WEBAPP_API="app-bhp-platformlab-dev-aue-controltower-api" \
WEBAPP_UI="app-bhp-platformlab-dev-aue-controltower-ui" \
API_URL="https://api-controltower.gitpushandpray.ai" \
UI_URL="https://controltower.gitpushandpray.ai" \
./scripts/release.sh

Deploy only UI (when API is already healthy)

RUN_GIT_UPDATE=0 RUN_DEPLOY_API=0 RUN_SMOKE_API=0 RUN_DEPLOY_UI=1 RUN_SMOKE_UI=1 ./scripts/release.sh

Deploy only API (when testing API fixes)

RUN_GIT_UPDATE=0 RUN_DEPLOY_UI=0 RUN_SMOKE_UI=0 DEPLOY_RETRIES=3 ./scripts/release.sh

CI/CD Release + Smoke Automation

GitHub Actions now includes a release gate workflow:

.github/workflows/release-smoke-controltower.yml

What it does:

runs deploy + smoke from CI (Azure path on push, optional App Runner on manual dispatch),
fails the workflow on smoke regressions,
runs explicit deployment-truth checks (health/version/build fingerprints + smoke artifact index),
runs CI preflight checks for required deployment config/secrets,
supports optional assisted-provider validation probe on manual dispatch,
supports optional automatic rollback attempt (manual dispatch only) when Azure release fails,
captures /version fingerprints after deploy,
uploads full smoke logs (deployment_logs/...) and build metadata as workflow artifacts.

Manual dispatch options:

run_azure_release (default true)
run_apprunner_release (default false)
run_apprunner_api_smoke (default true)
run_failure_probe (default false) to run a controlled negative probe and verify failure surfacing
run_assisted_provider_validation (default false)
expect_azure_assisted (default false) for strict Azure OpenAI provider validation
fail_on_preflight_errors (default true)
enable_auto_rollback (default false)
optional rollback_ref override (defaults to previous commit)
aws_region
optional apprunner_service_arn override

Scheduled regression workflow:

.github/workflows/scheduled-smoke-regression-controltower.yml
runs every 6 hours (plus manual dispatch),
executes API/UI runtime smoke contracts and deployment-truth checks,
runs CI preflight classification for workflow configuration,
can emit failure alerts via SMOKE_ALERT_WEBHOOK secret,
uploads logs and JSON artifacts from deployment_logs/scheduled-smoke-....

UI Operator Warnings

The UI now exposes two warning layers in both Overview and Incidents:

Source health warning banner: driven by _meta.aws_ok / _meta.azure_ok.
Capability warning banner: driven by _meta.capabilities, with expandable details (endpoint | source | status | reason).

This makes partial-source or degraded endpoint behavior visible in-page instead of only in raw payload metadata.

Operator Diagnostics View

UI now includes a Diagnostics view for operational trust checks:

build fingerprint reconciliation across UI/AWS/Azure surfaces,
assisted triage telemetry counters and recent event feed,
assisted provider posture (selected provider/policy/retry settings),
recent GitHub workflow outcomes (release smoke, scheduled smoke, deploy workflows),
JSON snapshot export for handoff/audit.

Endpoint Strategy (Locked)

Official contracts:

Impact: GET /api/pipelines/{pipeline_id}/impact is the first-class graph impact contract.
SLA (direct): GET /api/pipelines/{pipeline_id}/sla remains a first-class per-pipeline contract.
Costs: GET /api/costs is the first-class cost aggregate contract.
Cost trend: GET /api/costs/trend provides daily cost movement.
Cost anomalies: GET /api/costs/anomalies surfaces latest-day outliers vs baseline.
Triage query: POST /api/triage/query is the first-class deterministic triage contract.
Runbook lifecycle governance (persisted): GET /api/runbooks/governance, POST /api/runbooks/governance/upsert, POST /api/runbooks/governance/remove.
Policy profiles (persisted): GET /api/policies/profiles, POST /api/policies/profiles/assign, POST /api/policies/profiles/remove.
Effective policy resolution: GET /api/policies/effective.
Policy audit history: GET /api/policies/history.
Policy audit/export reporting: GET /api/policies/history/export, GET /api/policies/exceptions/register, GET /api/policies/review-queue.
Team scorecards export: GET /api/scorecards/teams/export.
Governance posture snapshot: GET /api/governance/posture.
Governance posture trend: GET /api/governance/posture/trend.
Governance export bundle: GET /api/governance/export/bundle.
Governance executive report pack: GET /api/governance/report/executive.
Assisted triage telemetry: GET /api/triage/assisted/telemetry.
Assisted provider status: GET /api/triage/assisted/provider/status.
DQ exception governance: POST /api/dq/exceptions/review, GET /api/dq/history, GET /api/dq/exceptions/register.

Compatibility contracts:

Lineage: GET /api/pipelines/{pipeline_id}/lineage is a deprecated alias to impact.
Deprecated lineage responses include:
- deprecated: true
- replacement_endpoint: /api/pipelines/{pipeline_id}/impact

Aggregator support:

GET /api/aggregate/pipelines/{pipeline_id}/impact
GET /api/aggregate/pipelines/{pipeline_id}/sla
GET /api/aggregate/costs
GET /api/aggregate/costs/trend
GET /api/aggregate/costs/anomalies
GET /api/aggregate/costs/pipelines
GET /api/aggregate/pipelines/{pipeline_id}/cost
GET /api/aggregate/runbooks
GET /api/aggregate/runbooks/governance
POST /api/aggregate/runbooks/governance/upsert
POST /api/aggregate/runbooks/governance/remove
GET /api/aggregate/governance/drilldown
GET /api/aggregate/policies/profiles
POST /api/aggregate/policies/profiles/assign
POST /api/aggregate/policies/profiles/remove
GET /api/aggregate/policies/effective
GET /api/aggregate/policies/history
GET /api/aggregate/policies/history/export
GET /api/aggregate/policies/exceptions/register
GET /api/aggregate/policies/review-queue
GET /api/aggregate/scorecards/teams/export
GET /api/aggregate/governance/posture
GET /api/aggregate/governance/posture/trend
GET /api/aggregate/governance/export/bundle
GET /api/aggregate/governance/report/executive
POST /api/aggregate/dq/exceptions/review
GET /api/aggregate/dq/history
GET /api/aggregate/dq/exceptions/register
POST /api/aggregate/triage/query
GET /api/aggregate/triage/assisted/telemetry
GET /api/aggregate/triage/assisted/provider/status
GET /api/aggregate/ops/diagnostics

API vs Aggregator Matrix

Capability	API route	Aggregator route	Status
Impact (official)	`/api/pipelines/{pipeline_id}/impact`	`/api/aggregate/pipelines/{pipeline_id}/impact`	First-class
Lineage (compat)	`/api/pipelines/{pipeline_id}/lineage`	n/a (use impact route)	Deprecated alias
Pipeline SLA (direct)	`/api/pipelines/{pipeline_id}/sla`	`/api/aggregate/pipelines/{pipeline_id}/sla`	First-class
SLA pipeline counts	`/api/sla/pipelines`	`/api/aggregate/sla/pipelines`	First-class
Costs	`/api/costs`	`/api/aggregate/costs`	First-class
Cost trend	`/api/costs/trend`	`/api/aggregate/costs/trend`	First-class
Cost anomalies	`/api/costs/anomalies`	`/api/aggregate/costs/anomalies`	First-class
Cost ranking	`/api/costs/pipelines`	`/api/aggregate/costs/pipelines`	First-class
Pipeline cost detail	`/api/pipelines/{pipeline_id}/cost`	`/api/aggregate/pipelines/{pipeline_id}/cost`	First-class
Runbook catalog	`/api/runbooks`	`/api/aggregate/runbooks`	First-class
Runbook lifecycle governance	`/api/runbooks/governance` + upsert/remove	`/api/aggregate/runbooks/governance` + upsert/remove	First-class
Triage query	`/api/triage/query`	`/api/aggregate/triage/query`	First-class
Assisted triage telemetry	`/api/triage/assisted/telemetry`	`/api/aggregate/triage/assisted/telemetry`	First-class
Assisted provider status	`/api/triage/assisted/provider/status`	`/api/aggregate/triage/assisted/provider/status`	First-class
Policy assignments	`/api/policies/profiles` + assign/remove	`/api/aggregate/policies/profiles` + assign/remove	First-class
Effective policy resolution	`/api/policies/effective`	`/api/aggregate/policies/effective`	First-class
Policy audit history	`/api/policies/history`	`/api/aggregate/policies/history`	First-class
Policy audit/export reporting	`/api/policies/history/export`, `/api/policies/exceptions/register`, `/api/policies/review-queue`	`/api/aggregate/policies/history/export`, `/api/aggregate/policies/exceptions/register`, `/api/aggregate/policies/review-queue`	First-class
Team scorecards export	`/api/scorecards/teams/export`	`/api/aggregate/scorecards/teams/export`	First-class
Governance posture snapshot	`/api/governance/posture`	`/api/aggregate/governance/posture`	First-class
Governance posture trend	`/api/governance/posture/trend`	`/api/aggregate/governance/posture/trend`	First-class
Governance export bundle	`/api/governance/export/bundle`	`/api/aggregate/governance/export/bundle`	First-class
Governance executive report	`/api/governance/report/executive`	`/api/aggregate/governance/report/executive`	First-class
Operator diagnostics	n/a	`/api/aggregate/ops/diagnostics`	First-class
DQ exception governance	`/api/dq/exceptions/review`, `/api/dq/history`, `/api/dq/exceptions/register`	`/api/aggregate/dq/exceptions/review`, `/api/aggregate/dq/history`, `/api/aggregate/dq/exceptions/register`	First-class

Incident Action Workflow

Incidents now include an operator action panel that provides:

likely cause
impact summary
next actions
runbook jump link

The panel is powered by same-origin aggregator routes:

GET /api/aggregate/explain/run/{run_id}
GET /api/aggregate/pipelines/{pipeline_id}/impact
optional POST /api/aggregate/triage/query fallback when error details are present

Formal Triage Query Contract

API:

POST /api/triage/query

Aggregator:

POST /api/aggregate/triage/query

Request (example):

{
  "source": "both",
  "pipeline_id": "p2",
  "run_id": "r_p2_20260307235500_Q5_76458",
  "context_days": 7,
  "mode": "deterministic",
  "user_prompt": "optional operator context",
  "error_message": "optional",
  "error_code": "optional"
}

Response contract:

summary (string)
likely_cause (string)
impacted_consumers[]
impact_summary (nodes_count, edges_count, degraded, impact_model)
next_actions[]
runbook_references[]
degraded (boolean)
warnings[]
confidence
confidence_reason
mode (deterministic_rules_v1 or assisted_bedrock_v1)
requested_mode (deterministic or assisted)
facts_used[] (field, value, source)
traceability object:
- telemetry_facts[]
- runbook_facts[]
- impact_facts[]
- cost_facts[]
- evidence_links[]
evidence (object)
generated_at

Current mode behavior:

mode=deterministic: active and fully supported.
mode=assisted: provider-assisted enrichment is attempted when runtime config is present.
Assisted provider selection is controlled by ASSISTED_TRIAGE_PROVIDER:
- bedrock (default)
- azure_openai (scaffolded parity path)
- auto (use Azure OpenAI when configured, else Bedrock)
If assisted provider is unavailable/misconfigured/fails and ASSISTED_FAILURE_POLICY=fallback (default), the API falls back to deterministic rules with:
- warning metadata,
- evidence.assisted.fallback_used=true,
- degraded=true.
If assisted provider fails and ASSISTED_FAILURE_POLICY=strict, the API returns HTTP 503 (assisted_mode_failed) and does not fall back.
Assisted telemetry/observability is exposed via:
- API: GET /api/triage/assisted/telemetry
- aggregate: GET /api/aggregate/triage/assisted/telemetry
- payload includes status counters (success|fallback|failed), provider breakdown, error-class counts, and recent events.

Assisted mode runtime configuration:

BEDROCK_MODEL_ID (required for assisted mode; expected provider: anthropic.*)
BEDROCK_REGION (optional, defaults to AWS_REGION, then ap-southeast-2)
standard AWS credentials/role permissions for bedrock:InvokeModel
ASSISTED_TRIAGE_PROVIDER (bedrock|azure_openai|auto, default bedrock)
ASSISTED_FAILURE_POLICY (fallback|strict, default fallback)
ASSISTED_RETRY_MAX_ATTEMPTS (default 4)
ASSISTED_RETRY_BASE_MS (default 300)

Azure OpenAI assisted-mode scaffolding:

AZURE_OPENAI_ENDPOINT
AZURE_OPENAI_API_KEY
AZURE_OPENAI_DEPLOYMENT
AZURE_OPENAI_API_VERSION (optional, default 2024-10-21)

Runbook catalog enrichment:

GET /api/runbooks and GET /api/aggregate/runbooks now include richer metadata:
- owner
- sla_minutes
- runbook_status (configured, missing, invalid_format)
- tags[]
- summary coverage block (configured, missing, invalid_format, coverage_ratio)
Supported filters: search, team, domain, criticality, owner, include_missing.
Runbook lifecycle governance (persisted):
- GET /api/runbooks/governance
- POST /api/runbooks/governance/upsert
- POST /api/runbooks/governance/remove
- aggregate equivalents under /api/aggregate/runbooks/governance*
- store path override: RUNBOOK_GOV_STORE_PATH=/custom/path/runbook_governance_store.json
- lifecycle states: none, open, in_progress, remediated, waived

Governance surface:

UI includes a baseline Governance view (?view=governance) with:
- runbook coverage
- critical runbook gaps
- SLA pressure count
- cost anomaly count
Governance drilldown rows for runbook-oriented views include an operator workflow jump:
- Open in runbooks pre-fills Triage runbook filters (search/team/domain/criticality/status) and switches to ?view=triage.
Governance drilldown API:
- GET /api/aggregate/governance/drilldown?source=both&days=7&drilldown=runbook_coverage|critical_runbook_gaps|sla_pressure|cost_anomalies|dq_mismatches&limit=200
Governance deep-link state params:
- view=governance
- gov_source=both|aws|azure
- gov_days=7|14|30
- gov_drilldown=runbook_coverage|critical_runbook_gaps|sla_pressure|cost_anomalies|dq_mismatches

Policy/profile surface (baseline):

UI includes ?view=policy with:
- persisted policy assignment actions by scope (pipeline, team, domain)
- SLA profile assignment/effective table with mismatch + exception flags
- exception metadata (reason, reviewer, expiry, approved)
- approved vs unmanaged exception-state visibility
- policy audit history table with filters (action, scope, target)
- baseline DQ/operational standards table with live evidence
- profile KPIs (strict/balanced/relaxed counts + standards coverage)
- row click-through actions into Overview, Governance, and Triage
Policy deep-link state params:
- view=policy
- policy_source=both|aws|azure
- policy_days=7|14|30
- policy_profile=all|strict|balanced|relaxed
- policy_mismatch=all|mismatch
- policy_exception=all|exception
- policy_exception_state=all|approved|unmanaged|expired|rejected
- policy_audit_action=all|assign|remove
- policy_audit_scope=all|pipeline|team|domain
- policy_audit_target=<target_id_substring>
Actionable DQ standards workflow (persisted):
- API:
  - GET /api/dq/standards
  - POST /api/dq/standards/upsert
  - POST /api/dq/standards/remove
  - GET /api/dq/effective
- aggregate equivalents:
  - GET /api/aggregate/dq/standards
  - POST /api/aggregate/dq/standards/upsert
  - POST /api/aggregate/dq/standards/remove
  - GET /api/aggregate/dq/effective
- supported standards:
  - runbook_required
  - max_breach_ratio
  - max_failures_7d
- precedence:
  - pipeline override > team override > domain override > default baseline
- temporary deviation fields:
  - is_exception
  - exception_approved
  - exception_reviewer
  - exception_expires_at
  - waiver_template
- effective DQ response includes:
  - per-pipeline remediation_suggestions[]
  - per-pipeline action_links (overview/governance/triage/incidents)
  - summary aggregates: failing_by_team, failing_by_domain, failing_by_profile, failing_standards_counts
- store path override:
  - DQ_STORE_PATH=/custom/path/dq_standards_store.json

Persisted policy backend notes:

default store path:
- /home/site/wwwroot/policy_profiles_store.json when available
- fallback local runtime path under app/
override with env var:
- POLICY_STORE_PATH=/custom/path/policy_profiles_store.json
effective resolution precedence:
- pipeline assignment > team assignment > domain assignment > recommended default (from criticality)
assignment payload supports exception governance metadata:
- is_exception
- exception_approved
- exception_status (none|proposed|approved|rejected|expired)
- exception_reviewer
- exception_expires_at
- review_due_at
- waiver_template
effective response includes:
- exception_state (none|approved|unmanaged|expired|rejected)
- exception_status
- summary unmanaged_exception_count
- summary proposed_exception_count, rejected_exception_count, approved_exception_count, expired_exception_count
history endpoint returns immutable change events:
- action, scope, target_id, changed_by, changed_at, reason, old_value, new_value

Cost drilldown polish:

Cost anomalies table now includes:
- team/domain context
- severity band
- anomaly filters (all vs anomaly_only)
- minimum ratio filter
- row Focus action to jump to pipeline detail in overview.
- row Investigate action to build deeper incident + triage context in-page.
Investigation endpoint:
- API: GET /api/costs/anomalies/investigate/{pipeline_id}
- aggregate: GET /api/aggregate/costs/anomalies/investigate/{pipeline_id}
- response includes:
  - anomaly
  - trend.items[]
  - top_cost_runs[]
  - related_incidents[]
  - baseline_delta_aud
  - baseline_delta_ratio
  - recent_change_24h_aud
  - policy_pressure
  - dq_pressure
  - recommended_next_actions[]
  - triage_prefill
  - degraded
  - warnings[]

Assisted triage evidence UX:

Incident/Triage action panels now surface:
- requested vs executed mode
- degraded/fallback metadata
- grouped facts_used by source
- grouped traceability facts (telemetry, runbook, impact, cost)
- traceability evidence links
- structured evidence cards
- raw evidence payload details (expandable)

Azure OpenAI parity:

Azure OpenAI assisted-mode parity is now scaffolded behind ASSISTED_TRIAGE_PROVIDER=azure_openai (or auto when configured).
Bedrock remains the default provider for current production posture.

Team scorecards:

API: GET /api/scorecards/teams
aggregate: GET /api/aggregate/scorecards/teams
export:
- API: GET /api/scorecards/teams/export
- aggregate: GET /api/aggregate/scorecards/teams/export
governance UI now renders team scorecards with:
- composite score
- DQ failures
- policy mismatches
- unmanaged exceptions
- runbook coverage ratio
- ownership posture ratio
- anomaly count
- SLA pressure trend direction
- breach ratio
- click-through into policy workflow

Governance audit/reporting:

Policy reporting endpoints:
- API: GET /api/policies/history/export, GET /api/policies/exceptions/register, GET /api/policies/review-queue
- aggregate: GET /api/aggregate/policies/history/export, GET /api/aggregate/policies/exceptions/register, GET /api/aggregate/policies/review-queue
Governance posture snapshot:
- API: GET /api/governance/posture
- aggregate: GET /api/aggregate/governance/posture

DQ exception approval workflow:

API:
- POST /api/dq/exceptions/review
- GET /api/dq/history
- GET /api/dq/exceptions/register
aggregate:
- POST /api/aggregate/dq/exceptions/review
- GET /api/aggregate/dq/history
- GET /api/aggregate/dq/exceptions/register

Backward compatibility:

The API response still includes legacy fields such as triage_hints for existing clients.

Copy helpers are available in the panel:

Copy action summary (plain text)
Copy as Markdown (ticket/chat ready format)

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
app		app
controltower_ui		controltower_ui
deploy		deploy
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
app.zip		app.zip
apprunner-runtime-trust.json		apprunner-runtime-trust.json
federated-credential-api.json		federated-credential-api.json
main.py		main.py
publish.xml		publish.xml
requirements.txt		requirements.txt
startup.sh		startup.sh
verify_and_deploy_API.sh		verify_and_deploy_API.sh
verify_and_deploy_ui.sh		verify_and_deploy_ui.sh
webapp_logs.zip		webapp_logs.zip

Folders and files

Latest commit

History

Repository files navigation

Control Tower API + Aggregation UI

Repository Structure

Usage Terms

Permitted

Not permitted

Runtime Architecture

Prerequisites

Environment Variables

API (Snowflake)

UI Aggregator

Deployment fingerprint (required for release smoke success)

Local Development

1) Create virtual environment

2) Run API locally

3) Run UI aggregator locally

Git Update Workflow (Before Deploying)

Standard update (main)

Update release branch from main and push

Azure Deployment (Zip Deploy)

Required release stamp exports (before deployment)

API deployment/update (recommended script)

UI deployment/update (recommended script)

AWS App Runner Deployment

First-time service creation

Update existing App Runner service (image already in ECR)

Build + push release image + update service

Post-Deployment Smoke Tests

API smoke test script

UI smoke test script

Full Step-by-Step Release Procedure (Recommended)

Rollback Guidance

Security and Operations Notes

Troubleshooting

One-Command Release Script

Common usage options

Deploy only UI (when API is already healthy)

Deploy only API (when testing API fixes)

CI/CD Release + Smoke Automation

UI Operator Warnings

Operator Diagnostics View

Endpoint Strategy (Locked)

API vs Aggregator Matrix

Incident Action Workflow

Formal Triage Query Contract

About

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages