control-surface-agent

Production reliability infrastructure for agentic systems.

This repository is a thesis artifact, not a chatbot demo. It demonstrates a bounded decision workflow through a real scenario, but the point is the supervision architecture: explicit intent framing, explicit planning, execution telemetry, reconciliation, operator intervention, and recovery after feedback.

A control surface for keeping AI decision systems reliable after they have been declared safe enough to deploy.

This does not replace your agent framework. It wraps systems built with Strands, LangGraph, LangChain, or custom loops with control, observability, and learning.

The reasoning steps are agentic. They use OpenAI in the same bounded pattern used in agent-memory-failure-demo: a shared client/config layer, explicit prompts per step, strict structured outputs, and deterministic tool/state orchestration around the model calls.

What This Is Really About

Most agent demos ask: can the model complete a task?

This repository asks a different question: how do you keep an agentic system reliable after it has been judged safe enough to deploy?

In production, reliability is not a static label. It is a continuous property, maintained through visibility into runs, reconciliation between intended and observed behavior, operator intervention when needed, and feedback loops that improve future executions.

Safety in staging is a checkpoint. Reliability in production is a continuous control problem.

This is what systems evolve into when verification, observability, and human-in-the-loop are taken seriously.

Adoption Path

This system is designed to be adopted incrementally:

Add verification and telemetry to an existing agent loop
Introduce pause and human gating for low-confidence cases
Capture remediation as structured signals
Use signals to influence routing decisions
Extract shared control logic into a dedicated control layer only when complexity justifies it

You do not need to adopt the full architecture upfront.

30-Second Walkthrough

Initialize a scenario (bundled Stripe example or your own input)
Inspect the explicit plan before execution
Watch telemetry accumulate as bounded steps run
Read the reconciliation report after execution
Apply operator feedback when intent and evidence diverge
Observe the updated plan, evidence, and verdict

The operator console exposes this loop as a live control surface instead of a chat transcript.

Quick Start

# Backend
cd backend && python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
OPENAI_API_KEY=... MODEL=gpt-5.4 uvicorn app:app --reload

# Frontend
cd frontend && npm install && npm run dev

Deterministic stub mode (no API key required):

CONTROL_SURFACE_STUB=1 uvicorn app:app --reload

Open http://127.0.0.1:3000.

Operator Console

The UI is intentionally closer to an operations console than a chat app.

Panels:

Inputs
Intent
Plan
Telemetry
Evidence
Reconciliation
Decision Artifact
Operator Controls

The reconciliation panel is central because it is the heart of the thesis: a script can execute, but a control system continuously checks divergence between plan and reality and surfaces it for correction.

Live mid-run view:

This screenshot shows the console in live mode after initialization, with the explicit plan visible, operator controls pinned high in the left column, and the bottom telemetry strip ready to expand into a full trace.

Feedback Loop In Action

A single intervention on a single run. The production claim is that many of these, compounded across many runs, are the real reliability signal.

Initial verdict:

{
  "verdict": "conditionally_pursue",
  "confidence": 0.63,
  "evidence_ids": ["ctx_stripe_1"],
  "unknowns": ["team structure", "success metrics"]
}

Operator intervention:

{
  "action": "force_retrieval",
  "target_id": "step_retrieve_context",
  "feedback_type": "missing_evidence",
  "note": "Need more evidence before locking the verdict."
}

Updated verdict after rerun:

{
  "verdict": "conditionally_pursue",
  "confidence": 0.58,
  "evidence_ids": ["ctx_stripe_1"],
  "reconciliation": {
    "intent_alignment": "partial",
    "evidence_coverage": "partial",
    "recommended_action": null
  }
}

What This Repo Proves

Production reliability is a continuous control property, not a one-time certification event.
A single successful run does not make an agent trustworthy. Longitudinal observation does.
The real production system is the agent plus the control surface plus the reconciliation and feedback loop across many runs.
Operator intervention is a signal, not a failure mode, and should compound into improved future runs.

Why This Exists

This repo is an implementation companion to the Control Systems for Intelligent Software thesis:

The example domain is job opportunity evaluation because it is real, evidence-rich, and inspectable. The architecture is the point. It generalizes to any production workflow where agent behavior must remain legible, auditable, and correctable over time.

Control-System Architecture

Production reliability is a closed loop, not a pipeline.

Short target-state overview:

Architecture Overview
Roadmap - from the thesis prototype to a control layer, with milestones for a Strands-first vertical slice, run-state projection, routing, and governance integration
Architecture Decision Records

The target architecture is best understood as a control layer around agent execution, not as a replacement runtime. The first adoption path can wrap a real Strands loop and add verification, telemetry, gates, and remediation before any broader control plane extraction.

                    ┌────────────────────┐
                    │      Operator      │
                    └─────────┬──────────┘
                              │ supervision & feedback
                              ▼
 ┌───────────────┐     ┌──────────────────────┐     ┌────────────────┐
 │ Intent Framer │ ──▶ │ Planner & Executors  │ ──▶ │ Decision       │
 │               │     │ (bounded tool calls) │     │ Artifact       │
 └───────┬───────┘     └──────────┬───────────┘     └────────▲────────┘
         │                         │                         │
         │                         ▼                         │
         │              ┌──────────────────────┐             │
         └────────────▶ │    Telemetry Bus     │ ────────────┘
                        └──────────┬───────────┘
                                   ▼
                        ┌──────────────────────┐
                        │ Reconciliation Layer │
                        └──────────────────────┘

In repo terms:

execution: bounded evaluators extract requirements, retrieve context, compare fit, assess unknowns, and generate a verdict
telemetry: every major step emits a structured event with confidence deltas, evidence refs, and model usage
interface: a single-page operator console surfaces live state instead of hiding it in chat
supervision: the operator can approve, reject, revise intent, force retrieval, retry with constraints, skip, or escalate
reconciliation: the system checks whether intent, evidence, and output still agree before presenting a final artifact
longitudinal visibility: today a single run, designed to extend to many runs across time

What This Is / Is Not

This is:

a prototype of production reliability infrastructure for agentic systems
a control surface for supervised decision workflows with file-backed run state
OpenAI-backed agentic reasoning for framing, planning, fit analysis, verdict generation, and reconciliation
a demonstration of telemetry and reconciliation as first-class runtime concerns
a feedback loop where operator correction changes downstream behavior

This is not:

a generic chatbot or autonomous-agent theater demo
a one-time safety validation or staging-only evaluation harness
a framework or template for every AI workflow
a live job scraper or generic RAG system

Scenario

The bundled single-run scenario evaluates a role against a fixed profile and company context. It is deliberately small, because the control surface is what the repo is demonstrating, not the task itself.

Inputs:

job description
company name
profile fixture
operator constraints

Outputs:

framed intent
revisable plan
evidence set
telemetry trace
reconciliation reports
final decision artifact

The default bundled example is Stripe EM, Developer Productivity AI, editable before execution.

Evals

The eval layer is small on purpose. It checks:

schema validity
retrieval behavior
evidence coverage in the artifact
recovery after operator correction
deterministic stub-mode behavior for local tests and screenshots

The most important evaluation is not accuracy in the abstract. It is whether the system improves after structured feedback. That property is what makes a run a data point in a longitudinal reliability claim instead of an isolated success.

Agent Runtime

The system supports two execution modes:

live — reasoning steps call OpenAI using MODEL (default gpt-5.4)
stub — the same workflow runs with deterministic local outputs, which keeps tests and screenshots reproducible without an API key

Stub mode activates automatically when OPENAI_API_KEY is not set, or explicitly with CONTROL_SURFACE_STUB=1.

The operator console surfaces per-run:

agent mode
model name
total tokens
latency
per-step model/token/latency telemetry

These are the primitives a longitudinal view would aggregate across runs.

Project Structure

backend/
  app.py
  schemas.py
  engine/
  fixtures/
  tests/

frontend/
  app/
  lib/

docs/

Run Locally

Backend

cd backend
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# optional: live model mode
export OPENAI_API_KEY=sk-...
export MODEL=gpt-5.4

uvicorn app:app --reload

Frontend

cd frontend
npm install
npm run dev

If your backend is not on http://127.0.0.1:8000, set NEXT_PUBLIC_API_BASE_URL before running the frontend.

If you want deterministic local runs without calling OpenAI:

export CONTROL_SURFACE_STUB=1

Roadmap: The Production View Across Many Runs

This repository implements the single-run control surface: one intent, one plan, one execution, one reconciliation, one artifact, one feedback loop.

A single run can show whether an agent succeeded. A history of runs shows whether the system is actually trustworthy.

The natural extension of this work is a production view across many runs, where operators can:

identify recurring failure modes
compare interventions over time
audit reconciliation patterns
detect drift and regression
evaluate whether the system is becoming more reliable, not just passing isolated checks

The goal is operational governance for agent systems that change, accumulate failure modes, and must remain correctable in production.

Thesis

The architecture here is the point:

explicit planning over hidden reasoning
telemetry over opaque autonomy
reconciliation over one-shot outputs
supervision over "just trust the agent"
longitudinal visibility over single-run validation

A production agent is not reliable because it once passed an evaluation. It is reliable only if its behavior remains legible, auditable, and correctable across repeated real-world runs.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
backend		backend
docs		docs
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

control-surface-agent

What This Is Really About

Adoption Path

30-Second Walkthrough

Quick Start

Operator Console

Feedback Loop In Action

What This Repo Proves

Why This Exists

Control-System Architecture

What This Is / Is Not

Scenario

Evals

Agent Runtime

Project Structure

Run Locally

Backend

Frontend

Roadmap: The Production View Across Many Runs

Thesis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

control-surface-agent

What This Is Really About

Adoption Path

30-Second Walkthrough

Quick Start

Operator Console

Feedback Loop In Action

What This Repo Proves

Why This Exists

Control-System Architecture

What This Is / Is Not

Scenario

Evals

Agent Runtime

Project Structure

Run Locally

Backend

Frontend

Roadmap: The Production View Across Many Runs

Thesis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages