<a href="https://colab.research.google.com/github/dima-potapov-1/FDU/blob/main/AI_System_PRD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Product Requirements Document
## Conversational Analytics System

> **Version:** 1.4 (Draft)  
> **Author:** Koen Rutten  
> **Date:** February 17, 2026  
> **Status:** Draft for Review  
> **Objective:** O10 - PRD / Technical Design

---

## Document Structure

This PRD is organized into two main sections:

| Section | Purpose | Audience |
|---------|---------|----------|
| **Part A: Conceptual Framework** (Sections 1-7) | Architecture, principles, and requirements | Human reviewers |
| **Part B: Implementation Details** (Appendices) | Schemas, interfaces, and agent instructions | Engineers & AI agents |

---

# PART A: CONCEPTUAL FRAMEWORK

---

# 1. Executive Summary

## 1.1 Vision Statement

Build a **context-aware, memory-driven, natural language analytics system** that enables analysts (human and agentic) to obtain **complete, accurate, and reliable** answers from enterprise data through conversational interfaces.

### Staged Approach

| Stage | Objective | Focus |
|-------|-----------|-------|
| **Stage 1** (Current) | Accurate, reliable query generation | Correct SQL from natural language; minimize silent errors |
| **Stage 2** (Future) | Augmented judgment | Help users ask better questions; surface strategic relevance |

**Stage 1** is the MVP focus. Stage 2 represents a future evolution where the system not only answers questions correctly but helps users identify which questions matter most given business objectives.

## 1.2 Stage Boundaries

### Stage 1: Read-Only Analytics (Current Focus)

| Capability | Included | Notes |
|------------|----------|-------|
| SQL Generation | ✅ | SELECT queries only |
| Query Validation | ✅ | Gotcha checks, schema validation |
| Acceptance Flow | ✅ | User confirms before closing |
| Disambiguation | ✅ | Clarifying questions when ambiguous |
| Personal Memory | ✅ | User corrections applied locally |
| Context Routing | ✅ | Match queries to entities |

**Hard Gate:** Stage 1 **rejects any non-SELECT SQL**. INSERT, UPDATE, DELETE, DDL operations are not permitted regardless of autonomy level or user request.

```
IF sql_type NOT IN ["SELECT", "WITH...SELECT"]:
    REJECT with message: "Write operations are not supported in the current system version."
```

### Stage 2: Extended Capabilities (Future)

| Capability | Status | Prerequisites |
|------------|--------|---------------|
| Write Operations | Future | Proven Stage 1 reliability, enhanced audit trail |
| Strategic Judgment | Future | Business knowledge integration, OKR alignment |
| Richer Autonomy | Future | Calibrated thresholds from Stage 1 data |
| Proactive Insights | Future | Memory patterns, usage analysis |

**Stage 2 unlocks when:**
- Stage 1 production accuracy ≥85% sustained for 3 months
- Error likelihood calibration validated
- Business knowledge integration complete

## 1.3 Core Value Proposition

| Current State | Stage 1 Target | Stage 2 Vision |
|---------------|----------------|----------------|
| AI doesn't know our data → wrong queries | **Context Layer** provides curated knowledge → accurate queries | Context includes business objectives → strategically relevant queries |
| Same mistakes repeat indefinitely | **Memory Layer** learns from corrections → continuous improvement | Memory includes user interests → personalized insights |
| No guardrails on autonomous action | **Rules Layer** defines graduated autonomy → appropriate trust | Rules adapt to user risk tolerance |
| Ad hoc query execution | **Orchestration Layer** manages request lifecycle → reliability | Orchestration prioritizes high-impact work |
| Unknown error risk | **Error Likelihood Engine** quantifies risk → informed decisions | Risk includes business impact → cost-aware autonomy |

## 1.4 Success Metrics

| Metric | Current | Stage 1 Target | Measurement |
|--------|---------|----------------|-------------|
| Context Layer Routing Accuracy | 98.2% | ≥95% maintained | Automated eval suite |
| Production Query Accuracy | Unknown | ≥85% | User feedback + correction rate |
| Time to First Answer | Variable | <30s for routine queries | System logs |
| Correction Learning Rate | 0% | 80% applied within 24h | Memory layer metrics |
| User Acceptance Rate | N/A | ≥90% queries accepted without revision | User feedback |

### Disambiguation Quality Metrics

Rather than targeting a specific disambiguation *rate*, we measure disambiguation *quality*:

| Metric | Definition | Target | Why It Matters |
|--------|------------|--------|----------------|
| **Clarification Precision** | % of clarifications that were actually needed (user chose non-default option) | ≥70% | Avoids unnecessary friction |
| **Post-Clarification Acceptance** | % of queries accepted after clarification vs. rejected/revised | ≥85% | Clarification should help, not confuse |
| **Incorrect-First-Answer Rate** | % of queries where first answer was wrong and required revision | ≤10% | Measures silent errors that disambiguation should have caught |

**Interpretation:**
- Low clarification precision → system is over-asking (annoying users)
- Low post-clarification acceptance → clarification questions aren't helpful
- High incorrect-first-answer rate → system is under-asking (missing ambiguity)

---

# 2. System Architecture

## 2.1 High-Level Architecture

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            ORCHESTRATION LAYER                                    │
│                                                                                   │
│   INTAKE → SCOPE → EXECUTE → DELIVER → ACCEPT → CLOSE                           │
│                                                                                   │
│   [Error Likelihood + Expected Cost calculated at each decision point]           │
└────────────────────────────────┬────────────────────────────────────────────────┘
                                 │
         ┌───────────────────────┼───────────────────────┐
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐
│  CONTEXT LAYER  │   │   RULES LAYER   │   │  MEMORY LAYER   │
│                 │   │                 │   │                 │
│ Data Knowledge: │   │ • Permissions   │   │ • Corrections   │
│ • Entities      │   │ • Autonomy      │   │ • Preferences   │
│ • Metrics       │   │ • Thresholds    │   │ • Work History  │
│ • Gotchas       │   │ • Escalation    │   │ • Relationships │
│ • Lineage       │   │ • PII Rules     │   │ • Learnings     │
│                 │   │                 │   │                 │
│ Business        │   │ Risk Tolerance: │   │ User Profile:   │
│ Knowledge:      │   │ • Per-user      │   │ • Interests     │
│ • OKRs          │   │   settings      │   │ • Authority     │
│ • Strategy docs │   │ • Per-request   │   │ • Role context  │
│ • Projects      │   │   overrides     │   │                 │
│ • Domain guides │   │                 │   │                 │
└─────────────────┘   └─────────────────┘   └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        EXECUTION & VALIDATION LAYER                               │
│                                                                                   │
│   SQL Generation → Gotcha Validation → Execute → Result Presentation             │
└────────────────────────────────┬────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        EVALUATION & FEEDBACK LAYER                                │
│                                                                                   │
│   Query-Level Evals ← Component-Level Evals ← System-Level Evals                 │
│                              ↓                                                    │
│                    Correction Classification → Layer Updates                      │
└─────────────────────────────────────────────────────────────────────────────────┘
```

## 2.2 Component Differentiation

### Context Layer vs Memory Layer

These layers serve distinct but complementary purposes:

| Aspect | Context Layer | Memory Layer |
|--------|---------------|--------------|
| **Scope** | Shared, organizational knowledge | User-specific, session-specific |
| **Content** | Data schemas, business objectives, domain guides | Corrections, preferences, conversation history |
| **Persistence** | Long-lived, curated | Evolving, accumulated |
| **Access** | Available to all actors | Filtered by user/session |
| **Updates** | Deliberate, reviewed | Continuous, automated |

**Context Layer** answers: "What does the organization know about this data and domain?"

**Memory Layer** answers: "What does the system remember about this user and their past interactions?"

### What Makes Context Layer More Than a Semantic Layer?

Traditional semantic layers focus on **data definitions** (entities, metrics, joins). Our Context Layer expands to include:

| Traditional Semantic Layer | Extended Context Layer |
|---------------------------|------------------------|
| Entity definitions | Entity definitions |
| Metric calculations | Metric calculations |
| Join paths | Join paths |
| — | **Business objectives** (OKRs, strategy) |
| — | **Domain guides** (conventions, gotchas) |
| — | **Project context** (what matters now) |
| — | **Organizational artifacts** (qualitative studies, competitive intel) |

This extension enables **Stage 2** (augmented judgment) by connecting data questions to business priorities.

## 2.3 Data Access Model

**Decision:** The LLM can see query results to enable summarization and (Stage 2) trend analysis. PII scrubbing is applied on the backend before results reach the LLM.

This enables richer functionality while managing data sensitivity through existing infrastructure controls.

---

# 3. Design Principles

## P1: Compound, Not Monolithic

**Principle:** The system is an orchestrated set of components, not a single agent.

**Rationale:**
- Each component can evolve independently
- Clear ownership enables accountability
- Failures are isolated and debuggable
- Teams can work in parallel

## P2: Explicit Interfaces Over Implicit Coupling

**Principle:** Components interact through defined contracts (registries), not assumptions.

**What are Registries?**

| Registry | Purpose | Contents |
|----------|---------|----------|
| **Agent Registry** | Discover and route to available agents/models | Agent capabilities, cost profiles, latency characteristics |
| **Data Registry** | Discover and understand available data | Entity definitions, schemas, owners, freshness |

**Why explicit interfaces are preferable to implicit coupling:**

| Implicit Coupling | Explicit Interfaces |
|-------------------|---------------------|
| Component A assumes Component B's behavior | Component A reads B's contract |
| Changes to B silently break A | Contract changes are versioned and communicated |
| Testing requires full system | Components tested in isolation with mock contracts |
| Ownership unclear | Contract has explicit owner |
| Debugging requires tracing through system | Interface violations caught at boundaries |

## P3: Meaning from Code

**Principle:** Context is enriched from dbt/pipeline code, not only manual documentation.

**What this means:**

The dbt codebase contains rich semantic information that can be automatically extracted:

| Source | Extractable Meaning |
|--------|---------------------|
| Model SQL | Business logic, transformations, scope |
| Model config | Materialization, freshness guarantees |
| Column descriptions | Semantic definitions |
| Tests | Data quality constraints |
| Exposures | Downstream dependencies |
| Sources | Upstream data lineage |

**How we intend to achieve automated meaning extraction:**

1. **Parse dbt manifest.json** — Extract model metadata, column info, tests
2. **Analyze model SQL** — Identify join patterns, filter conditions, aggregations
3. **Extract from YAML docs** — Pull descriptions, owners, tags
4. **Infer from usage** — Common query patterns, frequently joined entities

**Context Refresh Latency:**

Context Layer updates are **CI/CD-driven**:
- When dbt models are merged/deployed, the manifest is re-parsed
- Context Layer is updated within minutes of deployment
- No runtime parsing overhead on each query

**Risk:** If a dbt model changes and the Context Layer hasn't refreshed, the agent may use stale schema info, causing errors. Mitigations:
- CI/CD pipeline includes Context Layer refresh as a post-deploy step
- Context Layer includes schema version/timestamp for staleness detection
- High-novelty queries trigger a freshness check

This is a **Stage 2** enhancement. For Stage 1, we focus on optimizing manual model documentation for both agents and humans.

## P4: Production-in-the-Loop

**Principle:** Evaluations are fed by real query patterns and corrections.

**Why this matters:**

| Synthetic Evals Only | Production-in-the-Loop |
|----------------------|------------------------|
| Test what we think matters | Test what actually matters |
| Optimized for imagined scenarios | Optimized for real usage |
| Blind to distribution shift | Detects when reality changes |
| No feedback on actual accuracy | Continuous accuracy measurement |

**The production-benchmark gap is real:** Research shows state-of-the-art models achieve ~91% on simple benchmarks but ~21% on real enterprise queries (Spider 2.0). Our evals must reflect production reality.

## P5: Graduated Autonomy

**Principle:** Risk thresholds determine when to proceed, seek confirmation, or block.

**Risk = Probability × Cost**

Autonomy decisions should consider both:
- **Error Likelihood** — Probability the query is wrong
- **Expected Cost** — Impact if the query is wrong

See Section 5 for detailed autonomy model.

## P6: Consistent Interfaces, Optimized Presentation

**Principle:** Components expose consistent interfaces; presentation is optimized per actor type.

**Clarification:** "Same interface" means the underlying data contracts are consistent. However, **how information is presented** may differ:

| Aspect | Human Presentation | Agent Presentation |
|--------|-------------------|-------------------|
| Format | Natural language, visualizations | Structured YAML/JSON |
| Verbosity | Concise summaries | Complete details |
| Navigation | Interactive, drill-down | Direct access |
| Context | Highlighted relevance | Full context window |

The **transformation layer** adapts consistent underlying data to optimal presentation per actor type.

---

# 4. Context Layer

## 4.1 Purpose

Provide structured knowledge that enables accurate query generation and (in Stage 2) strategically relevant insights.

## 4.2 Content Categories

### 4.2.1 Data Knowledge (Stage 1 Focus)

| Category | Description | Current State |
|----------|-------------|---------------|
| **Entity definitions** | Tables, columns, types, distributions | ✅ 36+ entities documented |
| **Metric calculations** | Canonical formulas, gotchas | ⚠️ Partial (embedded in entities) |
| **Join paths** | How entities relate | ✅ Documented |
| **Gotchas** | Silent error patterns | ✅ 11+ documented |
| **Domain guides** | Conventions per domain | ✅ GTM, Revenue, Finance, Marketing |

### 4.2.2 Business Knowledge (Stage 2 Focus)

| Category | Description | Current State |
|----------|-------------|---------------|
| **OKRs** | Company/team objectives | ❌ Not integrated |
| **Strategy documents** | Corporate priorities | ❌ Not integrated |
| **Project descriptions** | Current initiatives | ⚠️ Exists in discussion/ folder |
| **Qualitative studies** | User research, insights | ❌ Not integrated |
| **Competitive intelligence** | Market context | ❌ Not integrated |

**Note:** Business knowledge integration is a Stage 2 objective. The mechanism for integration (direct indexing, RAG, summarization) is to be determined.

## 4.3 Domain Coverage

The current `analytics-context` repo focuses on specific domains. Coverage decisions remain open:

| Domain | Current Status | Priority |
|--------|---------------|----------|
| Growth | ✅ Covered | High |
| Finance | ✅ Covered | High |
| Revenue | ✅ Covered | High |
| Marketing | ✅ Covered | Medium |
| **Product Analytics** | ❌ Not covered | **High (MVP focus)** |
| Support | ❌ Not covered | Medium |
| Engineering | ❌ Not covered | Low |

**Decision needed:** How do we prioritize and resource domain coverage expansion?

## 4.4 Enhancement Roadmap

| Priority | Enhancement | Stage | Notes |
|----------|-------------|-------|-------|
| P1 | Add Product Analytics domain guide | 1 | MVP focus area |
| P1 | Optimize model documentation for agents + humans | 1 | Replace key_dimension_registry focus |
| P2 | Establish domain coverage strategy | 1 | Decide prioritization |
| P3 | dbt code enrichment pipeline | 2 | Automated meaning extraction |
| P3 | Business knowledge integration | 2 | OKRs, strategy docs |

---

# 5. Rules Layer

## 5.1 Purpose

Define what actions are permitted, by whom, under what conditions, and with what level of autonomy.

## 5.2 Autonomy Model

### 5.2.1 Risk-Based Autonomy

Autonomy level is determined by:

```
Risk Score = f(Error Likelihood, Expected Cost)
```

Where:
- **Error Likelihood** = Probability the output is incorrect (see Section 6)
- **Expected Cost** = Estimated impact if the output is wrong

### 5.2.2 Expected Cost Factors

| Factor | Description | How to Assess |
|--------|-------------|---------------|
| **Decision criticality** | How important is this decision? | User indicates; infer from context |
| **Reversibility** | Can mistakes be undone? | Query type (read vs write) |
| **Audience** | Who sees the results? | User indicates; infer from request |
| **Time sensitivity** | Is there urgency? | User indicates |

**Capturing Expected Cost:**

Since we lack empirical data on error costs, we propose:

1. **Ask users** to indicate criticality per request (optional field)
2. **Infer from context** (e.g., "for the board meeting" = high stakes)
3. **Learn from feedback** — track which errors users flag as serious

### 5.2.3 Autonomy Levels

| Level | Name | Behavior | When Used |
|-------|------|----------|-----------|
| **5** | Full Autonomy | Execute and close | Low risk, routine query |
| **4** | Execute & Notify | Execute, inform stakeholders | Low-medium risk, notable query |
| **3** | Execute & Confirm | Execute, await user confirmation before considering complete | Medium risk, user accepts or revises |
| **2** | Recommend Only | Generate recommendation, user executes | High risk or user preference |
| **1** | Human Only | Flag for human handling | Very high risk or policy requirement |

**Clarification: Level 3 vs Level 4**

- **Level 4 (Notify):** System executes and sends notification. No response required. Useful for audit trail.
- **Level 3 (Confirm):** System executes but the request remains "pending acceptance" until user confirms. User must explicitly accept or revise.

### 5.2.4 User Risk Tolerance

Users can set preferences that influence autonomy decisions:

| Preference | Effect |
|------------|--------|
| "I prefer to review all queries" | Never use Level 5 |
| "I trust routine queries" | Use Level 5 for low complexity |
| "Always confirm financial data" | Use Level 3+ for finance domain |

**Per-request overrides:** Users can always specify "please confirm before executing" or "just do it" on individual requests.

### 5.2.5 Mandatory Validation Threshold

When risk exceeds a threshold (to be calibrated), manual validation is **always required** regardless of user preferences:

```
IF (error_likelihood × expected_cost) > MANDATORY_REVIEW_THRESHOLD:
    autonomy_level = min(autonomy_level, 3)  # At most "Execute & Confirm"
```

### 5.2.6 Stage 1 Operational Safeguards

Stage 1 operates with minimal but essential safeguards:

| Safeguard | Implementation | Rationale |
|-----------|----------------|-----------|
| **Query Type Allowlist** | Only SELECT queries permitted | Stage 1 is read-only by design |
| **Schema Whitelist** | User RBAC determines accessible schemas | Prevent unauthorized data access |
| **Audit Logging** | All queries logged with context | Traceability and debugging |

**Hard Gate: Non-SELECT Rejection**

```
IF sql_type NOT IN ["SELECT", "WITH...SELECT"]:
    REJECT immediately
    Response: "Write operations are not supported. Stage 1 is read-only."
```

This is not a "high-cost trigger" — it's a hard boundary. Write operations are a Stage 2 capability.

### 5.2.7 Policy-Based Triggers (Stage 1)

Within read-only operations, certain patterns still warrant elevated review:

| Trigger | Condition | Minimum Level | Rationale |
|---------|-----------|---------------|-----------|
| **Resource Intensity** | EXPLAIN plan estimates >100GB scan or >10min runtime | Level 3 (Confirm) | Warehouse protection |
| **Cross-Domain Joins** | Query joins entities from 3+ domains | Level 4 (Notify) | Complex, error-prone |
| **Sensitive Tables** | Query touches tables tagged `sensitive` | Level 4 (Notify) | Awareness for sensitive data |

These triggers are defined in `rules/autonomy.yaml` and apply within Stage 1's read-only scope.

---

# 6. Error Likelihood Engine

## 6.1 Purpose

Calculate the probability of error to inform autonomy decisions and trigger disambiguation.

## 6.2 Formula

```
Error Likelihood = (
    w_complexity × complexity_score +
    w_source     × source_risk_score +
    w_novelty    × novelty_score +
    w_ambiguity  × ambiguity_score
)
```

### Bootstrap Weights (Equal)

```
w_complexity = 0.25
w_source     = 0.25
w_novelty    = 0.25
w_ambiguity  = 0.25
```

**This is an explicit bootstrap policy.** Equal weights are used because we lack production data to inform better weights.

### Weight Calibration Plan

| Phase | Timing | Action |
|-------|--------|--------|
| **Bootstrap** | Launch | Equal weights (0.25 each) |
| **Observation** | Months 1-2 | Collect correction/failure data tagged by factor |
| **First Calibration** | Month 3 | Analyze which factors correlate with actual errors |
| **Recalibration** | Monthly thereafter | Update weights based on trailing 30-day error patterns |

**Calibration method:**
```
For each failed query (user rejected or corrected):
  1. Record the factor scores at time of query
  2. Identify which factors were elevated vs. baseline
  3. Weight factors that were elevated in failures more heavily
  
New weights = normalize(
  baseline_weight + learning_rate × (factor_failure_correlation)
)
```

**Constraint:** No single weight can exceed 0.5 or fall below 0.1 to prevent over-fitting to noise.

## 6.3 Factor Definitions and Computation

| Factor | What It Measures | High Score Means |
|--------|------------------|------------------|
| **Complexity** | Query structural complexity (joins, CTEs, window functions) | More ways to make mistakes |
| **Source Risk** | Data source reliability (documentation, test coverage, freshness) | Less trustworthy data |
| **Novelty** | Dissimilarity to past successful queries | Uncharted territory |
| **Ambiguity** | Multiple interpretations possible (column names, metrics, entities) | User intent unclear |

### 6.3.1 Complexity Score (Deterministic)

Computed via **AST (Abstract Syntax Tree) parsing** of the generated SQL:

```
complexity = normalize(
    num_joins × 0.25 +
    num_ctes × 0.20 +
    has_window_functions × 0.20 +
    num_subqueries × 0.20 +
    num_aggregations × 0.15
)
```

This is deterministic and computed after SQL generation.

### 6.3.2 Source Risk Score (Deterministic)

Computed from metadata in the Context Layer:

```
source_risk = 1 - average([
    entity.documentation_completeness,  # % of columns documented
    entity.test_coverage,               # % of columns with tests
    entity.freshness_score,             # How recently validated
    entity.owner_responsiveness         # SLA for owner response
])
```

### 6.3.3 Novelty Score (Semantic Similarity)

Computed via **semantic similarity to the Golden Query Set** (verified past queries):

```
novelty = 1 - max_similarity(
    query_embedding,
    golden_query_embeddings  # Verified correct queries
)
```

If the query is very different from any verified past query, novelty is high.

### 6.3.4 Ambiguity Score (Retrieval Conflict + Optional LLM Confidence)

Ambiguity is the hardest to compute deterministically. We use a primary signal with an optional enhancement.

**Primary Signal: Retrieval Conflict + Schema Collision**

These heuristics are always available:

```
# Retrieval conflict: similar scores but conflicting definitions
retrieval_spread = std_dev(top_n_relevance_scores)
retrieval_conflict = detect_conflicting_definitions(top_n_chunks)
ambiguity_retrieval = retrieval_conflict × (1 - retrieval_spread)

# Schema collision: multiple columns/tables match the same term
schema_matches = count_matching_columns(query_terms)
ambiguity_schema = normalize(schema_matches - 1)  # 0 if unique match

# Combined heuristic
ambiguity_heuristic = 0.6 × ambiguity_retrieval + 0.4 × ambiguity_schema
```

**Optional Enhancement: LLM Log Probabilities**

If the LLM API provides log probabilities (not all do):

```
ambiguity_llm = 1 - average(log_probs[key_entity_tokens])
```

**Combined (when logprobs available):**
```
ambiguity = 0.5 × ambiguity_heuristic + 0.5 × ambiguity_llm
```

**Fallback (when logprobs unavailable):**
```
ambiguity = ambiguity_heuristic
```

| LLM Provider | Logprobs Available | Ambiguity Calculation |
|--------------|--------------------|-----------------------|
| OpenAI (GPT-4) | ✅ Yes | Hybrid (heuristic + logprobs) |
| Anthropic (Claude) | ❌ No | Heuristic only |
| Azure OpenAI | ✅ Yes | Hybrid |
| Local models | Varies | Check per deployment |

The heuristic-only approach is still effective — it catches the most common ambiguity patterns (multiple matching columns, conflicting definitions). Logprobs provide refinement, not a fundamental capability.

## 6.4 Thresholds and Actions

| Error Likelihood | Expected Cost: Low | Expected Cost: High |
|------------------|-------------------|---------------------|
| < 0.3 | Level 5 (Full) | Level 4 (Notify) |
| 0.3 - 0.5 | Level 4 (Notify) | Level 3 (Confirm) |
| 0.5 - 0.7 | Level 3 (Confirm) | Level 2 (Recommend) |
| ≥ 0.7 | Level 2 (Recommend) | Level 1 (Human) |

---

# 7. Evaluation Framework

## 7.1 Evaluation Hierarchy

Evaluations operate at three levels:

```
┌─────────────────────────────────────────────────────────┐
│                 SYSTEM-LEVEL EVALS                       │
│   Does the overall system meet user needs?               │
│   Metrics: User satisfaction, task completion rate       │
└────────────────────────────┬────────────────────────────┘
                             │
┌────────────────────────────┴────────────────────────────┐
│               COMPONENT-LEVEL EVALS                      │
│   Does each component perform its function?              │
│   Metrics: Routing accuracy, autonomy calibration        │
└────────────────────────────┬────────────────────────────┘
                             │
┌────────────────────────────┴────────────────────────────┐
│                 QUERY-LEVEL EVALS                        │
│   Is this specific query correct?                        │
│   Metrics: SQL correctness, result accuracy              │
└─────────────────────────────────────────────────────────┘
```

## 7.2 Current State

| Level | Eval Type | Status |
|-------|-----------|--------|
| Query | Context layer routing | ✅ 178 questions, 98.2% accuracy |
| Query | SQL correctness | ❌ No ground truth dataset |
| Component | Memory effectiveness | ❌ Not implemented |
| Component | Autonomy calibration | ❌ Not implemented |
| System | End-to-end accuracy | ❌ No production measurement |
| System | User satisfaction | ❌ No feedback collection |

## 7.3 Eval Development Priority

### Stage 1: Foundation

1. **Production query logging** — Capture real queries to understand distribution
2. **User feedback collection** — Thumbs up/down, corrections
3. **Ground truth dataset** — Curated set of queries with verified correct SQL

### Stage 2: Calibration

4. **Error likelihood validation** — Do high-likelihood queries actually fail more?
5. **Autonomy calibration** — Are thresholds set correctly?
6. **Component isolation testing** — Each layer tested independently

### Informing Golden Evals with Error Likelihood

The ground truth dataset should be **informed by error likelihood factors**:

| Factor | Implication for Eval Coverage |
|--------|------------------------------|
| High complexity | Include multi-join, CTE, window function queries |
| Source risk | Include queries on poorly-documented tables |
| Novelty | Include unusual query patterns |
| Ambiguity | Include queries with ambiguous column names |

This ensures evals cover the **riskiest** scenarios, not just easy cases.

---

# 8. Orchestration Layer

## 8.1 Purpose

Manage the full request lifecycle from intake to closure, including user acceptance.

## 8.2 Request Lifecycle

```
INTAKE → SCOPE → PLAN → EXECUTE → DELIVER → ACCEPT → CLOSE
                  │                            ↑
                  │                   User confirms or
                  │                   requests revision
                  ▼
           [EXPLAIN Plan Check]
           If resource-intensive → Halt, request confirmation
```

### Stage Gates

| Stage | Entry | Exit | Key Decision |
|-------|-------|------|--------------|
| **INTAKE** | Request received | Request parsed | Is request valid? |
| **SCOPE** | Parsed request | Entities identified | Error likelihood calculated |
| **PLAN** | SQL generated | Plan validated | Is execution safe? (resource check) |
| **EXECUTE** | Plan approved | SQL executed | Autonomy level applied |
| **DELIVER** | Results available | Results presented | Presentation complete |
| **ACCEPT** | Results presented | User accepts OR revises | **User acceptance required** |
| **CLOSE** | Acceptance received | Request closed | Feedback logged |

### PLAN Stage: Resource Safety Check

Before execution, the system runs an `EXPLAIN` (dry run) on the generated SQL:

```sql
EXPLAIN <generated_sql>
```

If the estimated resource cost exceeds thresholds:

| Metric | Threshold | Action |
|--------|-----------|--------|
| Estimated scan size | >100GB | Halt, request confirmation |
| Estimated runtime | >10 minutes | Halt, request confirmation |
| Estimated row count | >100M rows | Warn, proceed with caution |

This prevents the AI from accidentally overwhelming the data warehouse, even if the semantic risk was calculated as low.

### User Acceptance

**User acceptance is always required before closing a request.**

| Acceptance Type | Mechanism | When Used |
|-----------------|-----------|-----------|
| **Explicit** | User clicks "Accept" or "Looks good" | Level 3 and below |
| **Implicit** | User takes action on results (downloads, shares) | Level 4-5 |
| **Timeout** | No response within window → auto-accept with flag | Level 5 only |

For Level 1-3, explicit acceptance is required. For Level 4-5, implicit acceptance (user engagement with results) or timeout is acceptable.

### Timeout Handling

Different autonomy levels have different timeout policies:

| Level | Timeout Policy | On Timeout |
|-------|----------------|------------|
| **Level 5** | 24 hours | Auto-accept, close request |
| **Level 4** | 48 hours | Auto-accept, close request |
| **Level 3** | 72 hours | **Expire request, notify user** |
| **Level 2** | 7 days | Expire recommendation, notify user |
| **Level 1** | No timeout | Remains open until human handles |

**Level 3 Timeout Behavior (Execute & Confirm):**

When a Level 3 request times out without user confirmation:

1. Request is marked as **"Abandoned"** (not auto-accepted)
2. User receives notification: *"Your query from [date] timed out without confirmation. Results were not finalized. Please resubmit if still needed."*
3. Any actions taken are flagged for potential rollback (if applicable)
4. Query is logged with `outcome: abandoned` for eval purposes

**Rationale:** Level 3 requests are medium-risk. Auto-accepting them on timeout would defeat the purpose of requiring confirmation. Expiring with notification ensures users are aware without creating orphan requests.

---

# 9. Memory Layer

## 9.1 Purpose

Persist user-specific learnings to enable personalization and continuous improvement.

## 9.2 Memory Categories

| Category | Contents | Retention |
|----------|----------|-----------|
| **Corrections** | What the user fixed | Permanent (→ may promote to Context) |
| **Preferences** | How user likes things presented | Until changed |
| **Work History** | Past queries and outcomes | Rolling window (e.g., 90 days) |
| **Relationships** | User's team, stakeholders, interests | Updated periodically |

## 9.3 Correction Classification

When a user corrects system output:

| Correction Type | Example | Route To |
|-----------------|---------|----------|
| **Factual** | "This column is in schema X" | Context Layer (after review) |
| **Gotcha** | "Always exclude test accounts" | Context Layer (after review) |
| **Permission** | "I shouldn't see this PII" | Rules Layer |
| **Preference** | "I prefer weekly aggregations" | Memory Layer |

## 9.4 Context Poisoning Guardrails

The promotion path from Memory (user corrections) to Context (organizational knowledge) creates a potential vector for "context poisoning" — where incorrect or malicious corrections could degrade system accuracy for everyone.

### Four-Stage Correction Promotion

```
┌─────────────┐    ┌─────────────┐    ┌─────────────────┐    ┌─────────────┐
│   CAPTURE   │ →  │  LOCALIZE   │ →  │  CANDIDATE GEN  │ →  │  AIR GAP    │
│             │    │             │    │                 │    │             │
│ User makes  │    │ Applied to  │    │ Consensus       │    │ Human       │
│ correction  │    │ Personal    │    │ detection       │    │ Stewardship │
│             │    │ Memory ONLY │    │ triggers flag   │    │ required    │
└─────────────┘    └─────────────┘    └─────────────────┘    └─────────────┘
```

### Stage 1: Capture

When a user corrects the system:
- Correction is recorded with full context (original query, original answer, correction, user role)
- No immediate action beyond acknowledgment

### Stage 2: Localize

The correction is applied **ONLY** to that specific user's Personal Memory:
- The correcting user benefits immediately
- Other users are **never affected automatically**
- This is a strict isolation boundary

### Stage 3: Aggregated Candidate Generation

The system monitors for **Consensus Signals** across users:

```yaml
consensus_rule:
  trigger: "semantic_similarity"
  conditions:
    - distinct_users: ">3"
    - user_role_filter: ["Senior", "Lead", "Principal", "Manager"]
    - correction_domain: "same"  # e.g., all corrections about "Churn"
    - semantic_similarity: ">0.85"  # Corrections mean the same thing
  action: "flag_as_context_candidate"
```

**Example trigger:** "If > 3 distinct users with 'Senior' roles make the same semantic correction regarding 'Churn', flag this as a Context Candidate."

**Why role filtering?** Senior roles have more domain expertise. Corrections from experienced analysts carry more signal than corrections from new users who may be learning.

### Stage 4: The "Air Gap" (Human Stewardship)

Context Candidates **never auto-promote**. There is always a human in the loop.

| Step | Action | Owner |
|------|--------|-------|
| **Queue** | Context Candidate appears in review queue (JIRA ticket or dashboard) | System |
| **Review** | Data Steward reviews the candidate correction | Analytics Engineer |
| **Verify** | Steward validates against Data Dictionary / dbt manifest | Analytics Engineer |
| **Promote** | If valid, Steward submits PR to `analytics-context` repo | Analytics Engineer |
| **Merge** | PR reviewed and merged following standard process | Team |

```
Context Candidate → JIRA Ticket → Steward Review → Validate vs dbt → PR → Merge → Context Layer
                                       │
                                       └── REJECT if invalid (with reason logged)
```

### Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `distinct_user_threshold` | 3 | Minimum distinct users to trigger candidate |
| `role_filter` | `["Senior", "Lead", "Principal", "Manager"]` | Roles whose corrections trigger consensus |
| `semantic_similarity_threshold` | 0.85 | How similar corrections must be to count as "same" |
| `approval_required` | Always (non-configurable) | Human review is mandatory |
| `promotion_mechanism` | Pull Request | How corrections enter Context Layer |

### Why This Architecture?

| Risk | Mitigation |
|------|------------|
| Single malicious user poisons context | Localization + consensus threshold |
| Multiple confused users create false consensus | Role filtering (senior roles only) |
| Valid corrections blocked by bureaucracy | Automated candidate detection reduces manual triage |
| Steward rubber-stamps without checking | Validation against dbt manifest required |
| Steward unavailable creates bottleneck | Queue visible to multiple stewards |

**Key principle:** A single user — even a malicious one — cannot define `revenue = cost` for the entire organization. The air gap ensures all organizational knowledge changes are deliberate and verified.

---

# 10. Governance & Ownership

## 10.1 Component Ownership

| Component | Owner | Accountability |
|-----------|-------|----------------|
| Context Layer | Data Platform | Accuracy of entity definitions |
| Rules Layer | Platform Lead | Permission correctness, autonomy calibration |
| Memory Layer | Platform Lead | Learning effectiveness |
| Orchestration | Platform Lead | Request handling reliability |
| Error Likelihood | Data Science | Factor calibration |
| Evaluation | Analytics Lead | Eval coverage and quality |

## 10.2 Change Management

Interface changes require:
1. Proposal with rationale
2. Review with dependent component owners
3. Versioning
4. Migration timeline for deprecated interfaces

---

# 11. Physical Architecture & Repository Strategy

## 11.1 Why This Section Exists

Sections 2-10 define **logical architecture** (Context, Rules, Memory, Orchestration, Evals).  
This section defines **physical architecture**: where components live, how they are deployed, and how they are versioned.

## 11.2 Repository Strategy (Recommended)

### Decision: Dedicated Platform Repository

Use a dedicated repository for the conversational analytics platform:

- **Recommended repo name:** `conversational-analytics-platform`
- **Ownership:** Data Platform (primary), Analytics Engineering + Data Science (reviewers)
- **Release cadence:** Independent from dbt model release cadence

### Relationship to Existing Repositories

| Repository | Role | Source of Truth |
|------------|------|-----------------|
| `dbt` | Data transformations, tests, contracts, metadata artifacts | SQL logic and model contracts |
| `looker` | BI semantic/explore layer | BI serving and field presentation |
| `conversational-analytics-platform` | Orchestration, context retrieval, rules, memory, evals | Conversational runtime behavior |

**Rationale:** Keep runtime services and eval pipelines decoupled from analytics transformation code. This enables cleaner ownership, faster iteration, and clearer incident boundaries.

## 11.3 Canonical Repository Structure

```text
conversational-analytics-platform/
  README.md
  docs/
    architecture/
      system-overview.md
      deployment-topology.md
      interface-contracts.md
    adr/
  services/
    api-gateway/
    orchestrator/
    context-service/
    rules-service/
    memory-service/
    evaluation-service/
    execution-gateway/
  packages/
    contracts/
    risk-engine/
    telemetry/
    authz/
  context/
    domains/
      product/
      growth/
      finance/
      revenue/
      marketing/
    entities/
    metrics/
    gotchas/
    join-paths/
    registry.yaml
  evals/
    datasets/
      golden-queries/
    suites/
      query-level/
      component-level/
      system-level/
    reports/
  configs/
    environments/
      dev.yaml
      staging.yaml
      prod.yaml
    rules/
      autonomy.yaml
      escalation.yaml
  infra/
    terraform/
    pipelines/
      ci/
      cd/
  scripts/
    sync_dbt_metadata.py
    sync_looker_metadata.py
  tests/
    unit/
    integration/
    contract/
    e2e/
  .github/
    workflows/
```

## 11.4 Component-to-Folder Mapping

| PRD Component | Runtime/Code Location |
|---------------|-----------------------|
| Context Layer | `context/` + `services/context-service/` |
| Rules Layer | `configs/rules/` + `services/rules-service/` |
| Memory Layer | `services/memory-service/` |
| Orchestration Layer | `services/orchestrator/` |
| Error Likelihood Engine | `packages/risk-engine/` |
| Execution & Validation | `services/execution-gateway/` |
| Evaluation Framework | `services/evaluation-service/` + `evals/` |
| Interface Specifications | `packages/contracts/` |

## 11.5 Stage 1 Deployment Boundaries

Stage 1 must enforce these boundaries in implementation:

1. **Read-only execution only:** `execution-gateway` accepts only SELECT queries
2. **Schema whitelist enforcement:** query execution scoped by user RBAC
3. **Audit logging required:** all requests and outcomes logged
4. **Feature-flag Stage 2 capabilities:** keep scaffolding disabled in production

## 11.6 Repo Split Option (Future)

The recommended default is one dedicated platform repo.  
If governance complexity grows, split context into a separate repo:

- `conversational-analytics-platform` (runtime + orchestration + evals)
- `analytics-context` (governed shared context artifacts)

Only split when needed; start with one platform repo to reduce operational complexity.

---

# PART B: IMPLEMENTATION DETAILS

---

# Appendix A: Interface Specifications

## A.1 Context Layer Interface

```yaml
# INPUT: Context Request
context_request:
  query: string              # Natural language question
  domain_hint: string?       # Optional domain
  actor_id: string?          # For personalized context
  session_id: string?        # For multi-turn context

# OUTPUT: Context Response
context_response:
  matched_entities:
    - entity_name: string
      relevance_score: float
      columns: Column[]
      gotchas: Gotcha[]
      join_paths: JoinPath[]
  domain_guide: DomainGuide?
  business_context: BusinessContext?  # Stage 2
  disambiguation_needed: boolean
  disambiguation_questions: Question[]?
  error_likelihood_factors:
    complexity: float
    source_risk: float
    novelty: float
    ambiguity: float
```

## A.2 Rules Layer Interface

```yaml
# INPUT: Authorization Request
auth_request:
  actor:
    id: string
    type: enum[human, agent]
    role: string
    risk_tolerance: RiskTolerance?
  action:
    type: enum[query, modify, recommend]
    target_schemas: string[]
  risk:
    error_likelihood: float
    expected_cost: float?

# OUTPUT: Authorization Response
auth_response:
  decision: enum[PROCEED, NOTIFY, CONFIRM, RECOMMEND, BLOCK]
  autonomy_level: int
  rationale: string
  escalation_path: Actor[]?
```

## A.3 Memory Layer Interface

```yaml
# RETAIN: Store memory
retain_request:
  memory_type: enum[correction, preference, experience]
  content: any
  actor_id: string
  session_id: string?

# RECALL: Retrieve memories
recall_request:
  actor_id: string
  context: string[]  # Relevant entities/domains
  
recall_response:
  memories:
    - type: string
      content: any
      relevance: float
```

## A.4 Execution Layer Interface

```yaml
# INPUT: Execution Request
execution_request:
  request_id: string
  sql: string
  schema_whitelist: string[]  # Injected based on user RBAC

# OUTPUT: Execution Response
execution_response:
  request_id: string
  status: enum[success, rejected, error]
  rejection_reason: string?   # e.g., "Non-SELECT query rejected", "Schema not in whitelist"
  
  validation:
    sql_type: string          # SELECT, INSERT, UPDATE, etc.
    is_allowed: boolean       # Stage 1: only SELECT allowed
    schemas_referenced: string[]
    schemas_blocked: string[]
  
  plan:                       # EXPLAIN results (only if validation passed)
    estimated_scan_gb: float
    estimated_runtime_ms: int
    estimated_rows: int
    warnings: string[]
  
  results:                    # Only if execution succeeded
    columns: Column[]
    rows: Row[]
    row_count: int
    truncated: boolean
```

## A.5 Orchestration Interface

```yaml
# Analytics Request
analytics_request:
  request_id: string
  actor:
    id: string
    type: enum[human, agent]
  query:
    natural_language: string
  preferences:
    autonomy_override: int?  # User can request specific level
    criticality: enum[low, normal, high]?

# Analytics Response
analytics_response:
  request_id: string
  status: enum[completed, pending_acceptance, needs_disambiguation, blocked]
  result:
    sql: string?
    data: any?
    confidence: float
    caveats: string[]
  acceptance:
    required: boolean
    timeout_minutes: int?
```

---

# Appendix B: Data Schemas

## B.1 Query Log Schema

```yaml
query_log:
  log_id: string
  timestamp: datetime
  
  request:
    request_id: string
    actor_id: string
    natural_language: string
    session_id: string
    stated_criticality: enum[low, normal, high]?
  
  context:
    matched_entities: string[]
    domain: string
    routing_confidence: float
  
  risk:
    error_likelihood: float
    expected_cost: float?
    factors:
      complexity: float
      source_risk: float
      novelty: float
      ambiguity: float
  
  execution:
    autonomy_level: int
    generated_sql: string
    execution_time_ms: int
  
  outcome:
    user_accepted: boolean
    feedback: enum[positive, negative, none]?
    correction_made: boolean
    correction_type: string?
```

## B.2 Correction Schema

```yaml
correction:
  id: string
  timestamp: datetime
  actor_id: string
  actor_role: string               # For consensus role filtering
  source_request_id: string
  
  correction_type: enum[factual, gotcha, permission, preference]
  correction_domain: string        # e.g., "Churn", "Revenue", "User"
  
  original:
    content: any
  corrected:
    content: any
  
  embedding: vector                # For semantic similarity matching
  
  routing:
    target_layer: enum[context, rules, memory]
    stage: enum[captured, localized, candidate, promoted, rejected]
    
  # Populated when stage = "localized"
  personal_memory:
    applied_to_user: boolean
    applied_at: datetime
    
  # Populated when stage = "candidate"
  consensus:
    similar_corrections: string[]  # IDs of similar corrections
    distinct_users: int
    triggered_at: datetime
    jira_ticket: string?
    
  # Populated when stage = "promoted" or "rejected"
  stewardship:
    reviewed_by: string
    reviewed_at: datetime
    decision: enum[approved, rejected]
    rejection_reason: string?
    pull_request_url: string?
    merged_at: datetime?
```

---

# Appendix C: Agent Instructions

> **This appendix provides explicit instructions for AI agents operating within this system.**

## C.1 Context for AI Agents

You are operating within a **compound AI system** for conversational analytics. Your role is defined by the **Rules Layer** and your actions are logged to the **Memory Layer**.

### Operating Principles

1. **ALWAYS** check error likelihood before proceeding
2. **ALWAYS** respect autonomy level decisions
3. **ALWAYS** log your actions for auditability
4. **ALWAYS** obtain user acceptance before closing a request
5. **NEVER** bypass disambiguation when error likelihood ≥ 0.5
6. **NEVER** auto-close requests at Level 1-3 without user confirmation

### Stage 1 Operational Constraints

**Read-Only:** Stage 1 only permits SELECT queries. Any non-SELECT SQL is rejected at the Execution Layer.

**Schema Whitelist:** The Orchestration Layer injects a `schema_whitelist` based on user RBAC. The Execution Layer rejects queries referencing schemas outside this list.

**Audit Logging:** All queries are logged with full context for traceability.

## C.2 Decision Tree (Stage 1: Read-Only)

```
START
  │
  ▼
Parse user query
  │
  ▼
Generate SQL
  │
  ├── If NOT SELECT query → REJECT immediately
  │     Response: "Write operations not supported in Stage 1."
  │     END
  │
  ▼
Call Context Layer → Get matched entities
  │
  ▼
Call Error Likelihood Engine → Get risk score
  │
  ▼
Check user risk tolerance and stated criticality
  │
  ▼
Calculate combined risk = error_likelihood × expected_cost
  │
  ▼
Determine autonomy level
  │
  ├── If Level 5 (Full Autonomy)
  │     │
  │     ▼
  │   Run EXPLAIN → Check resource cost
  │     ├── If resources OK → Execute
  │     └── If resources HIGH → Elevate to Level 3
  │   Present Results
  │   Wait for implicit acceptance or timeout (24h)
  │   Close request
  │
  ├── If Level 4 (Notify)
  │     │
  │     ▼
  │   Run EXPLAIN → Check resource cost
  │     ├── If resources OK → Execute
  │     └── If resources HIGH → Elevate to Level 3
  │   Present Results, Notify stakeholders
  │   Wait for implicit acceptance or timeout (48h)
  │   Close request
  │
  ├── If Level 3 (Confirm)
  │     │
  │     ▼
  │   Run EXPLAIN → Check resource cost
  │     ├── If resources OK → Execute
  │     └── If resources HIGH → Warn user, request confirmation
  │   Present Results
  │   Request explicit user acceptance
  │   IF user accepts → Close request
  │   IF user revises → Loop back to parse revised query
  │   IF timeout (72h) → Mark as Abandoned, notify user
  │
  ├── If Level 2 (Recommend)
  │     │
  │     ▼
  │   Show SQL recommendation (do not execute)
  │   Run EXPLAIN → Show estimated cost
  │   Present recommendation with rationale
  │   User decides whether to execute
  │   IF timeout (7 days) → Expire recommendation, notify user
  │
  └── If Level 1 (Human Only)
        │
        ▼
      Flag for human handling
      Explain why automation is inappropriate
      No timeout (remains open)
```

## C.3 Handling Disambiguation

When ambiguity score is high (≥ 0.5):

```yaml
# 1. Identify ambiguous elements
ambiguous_elements:
  - type: column
    options: ["user.user_id", "activity.user_id"]
    question: "Which user_id do you mean?"
  - type: metric
    options: ["ARR (annual)", "MRR (monthly)"]
    question: "Which revenue metric?"

# 2. Present structured questions
response: |
  I need to clarify a few things:
  
  1. Multiple columns match 'user_id'. Which should I use?
     a) user.user_id - Primary user identifier
     b) activity.user_id - Activity-specific ID
  
  2. Which revenue metric do you need?
     a) ARR (Annual Recurring Revenue)
     b) MRR (Monthly Recurring Revenue)

# 3. Wait for user response
# 4. Recalculate error likelihood with clarified scope
# 5. Proceed when ambiguity score < 0.5
```

## C.4 Handling Corrections

```yaml
# When user provides correction:

# 1. Acknowledge
response: "I understand. Let me update my understanding."

# 2. Classify
correction_type = classify(correction)
correction_domain = extract_domain(correction)  # e.g., "Churn", "Revenue"

# 3. Route based on type
if correction_type in [factual, gotcha]:
  # CAPTURE: Record correction with full context
  correction = create_correction(
    type: correction_type,
    domain: correction_domain,
    actor_role: user.role,
    embedding: embed(correction.content)
  )
  
  # LOCALIZE: Apply to Personal Memory ONLY
  personal_memory.add(correction)
  correction.stage = "localized"
  
  # Inform user of the process
  response: |
    I've applied this correction to your personal context — it will
    improve my answers for you immediately.
    
    If multiple senior analysts make similar corrections, this may be
    flagged for review to update the shared knowledge base.
  
if correction_type == preference:
  # Preferences go directly to Memory Layer (no consensus needed)
  action: Store directly in Memory Layer
  memory: Update user preferences
  response: "I've noted your preference and will apply it to future queries."
  
if correction_type == permission:
  # Security issues bypass the normal flow
  action: Flag for immediate Rules Layer review
  log: Security event
  response: "I've flagged this as a permission issue for immediate review."

# 4. Background: Consensus Detection (runs asynchronously)
# System checks if this correction triggers a consensus signal:
#
# IF (
#   count(similar_corrections where actor_role in SENIOR_ROLES) > 3
#   AND semantic_similarity > 0.85
# ) THEN:
#   correction.stage = "candidate"
#   create_jira_ticket(correction)
#   notify_stewards()
```

## C.5 Multi-Turn Context

```yaml
# Maintain session state
session:
  entities_discussed: []      # Accumulate
  filters_established: []     # Carry forward
  corrections_applied: []     # Apply to all queries
  disambiguation_choices: []  # Remember preferences

# On each turn:
1. Load session context
2. Apply accumulated filters and corrections
3. Boost relevance for previously discussed entities
4. Process new query
5. Update session context
6. Persist to Memory Layer

# Referential phrases:
# "same as before" → use previous parameters
# "but for Q2" → modify previous query
# "actually, I meant..." → correction to previous query
```

---

# Appendix D: Implementation Roadmap

## Phase 1: Foundation (Weeks 1-4)

| Task | Owner | Priority |
|------|-------|----------|
| Finalize PRD with architect input | Koen | P1 |
| Design QueryLog schema | Koen | P1 |
| Implement logging in get_query_context | Felipe | P1 |
| Add Product Analytics domain guide | Analytics | P1 |
| Prototype error likelihood calculation | Koen | P2 |

## Phase 2: Core Components (Weeks 5-8)

| Task | Owner | Priority |
|------|-------|----------|
| Implement rules/roles.yaml | Platform | P1 |
| Implement rules/autonomy.yaml | Platform | P1 |
| Add user feedback collection | Felipe | P1 |
| Implement memory/corrections.jsonl | Platform | P2 |

## Phase 3: Orchestration (Weeks 9-12)

| Task | Owner | Priority |
|------|-------|----------|
| Implement orchestration state machine | Platform | P1 |
| Add user acceptance flow | Platform | P1 |
| Implement disambiguation UI | Frontend | P2 |

## Phase 4: Calibration (Weeks 13-16)

| Task | Owner | Priority |
|------|-------|----------|
| Build ground truth dataset | Analytics | P1 |
| Implement error likelihood validation | Data Science | P2 |
| Calibrate autonomy thresholds | Data Science | P2 |

---

# Document History

| Date | Version | Change | Author |
|------|---------|--------|--------|
| 2026-02-17 | 1.0 | Initial draft | Koen Rutten |
| 2026-02-17 | 1.1 | Restructured per feedback: separated concepts from implementation; clarified Context vs Memory; added staged approach; added expected cost to risk; clarified autonomy levels; added user acceptance requirement; established eval hierarchy | Koen Rutten |
| 2026-02-17 | 1.2 | Architecture review: Privacy architecture, CI/CD context refresh, policy triggers, improved error likelihood computation, PLAN stage, timeout handling, context poisoning guardrails, four-stage correction promotion | Koen Rutten |
| 2026-02-17 | 1.3 | Stage boundary clarification: Stage 1 read-only (SELECT only hard gate); Stage 2 scope defined; Replaced disambiguation rate with quality metrics (clarification precision, post-clarification acceptance, incorrect-first-answer rate); Formalized weight calibration plan (bootstrap + monthly recalibration); Added logprob availability fallback for ambiguity; Reduced privacy/security narrative emphasis; Retained minimal operational safeguards (schema whitelist, query allowlist, audit logging) | Koen Rutten |
| 2026-02-17 | 1.4 | Added physical architecture and repository strategy section: dedicated platform repo recommendation, canonical repo structure, component-to-folder mapping, Stage 1 deployment boundaries, and future repo split option | Koen Rutten |

---

# Approval

| Role | Name | Date | Signature |
|------|------|------|-----------|
| Author | Koen Rutten | 2026-02-17 | ✓ |
| Technical Review | Larissa / Rebecca | Pending | |
| Business Review | Dima Potapov | Pending | |

---

*This document should be treated as a living specification. Updates should be proposed via PR with review from affected component owners.*
