# Colab Book 05 — Multi-Tenant ETL (Hetzner-Isolated Tenants, Keys, Storage, Audit Streams)

**Focus:** Tenant isolation by construction: per-tenant environments, per-tenant keys, per-tenant stores, and immutable audit streams; blast-radius control.

**Iteration note:** This notebook is designed to be shared with Frode and edited live as the pipeline matures.


## Sources (public)
These sources reflect Frode/Frostbyte’s public positioning around enterprise AI infrastructure, sovereignty, and security risk.

- Frostbyte site (enterprise AI infra for regulated industries): https://frostbyteholding.com/
- “Stop Selling Toys as Enterprise Solutions” (enterprise RAG needs control): https://frostbyteholding.com/blog/stop-selling-toys-enterprise-solutions
- “Your AI-Powered Platform Just Became a Security Nightmare” (documents-as-weapons / injection framing): https://frostbyteholding.com/blog/ai-platform-security-nightmare
- YouTube series: “AI, Law & Infrastructure” (roundtable with lawyers + AI/data/security practitioners): https://www.youtube.com/playlist?list=PLpl5qpGe_tuAB5Sjx8g_aj9Kwbg22tcc5
- Example episode featuring Frode: “Blind Trust in Legal AI Vendors Is Now a Legal Liability”: https://www.youtube.com/watch?v=UCBcmWPHSKY
- Docling docs: https://docling-project.github.io/docling/reference/document_converter/
- Unstructured OSS docs: https://docs.unstructured.io/open-source/introduction/overview
- OpenRouter embeddings API: https://openrouter.ai/docs/api/reference/embeddings
- Hetzner Cloud API: https://docs.hetzner.cloud/reference/cloud
- Nomic embed-text model: https://huggingface.co/nomic-ai/nomic-embed-text-v1


## How this ties to the pipeline sketch
The source sketch emphasizes:
- demand for **data pipelines**
- two execution modes: **online API** and **offline via Docker**
- open-source-first ingestion: **Unstructured** + **Docling**
- embeddings via **OpenRouter** (OpenAI / Qwen / Kimi)
- per-tenant isolation via **Hetzner**
- outputs stored as **structured DB records + vector index**

This notebook turns that into an implementable, review-ready ETL architecture with explicit sovereignty and safety boundaries.


## Architectural diagrams (Mermaid)
The diagrams are rendered inline below.


In [None]:
# Mermaid rendering helper (Colab)
# This creates an HTML block that Colab can display.
from IPython.display import HTML, display
import base64, textwrap

def show_mermaid(mermaid_text: str):
    escaped = mermaid_text.replace("`","\`")
    html = f"""
    <script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
    <div class="mermaid">
    {escaped}
    </div>
    <script>
      mermaid.initialize({{ startOnLoad: true, theme: 'default' }});
    </script>
    """
    display(HTML(html))


### End-to-end pipeline

In [None]:
show_mermaid('''flowchart TD
  A[Vendor Sources\nSFTP / HTTPS / API / Batch Drop] --> B[Intake Gateway\nAuthZ, manifest, checksums, scanning]
  B --> C[Normalization Layer\nDocling + Unstructured]
  C --> D[Canonical Structured Doc JSON\nsections/tables/offsets/lineage]
  D --> E[Policy Gates + Enrichment\nPII, classification, injection defense]
  E --> F[Object Store\nraw + normalized]
  E --> G[Relational DB\nmetadata/lineage/jobs]
  E --> H[Vector Store\nembeddings + metadata]
  F --> I[Serving\nRAG / Agents / Dashboards]
  G --> I
  H --> I
  I --> J[Audit + Observability\nimmutable logs + metrics]
''')

### Multi-tenant isolation (control plane → tenant environments)

In [None]:
show_mermaid('''flowchart LR
  CP[Control Plane] -->|provision tenant A| A[Tenant A\ncompute + db + vector + keys]
  CP -->|provision tenant B| B[Tenant B\ncompute + db + vector + keys]
  CP -->|audit stream| L[Immutable Audit Store]
  A --> L
  B --> L
''')

### Offline Docker bundle

In [None]:
show_mermaid('''flowchart TD
  H[Air-gapped Host] --> DC[Docker Compose]
  DC --> P[Parsing Services\nDocling + Unstructured]
  DC --> E[Local Embeddings\nNomic]
  DC --> V[Local Vector Store]
  DC --> R[Local Relational DB]
  DC --> UI[Local Validation UI\nreceipts + diffs + export]
  UI --> X[Export Bundle\nstructured JSON + index snapshot + audit log]
''')

## Full proposal (canonical)
The canonical proposal and safety model are embedded below for single-file sharing.


### ETL Pipeline Proposal

# ETL Pipeline Proposal — Data Sovereignty + Safety (Online + Offline)

## 0) Executive Summary
Build a dual-mode ETL pipeline that supports:
- **Online mode**: API-driven ingestion + optional provider embeddings via OpenRouter.
- **Offline mode**: Docker-run pipeline with **local embeddings (Nomic)** and **no outbound calls**.

Core contract:
- **Document in → Structure out → Stored in DB + Vector**
- **Tenant isolation by construction** (per-tenant environments)
- **Full provenance + audit trails** at every step
- **Injection-resistant ingestion** (documents are untrusted inputs)

## 1) Use For / Do Not Use For

### Use for
- Regulated vendor document onboarding (contracts, policies, SOPs, case/matter files)
- RAG-ready corpora where retrieval provenance must be inspectable
- Dashboards/analytics derived from normalized, governed structured outputs
- Multi-tenant deployments where each tenant’s data/keys/compute are isolated

### Do Not Use for
- Returning “legal citations” unless citations are verifiably derived from stored source slices
- Any workflow allowing untrusted docs to influence tool instructions or control prompts
- Cross-tenant aggregation without explicit de-identification + contractual permission
- “Demo-only” pipelines without manifesting, acceptance checks, and audit logs

## 2) Customer Persona + Context
See `docs/CUSTOMER_JOURNEY_MAP.md` for the full journey.

## 3) Pipeline Phases (Major Components)

### Phase A — Intake Gateway (Trust Boundary)
**Inputs**
- Vendor batch drops (SFTP/HTTPS), API pulls, manual uploads (if needed)

**Responsibilities**
- Tenant authentication/authorization
- Manifest + checksum receipts
- File-type allowlists + malware scanning
- Quotas/rate limits

**Outputs**
- Immutable intake receipt: (tenant_id, file_id, sha256, timestamp, source)

### Phase B — Document Normalization (Structure Extraction)
**Primary tooling**
- **Docling**: document conversion + layout-aware structuring
- **Unstructured**: partitioning/chunking primitives + metadata enrichment

**Deliverable**
- A canonical “Structured Document” JSON with:
  - sections, tables, figures
  - reading order + offsets
  - doc lineage pointers (raw → normalized → chunks)

### Phase C — Policy Gates + Enrichment (Safety + Governance)
**Gates**
- PII/PHI detection & redaction policies (configurable)
- Document classification (contract / invoice / SOP / etc.)
- Prompt/document-injection defenses (see `docs/THREAT_MODEL_SAFETY.md`)
- Deterministic chunking + stable chunk IDs

### Phase D — Storage (DB + Object + Vector)
**Write targets**
- Object store: raw + normalized artifacts
- Relational DB: governance metadata, lineage, job statuses, retention rules
- Vector store: embeddings for retrieval (per-chunk + metadata)

**Key requirement**
- Tenant isolation across all stores (namespaces + keys + IAM policies)

### Phase E — Serving (RAG + Analytics + Agent Networks)
**Surfaces**
- Retrieval API (RAG/agents)
- Analytics extracts (warehouse-ready tables)
- Dashboards

**Operational requirement**
- Observability + audit trails for all retrieval and generation events

## 4) Deployment Modes

### 4.1 Online mode (API)
- Orchestrator provisions per-tenant resources
- Embeddings routed via OpenRouter where permitted
- Provider selection must not violate sovereignty contracts

### 4.2 Offline mode (Docker)
- Compose bundle includes parsers + local embeddings + local storage
- Default networking: no outbound routes
- Export artifacts as signed bundles (normalized JSON + index snapshot)

## 5) Tooling Choices + Alternatives

### Parsing / structuring
Primary:
- Unstructured (OSS)
- Docling

Alternatives:
- Apache Tika
- OCRmyPDF + layout models (scan-heavy)

### Embeddings
Primary:
- Online: OpenRouter embeddings endpoint
- Offline: Nomic embed-text

Alternatives:
- bge / e5 family models for local embeddings
- pgvector for moderate scale

### Vector store
Primary pattern:
- Tenant-scoped Qdrant/Weaviate/pgvector (choose based on ops maturity)

### Infrastructure + tenant isolation
Primary:
- Hetzner Cloud API to provision per-tenant isolated environments

Alternatives:
- Dedicated Kubernetes cluster per tenant (strongest blast-radius control)
- Shared cluster with strict namespaces + network policies (lower cost, higher diligence)

## 6) Acceptance Criteria (Rollout-Ready)
- Manifest parity: 0 missing files vs vendor manifest
- Parse visibility: “what extracted / what dropped” report per file
- Deterministic reruns: reprocess only failed docs/segments
- Provenance: every chunk maps to a source slice + offsets + hash
- Audit: ingestion → chunking → embedding model/version → index write → retrieval events

## 7) Deliverables for Vendor Rollout
- Vendor intake checklist + manifest format
- Acceptance report template (`templates/VENDOR_ACCEPTANCE_REPORT.md`)
- Sandbox tenant environment for trial submissions
- Offline Docker bundle + compatibility matrix (hardware/OS/GPU)



### Customer Persona + Journey Map

# Customer Persona + Journey Map (Vendor Rollout)

## Persona
**Dana (Vendor Data Operations Lead)**
- Accountable for sending documents/data to the buyer on schedule
- Limited engineering support
- High risk sensitivity: “Where does our data go? Who can see it?”

## Illustrated pain points
- [P1] Ambiguous requirements → repeated rework, missed deadlines
- [P2] Black-box ingestion → “What did you parse? What did you drop?”
- [P3] Sovereignty anxiety → “Did this go to external APIs?”
- [P4] Retrieval mismatch → “Your system answers wrong because retrieval is wrong”
- [P5] Offline installs break → Docker/GPU/deps/updates

## Journey map

| Stage | Dana’s goal | Dana’s action | System behavior | Failure to prevent | Required artifact |
|---|---|---|---|---|---|
| Trust framing | Confirm sovereignty | asks where data flows | clear boundary statement | vague claims | Data boundary contract |
| Onboarding | Connect sources | provides SFTP/API creds | least privilege + rotation | over-permission | scoped connector config |
| Upload | Ship documents | sends batch + manifest | receipts + checksums | missing/duplicate files | intake receipt |
| Parsing | Preserve structure | waits | parse preview + diffs | silent loss | parse diff report |
| Enrichment | Validate categories | approves taxonomy | human-in-loop gate | misclassification | review UI/report |
| Storage | Confirm isolation | asks about tenancy | tenant-only namespaces/keys | cross-tenant bleed | isolation evidence |
| Retrieval QA | Trust results | runs test queries | source slices + offsets | “fluent wrong” | retrieval proof report |
| Operations | Reduce burden | schedules deltas | idempotent deltas | drift/unseen fails | observability dashboard |
| Incident | Contain blast radius | reports issue | tenant kill-switch | lateral movement | immutable audit log |



### Threat Model + Safety Controls

# Threat Model + Safety Controls (Document ETL → RAG)

## Core threat statement
Documents are untrusted inputs. They can contain hidden instructions, malicious payloads, or adversarial text intended to subvert downstream models.

## Primary risks
1. **Prompt/document injection**: instructions embedded in docs that try to control the model or tools
2. **Cross-tenant data leakage**: shared infra or logging surfaces exposing other tenants
3. **Citation fraud**: model returns citations not present in the corpus
4. **Silent extraction loss**: parsing drops critical tables/clauses without visibility
5. **Supply chain / dependency drift**: offline bundle breaks as dependencies change

## Controls (required)
### A) Boundary controls
- Strict separation between: *content* vs *system/tool instructions*
- Never include raw doc text inside system prompts
- Create a “content-only” channel for retrieved slices

### B) Ingestion controls
- File allowlist + MIME verification
- Malware scanning
- Checksums and immutable receipts

### C) Parsing + chunking controls
- Deterministic chunk IDs
- Store offsets (doc_id, page, start_char, end_char)
- Preserve tables with structured representation

### D) Retrieval + output controls
- “Cite-only-from-retrieval”: answer must be backed by stored slices
- Block/flag if retrieval confidence is low (recall/coverage threshold)
- Maintain a “retrieval proof” object for every answer in legal workflows

### E) Tenancy controls
- Per-tenant keys (KMS)
- Per-tenant storage namespaces
- Network isolation and strict IAM boundaries

### F) Auditability
- Immutable audit log for: ingestion → normalization → embedding model/version → index write → retrieval
- Version pins for models and pipelines


## Variant-specific emphasis


**Emphasis areas:**
- Tenancy Contract
- Isolation Evidence
- Provisioning clearly via Hetzner API
- Audit Streams

**Why this variant matters:** Use this when the buyer’s first question is: 'If one tenant is compromised, can they reach others?'


## Data Engineering audition slice — what to implement first (concrete)
1. Intake receipts + manifest parity checks
2. Deterministic parse preview + diff report per document
3. Canonical structured-doc JSON schema with lineage pointers
4. Embedding + index writes with model/version recording
5. Vendor acceptance report generation

These five items are enough to demonstrate production discipline before the rest of the platform exists.
