RAG Support Assistant

A customer support RAG pipeline built over the Wix Help Center dataset. Every stage — chunking, retrieval, reranking, classification, generation, and evaluation — is implemented from scratch so the design decisions are visible and auditable.

The focus is on what happens when the pipeline can't confidently answer: when retrieval returns nothing useful, when the model hedges, or when the user is upset and asking to cancel their account. Each of these cases has a specific, designed response path rather than a generic fallback.

In a support context, these failure cases are where the business impact lives. A bot that handles a cancellation request the same way it handles a how-to question will lose that customer. A bot that confidently presents a wrong answer erodes trust faster than having no bot at all. And a bot that says "I don't know" with no next step is a dead end that drives a support ticket anyway. The failure handling in this pipeline is designed to deflect tickets when the bot can help, preserve trust when it can't, and route to a human immediately when the situation calls for it.

The support widget lives in the corner of a mock SaaS dashboard

Opening the widget shows a welcome message and suggested questions

Limitations

No multi-turn conversation. The pipeline is completely stateless — each request is independent with no memory of previous messages. For a production support bot, conversation context is table stakes (follow-up questions, pronoun resolution, etc.). This is a deliberate scope boundary for the demo; the architecture would need a session store and context window to support it.
Connect-to-agent is a demo stub. The agent handoff flow shows mock team availability but does not actually connect to a live agent. In production this would integrate with a live chat system.
No adversarial input protection. The pipeline does no prompt injection filtering or input sanitization beyond what the LLM provider applies. A malicious user could craft inputs to manipulate classifier routing or generation output. Production deployment would need an input guardrail layer before the classifier.

How it works

Here's what a user sees when they ask a question:

Got an answer — the bot replies with a grounded answer and links to the source help articles. If the answer only partially covers the question, the bot says so and offers a soft link to talk to a human instead of pretending it nailed it.

A grounded answer with step-by-step instructions and source article links

Got asked a follow-up — the bot couldn't find anything relevant in the knowledge base, so instead of guessing, it asks a clarifying question to keep the conversation going.

When retrieval finds nothing relevant, the bot asks a clarifying question

Got connected to an agent — the user asked something sensitive (cancellation, billing dispute) or the bot determined it genuinely can't help. The user sees an empathetic message and a button to reach a support agent, not a generic "I don't know."

High-stakes queries get an empathetic response with a path to a human agent

Under the hood, every question is classified into one of five categories — answerable, nonsense, irrelevant, out-of-scope, or high-stakes — and routed to a purpose-built handler. Only answerable and high-stakes queries go through retrieval; the rest get static responses immediately without wasting compute.

For queries that go through retrieval, the pipeline applies two confidence gates before presenting an answer:

Relevance gate — if the cross-encoder reranker's top score is below 2.0, the retrieved results aren't good enough. Instead of generating from weak context, the bot asks a clarifying follow-up question.
Confidence gate — if the top score is below 5.0, the answer is shown but the bot signals that it isn't fully sure and asks the user to be more specific.

Low confidence answers show a disclaimer and omit source links

After generation, a self-critique step assesses whether the answer fully, partially, or cannot address the question. This determines whether to offer human escalation and how much confidence to convey to the user.

High-stakes queries (cancellations, billing disputes, complaints) bypass normal generation entirely. They get an empathetic acknowledgment with a structured retention offer and escalation path — because a standard RAG answer to "I want to cancel my account" is the wrong response.

Latency

The full pipeline for an answerable query takes ~6–12 seconds end-to-end, averaging around 9 seconds. Almost all of that is LLM calls — local compute (embedding, reranking) is under 250ms after model warmup.

Stage	Time	Notes
Classify	~1.3s	Every query pays this
Query rewrite	~2s	Only short queries (<8 words)
Retrieve + rerank	~0.3s	Local models, fast after warmup
Generate	~5s	Longest single call
Self-critique	~1.5s	Only runs when confidence is high enough to cite sources

The cost is in the sequential LLM chain. An answerable query makes 3–4 LLM calls in sequence: classify → (rewrite) → generate → self-critique. Each call depends on the output of the previous one — classification determines whether to retrieve at all, generation produces the text that self-critique evaluates — so they cannot be parallelized. This is different from a pipeline where multiple independent checks run on the same input, which could run concurrently. Here, each step genuinely needs the previous step's output before it can start.

That's a real tradeoff. Whether 9 seconds is acceptable depends entirely on the product context. A search autocomplete can't afford a single extra LLM call. But in a support context, a wrong answer or a tone-deaf response to a frustrated user costs more than a few extra seconds of wait time — users who are stuck on a billing issue or trying to set up a domain are willing to wait longer for a good answer than they would for a quick-and-wrong one.

The pipeline prioritizes response quality over speed: the classifier prevents wasted retrieval on nonsense, the self-critique catches answers that look plausible but don't actually address the question, and the confidence gate prevents the bot from confidently citing articles it shouldn't. Each of those calls earns its latency. But the cumulative cost is real, and production would need to address it — streaming the answer to the user while self-critique runs in the background, swapping the classifier for a fine-tuned model that runs locally in <50ms, or running independent checks concurrently where the dependency graph allows it.

Failure states

Situation	Response	Reasoning
Retrieval finds nothing relevant	Clarifying follow-up question	Avoids hallucinating from weak context; keeps the conversation going
Low retrieval confidence (score 2.0–5.0)	A disclaimer appears above the answer ("I'm not fully sure this matches what you're asking…") and no source links are shown	The context is good enough to attempt an answer but not good enough to cite confidently — the disclaimer is rendered as a visually distinct element above the answer bubble so the user reads everything that follows with the right expectations
Self-critique: CANNOT_ANSWER	"I couldn't find an answer" + connect-to-agent button	Immediate escalation path instead of a vague hedge
Self-critique: PARTIALLY_ANSWERED	Answer + 1 source link (labeled "Related article") + soft inline escalation link ("Need more detail? Talk to a support agent")	Partial answers are still useful — a prominent connect-agent button would push escalation too hard, but a soft text link gives the user an easy path if the answer isn't enough
Self-critique: FULLY_ANSWERED	Answer + up to 2 source links	Confident answer with "read more" links to the full articles
High-stakes query	Empathetic acknowledgment + retention offer + escalation	Sensitive queries need tone and structure control, not a help article
Out-of-scope query	Escalation offer	Doesn't pretend to help with things it can't handle
Nonsense or irrelevant query	Polite deflection + topic suggestions	Saves retrieval and generation compute
Rate limit exceeded	HTTP 429, "try again tomorrow" in the widget	Prevents runaway API cost

These are implemented in src/pipeline.py and src/generator.py. The frontend (frontend/main.js) renders each routing state differently — source pills, connect-agent buttons, suggestion chips.

Evaluation

The bot finds the right help article about 7 times out of 10, and when it does generate an answer, that answer is almost always grounded in the retrieved content rather than hallucinated.

The pipeline is evaluated against the WixQA benchmark with retrieval and generation scored separately, so failures can be diagnosed to the right stage. I ran 7 experiments across retrieval strategy, query expansion, classifier tuning, and chunk enrichment. The best configuration is title-prepended embeddings + cross-encoder reranking.

Metric	Score	What it means
Expert Hit Rate @ 5	0.71 (n=200)	"Did the right article show up?" — 71% of the time, the correct help article appeared in the top 5 results when tested against human-written questions with known answers
Expert MRR	0.51 (n=200)	"How high did it rank?" — on average, the correct article appeared around position 2. Higher is better (1.0 = always first)
Faithfulness	4.80 / 5 (n=50)	"Did the bot make things up?" — an LLM judge scored how well answers stuck to the retrieved content. 4.8/5 means answers are almost entirely grounded, not hallucinated
Relevancy	4.76 / 5 (n=50)	"Did the bot actually answer the question?" — an LLM judge scored whether the response addressed what was asked. 4.76/5 means answers are consistently on-topic

Sample sizes are sufficient to compare configurations and identify failure patterns, but too small for tight confidence intervals — individual values could shift ±0.05 with a different sample.

The expert hit rate ceiling at 0.71 is a genuine finding: the remaining 29% are KB coverage gaps and vocabulary mismatches that persisted across all seven experiments, including approaches I tried and reverted (HyDE, always-on query expansion, larger candidate pools). The most impactful next steps would be hybrid retrieval (dense + BM25) and a larger embedding model.

Details: experiment log and evaluation methodology.

Pipeline

flowchart LR

  %% ── Ingest (one-time) ──────────────────────────────────────
  subgraph INGEST["ingest.py  (one-time)"]
    direction LR
    DS["Wix/WixQA dataset<br/>(HuggingFace)"]
    CHUNK["Chunker<br/>1 · split on \\n\\n<br/>2 · merge &lt;50 tok<br/>3 · split &gt;300 tok at sentences<br/>4 · 50-token sliding overlap"]
    EMBED_I["Embed chunks<br/>all-MiniLM-L6-v2"]

    DS -->|"title + body"| CHUNK --> EMBED_I
  end

  EMBED_I --> CHROMA[("ChromaDB<br/>collection: wix_kb")]

  %% ── Runtime ────────────────────────────────────────────────
  USER["User question"] --> API["FastAPI<br/>POST /ask"] --> RL{"Rate limit<br/>≤100 req/day"}

  RL -- "limit exceeded" --> ERR429["HTTP 429"]
  RL -- "ok" --> CLF{"Classifier<br/>gpt-4o-mini<br/>5 categories"}

  CLF -- "nonsense /<br/>irrelevant" --> STATIC["Static response<br/>+ topic suggestions"]
  CLF -- "out of scope" --> OOS["Escalation offer"]
  CLF -- "answerable /<br/>high-stakes" --> QR["Query rewriter<br/>gpt-4o-mini<br/>(short queries only)"]

  QR --> EMBED_R["Embed query<br/>all-MiniLM-L6-v2"] --> VSEARCH["Vector search<br/>ChromaDB · top-20"]
  VSEARCH <--> CHROMA
  VSEARCH -- "20 candidates" --> RERANK["Cross-encoder reranker<br/>ms-marco-MiniLM-L6-v2 · top-5"]

  RERANK --> REL{"Relevance gate<br/>score ≥ 2.0?"}

  REL -- "yes" --> HS_CHECK{"High-stakes?"}
  REL -- "no" --> HS_CHECK_NO{"High-stakes?"}

  HS_CHECK -- "yes" --> HS["High-stakes response<br/>empathy + context<br/>+ escalation"] --> RESP["Response<br/>to user"]
  HS_CHECK -- "no" --> GEN["Generator<br/>gpt-4o-mini"] --> CONF{"Confidence gate<br/>score ≥ 5.0?"}

  HS_CHECK_NO -- "yes" --> HS_NO["High-stakes response<br/>empathy + escalation<br/>(no context)"] --> RESP
  HS_CHECK_NO -- "no" --> FOLLOWUP["Follow-up question<br/>ask for clarification"] --> RESP

  CONF -- "low confidence" --> LC["Answer + disclaimer<br/>no sources"] --> RESP
  CONF -- "confident" --> CRITIQUE{"Self-critique<br/>gpt-4o-mini"}

  CRITIQUE -- "CANNOT_ANSWER" --> CA["Escalation message<br/>+ connect-agent"] --> RESP
  CRITIQUE -- "PARTIALLY_ANSWERED" --> PA["Answer + 1 source<br/>+ soft escalation link"] --> RESP
  CRITIQUE -- "FULLY_ANSWERED" --> FA["Answer + up to 2 sources"] --> RESP

  STATIC --> RESP
  OOS --> RESP
  RESP --> USER

  classDef store     fill:#e8f4fd,stroke:#3b82f6,color:#1e3a5f
  classDef llm       fill:#fef3c7,stroke:#f59e0b,color:#78350f
  classDef gate      fill:#f3f4f6,stroke:#6b7280,color:#111827
  classDef endpoint  fill:#ede9fe,stroke:#7c3aed,color:#3b0764
  classDef io        fill:#dcfce7,stroke:#16a34a,color:#14532d

  class CHROMA store
  class CLF,QR,GEN,FOLLOWUP,HS,HS_NO llm
  class RL,HS_CHECK,HS_CHECK_NO,REL,CONF,CRITIQUE gate
  class API endpoint
  class USER,RESP io

Stack

Layer	Tech	What it does
Chunking	Paragraph split + merge/split heuristics + 50-token overlap	Breaks help articles into retrievable pieces while preserving natural structure
Embedding	`sentence-transformers/all-MiniLM-L6-v2` (local)	Converts text to vectors for similarity search, runs locally with no API cost
Vector store	ChromaDB (persistent, local)	Stores and searches article chunks by similarity, zero infrastructure to set up
Reranker	`cross-encoder/ms-marco-MiniLM-L6-v2`	Two-pass search: fast scan of all articles, then precise re-scoring of top candidates so the right answer surfaces even when the user's phrasing doesn't match the article's wording
LLM	OpenAI `gpt-4o-mini` via `pydantic-ai`	Generates answers, classifies queries, and self-critiques responses with structured output
API	FastAPI + CORS + static files	Serves the pipeline and frontend; preloads models at startup to eliminate cold-start delays
Frontend	Vanilla HTML/CSS/JS chat widget	Renders each response type differently: source links, escalation buttons, suggestion chips
Observability	Langfuse (traces + prompt versioning)	Shows exactly what happened on every request; prompts can be updated without code changes

Key design decisions (detailed reasoning in ARCHITECTURE.md):

No framework — every pipeline stage is written explicitly so the design decisions are visible and debuggable, not hidden behind abstractions
Two-stage retrieval — fast vector search for recall, then a cross-encoder reranker for precision; this was the single largest quality improvement in evaluation
Self-critique over generic caveats — a structured LLM assessment (FULLY / PARTIALLY / CANNOT) drives specific UI behavior rather than appending "please verify" to every answer

What changes for production

Area	Demo	Production
Vector store	ChromaDB (local file)	pgvector or managed service (Pinecone, Qdrant)
Retrieval	Dense only	Hybrid: dense + BM25 with RRF
Embeddings	Local `all-MiniLM-L6-v2`	Embedding API or domain-adapted model
Query expansion	8-word gated rewriter	Selective expansion based on learned heuristics
Classification	LLM call (gpt-4o-mini)	Fine-tuned small classifier, <50 ms
Evaluation	Offline script, LLM-as-judge on 50 questions	CI-gated regression suite on prompt/retrieval changes
Rate limiting	JSON file	Redis `INCR` + `EXPIRE`, per-user
Prompt caching	Process lifetime	TTL-based (60–300 s), A/B tested via Langfuse datasets
Agent init	Lazy, per-module	Startup lifespan event, shared across workers
High-stakes	Fictional retention, mock queue	CRM-driven offers, real ticketing / live chat integration
Conversation context	Stateless (single turn)	Sliding window stored in Redis, session ID on API
Streaming	None	Server-Sent Events, token-by-token
Frontend	Vanilla JS widget	React + TypeScript
Auth	None	API key or OAuth, per-user quotas
Observability	Langfuse traces	Langfuse + structured logging + latency metrics per stage

Setup, API, project structure, and environment variables

Setup

# Create venv and install dependencies
uv venv && uv sync

# Set environment variables
cp .env.example .env
# Fill in OPENAI_API_KEY (required) and optionally LANGFUSE_* keys

# Ingest the dataset (one-time, idempotent — skips if already populated)
uv run python src/ingest.py

# Register prompts with Langfuse (optional — run once, or after any prompt edit)
uv run python prompts/seed.py

# Start the API + frontend
uv run uvicorn api.main:app --reload
# Visit http://localhost:8000

API

# Health check
curl http://localhost:8000/health

# Ask a question
curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I connect a custom domain to my Wix site?"}'

Response:

{
  "answer": "To connect a custom domain...",
  "sources": ["Connecting a Domain to Your Wix Site"],
  "routing": "answered"
}

The routing field tells the frontend how to render the response:

Value	Meaning
`answered`	Full pipeline ran; answer grounded in retrieved content
`partially_answered`	Self-critique found a partial answer; 1 source link + soft escalation link
`followup`	Retrieval found nothing useful; bot asks a clarifying question
`low_confidence`	Reranker scores too low; disclaimer shown above answer, no sources
`cannot_answer`	Self-critique determined context can't answer; escalation offered
`high_stakes`	Cancellation/dispute/complaint; empathetic response + escalation
`out_of_scope`	Beyond assistant scope; escalation offered
`irrelevant`	Off-topic; topic suggestions shown
`nonsense`	No discernible intent; topic suggestions shown

Project structure

src/
  ingest.py          # Dataset loading, chunking, ChromaDB population
  retriever.py       # Query embedding + vector search (top-20 candidates)
  reranker.py        # Cross-encoder reranking → top-5 chunks
  classifier.py      # 5-category query classifier (pydantic-ai)
  query_rewriter.py  # Query expansion for short queries (<8 words)
  generator.py       # Answer generation, self-critique, confidence signaling
  pipeline.py        # Orchestrates the full request pipeline
  rate_limit.py      # Daily 100-request cap (rate_limit.json)
  prompts.py         # Langfuse prompt fetching + OTel prompt linkage
prompts/
  *.txt              # Prompt source files (version-controlled)
  seed.py            # Registers prompts with Langfuse
api/
  main.py            # FastAPI app with CORS and static file mount
frontend/
  index.html         # Mock SaaS dashboard + chat widget
  style.css          # All styles (CSS variables, no framework)
  main.js            # Widget logic, markdown rendering, routing handlers
eval/
  evaluate.py        # Retrieval + generation evaluation against WixQA benchmark
  EXPERIMENTS.md     # Full experiment log with results and root cause analysis
  README.md          # Evaluation methodology and metric definitions
tests/
  test_chunking.py   # Chunking pipeline unit tests
  test_generator.py  # Generator helpers and source link logic tests
  test_pipeline.py   # Routing decision tests (mocked LLM calls)
  test_metrics.py    # Evaluation metric function tests

Environment variables

Variable	Required	Description
`OPENAI_API_KEY`	Yes	OpenAI API key
`LANGFUSE_PUBLIC_KEY`	No	Langfuse public key for tracing
`LANGFUSE_SECRET_KEY`	No	Langfuse secret key for tracing
`LANGFUSE_HOST`	No	Langfuse host (defaults to cloud)

Tracing and prompt versioning degrade gracefully if Langfuse keys are not set — the pipeline falls back to local prompt files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Support Assistant

Limitations

How it works

Latency

Failure states

Evaluation

Pipeline

Stack

What changes for production

Setup

API

Project structure

Environment variables

Further reading

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
api		api
eval		eval
frontend		frontend
images		images
prompts		prompts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

RAG Support Assistant

Limitations

How it works

Latency

Failure states

Evaluation

Pipeline

Stack

What changes for production

Setup

API

Project structure

Environment variables

Further reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages