Upload documents. Ask questions. Get cited answers.
A production-grade Retrieval-Augmented Generation system with real-time streaming, multi-provider LLM support, and built-in answer evaluation.
- Full RAG Pipeline — document ingestion, chunking, embedding, retrieval, and generation in one system
- Real-Time Streaming — token-by-token answer delivery over WebSocket with chain-of-thought reasoning
- Multi-Provider LLM — OpenAI, Anthropic, and Ollama with automatic provider detection
- Answer Evaluation — RAGAS-inspired faithfulness, relevancy, and precision scoring (LLM-as-judge)
- 7 File Formats — PDF, DOCX, TXT, Markdown, HTML, CSV, JSON
- Persistent Storage — ChromaDB vector store + SQLite for conversations, messages, and sources
- No API Keys Required — runs fully standalone with local embeddings and dummy LLM fallback
┌─────────────────────────────────────────────────────────────────────────┐
│ INDEXING PIPELINE │
│ │
│ 📄 Document → Loader → Chunker → ChromaDB Embedder → HNSW │
│ (7 formats) │ (recursive) (all-MiniLM-L6-v2) Index │
│ │ │
│ └──→ SQLite (metadata, doc records) │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ ❓ Question → Embed → Cosine Search → Top-K Chunks │
│ │ │
│ ┌─────────────────────┘ │
│ ▼ │
│ ┌─── Reasoning (CoT) ───┐ │
│ │ gpt-4.1-nano │ │
│ └───────────┬───────────┘ │
│ ▼ │
│ ┌─── Generation ────────┐ │
│ │ user-selected model │──→ 💬 Streaming Answer │
│ └───────────┬───────────┘ with Source Citations │
│ ▼ │
│ ┌─── Evaluation ────────┐ │
│ │ Faithfulness check │──→ 📊 Quality Score │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
The RAGBackend facade orchestrates all components and persists state via app.state, so documents indexed through /upload are immediately searchable via /chat.
| 🖥️ Frontend | React 19 · TypeScript · Vite 7 · Tailwind CSS v4 · shadcn/ui · TanStack Query |
| ⚙️ Backend | Python · FastAPI · Uvicorn · Pydantic v2 · SQLModel |
| 🔍 Retrieval | ChromaDB (HNSW index) · all-MiniLM-L6-v2 embeddings (384-dim) · Cosine similarity |
| 🤖 LLM | OpenAI (GPT-5, GPT-4.1) · Anthropic (Claude) · Ollama (Llama 3, Mistral, local models) |
| 📄 Parsing | pypdf · python-docx · BeautifulSoup4 |
| 💾 Storage | SQLite (conversations, messages, evaluations) · ChromaDB (vector persistence) |
| 📦 Deploy | Docker Compose · Multi-stage builds |
- Drag-and-drop upload with instant auto-indexing — no manual "upload" button needed
- 7 supported formats: PDF, DOCX, TXT, Markdown, HTML, CSV, JSON (50 MB max)
- 3 chunking strategies: fixed-size, recursive (paragraph-aware), semantic (sentence-level)
- Smart filtering: strips chunks under 20 chars or with >15% dots (table-of-contents noise)
- Idempotent upsert: re-uploading the same document overwrites existing chunks
- WebSocket protocol delivers tokens in real-time as they generate
- Two-phase generation: lightweight reasoning pass (gpt-4.1-nano) followed by full answer
- Chain-of-thought panel: collapsible display of the model's reasoning process with timing
- Markdown rendering: GitHub Flavored Markdown with syntax-highlighted code blocks
- Conversation history: sliding window context (5 turns) for follow-up questions
- Resizable sources panel with per-chunk relevance scores
- Collapsible source cards showing document name, excerpt, and chunk/doc IDs
- Score-based ordering: most relevant chunks surface first
- Full traceability: every answer links back to specific document chunks
- Persistent conversations stored in SQLite with full message history
- Pin, rename, search, export (Markdown), and share (read-only public link)
- Auto-generated titles from first user message
- Delete confirmation dialog to prevent accidental loss
- Faithfulness: decomposes answer into atomic claims, validates each against retrieved context
- Answer Relevancy: judges whether the answer addresses the question
- Context Precision: assesses which retrieved chunks were actually useful
- LLM-as-judge: separate evaluation model (gpt-4.1-mini) to avoid self-evaluation bias
- Real-time: faithfulness runs automatically after each answer; full evaluation on-demand
- Collection statistics: document count, total chunks, storage size, file type breakdown
- Sortable table with filename, chunk count, upload date, and file size
- Chunk inspector: view individual chunks to verify chunking quality
git clone https://github.com/mohamed-elkholy95/rag-document-qa.git
cd rag-document-qa
docker compose up --build| Service | URL |
|---|---|
| Frontend | localhost:3000 |
| API | localhost:8001 |
| Swagger Docs | localhost:8001/docs |
docker compose --profile observability upPhoenix UI is available at http://localhost:6006. The FastAPI backend
will export per-stage spans (rag.retrieve, rag.generate) with token
counts and cost as span attributes. If Phoenix isn't running, the app
works normally — span export silently fails.
# Backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m src.api.main # API on :8001
# Frontend (separate terminal)
cd frontend && npm install
npm run dev # Vite on :5173The system runs without any API keys — retrieval works with local ChromaDB embeddings, and the LLM falls back to dummy responses. To enable real answers:
# Pick one or more:
export OPENAI_API_KEY="sk-..." # GPT-5, GPT-4.1, o-series
export ANTHROPIC_API_KEY="sk-ant-..." # Claude Opus, Sonnet, Haiku
# Or run a local model with Ollama (no key needed):
ollama pull llama3| Method | Endpoint | Description |
|---|---|---|
POST |
/api/upload |
Upload and index a single document |
POST |
/api/upload/batch |
Batch upload multiple files |
GET |
/api/documents |
List all indexed documents |
GET |
/api/documents/{doc_id}/chunks |
Inspect individual chunks |
DELETE |
/api/documents/{doc_id} |
Delete document and its chunks |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/query |
Synchronous RAG query → answer + sources |
WS |
/api/chat |
Streaming chat via WebSocket |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/conversations |
List all conversations |
POST |
/api/conversations |
Create a new conversation |
GET |
/api/conversations/{id} |
Get conversation with messages |
PATCH |
/api/conversations/{id} |
Rename or pin a conversation |
DELETE |
/api/conversations/{id} |
Delete (cascades messages + sources) |
GET |
/api/conversations/search?q= |
Search conversations |
GET |
/api/conversations/{id}/export |
Export as Markdown |
POST |
/api/conversations/{id}/share |
Generate a share token |
GET |
/api/shared/{token} |
View shared conversation (read-only) |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/messages/{id}/evaluate |
Run 3-metric evaluation |
GET |
/api/messages/{id}/evaluation |
Get stored evaluation scores |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Liveness check |
The /api/chat endpoint streams responses through structured JSON events:
src/
├── api/
│ ├── main.py # FastAPI app with lifespan, CORS, routers
│ ├── models.py # Pydantic v2 request/response schemas
│ ├── dependencies.py # Dependency injection helpers
│ └── routes/
│ ├── upload.py # File upload + validation
│ ├── query.py # REST query + WebSocket streaming
│ ├── documents.py # Document CRUD + chunk inspection
│ ├── conversations.py # Conversation CRUD + search/export/share
│ └── evaluation.py # On-demand evaluation endpoints
├── models/
│ ├── conversation.py # Conversation table (cascade relationships)
│ ├── message.py # Message + MessageSource tables
│ ├── document.py # DocumentRecord metadata
│ └── evaluation.py # MessageEvaluation scores
├── backend.py # RAGBackend — stateful orchestration facade
├── config.py # Centralized configuration constants
├── database.py # SQLite + SQLModel setup
├── document_loader.py # Multi-format parser (7 file types)
├── llm_handler.py # Multi-provider LLM adapter with streaming
├── vector_store.py # ChromaDB wrapper (embeddings + search)
├── evaluation.py # RAGAS-inspired scoring (3 metrics)
└── generator.py # Prompt templates + context assembly
frontend/src/
├── pages/
│ ├── chat.tsx # Streaming chat with resizable sources panel
│ ├── upload.tsx # Drag-and-drop upload with auto-indexing
│ ├── documents.tsx # Document library + chunk inspector
│ └── shared.tsx # Read-only shared conversation view
├── components/
│ ├── chat/ # ChatThread, ChatMessage, ThinkingPanel, SourcesPanel
│ ├── upload/ # Dropzone, FileQueue
│ ├── documents/ # DocTable, DocStats, ChunkViewer
│ └── layout/ # AppLayout, Sidebar (model selector, top-K, stats)
├── hooks/ # useChat, useConversations, useDocuments, useSettings
└── api/ # Typed HTTP + WebSocket client
All settings are centralized in src/config.py:
| Setting | Default | Description |
|---|---|---|
CHUNK_SIZE |
500 | Characters per chunk |
CHUNK_OVERLAP |
50 | Overlap between adjacent chunks |
TOP_K_RESULTS |
5 | Number of chunks retrieved per query |
DEFAULT_MODEL |
gpt-5-mini |
Default answer generation model |
REASONING_MODEL |
gpt-4.1-nano |
Chain-of-thought model (lightweight) |
EVAL_MODEL |
gpt-4.1-mini |
Evaluation judge model |
SLIDING_WINDOW_SIZE |
5 | Conversation turns kept in context |
API_PORT |
8001 | Server port |
Provider is auto-detected from the model name:
| Prefix | Provider | API Key |
|---|---|---|
gpt-*, o1-*, o3-* |
OpenAI | OPENAI_API_KEY |
claude-* |
Anthropic | ANTHROPIC_API_KEY |
| Everything else | Ollama (local) | None needed |
| Decision | Rationale |
|---|---|
| ChromaDB over Qdrant/Pinecone | Zero external services — persistent vector store that runs in-process |
| Recursive chunking as default | Preserves paragraph and sentence boundaries vs. naive fixed-size windows |
| Separate reasoning model | Cheap CoT pass (gpt-4.1-nano) avoids doubling cost on the main model |
| Separate evaluation model | Prevents self-evaluation bias — a different model judges the answer |
| WebSocket streaming | Token-by-token delivery for responsive UX vs. waiting for full response |
| SQLite + SQLModel | Zero-ops relational storage — conversations, messages, and sources persist across restarts |
| React over Streamlit | Full control over real-time streaming, layout, and component architecture |
| Multi-provider adapter | Swap models by name with zero code changes — same interface for all providers |
| Idempotent document upsert | Re-uploading overwrites by content hash — no duplicate chunks |
python -m pytest tests/ -v # Full test suite
python -m pytest tests/ --cov=src --cov-report=term-missing # With coverage report
python -m pytest tests/test_evaluation.py -v # Single moduleTests use ChromaDB's EphemeralClient for isolated vector store testing and deterministic fixtures from conftest.py.
All optional. The system runs fully without any keys using local ChromaDB embeddings and dummy LLM responses.
| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
OpenAI models (GPT-5, GPT-4.1, o-series) |
ANTHROPIC_API_KEY |
Anthropic models (Claude Opus, Sonnet, Haiku) |
Mohamed Elkholy — GitHub
Built with FastAPI · React · ChromaDB · scikit-learn