A production-grade, fully offline Retrieval-Augmented Generation system built for Apple M1. Combines a Go ingestion service for reliable file watching with a Python RAG core for AI-powered document search and question answering.
- Fully offline — runs entirely on-device using local GGUF models; optionally use Groq cloud API for faster inference
- Web dashboard — chat, document management, and folder watching via browser UI
- Incremental indexing — content hashing ensures only changed documents are re-processed
- Structured chunking — section-aware splitting with token-limit fallback
- Reranking — optional cross-encoder reranking for higher answer quality
- Debounced ingestion — avoids duplicate indexing from rapid file events
- File stability checks — prevents partial reads during file writes
- Dynamic folder watching — add/remove watched directories via UI without restarting
- Metadata tracking — SQLite-backed document and chunk lineage
┌──────────────────────┐ ┌──────────────────────────────────┐ ┌──────────────┐
│ Go Ingestion Service│ │ Python RAG Core (FastAPI) │ │ Web Dashboard │
│ │ │ │ │ │
│ File Watcher │───▶│ Parser ─▶ Chunker ─▶ Embedder │◀───│ Chat │
│ Event Normalizer │ │ Vector Store (Qdrant) │ │ Documents │
│ Debouncer │ │ Retrieval ─▶ Reranker ─▶ LLM │ │ Folders │
│ Stability Checker │ │ Metadata DB (SQLite) │ │ Upload │
│ Content Hasher │ │ Dashboard API │ │ │
│ Job Queue + Worker │ │ │ │ │
│ Dynamic Folder Poll │ │ │ │ │
└──────────────────────┘ └──────────────────────────────────┘ └──────────────┘
| Component | Technology |
|---|---|
| LLM | Phi-3 / Gemma / Mistral / Llama 3 (GGUF via llama.cpp) or Groq cloud API |
| Embeddings | BGE-small-en-v1.5 (384 dim) |
| Reranker | BGE-reranker-base (CrossEncoder) |
| Vector DB | Qdrant (Docker) |
| Metadata DB | SQLite (WAL mode) |
| API | FastAPI + Uvicorn |
| UI | Static HTML/CSS/JS (served by FastAPI) |
| File Watcher | Go + fsnotify |
| Hashing | xxhash |
- macOS (Apple Silicon)
- Python 3.13 (Homebrew:
brew install python@3.13) - Go 1.21+
- Docker Desktop
# 1. Clone and enter the project
cd local-rag-system
# 2. Full setup (venv, dependencies, Qdrant, SQLite, Go)
make setup
# 3. Download models
# LLM (~2.3GB)
curl -L -o models/llm/Phi-3-mini-4k-instruct-q4.gguf \
"https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"
# Reranker (~1.1GB, optional)
make download-reranker
# 4. Start services (in separate terminals)
make python-api # Terminal 1: FastAPI on :8000
make go-watcher # Terminal 2: Go file watcher
# Or start everything in the background:
make run-allAfter starting the API, open http://localhost:8000 in your browser. The dashboard has three tabs:
Ask questions about your indexed documents. Answers include source citations with relevance scores, powered by the full RAG pipeline (embed → retrieve → rerank → generate).
View all indexed documents with chunk counts and timestamps. Upload files via drag-and-drop (PDF, DOCX, TXT, MD) or delete individual documents from the index.
See all directories being monitored by the Go file watcher. Add new folders (existing files are auto-scanned immediately) or remove them (full index cleanup).
- The default
data/documents/folder is always present and cannot be removed - The Go watcher polls for folder changes every 30 seconds (no restart needed)
Drop files into data/documents/ (the Go watcher picks them up automatically), or call the API directly:
curl -s -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"file_path": "/absolute/path/to/document.pdf", "event_type": "create", "file_hash": ""}' \
| python3 -m json.toolcurl -s -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is this document about?"}' \
| python3 -m json.toolcurl -s http://localhost:8000/health | python3 -m json.toolcurl -s -X POST http://localhost:8000/reindex | python3 -m json.tool| Command | Description |
|---|---|
make setup |
Full setup (Python, Docker, SQLite, Go) |
make python-api |
Start the FastAPI server on :8000 |
make go-watcher |
Start the Go file watcher |
make run-all |
Start all services in background with logs |
make stop |
Stop all background services |
make download-reranker |
Download the reranker model (~1.1GB) |
make start-qdrant |
Start Qdrant container |
make stop-qdrant |
Stop Qdrant container |
make build-go |
Build Go binary |
make health |
Check API health |
make ingest-test |
Ingest a sample test document |
make query-test |
Run a sample query |
make reindex |
Force reindex all documents |
make clean |
Remove build artifacts and caches |
make clean-all |
Remove venv, DB, and Docker volumes |
All settings live in python-core/config.py:
| Setting | Default | Description |
|---|---|---|
LLM_PROVIDER |
"local" |
"local" (GGUF) or "groq" (cloud API) |
LLM_MODEL_FILENAME |
"Phi-3-mini-4k-instruct-q4.gguf" |
GGUF filename in models/llm/ (local only) |
GROQ_MODEL |
"llama-3.3-70b-versatile" |
Groq model ID (groq only) |
USE_RERANKER |
False |
Enable/disable cross-encoder reranking |
CHUNK_SIZE |
400 |
Token limit per chunk |
CHUNK_OVERLAP |
80 |
Overlap tokens between chunks |
RETRIEVAL_TOP_K |
20 |
Candidates fetched from vector search |
RERANK_TOP_K |
5 |
Final chunks after reranking |
MAX_CONTEXT_TOKENS |
2000 |
Token budget for LLM context |
LLM_MAX_TOKENS |
512 |
Max tokens in LLM response |
LLM_TEMPERATURE |
0.1 |
LLM sampling temperature |
- Set
LLM_PROVIDER = "groq"and optionallyGROQ_MODELinpython-core/config.py - Export your API key before starting the server:
export GROQ_API_KEY=gsk_...
make python-apiGet a free key at console.groq.com. Popular model IDs:
llama-3.3-70b-versatile— general purposellama-3.1-8b-instant— ultra-fastmixtral-8x7b-32768— large 32k contextgemma2-9b-it— Google Gemma 2
Place any GGUF file in models/llm/ and update LLM_MODEL_FILENAME in config.py. Stop tokens and context size are auto-detected from the filename — no other changes needed.
local-rag-system/
├── python-core/ # Python RAG core
│ ├── api/ # FastAPI endpoints
│ │ ├── main.py # App entry + static file serving
│ │ ├── ingest_endpoint.py
│ │ ├── query_endpoint.py
│ │ └── dashboard_endpoint.py # Documents, folders, upload API
│ ├── chunking/ # Structure-aware document chunker
│ ├── db/ # SQLite metadata (documents, chunks, watched_folders)
│ ├── embeddings/ # BGE embedding service
│ ├── ingestion/ # Incremental ingestion pipeline
│ ├── llm/ # llama.cpp LLM service
│ ├── parsers/ # PDF, DOCX, TXT parsers
│ ├── reranker/ # Cross-encoder reranker
│ ├── retrieval/ # Vector search + prompt builder
│ ├── vector_store/ # Qdrant client
│ ├── config.py # Central configuration
│ └── requirements.txt # Python dependencies
├── go-ingestion/ # Go file watcher service
│ ├── watcher/ # fsnotify file watcher (dynamic folder support)
│ ├── normalizer/ # Event normalization
│ ├── debouncer/ # Event deduplication
│ ├── stability/ # File stability checks
│ ├── hashing/ # Content hashing (xxhash)
│ ├── queue/ # Job queue
│ ├── worker/ # Ingestion worker (calls Python API)
│ └── config/ # Go configuration
├── ui/static/ # Web dashboard (served by FastAPI)
│ ├── index.html # Single-page app shell
│ ├── style.css # Dark theme styles
│ └── app.js # Client-side logic
├── data/documents/ # Default watched folder for ingestion
├── models/ # Downloaded model files
│ ├── llm/ # GGUF model
│ ├── embeddings/ # Sentence transformer cache
│ └── reranker/ # Reranker model cache
├── docs/ # Phase documentation
├── docker-compose.yml # Qdrant service
├── Makefile # All project commands
└── plan.md # Original system design
- PDF (
.pdf) - Microsoft Word (
.docx) - Plain text (
.txt)
With 8GB RAM on M1, all three models can run simultaneously but memory is tight:
| Model | Size |
|---|---|
| BGE-small-en-v1.5 | ~130 MB |
| BGE-reranker-base | ~1.1 GB |
| Phi-3 Mini Q4 (GGUF) | ~2.3 GB |
If memory is constrained, set USE_RERANKER = False in config.py to skip loading the reranker model.


