Local RAG System

A production-grade, fully offline Retrieval-Augmented Generation system built for Apple M1. Combines a Go ingestion service for reliable file watching with a Python RAG core for AI-powered document search and question answering.

Features

Fully offline — runs entirely on-device using local GGUF models; optionally use Groq cloud API for faster inference
Web dashboard — chat, document management, and folder watching via browser UI
Incremental indexing — content hashing ensures only changed documents are re-processed
Structured chunking — section-aware splitting with token-limit fallback
Reranking — optional cross-encoder reranking for higher answer quality
Debounced ingestion — avoids duplicate indexing from rapid file events
File stability checks — prevents partial reads during file writes
Dynamic folder watching — add/remove watched directories via UI without restarting
Metadata tracking — SQLite-backed document and chunk lineage

Architecture

┌──────────────────────┐     ┌──────────────────────────────────┐     ┌──────────────┐
│  Go Ingestion Service│     │  Python RAG Core (FastAPI)       │     │  Web Dashboard │
│                      │     │                                  │     │              │
│  File Watcher        │───▶│  Parser ─▶ Chunker ─▶ Embedder  │◀───│  Chat        │
│  Event Normalizer    │     │  Vector Store (Qdrant)           │     │  Documents    │
│  Debouncer           │     │  Retrieval ─▶ Reranker ─▶ LLM   │     │  Folders      │
│  Stability Checker   │     │  Metadata DB (SQLite)            │     │  Upload       │
│  Content Hasher      │     │  Dashboard API                   │     │              │
│  Job Queue + Worker  │     │                                  │     │              │
│  Dynamic Folder Poll │     │                                  │     │              │
└──────────────────────┘     └──────────────────────────────────┘     └──────────────┘

Tech Stack

Component	Technology
LLM	Phi-3 / Gemma / Mistral / Llama 3 (GGUF via llama.cpp) or Groq cloud API
Embeddings	BGE-small-en-v1.5 (384 dim)
Reranker	BGE-reranker-base (CrossEncoder)
Vector DB	Qdrant (Docker)
Metadata DB	SQLite (WAL mode)
API	FastAPI + Uvicorn
UI	Static HTML/CSS/JS (served by FastAPI)
File Watcher	Go + fsnotify
Hashing	xxhash

Prerequisites

macOS (Apple Silicon)
Python 3.13 (Homebrew: brew install python@3.13)
Go 1.21+
Docker Desktop

Quick Start

# 1. Clone and enter the project
cd local-rag-system

# 2. Full setup (venv, dependencies, Qdrant, SQLite, Go)
make setup

# 3. Download models
#    LLM (~2.3GB)
curl -L -o models/llm/Phi-3-mini-4k-instruct-q4.gguf \
  "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"

#    Reranker (~1.1GB, optional)
make download-reranker

# 4. Start services (in separate terminals)
make python-api      # Terminal 1: FastAPI on :8000
make go-watcher      # Terminal 2: Go file watcher

# Or start everything in the background:
make run-all

Web Dashboard

After starting the API, open http://localhost:8000 in your browser. The dashboard has three tabs:

Chat

Ask questions about your indexed documents. Answers include source citations with relevance scores, powered by the full RAG pipeline (embed → retrieve → rerank → generate).

Documents

View all indexed documents with chunk counts and timestamps. Upload files via drag-and-drop (PDF, DOCX, TXT, MD) or delete individual documents from the index.

Watched Folders

See all directories being monitored by the Go file watcher. Add new folders (existing files are auto-scanned immediately) or remove them (full index cleanup).

The default data/documents/ folder is always present and cannot be removed
The Go watcher polls for folder changes every 30 seconds (no restart needed)

API Usage

Ingest a document

Drop files into data/documents/ (the Go watcher picks them up automatically), or call the API directly:

curl -s -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{"file_path": "/absolute/path/to/document.pdf", "event_type": "create", "file_hash": ""}' \
  | python3 -m json.tool

Query

curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is this document about?"}' \
  | python3 -m json.tool

Health check

curl -s http://localhost:8000/health | python3 -m json.tool

Reindex all documents

curl -s -X POST http://localhost:8000/reindex | python3 -m json.tool

Makefile Commands

Command	Description
`make setup`	Full setup (Python, Docker, SQLite, Go)
`make python-api`	Start the FastAPI server on :8000
`make go-watcher`	Start the Go file watcher
`make run-all`	Start all services in background with logs
`make stop`	Stop all background services
`make download-reranker`	Download the reranker model (~1.1GB)
`make start-qdrant`	Start Qdrant container
`make stop-qdrant`	Stop Qdrant container
`make build-go`	Build Go binary
`make health`	Check API health
`make ingest-test`	Ingest a sample test document
`make query-test`	Run a sample query
`make reindex`	Force reindex all documents
`make clean`	Remove build artifacts and caches
`make clean-all`	Remove venv, DB, and Docker volumes

Configuration

All settings live in python-core/config.py:

Setting	Default	Description
`LLM_PROVIDER`	`"local"`	`"local"` (GGUF) or `"groq"` (cloud API)
`LLM_MODEL_FILENAME`	`"Phi-3-mini-4k-instruct-q4.gguf"`	GGUF filename in `models/llm/` (local only)
`GROQ_MODEL`	`"llama-3.3-70b-versatile"`	Groq model ID (groq only)
`USE_RERANKER`	`False`	Enable/disable cross-encoder reranking
`CHUNK_SIZE`	`400`	Token limit per chunk
`CHUNK_OVERLAP`	`80`	Overlap tokens between chunks
`RETRIEVAL_TOP_K`	`20`	Candidates fetched from vector search
`RERANK_TOP_K`	`5`	Final chunks after reranking
`MAX_CONTEXT_TOKENS`	`2000`	Token budget for LLM context
`LLM_MAX_TOKENS`	`512`	Max tokens in LLM response
`LLM_TEMPERATURE`	`0.1`	LLM sampling temperature

Using Groq

Set LLM_PROVIDER = "groq" and optionally GROQ_MODEL in python-core/config.py
Export your API key before starting the server:

export GROQ_API_KEY=gsk_...
make python-api

Get a free key at console.groq.com. Popular model IDs:

llama-3.3-70b-versatile — general purpose
llama-3.1-8b-instant — ultra-fast
mixtral-8x7b-32768 — large 32k context
gemma2-9b-it — Google Gemma 2

Switching local models

Place any GGUF file in models/llm/ and update LLM_MODEL_FILENAME in config.py. Stop tokens and context size are auto-detected from the filename — no other changes needed.

Project Structure

local-rag-system/
├── python-core/            # Python RAG core
│   ├── api/                # FastAPI endpoints
│   │   ├── main.py         # App entry + static file serving
│   │   ├── ingest_endpoint.py
│   │   ├── query_endpoint.py
│   │   └── dashboard_endpoint.py  # Documents, folders, upload API
│   ├── chunking/           # Structure-aware document chunker
│   ├── db/                 # SQLite metadata (documents, chunks, watched_folders)
│   ├── embeddings/         # BGE embedding service
│   ├── ingestion/          # Incremental ingestion pipeline
│   ├── llm/                # llama.cpp LLM service
│   ├── parsers/            # PDF, DOCX, TXT parsers
│   ├── reranker/           # Cross-encoder reranker
│   ├── retrieval/          # Vector search + prompt builder
│   ├── vector_store/       # Qdrant client
│   ├── config.py           # Central configuration
│   └── requirements.txt    # Python dependencies
├── go-ingestion/           # Go file watcher service
│   ├── watcher/            # fsnotify file watcher (dynamic folder support)
│   ├── normalizer/         # Event normalization
│   ├── debouncer/          # Event deduplication
│   ├── stability/          # File stability checks
│   ├── hashing/            # Content hashing (xxhash)
│   ├── queue/              # Job queue
│   ├── worker/             # Ingestion worker (calls Python API)
│   └── config/             # Go configuration
├── ui/static/              # Web dashboard (served by FastAPI)
│   ├── index.html          # Single-page app shell
│   ├── style.css           # Dark theme styles
│   └── app.js              # Client-side logic
├── data/documents/         # Default watched folder for ingestion
├── models/                 # Downloaded model files
│   ├── llm/                # GGUF model
│   ├── embeddings/         # Sentence transformer cache
│   └── reranker/           # Reranker model cache
├── docs/                   # Phase documentation
├── docker-compose.yml      # Qdrant service
├── Makefile                # All project commands
└── plan.md                 # Original system design

Supported File Types

PDF (.pdf)
Microsoft Word (.docx)
Plain text (.txt)

Memory Considerations

With 8GB RAM on M1, all three models can run simultaneously but memory is tight:

Model	Size
BGE-small-en-v1.5	~130 MB
BGE-reranker-base	~1.1 GB
Phi-3 Mini Q4 (GGUF)	~2.3 GB

If memory is constrained, set USE_RERANKER = False in config.py to skip loading the reranker model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local RAG System

Features

Architecture

Tech Stack

Prerequisites

Quick Start

Web Dashboard

Chat

Documents

Watched Folders

API Usage

Ingest a document

Query

Health check

Reindex all documents

Makefile Commands

Configuration

Using Groq

Switching local models

Project Structure

Supported File Types

Memory Considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/documents		data/documents
docs		docs
go-ingestion		go-ingestion
python-core		python-core
scripts		scripts
ui/static		ui/static
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Local RAG System

Features

Architecture

Tech Stack

Prerequisites

Quick Start

Web Dashboard

Chat

Documents

Watched Folders

API Usage

Ingest a document

Query

Health check

Reindex all documents

Makefile Commands

Configuration

Using Groq

Switching local models

Project Structure

Supported File Types

Memory Considerations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages