Skip to content

fizznix/local-rag-system

Repository files navigation

Local RAG System

A production-grade, fully offline Retrieval-Augmented Generation system built for Apple M1. Combines a Go ingestion service for reliable file watching with a Python RAG core for AI-powered document search and question answering.

Features

  • Fully offline — runs entirely on-device using local GGUF models; optionally use Groq cloud API for faster inference
  • Web dashboard — chat, document management, and folder watching via browser UI
  • Incremental indexing — content hashing ensures only changed documents are re-processed
  • Structured chunking — section-aware splitting with token-limit fallback
  • Reranking — optional cross-encoder reranking for higher answer quality
  • Debounced ingestion — avoids duplicate indexing from rapid file events
  • File stability checks — prevents partial reads during file writes
  • Dynamic folder watching — add/remove watched directories via UI without restarting
  • Metadata tracking — SQLite-backed document and chunk lineage

Architecture

┌──────────────────────┐     ┌──────────────────────────────────┐     ┌──────────────┐
│  Go Ingestion Service│     │  Python RAG Core (FastAPI)       │     │  Web Dashboard │
│                      │     │                                  │     │              │
│  File Watcher        │───▶│  Parser ─▶ Chunker ─▶ Embedder  │◀───│  Chat        │
│  Event Normalizer    │     │  Vector Store (Qdrant)           │     │  Documents    │
│  Debouncer           │     │  Retrieval ─▶ Reranker ─▶ LLM   │     │  Folders      │
│  Stability Checker   │     │  Metadata DB (SQLite)            │     │  Upload       │
│  Content Hasher      │     │  Dashboard API                   │     │              │
│  Job Queue + Worker  │     │                                  │     │              │
│  Dynamic Folder Poll │     │                                  │     │              │
└──────────────────────┘     └──────────────────────────────────┘     └──────────────┘

Tech Stack

Component Technology
LLM Phi-3 / Gemma / Mistral / Llama 3 (GGUF via llama.cpp) or Groq cloud API
Embeddings BGE-small-en-v1.5 (384 dim)
Reranker BGE-reranker-base (CrossEncoder)
Vector DB Qdrant (Docker)
Metadata DB SQLite (WAL mode)
API FastAPI + Uvicorn
UI Static HTML/CSS/JS (served by FastAPI)
File Watcher Go + fsnotify
Hashing xxhash

Prerequisites

  • macOS (Apple Silicon)
  • Python 3.13 (Homebrew: brew install python@3.13)
  • Go 1.21+
  • Docker Desktop

Quick Start

# 1. Clone and enter the project
cd local-rag-system

# 2. Full setup (venv, dependencies, Qdrant, SQLite, Go)
make setup

# 3. Download models
#    LLM (~2.3GB)
curl -L -o models/llm/Phi-3-mini-4k-instruct-q4.gguf \
  "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"

#    Reranker (~1.1GB, optional)
make download-reranker

# 4. Start services (in separate terminals)
make python-api      # Terminal 1: FastAPI on :8000
make go-watcher      # Terminal 2: Go file watcher

# Or start everything in the background:
make run-all

Web Dashboard

After starting the API, open http://localhost:8000 in your browser. The dashboard has three tabs:

Chat

Ask questions about your indexed documents. Answers include source citations with relevance scores, powered by the full RAG pipeline (embed → retrieve → rerank → generate).

Chat tab

Documents

View all indexed documents with chunk counts and timestamps. Upload files via drag-and-drop (PDF, DOCX, TXT, MD) or delete individual documents from the index.

Documents tab

Watched Folders

See all directories being monitored by the Go file watcher. Add new folders (existing files are auto-scanned immediately) or remove them (full index cleanup).

Watched Folders tab

  • The default data/documents/ folder is always present and cannot be removed
  • The Go watcher polls for folder changes every 30 seconds (no restart needed)

API Usage

Ingest a document

Drop files into data/documents/ (the Go watcher picks them up automatically), or call the API directly:

curl -s -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{"file_path": "/absolute/path/to/document.pdf", "event_type": "create", "file_hash": ""}' \
  | python3 -m json.tool

Query

curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is this document about?"}' \
  | python3 -m json.tool

Health check

curl -s http://localhost:8000/health | python3 -m json.tool

Reindex all documents

curl -s -X POST http://localhost:8000/reindex | python3 -m json.tool

Makefile Commands

Command Description
make setup Full setup (Python, Docker, SQLite, Go)
make python-api Start the FastAPI server on :8000
make go-watcher Start the Go file watcher
make run-all Start all services in background with logs
make stop Stop all background services
make download-reranker Download the reranker model (~1.1GB)
make start-qdrant Start Qdrant container
make stop-qdrant Stop Qdrant container
make build-go Build Go binary
make health Check API health
make ingest-test Ingest a sample test document
make query-test Run a sample query
make reindex Force reindex all documents
make clean Remove build artifacts and caches
make clean-all Remove venv, DB, and Docker volumes

Configuration

All settings live in python-core/config.py:

Setting Default Description
LLM_PROVIDER "local" "local" (GGUF) or "groq" (cloud API)
LLM_MODEL_FILENAME "Phi-3-mini-4k-instruct-q4.gguf" GGUF filename in models/llm/ (local only)
GROQ_MODEL "llama-3.3-70b-versatile" Groq model ID (groq only)
USE_RERANKER False Enable/disable cross-encoder reranking
CHUNK_SIZE 400 Token limit per chunk
CHUNK_OVERLAP 80 Overlap tokens between chunks
RETRIEVAL_TOP_K 20 Candidates fetched from vector search
RERANK_TOP_K 5 Final chunks after reranking
MAX_CONTEXT_TOKENS 2000 Token budget for LLM context
LLM_MAX_TOKENS 512 Max tokens in LLM response
LLM_TEMPERATURE 0.1 LLM sampling temperature

Using Groq

  1. Set LLM_PROVIDER = "groq" and optionally GROQ_MODEL in python-core/config.py
  2. Export your API key before starting the server:
export GROQ_API_KEY=gsk_...
make python-api

Get a free key at console.groq.com. Popular model IDs:

  • llama-3.3-70b-versatile — general purpose
  • llama-3.1-8b-instant — ultra-fast
  • mixtral-8x7b-32768 — large 32k context
  • gemma2-9b-it — Google Gemma 2

Switching local models

Place any GGUF file in models/llm/ and update LLM_MODEL_FILENAME in config.py. Stop tokens and context size are auto-detected from the filename — no other changes needed.

Project Structure

local-rag-system/
├── python-core/            # Python RAG core
│   ├── api/                # FastAPI endpoints
│   │   ├── main.py         # App entry + static file serving
│   │   ├── ingest_endpoint.py
│   │   ├── query_endpoint.py
│   │   └── dashboard_endpoint.py  # Documents, folders, upload API
│   ├── chunking/           # Structure-aware document chunker
│   ├── db/                 # SQLite metadata (documents, chunks, watched_folders)
│   ├── embeddings/         # BGE embedding service
│   ├── ingestion/          # Incremental ingestion pipeline
│   ├── llm/                # llama.cpp LLM service
│   ├── parsers/            # PDF, DOCX, TXT parsers
│   ├── reranker/           # Cross-encoder reranker
│   ├── retrieval/          # Vector search + prompt builder
│   ├── vector_store/       # Qdrant client
│   ├── config.py           # Central configuration
│   └── requirements.txt    # Python dependencies
├── go-ingestion/           # Go file watcher service
│   ├── watcher/            # fsnotify file watcher (dynamic folder support)
│   ├── normalizer/         # Event normalization
│   ├── debouncer/          # Event deduplication
│   ├── stability/          # File stability checks
│   ├── hashing/            # Content hashing (xxhash)
│   ├── queue/              # Job queue
│   ├── worker/             # Ingestion worker (calls Python API)
│   └── config/             # Go configuration
├── ui/static/              # Web dashboard (served by FastAPI)
│   ├── index.html          # Single-page app shell
│   ├── style.css           # Dark theme styles
│   └── app.js              # Client-side logic
├── data/documents/         # Default watched folder for ingestion
├── models/                 # Downloaded model files
│   ├── llm/                # GGUF model
│   ├── embeddings/         # Sentence transformer cache
│   └── reranker/           # Reranker model cache
├── docs/                   # Phase documentation
├── docker-compose.yml      # Qdrant service
├── Makefile                # All project commands
└── plan.md                 # Original system design

Supported File Types

  • PDF (.pdf)
  • Microsoft Word (.docx)
  • Plain text (.txt)

Memory Considerations

With 8GB RAM on M1, all three models can run simultaneously but memory is tight:

Model Size
BGE-small-en-v1.5 ~130 MB
BGE-reranker-base ~1.1 GB
Phi-3 Mini Q4 (GGUF) ~2.3 GB

If memory is constrained, set USE_RERANKER = False in config.py to skip loading the reranker model.

About

A production-grade local RAG system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors