Offline multimodal document retrieval — images, PDFs, Office, code. Fully local, no cloud required.
📕 Table of Contents
A fully local, multimodal Retrieval-Augmented Generation (RAG) pipeline that processes images, scanned PDFs, Office documents, and code — all on your machine, no cloud API required (cloud models optional as fallback).
Most RAG frameworks only handle plain text. This pipeline fills the gap:
- 🖼️ Images → PaddleOCR extracts text → OCR failures fall back to vision model descriptions
- 📄 Scanned PDFs → Render pages → OCR → text extraction
- 📊 Office files (docx/xlsx/pptx) → LibreOffice render → OCR/visual description
- 💻 Code & configs → Language-aware semantic chunking
- 🎯 Search → Pure NumPy cosine similarity (zero external vector DB dependency)
Built from real-world production experience — handles GPU driver crashes, PaddleOCR memory leaks, and subprocess deadlocks out of the box.
- 🔒 100% local — Your files never leave your machine
- 🖼️ Multimodal ingestion — Images, PDFs, Office, code, text in one pipeline
- 🧠 Smart OCR fallback — PaddleOCR → vision model description → never lose content
- 🔍 Zero-dependency search — Pure NumPy, no FAISS/Milvus/Chroma needed
- 🖥️ Web UI — Google-style search page + config dashboard
- 🚀 REST API — Simple search endpoint for integration
- 📡 OpenAI-compatible vision — LMStudio / Ollama / vLLM / any
/v1/chat/completionsendpoint - 🔄 Cross-platform — Linux & Windows (auto-detects Everything on Windows, falls back to os.walk)
- 📊 SQLite-based — Single
.dbfile, easy backup and transfer - 🛡️ Battle-tested — Memory leak isolation, GPU crash defense, signal handling, resume support
┌─────────────────────────────────────────────────────────┐
│ File Scanner │
│ Windows: Everything SDK │ Linux: os.walk │
└─────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Metadata Extractor │
│ EXIF / PDF info / Media / Code stats │
└─────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Multimodal Extractor │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ PaddleOCR │ │ PyMuPDF │ │LibreOffice│ │ Code │ │
│ │ (images) │ │ (PDF) │ │ (Office) │ │ Parser │ │
│ └─────┬─────┘ └────┬─────┘ └────┬──────┘ └───┬────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ OCR Quality Check │ │
│ │ Failed? → Vision Model (OpenAI-compat API) │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Semantic Chunker │
│ Greedy merging · Code-aware · 1200 char max │
└─────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Jina v5 Embedding (512-dim) │
│ Local model · Matryoshka · int8 quantized │
└─────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ SQLite + NumPy Search │
│ Cosine similarity · No external vector DB needed │
└─────────────────────────────────────────────────────────┘
git clone https://github.com/yourname/local-multimodal-rag.git
cd local-multimodal-rag
pip install -r requirements.txt
# Optional: Install PaddleOCR for image text extraction
pip install paddlepaddle paddleocr
# Optional: Install LibreOffice for Office file rendering (Linux)
sudo apt install libreoffice# Copy and edit config
cp .env.example .env
# Edit .env: add API keys (optional, only for cloud vision fallback)
# Edit config.yaml: set your scan_paths
# Default: scans ~/documents and ~/code# Step 1: Scan files into database
python run_pipeline.py scan
# Step 2: Extract metadata
python run_pipeline.py extract-meta
# Step 3: Process & embed (the main RAG pipeline)
python run_pipeline.py process
# Step 4: Search via CLI
python run_pipeline.py search "your keywords"
# Or: Start web server (http://localhost:8100)
python run_pipeline.py serveThat's it. Open http://localhost:8100 in your browser.
All settings live in config.yaml. Key sections:
# Where to scan files
scanner:
db_path: ./data/file_index.db
scan_paths:
- ~/documents
- ~/code
# Vision model (any OpenAI-compatible endpoint)
vision:
endpoints:
- api_base: "http://localhost:1234/v1" # LMStudio / Ollama
model: "qwen2.5-vl-7b"
max_tokens: 1024
# Web server
server:
host: "0.0.0.0"
port: 8100
# Embedding
embedding:
model_path: ./models/jina-embeddings-v5-text-small
dim: 512See config.yaml for the full list with comments.
- Search page (
/) — Google-style search with file type filters - Config dashboard (
/config.html) — Edit all settings in browser
# Search
curl "http://localhost:8100/api/search?q=machine+learning&top_k=50"
# Response:
# {
# "results": [
# {"path": "/docs/paper.pdf", "score": 0.95, "snippet": "...", "file_type": "pdf"},
# ...
# ],
# "meta": {"query": "machine learning", "total": 50, "elapsed_ms": 120}
# }
# Get config
curl http://localhost:8100/api/config
# Trigger pipeline
curl -X POST http://localhost:8100/api/pipeline/scan
curl -X POST http://localhost:8100/api/pipeline/process| Feature | Local Multimodal RAG | LangChain RAG | LlamaIndex | RAGFlow |
|---|---|---|---|---|
| Fully offline | ✅ | ❌ | ❌ | ❌ |
| Image OCR → text | ✅ | Plugin | Plugin | ✅ |
| Scanned PDF OCR | ✅ | Plugin | Plugin | ✅ |
| Office files | ✅ | ❌ | Plugin | ✅ |
| External vector DB needed | ❌ | ✅ | ✅ | ✅ |
| GPU crash defense | ✅ | ❌ | ❌ | ❌ |
| Memory leak isolation | ✅ | ❌ | ❌ | ❌ |
| Zero-config search | ✅ | ❌ | ❌ | ❌ |
| Built-in web UI | ✅ | ❌ | ❌ | ✅ |
| Vision model fallback | ✅ | ❌ | ❌ | ✅ |
| Windows + Linux | ✅ | ✅ | ✅ | Docker |
local-multimodal-rag/
├── run_pipeline.py # CLI entry point (scan/extract/process/search/serve)
├── server.py # FastAPI web server
├── config.yaml # All configuration
├── config_loader.py # Config loader + env vars
│
├── file_scanner.py # File discovery (Everything / os.walk)
├── metadata_extractor.py # EXIF, PDF info, media metadata
├── extractors.py # Multimodal content extraction
├── chunker_v2.py # Semantic chunking
├── jina_v5_embedding.py # Jina v5 text embedding
├── search_numpy.py # NumPy cosine search
├── describer.py # Vision model (OpenAI-compat) descriptions
├── description_embedder.py # Description text embedding
├── ocr_server.py # PaddleOCR HTTP service
├── ocr_worker.py # OCR worker process
│
├── static/ # Web UI
│ ├── index.html # Search page
│ ├── config.html # Config dashboard
│ └── style.css
│
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
└── README.md
Issues and pull requests are welcome!
- Fork the repo
- Create a feature branch (
git checkout -b feature/amazing-thing) - Commit your changes
- Open a PR
MIT License — use it however you want.