🗂️ Local Multimodal RAG

Offline multimodal document retrieval — images, PDFs, Office, code. Fully local, no cloud required.

📕 Table of Contents

💡 What is this?
✨ Key Features
📸 How it Works
🚀 Quick Start
⚙️ Configuration
🌐 Web UI & API
📊 Comparison
📁 Project Structure
🤝 Contributing
📄 License

💡 What is this?

A fully local, multimodal Retrieval-Augmented Generation (RAG) pipeline that processes images, scanned PDFs, Office documents, and code — all on your machine, no cloud API required (cloud models optional as fallback).

Most RAG frameworks only handle plain text. This pipeline fills the gap:

🖼️ Images → PaddleOCR extracts text → OCR failures fall back to vision model descriptions
📄 Scanned PDFs → Render pages → OCR → text extraction
📊 Office files (docx/xlsx/pptx) → LibreOffice render → OCR/visual description
💻 Code & configs → Language-aware semantic chunking
🎯 Search → Pure NumPy cosine similarity (zero external vector DB dependency)

Built from real-world production experience — handles GPU driver crashes, PaddleOCR memory leaks, and subprocess deadlocks out of the box.

✨ Key Features

🔒 100% local — Your files never leave your machine
🖼️ Multimodal ingestion — Images, PDFs, Office, code, text in one pipeline
🧠 Smart OCR fallback — PaddleOCR → vision model description → never lose content
🔍 Zero-dependency search — Pure NumPy, no FAISS/Milvus/Chroma needed
🖥️ Web UI — Google-style search page + config dashboard
🚀 REST API — Simple search endpoint for integration
📡 OpenAI-compatible vision — LMStudio / Ollama / vLLM / any /v1/chat/completions endpoint
🔄 Cross-platform — Linux & Windows (auto-detects Everything on Windows, falls back to os.walk)
📊 SQLite-based — Single .db file, easy backup and transfer
🛡️ Battle-tested — Memory leak isolation, GPU crash defense, signal handling, resume support

📸 How it Works

┌─────────────────────────────────────────────────────────┐
│                    File Scanner                          │
│   Windows: Everything SDK  │  Linux: os.walk             │
└─────────────┬───────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│                 Metadata Extractor                       │
│         EXIF / PDF info / Media / Code stats             │
└─────────────┬───────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│              Multimodal Extractor                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────┐  │
│  │ PaddleOCR │  │ PyMuPDF  │  │LibreOffice│  │  Code  │  │
│  │ (images)  │  │  (PDF)   │  │ (Office)  │  │ Parser │  │
│  └─────┬─────┘  └────┬─────┘  └────┬──────┘  └───┬────┘  │
│        │              │              │              │       │
│        ▼              ▼              ▼              ▼       │
│   ┌──────────────────────────────────────────────────┐    │
│   │  OCR Quality Check                               │    │
│   │  Failed? → Vision Model (OpenAI-compat API)      │    │
│   └──────────────────────────────────────────────────┘    │
└─────────────┬───────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│              Semantic Chunker                            │
│    Greedy merging · Code-aware · 1200 char max          │
└─────────────┬───────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│            Jina v5 Embedding (512-dim)                   │
│      Local model · Matryoshka · int8 quantized          │
└─────────────┬───────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────┐
│            SQLite + NumPy Search                         │
│     Cosine similarity · No external vector DB needed     │
└─────────────────────────────────────────────────────────┘

🚀 Quick Start

1. Install

git clone https://github.com/yourname/local-multimodal-rag.git
cd local-multimodal-rag
pip install -r requirements.txt

# Optional: Install PaddleOCR for image text extraction
pip install paddlepaddle paddleocr

# Optional: Install LibreOffice for Office file rendering (Linux)
sudo apt install libreoffice

2. Configure

# Copy and edit config
cp .env.example .env
# Edit .env: add API keys (optional, only for cloud vision fallback)

# Edit config.yaml: set your scan_paths
# Default: scans ~/documents and ~/code

3. Run

# Step 1: Scan files into database
python run_pipeline.py scan

# Step 2: Extract metadata
python run_pipeline.py extract-meta

# Step 3: Process & embed (the main RAG pipeline)
python run_pipeline.py process

# Step 4: Search via CLI
python run_pipeline.py search "your keywords"

# Or: Start web server (http://localhost:8100)
python run_pipeline.py serve

That's it. Open http://localhost:8100 in your browser.

⚙️ Configuration

All settings live in config.yaml. Key sections:

# Where to scan files
scanner:
  db_path: ./data/file_index.db
  scan_paths:
    - ~/documents
    - ~/code

# Vision model (any OpenAI-compatible endpoint)
vision:
  endpoints:
    - api_base: "http://localhost:1234/v1"  # LMStudio / Ollama
      model: "qwen2.5-vl-7b"
      max_tokens: 1024

# Web server
server:
  host: "0.0.0.0"
  port: 8100

# Embedding
embedding:
  model_path: ./models/jina-embeddings-v5-text-small
  dim: 512

See config.yaml for the full list with comments.

🌐 Web UI & API

Web Interface

Search page (/) — Google-style search with file type filters
Config dashboard (/config.html) — Edit all settings in browser

REST API

# Search
curl "http://localhost:8100/api/search?q=machine+learning&top_k=50"

# Response:
# {
#   "results": [
#     {"path": "/docs/paper.pdf", "score": 0.95, "snippet": "...", "file_type": "pdf"},
#     ...
#   ],
#   "meta": {"query": "machine learning", "total": 50, "elapsed_ms": 120}
# }

# Get config
curl http://localhost:8100/api/config

# Trigger pipeline
curl -X POST http://localhost:8100/api/pipeline/scan
curl -X POST http://localhost:8100/api/pipeline/process

📊 Comparison

Feature	Local Multimodal RAG	LangChain RAG	LlamaIndex	RAGFlow
Fully offline	✅	❌	❌	❌
Image OCR → text	✅	Plugin	Plugin	✅
Scanned PDF OCR	✅	Plugin	Plugin	✅
Office files	✅	❌	Plugin	✅
External vector DB needed	❌	✅	✅	✅
GPU crash defense	✅	❌	❌	❌
Memory leak isolation	✅	❌	❌	❌
Zero-config search	✅	❌	❌	❌
Built-in web UI	✅	❌	❌	✅
Vision model fallback	✅	❌	❌	✅
Windows + Linux	✅	✅	✅	Docker

📁 Project Structure

local-multimodal-rag/
├── run_pipeline.py          # CLI entry point (scan/extract/process/search/serve)
├── server.py                # FastAPI web server
├── config.yaml              # All configuration
├── config_loader.py         # Config loader + env vars
│
├── file_scanner.py          # File discovery (Everything / os.walk)
├── metadata_extractor.py    # EXIF, PDF info, media metadata
├── extractors.py            # Multimodal content extraction
├── chunker_v2.py            # Semantic chunking
├── jina_v5_embedding.py     # Jina v5 text embedding
├── search_numpy.py          # NumPy cosine search
├── describer.py             # Vision model (OpenAI-compat) descriptions
├── description_embedder.py  # Description text embedding
├── ocr_server.py            # PaddleOCR HTTP service
├── ocr_worker.py            # OCR worker process
│
├── static/                  # Web UI
│   ├── index.html           # Search page
│   ├── config.html          # Config dashboard
│   └── style.css
│
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
└── README.md

🤝 Contributing

Issues and pull requests are welcome!

Fork the repo
Create a feature branch (git checkout -b feature/amazing-thing)
Commit your changes
Open a PR

📄 License

MIT License — use it however you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🗂️ Local Multimodal RAG

💡 What is this?

✨ Key Features

📸 How it Works

🚀 Quick Start

1. Install

2. Configure

3. Run

⚙️ Configuration

🌐 Web UI & API

Web Interface

REST API

📊 Comparison

📁 Project Structure

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
static		static
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
chunker_v2.py		chunker_v2.py
config.yaml		config.yaml
config_loader.py		config_loader.py
describer.py		describer.py
description_embedder.py		description_embedder.py
extractors.py		extractors.py
file_scanner.py		file_scanner.py
jina_v5_embedding.py		jina_v5_embedding.py
metadata_extractor.py		metadata_extractor.py
ocr_server.py		ocr_server.py
ocr_worker.py		ocr_worker.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
search_numpy.py		search_numpy.py
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

🗂️ Local Multimodal RAG

💡 What is this?

✨ Key Features

📸 How it Works

🚀 Quick Start

1. Install

2. Configure

3. Run

⚙️ Configuration

🌐 Web UI & API

Web Interface

REST API

📊 Comparison

📁 Project Structure

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages