Local RAG Stack with Docker

This repository provides a simple docker-compose.yml to run the core components of a Retrieval-Augmented Generation (RAG) pipeline on your own server or local machine, even without a GPU.

Update (April 2026): Now using Ollama for LLM serving — automatic model management, broader model support, and simpler configuration. The previous llama.cpp setup remains available as an alternative.

Components

This stack includes:

postgres: A PostgreSQL database with the pgvector extension for storing vector embeddings.
embeddings: A Hugging Face text embeddings model (bge-base-en-v1.5) served via the Text Embeddings Inference container.
llm: Local LLM via Ollama. Pulls and caches models automatically. Supports Llama, Qwen, Mistral, and many others.

Prerequisites

Docker and Docker Compose are installed.
~2 GB disk space for the embeddings model; 4–8 GB for the LLM (varies by model).

How to Use

Clone this repository:

git clone https://github.com/emilvrana/local-rag-stack.git
cd local-rag-stack

Configure your model:
- Copy the example environment file: cp .env.example .env
- Edit .env and set OLLAMA_MODEL to your preferred model:
  - qwen2.5:7b — fast, good for most tasks (default)
  - qwen2.5:14b — better quality, slower
  - llama3.2 — compact, good for constrained environments
  - See ollama.com/library for all options
Start the services:
```
docker-compose up -d
```
The model will download automatically on first startup (this may take a few minutes).

Verify:

# Quick health check (all services)
./healthcheck.sh

# Or manually:
curl http://localhost:8080/api/tags    # LLM
curl http://localhost:8081/embed -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs": "Hello world"}'      # Embeddings

You should have:

PostgreSQL with pgvector on localhost:5433
Embeddings API at http://localhost:8081
LLM API (OpenAI-compatible) at http://localhost:8080

Python Example

A complete working example is provided in example_rag.py. It demonstrates:

Database initialization with pgvector
Document chunking with sliding windows
Embedding via the local TEI service
Similarity search using cosine distance
Answer generation via the local LLM

# Install dependencies
pip install -r requirements.txt

# Configure
# Edit .env with your settings (if you changed defaults in docker-compose.yml)

# Run the example
python example_rag.py

Model Recommendations

Model	Use Case	Notes
`qwen2.5:7b`	General purpose (default)	Good balance of speed and quality
`qwen2.5:14b`	Complex reasoning	Noticeably slower, better answers
`llama3.2`	Constrained environments	Fastest, sufficient for simple tasks
`mistral:7b`	Coding tasks	Good code understanding

Semantic Chunking

The default chunking now uses sentence-aware semantic chunking instead of naive sliding windows. This keeps sentences intact, producing more coherent chunks that work better for RAG retrieval.

from semantic_chunker import semantic_chunk

# Sentence-aware (default): groups sentences into chunks
chunks = semantic_chunk(text, chunk_size=500, strategy="sentence")

# Paragraph-aware: splits at paragraph boundaries, subdivides oversized paragraphs
chunks = semantic_chunk(text, chunk_size=500, strategy="paragraph")

The example_rag.py uses semantic chunking automatically. If you need the old naive chunker, just remove the semantic_chunker import — it falls back gracefully.

Hybrid Search

Vector similarity alone misses exact term matches. The new hybrid_search.py module combines vector search with PostgreSQL full-text search and trigram matching, using Reciprocal Rank Fusion (RRF) to merge results.

from hybrid_search import hybrid_query, init_hybrid_tables

# Run once to add full-text indexes
init_hybrid_tables()

# Alpha controls vector vs keyword weight (0.0–1.0, default 0.7)
answer = hybrid_query("What is pgvector?", alpha=0.7)

Why it matters: queries with specific names, error codes, or IDs often fail on pure vector search. Keyword-only search misses paraphrases. Hybrid catches both.

What's Next?

Extend the example for your use case:

Add document loaders (PDF, web scraping, APIs)
Try paragraph-aware chunking for structured documents
Build a web interface (FastAPI, Streamlit)
Add caching and rate limiting
Deploy to your own infrastructure

Alternative: Direct llama.cpp

If you prefer direct GGUF model serving without Ollama, the docker-compose.yml includes a commented llm service using llama.cpp. Uncomment that block and comment out the Ollama service to switch.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
docker-compose.yml		docker-compose.yml
example_rag.py		example_rag.py
healthcheck.sh		healthcheck.sh
hybrid_search.py		hybrid_search.py
ollama-entrypoint.sh		ollama-entrypoint.sh
requirements.txt		requirements.txt
semantic_chunker.py		semantic_chunker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local RAG Stack with Docker

Components

Prerequisites

How to Use

Python Example

Model Recommendations

Semantic Chunking

Hybrid Search

What's Next?

Alternative: Direct llama.cpp

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local RAG Stack with Docker

Components

Prerequisites

How to Use

Python Example

Model Recommendations

Semantic Chunking

Hybrid Search

What's Next?

Alternative: Direct llama.cpp

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages