Fully Self-Hosted RAG System for OpenShift Container Platform CPU-only β’ Zero External Dependencies
A complete Retrieval-Augmented Generation (RAG) stack designed to run entirely on OpenShift Container Platform without GPU requirements or external API dependencies. Built as a proof-of-value pilot for platform engineers to query internal documentation using natural language.
- π 100% CPU-Only: Runs on standard OpenShift nodes without GPU requirements
- π Fully Self-Hosted: No external API calls, complete data sovereignty
- π Smart Chunking: Markdown-aware chunking preserves document structure
- β‘ Real-Time Streaming: Server-Sent Events for responsive chat experience
- User uploads documents via Frontend
- Ingestion API chunks and embeds documents
- Embeddings stored in ChromaDB
- User asks questions via Chat UI
- Chat API retrieves relevant chunks
- Ollama generates contextual responses
- Streaming response displayed in real-time
- OpenShift 4.14+ cluster with cluster-admin access
ocCLI configured and authenticated- 16GB+ available memory across cluster
- 50GB+ available storage
# Clone repository
git clone https://github.com/devenes/ocp-rag-stack.git
cd ocp-rag-stack
# Deploy entire stack
make deploy
# Wait for all pods to be ready (5-10 minutes for model downloads)
oc get pods -n ocp-rag-stack -w
# Seed with example runbooks
make seed
# Get frontend URL
make demo# Create namespace and RBAC
oc apply -f deploy/namespace.yaml
oc apply -f deploy/rbac/
# Deploy infrastructure (Ollama + ChromaDB)
oc apply -f deploy/ollama/
oc apply -f deploy/chromadb/
# Wait for Ollama models to download
oc wait --for=condition=complete job/ollama-init-models -n ocp-rag-stack --timeout=600s
# Build and deploy application services
make build-images
oc apply -f deploy/ingestion-api/
oc apply -f deploy/chat-api/
# Deploy frontend
oc apply -f deploy/frontend/
# Verify deployment
oc get pods -n ocp-rag-stack# Get frontend URL
FRONTEND_URL=$(oc get route frontend -n ocp-rag-stack -o jsonpath='{.spec.host}')
echo "Frontend: https://$FRONTEND_URL"
# Open in browser
open "https://$FRONTEND_URL"- Click the upload area or drag-and-drop files
- Supported formats:
.txt,.md - Documents are automatically chunked and indexed
- View indexed documents in the sidebar
- Type your question in the chat input
- Press Enter or click Send
- Watch the AI response stream in real-time
- View source citations below each response
"How do I restart the payment service?"
"What should I do if a node shows NotReady status?"
"How do I check etcd cluster health?"
"What are the steps for incident response?"
# Build Go binaries
make build
# Run tests
make test
# Build container images
make build-images
# Push to OpenShift internal registry
make push-images# Start ChromaDB (requires Docker)
docker run -d -p 8000:8000 chromadb/chroma:0.5.23
# Start Ollama (requires Ollama installed)
ollama serve &
ollama pull nomic-embed-text
ollama pull qwen2.5:1.5b
# Run ingestion API
VECTOR_STORE=chromahttp \
CHROMA_URL=http://localhost:8000 \
OLLAMA_URL=http://localhost:11434 \
go run cmd/ingestion/main.go
# Run chat API (in another terminal)
VECTOR_STORE=chromahttp \
CHROMA_URL=http://localhost:8000 \
OLLAMA_URL=http://localhost:11434 \
go run cmd/chat/main.go
# Open frontend/index.html in browserIngestion API:
PORT=8081 # API port (default: 8081)
VECTOR_STORE=chromahttp # Vector store backend (chromem|chromahttp)
CHROMA_URL=http://chromadb:8000 # ChromaDB URL
OLLAMA_URL=http://ollama:11434 # Ollama URL
EMBED_MODEL=nomic-embed-text # Embedding model
CHUNKING_STRATEGY=markdown # Chunking strategy (fixed|sentence|markdown)
CHUNK_SIZE=512 # Chunk size in tokens
CHUNK_OVERLAP=50 # Chunk overlap in tokens
LOG_LEVEL=info # Log level (debug|info|warn|error)Chat API:
PORT=8082 # API port (default: 8082)
VECTOR_STORE=chromahttp # Vector store backend
CHROMA_URL=http://chromadb:8000 # ChromaDB URL
OLLAMA_URL=http://ollama:11434 # Ollama URL
EMBED_MODEL=nomic-embed-text # Embedding model
CHAT_MODEL=qwen2.5:1.5b # Chat model
TOP_K=5 # Number of chunks to retrieve
LOG_LEVEL=info # Log level| Component | CPU Request | CPU Limit | Memory Request | Memory Limit | Storage |
|---|---|---|---|---|---|
| Ollama | 2 cores | 4 cores | 4Gi | 6Gi | 10Gi |
| ChromaDB | 500m | 1 core | 1Gi | 2Gi | 10Gi |
| Ingestion API | 200m | 500m | 256Mi | 512Mi | - |
| Chat API | 200m | 500m | 256Mi | 512Mi | - |
| Frontend | 50m | 200m | 64Mi | 128Mi | - |
Total Cluster Requirements:
- CPU: ~3.5 cores (requests), ~6.5 cores (limits)
- Memory: ~6Gi (requests), ~10Gi (limits)
- Storage: ~20Gi persistent volumes
- Size: 274MB
- Dimensions: 768
- Context: 8192 tokens
- Performance: ~100 embeddings/sec on CPU
- Use Case: Document and query embeddings
- Size: 1.1GB
- Parameters: 1.5 billion
- Context: 32K tokens
- Performance: 15-25 tokens/sec on CPU
- Use Case: Conversational responses
Both models are optimized for CPU inference and automatically downloaded during deployment.
# All pods
oc get pods -n ocp-rag-stack
# Specific component logs
make logs-ollama
make logs-chromadb
make logs-ingestion
make logs-chat
make logs-frontend
# Port forward for debugging
make port-forward-ollama # localhost:11434
make port-forward-chromadb # localhost:8000
make port-forward-ingestion # localhost:8081
make port-forward-chat # localhost:8082# Ingestion API
curl http://ingestion-api:8081/health
# Chat API
curl http://chat-api:8082/health
# Ollama
curl http://ollama:11434/api/tags
# ChromaDB
curl http://chromadb:8000/api/v1/heartbeat# Check pod status
oc describe pod <pod-name> -n ocp-rag-stack
# Check events
oc get events -n ocp-rag-stack --sort-by='.lastTimestamp'
# Check logs
oc logs <pod-name> -n ocp-rag-stack# Check init job status
oc get job ollama-init-models -n ocp-rag-stack
# Check job logs
oc logs job/ollama-init-models -n ocp-rag-stack
# Manually trigger model download
oc exec -n ocp-rag-stack deployment/ollama -- ollama pull nomic-embed-text
oc exec -n ocp-rag-stack deployment/ollama -- ollama pull qwen2.5:1.5b# Check ChromaDB pod
oc get pod -n ocp-rag-stack -l app=chromadb
# Test connectivity from ingestion pod
oc exec -n ocp-rag-stack deployment/ingestion-api -- curl -v http://chromadb:8000/api/v1/heartbeat
# Check network policies
oc get networkpolicy -n ocp-rag-stack# Check route
oc get route frontend -n ocp-rag-stack
# Check ConfigMap
oc get configmap frontend-html -n ocp-rag-stack
# Restart frontend
oc rollout restart deployment/frontend -n ocp-rag-stack# All tests
make test
# Specific package
go test ./internal/chunking/... -v
# With coverage
go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out# Deploy to test namespace
oc new-project ocp-rag-stack-test
make deploy NAMESPACE=ocp-rag-stack-test
# Run integration tests
make test-integration
# Cleanup
oc delete project ocp-rag-stack-testPOST /api/v1/ingest/text
curl -X POST http://ingestion-api:8081/api/v1/ingest/text \
-F "file=@runbook.md"
Response:
{
"message": "Document ingested successfully",
"chunks_created": 42,
"document_id": "runbook.md"
}GET /health
curl http://ingestion-api:8081/health
Response:
{
"status": "healthy",
"vector_store": "connected",
"ollama": "connected"
}POST /api/v1/chat/stream (SSE)
curl -X POST http://chat-api:8082/api/v1/chat/stream \
-H "Content-Type: application/json" \
-d '{"message": "How do I restart a pod?", "stream": true}'
Response (SSE):
data: {"content": "To restart"}
data: {"content": " a pod"}
data: {"content": ", use the"}
data: {"sources": [...]}
data: [DONE]POST /api/v1/chat (Synchronous)
curl -X POST http://chat-api:8082/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"message": "How do I restart a pod?", "stream": false}'
Response:
{
"response": "To restart a pod, use the oc delete pod command...",
"sources": [
{
"content": "...",
"metadata": {"source": "runbook.md"},
"similarity": 0.89
}
]
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Ollama - Local LLM runtime
- ChromaDB - Vector database
- chromem-go - Pure Go vector store
- go-chi - Lightweight HTTP router
- OpenShift - Enterprise Kubernetes platform
Built with β€οΈ for Platform Engineers



