A comprehensive Retrieval-Augmented Generation (RAG) system that combines semantic search with LLM generation, featuring both CLI tools and a web interface.
- CLI Interface: Use CLI for indexing, searching, and chatting
- Semantic Vector Search: Search documents using semantic similarity
- RAG Chat System: Context-aware chat powered by LLMs with retrieved documents
- Multiple Data Sources: Index from JSON files or markdown directories
- Hybrid Search: Combines vector similarity with keyword matching for better relevance
- Incremental Indexing: Update single files without re-indexing everything
- LLM-Optimized Output: Token-efficient output format designed for AI agents
- Interactive Modes: Both search and chat interfaces with rich commands
- Streaming Responses: Real-time streaming of LLM responses
- Web Interface: FastAPI backend with Svelte frontend.
-
Install Python dependencies:
pip install -r requirements.txt
-
Start required services:
# Start Qdrant (using Docker) docker run -p 6333:6333 -p 6334:6334 \ -v $(pwd)/qdrant_storage:/qdrant/storage:z \ qdrant/qdrant # Start Ollama and pull required models ollama serve ollama pull bge-m3:567m ollama pull llama3
-
Make the CLI executable:
chmod +x ./qras
Use the ./qras script as your main entry point:
# Show help
./qras help
# Search documents
./qras query "machine learning algorithms"
./qras query "what is neural network" --hybrid --limit 5
# Index documents
./qras index --input-path ./documents --collection docs
./qras index --input-path ./articles.json --collection articles --recreate
# Interactive chat
./qras chat --interactive
./qras chat "What is machine learning?" --chat-model llama3
# Start web interface (backend + frontend)
./qras webSearch your indexed documents with semantic or hybrid search:
# Basic search (LLM-optimized output by default)
./qras query "machine learning algorithms"
# Hybrid search (vector + keyword)
./qras query "neural networks" --hybrid --limit 5
# Verbose mode (detailed output with emojis and timing)
./qras query "neural networks" --hybrid --verbose
# Save results to file
./qras query "deep learning" --output-format json --output results.json
# Interactive search mode
./qras query --interactive
# Show specific article
./qras query --article-id 123 --collection articlesKey Options:
--hybrid: Use hybrid search (vector + keyword matching)--limit: Maximum number of results (default: 10)--output-format:llm(default),verbose,compact, orjson--verbose: Show debug logs and detailed output with emojis--max-content: Max characters per result (default: 500)--collection: Collection to search (default: docs)--interactive: Interactive search mode
Index documents into the vector database:
# Index a directory of documents
./qras index --input-path ./documents --collection docs
# Index a single file (incremental update)
./qras index --input-path ./notes/meeting.md --collection docs
# Delete chunks for a specific file
./qras index --delete "meeting.md" --collection docs
# Index JSON files
./qras index --input-path ./articles.json --collection articles
# Recreate collection from scratch
./qras index --input-path ./data --recreate
# Custom chunking settings
./qras index --input-path ./docs --chunk-size 200 --chunk-overlap 50
# Show collection info
./qras index --info
./qras index --info --collection docsKey Options:
--input-path: Path to documents directory or single file--delete: Delete all chunks for a specific source/file--collection: Collection name (default: docs)--recreate: Delete and recreate collection--chunk-size: Words per chunk (default: 150)--file-type:auto,json, ormarkdown
Note: Single file indexing uses deterministic chunk IDs, enabling proper incremental updates without duplicates.
Interactive RAG chat combining search with LLM generation:
# Interactive chat mode
./qras chat --interactive
# Single question
./qras chat "What is machine learning?"
# Use different models
./qras chat --interactive --embedding-model bge-m3:567m --chat-model llama3
# Hybrid search in chat
./qras chat --interactive --hybrid --search-limit 3Key Options:
--interactive: Start interactive chat mode--embedding-model: Model for search embeddings--chat-model: LLM for response generation--search-limit: Max sources to retrieve (default: 5)--hybrid: Use hybrid search
Start both backend and frontend servers simultaneously:
# Start web interface (backend + frontend)
./qras webOutput:
Starting Qdrant RAG Web Services...
====================================
Starting backend server on http://localhost:8000...
Starting frontend server on http://localhost:5173...
Web services are starting up...
Backend API: http://localhost:8000
Frontend UI: http://localhost:5173
Press Ctrl+C to stop all services
Requirements:
- Node.js and npm installed
- Backend dependencies installed (
pip install -r web/backend/app/requirements.txt) - Frontend dependencies installed (
cd web/frontend && npm install)
Start both backend and frontend with a single command:
# Start both web backend and frontend simultaneously
./qras web- Backend API: http://localhost:8000
- Frontend UI: http://localhost:5173
- API Docs: http://localhost:8000/api/v1/docs
Press Ctrl+C to stop both services.
If you prefer to run services separately:
cd web/backend/app
python main.pycd web/frontend
npm install
npm run devSet environment variables to customize behavior:
# Database settings
export QDRANT_URL="http://localhost:6333"
export QDRANT_TIMEOUT=30
# Embedding settings
export OLLAMA_URL="http://localhost:11434"
export OLLAMA_MODEL="embeddinggemma:latest"
export OLLAMA_TIMEOUT=120
# Indexing settings
export INDEXING_CHUNK_SIZE=150
export INDEXING_CHUNK_OVERLAP=30
# Search settings
export SEARCH_DEFAULT_LIMIT=10
export SEARCH_ENABLE_HYBRID=trueCreate a JSON configuration file:
{
"database": {
"url": "http://localhost:6333",
"timeout": 30
},
"embedding": {
"url": "http://localhost:11434",
"model": "embeddinggemma:latest",
"timeout": 120
},
"indexing": {
"chunk_size": 150,
"chunk_overlap": 30
}
}Use with: ./qras query "test" --config-file config.json
embeddinggemma:latest(768 dims) - Fast and efficientbge-m3:567m(1024 dims) - Multilingual, high qualitybge-large:latest(1024 dims) - Large model for qualityall-minilm-l6-v2(384 dims) - Compact and fast
llama3- General purpose conversational AIcodellama- Code-focused responsesmistral- Efficient and capablegemma- Google's efficient model- Any Ollama-compatible model
-
Index some documents:
./qras index --input-path ./my-documents --collection knowledge
-
Search the indexed content:
./qras query "artificial intelligence" --collection knowledge --hybrid -
Chat with your documents:
./qras chat --interactive --collection knowledge
-
Use the web interface:
./qras web
# Index with custom settings
./qras index \
--input-path ./documents \
--collection docs \
--model bge-m3:567m \
--chunk-size 200 \
--chunk-overlap 40 \
--workers 8
# Search with specific parameters
./qras query \
"machine learning algorithms" \
--collection docs \
--hybrid \
--limit 15 \
--min-score 0.1 \
--output-format json
# Chat with custom prompts
./qras chat \
--interactive \
--collection docs \
--embedding-model bge-m3:567m \
--chat-model llama3 \
--system-prompt "You are a technical expert. Provide detailed explanations."The system is built with a clean separation between:
- Shared Library (
lib/): Common functionality used by both CLI and web - CLI Tools (
cli/): Command-line interfaces for different operations - Web Interface (
web/): FastAPI backend + Svelte frontend
- Import errors: Make sure you're using
./qras(not individual Python scripts) - Permission denied: Run
chmod +x ./qrasto make the script executable - Connection errors: Verify Qdrant and Ollama are running
- Model not found: Pull the required model with
ollama pull <model-name> - Empty results: Check if documents are properly indexed
Run with verbose output:
LOG_LEVEL=DEBUG ./qras query "test"If you get a "Vector dimension error", ensure the embedding model used for querying matches the one used for indexing:
# Check collection info
./qras query --interactive
# Then type: stats
# Re-index with correct model if needed
./qras index --input-path ./docs --recreate --model <correct-model>Ollama connection refused:
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama if needed
ollama serveQdrant connection refused:
# Check if Qdrant is running
docker ps | grep qdrant
# Start Qdrant if needed
docker run -p 6333:6333 qdrant/qdrant-
Use hybrid search for better relevance:
./qras query "your query" --hybrid -
Optimize chunk size based on your content:
- Shorter chunks (100-150 words) for precise retrieval
- Longer chunks (200-300 words) for more context
-
Adjust search parameters for RAG:
- Lower limit (3-5) for focused responses
- Higher limit (7-10) for comprehensive answers
-
Use streaming for better UX:
./qras chat --interactive # Streaming enabled by default
QRAS can be used as a semantic memory search backend for OpenClaw agents.
-
Install QRAS (follow Getting Started above)
-
Copy
qmemwrapper to workspace:cd ~/.openclaw/workspace cp qras/openclaw/qmem ./qmem chmod +x ./qmem
-
Create config file:
cp .qmem.conf.example .qmem.conf nano .qmem.conf
.qmem.confcontents:QRAS_DIR=/home/user/.openclaw/workspace/qras COLLECTION=oc_memory OLLAMA_HOST=localhost:11434
-
Index your memory files:
cd qras ./qras index --input-path ~/.openclaw/workspace/memory --collection oc_memory --file-type markdown ./qras index --input-path ~/.openclaw/workspace/MEMORY.md --collection oc_memory
-
Test search:
cd ~/.openclaw/workspace ./qmem "test query"
Add to your AGENTS.md to enforce QRAS-first behavior:
### π Memory Recall Rule
**QRAS first, `memory_search` fallback.** When recalling anything, always use QRAS first. Only fall back to built-in `memory_search` if QRAS returns no results or errors.
### β οΈ Index After Writing Memory
Every time you update a memory file (`memory/*.md`, `MEMORY.md`), re-index it:
\`\`\`bash
cd ~/.openclaw/workspace/qras && ./qras index --input-path <path-to-file> --collection oc_memory
\`\`\`| Option | Required | Default | Description |
|---|---|---|---|
QRAS_DIR |
β | - | Path to QRAS installation |
COLLECTION |
β | - | Qdrant collection name |
OLLAMA_HOST |
β | - | Ollama server (host:port) |
MIN_SCORE |
β | 0.22 | Minimum relevance score |
LIMIT |
β | 3 | Max results returned |
MAX_CONTENT |
β | 600 | Max content chars per result |
For detailed instructions, see the OpenClaw QRAS skill.
- Qdrant - Vector database
- Ollama - Local LLM runtime
- FastAPI - Web framework
- Svelte - Frontend framework
Author: Robin Syihab