Production-ready hybrid RAG (Retrieval-Augmented Generation) system with intelligent reranking, multi-backend vector database support, and comprehensive evaluation framework.
- Hybrid Search: Combines BM25/sparse and dense vector retrieval with configurable fusion methods (Alpha, RRF, relative scoring)
- Intelligent Reranking: Local BGE-reranker-v2.5 with Cohere Rerank 3.5 fallback for optimal relevance scoring
- Query Enhancement: HyDE and RAG-Fusion techniques for improved recall and comprehensive coverage
- Multi-Backend Support: Weaviate (primary), Milvus 2.5+ (with sparse BM25), and Qdrant vector databases
- Comprehensive Evaluation: TruLens RAG Triad, Ragas v0.2+, offline local LLM judges, and embedding similarity metrics with A/B testing capabilities
- Production Ready: FastAPI service with health checks, Docker deployment, monitoring, and extensive configuration
Documents → Ingest → Vector DB (Hybrid Index) → Query → Retrieve (topK=50) →
Rerank (topN=8) → LLM Generate → Answer with Citations → Evaluate
- Embeddings: BGE-M3 multilingual model (dense + sparse support)
- Vector Database: Weaviate (primary), Milvus 2.5+/Qdrant alternatives with hybrid search support
- Reranking: BAAI/bge-reranker-v2-m3 cross-encoder (local), Cohere Rerank 3.5 (cloud fallback)
- Framework: FastAPI with Python 3.10+, uvicorn ASGI server
- Evaluation: TruLens + Ragas with local embedding metrics, configurable LLM judges, and automated quality thresholds
- Generation: OpenAI-compatible, local Ollama, or template-based approaches
- Python 3.10 or higher
- uv package manager (recommended) or pip
- Docker and Docker Compose (for vector database and full deployment)
# Clone the repository
git clone https://github.com/deadjoe/askme.git
cd askme
# Install dependencies
uv sync --dev
# Start vector database (Weaviate)
docker compose -f docker/docker-compose.yaml --profile weaviate up -d weaviate
# Start API server (development mode)
ASKME_SKIP_HEAVY_INIT=1 uv run uvicorn askme.api.main:app --port 8080 --reload
# Ingest documents
./scripts/ingest.sh /path/to/documents --tags="project,documentation"
# Ask questions
./scripts/answer.sh "What is machine learning?"
# Retrieve documents only (for debugging/tuning)
./scripts/retrieve.sh "hybrid search techniques" --topk=100 --alpha=0.7
# Quick end-to-end smoke test (requires curl + jq)
./query_test.sh "Who created BM25?"
# Run evaluation
./scripts/evaluate.sh --suite=baseline
The ./scripts/answer.sh
script supports comprehensive parameter tuning for optimal results:
Parameter | Default | Range/Options | Description | Impact |
---|---|---|---|---|
--topk=N |
100 | 1-100 | Number of initial retrieval candidates | Higher = better recall, slower response |
--alpha=X |
0.5 | 0.0-1.0 | Hybrid search weight (0=sparse, 1=dense) | 0.0=keyword matching, 1.0=semantic similarity |
--rrf / --no-rrf |
--rrf |
boolean | Use RRF vs alpha fusion | RRF=stable ranking, alpha=direct weighting |
--rrf-k=N |
60 | 1-200 | RRF fusion smoothing parameter | Lower=aggressive reranking, higher=conservative |
--reranker=TYPE |
bge_local |
bge_local , cohere |
Reranking model selection | Local=private, Cohere=higher quality |
--max-passages=N |
8 | 1-20 | Final passages for LLM generation | More=richer context, risk of attention dilution |
--hyde |
disabled | boolean | Enable HyDE query expansion | Better for abstract/conceptual queries |
--rag-fusion |
disabled | boolean | Multi-query generation and fusion | Better coverage for complex questions |
--debug |
disabled | boolean | Include retrieval debug information | Shows timing and score details |
--format=FORMAT |
text |
text , json , markdown |
Output format selection | Choose based on consumption needs |
--verbose |
disabled | boolean | Enable verbose logging | Detailed execution information |
--api-url=URL |
localhost:8080 |
URL | Target API base URL | Override for remote deployments |
# Dense semantic search for conceptual queries
./scripts/answer.sh "Explain machine learning principles" --alpha=0.8 --max-passages=12
# Sparse keyword search for specific terms
./scripts/answer.sh "BGE-M3 model architecture" --alpha=0.2 --topk=80
# Enhanced query with expansion techniques
./scripts/answer.sh "How does attention mechanism work?" --hyde --rag-fusion --format=markdown
# Debug mode with detailed retrieval information
./scripts/answer.sh "Vector database comparison" --debug --verbose --format=json
# High-quality reranking with cloud fallback
./scripts/answer.sh "Production RAG best practices" --reranker=cohere --max-passages=10
Configuration is managed through configs/askme.yaml
with environment variable overrides. Key parameters:
# Vector backend selection
vector_backend: weaviate # weaviate | milvus | qdrant
# Hybrid search configuration
hybrid:
mode: rrf # rrf | alpha | relative_score | ranked
alpha: 0.5 # 0=sparse only, 1=dense only, 0.5=balanced
rrf_k: 60 # RRF fusion parameter
topk: 50 # Initial retrieval candidates
# Embedding model
embedding:
model: BAAI/bge-m3
dimension: 1024
normalize_embeddings: true
# Reranking
rerank:
local_model: BAAI/bge-reranker-v2-m3
local_enabled: true
cohere_enabled: false # Enable via ASKME_ENABLE_COHERE=1
top_n: 8
# Generation
generation:
provider: ollama # simple | ollama | openai
ollama_model: gpt-oss:20b
ollama_endpoint: http://localhost:11434
openai_model: gpt-4o-mini
ASKME_SKIP_HEAVY_INIT=1
- Skip heavy service initialization during developmentASKME_API_URL
/ASKME_API_KEY
- Target API base URL and optional auth for CLI scriptsASKME_ENABLE_COHERE=1
+COHERE_API_KEY
- Enable Cohere rerankingASKME_ENABLE_OLLAMA=1
- Enable local Ollama generationASKME_RAGAS_LLM_MODEL
- Override local LLM judge used for evaluations (defaultgpt-oss:20b
)ASKME_RAGAS_EMBED_MODEL
- Override embedding model for Ragas metrics (defaultBAAI/bge-m3
)ASKME_TRULENS_LLM_MODEL
- Override TruLens evaluation model (falls back toASKME_RAGAS_LLM_MODEL
)OPENAI_BASE_URL
/OPENAI_API_KEY
- OpenAI-compatible endpoints for evaluation
GET /health/
- Basic health checkGET /health/ready
- Readiness check for orchestrationGET /health/live
- Liveness check for orchestration
POST /ingest/
- Universal document ingestion (file/directory)POST /ingest/file
- Single file ingestionPOST /ingest/directory
- Directory ingestion with recursionGET /ingest/status/{task_id}
- Task status monitoringGET /ingest/stats
- Global ingestion statistics
POST /query/
- Hybrid search + reranking + generation pipelinePOST /query/retrieve
- Retrieval-only endpoint for debuggingGET /query/similar/{doc_id}
- Similar document discoveryPOST /query/explain
- Retrieval explanation (debugging)
POST /eval/run
- Execute evaluation pipeline with TruLens + RagasGET /eval/runs/{run_id}
- Retrieve evaluation resultsPOST /eval/compare
- A/B test comparison between runsGET /eval/runs
- List recent evaluation runsDELETE /eval/runs/{run_id}
- Delete evaluation runGET /eval/metrics
- Available evaluation metrics
./scripts/evaluate.sh
provides a unified CLI for starting suites, overriding retrieval parameters (--alpha
,--topk
,--topn
), and choosing output formats (text
,json
,table
).- Embedding similarity metrics run locally when
embedding_service
is available, adding groundedness, context precision/recall, and answer relevance without leaving the host. - Local LLM judge metrics (faithfulness, answer relevancy, context precision/recall) default to Ollama via the OpenAI-compatible API; override with
ASKME_RAGAS_LLM_MODEL
,OPENAI_BASE_URL
, or standard OpenAI keys as needed. - TruLens metrics automatically fall back to LiteLLM/OpenAI providers and respect
ASKME_TRULENS_LLM_MODEL
, enabling fully offline evaluation when paired with Ollama. - Use
ASKME_RAGAS_EMBED_MODEL
to swap in alternative embedding backends for evaluation-only workloads without touching production retrieval settings.
curl -X POST "http://localhost:8080/query/" \
-H "Content-Type: application/json" \
-d '{
"q": "What is machine learning?",
"topk": 50,
"alpha": 0.5,
"use_rrf": true,
"reranker": "bge_local",
"max_passages": 8,
"include_debug": true
}'
# Full stack with Milvus (recommended)
docker compose -f docker/docker-compose.yaml up -d
# Alternative vector databases
docker compose -f docker/docker-compose.yaml --profile weaviate up -d
docker compose -f docker/docker-compose.yaml --profile qdrant up -d
# With monitoring
docker compose -f docker/docker-compose.yaml --profile monitoring up -d
- Performance Targets: P95 < 1500ms retrieval, < 1800ms with reranking
- Scaling: ~50k documents per node, horizontal scaling via load balancing
- Security: Local-only by default, cloud services require explicit opt-in
- Monitoring: Prometheus metrics and Grafana dashboards included
# Install development dependencies
uv sync --dev
# Install pre-commit hooks
uv run pre-commit install
# Run tests with coverage
uv run pytest --cov=askme --cov-report=term --cov-report=html
# Code formatting and type checking
uv run black askme tests && uv run isort askme tests
uv run mypy askme
The project includes comprehensive test coverage with pytest:
# Run all tests
uv run pytest -ra
# Run specific test categories
uv run pytest -m "unit and not slow"
uv run pytest -m integration
uv run pytest -m "slow"
# Run with coverage reporting
uv run pytest --cov=askme --cov-report=html
Test Markers:
unit
: Unit tests for individual componentsintegration
: Cross-component integration testsslow
: Time-intensive tests (model loading, evaluation)
The project maintains high code quality standards:
- Formatting: Black and isort for consistent code style
- Type Checking: MyPy with strict configuration
- Linting: Flake8 with Black-compatible settings
- Security: Bandit security analysis
- Pre-commit Hooks: Automated quality checks on commit
# Comprehensive baseline evaluation
./scripts/evaluate.sh --suite=baseline
# Quick evaluation for CI/CD
./scripts/evaluate.sh --suite=quick --sample-size=3
# Custom dataset evaluation
./scripts/evaluate.sh --dataset="/path/to/qa_dataset.jsonl" --metrics="faithfulness,context_precision"
# Parameter tuning evaluation
./scripts/evaluate.sh --suite=baseline --alpha=0.3 --topk=75 --topn=10
TruLens RAG Triad:
- Context Relevance: How relevant retrieved context is to the query
- Groundedness: How well the answer is supported by retrieved context
- Answer Relevance: How relevant the answer is to the original query
Ragas Metrics:
- Faithfulness: Factual consistency of answer with retrieved context
- Answer Relevancy: Semantic relevance of answer to query
- Context Precision: Precision of retrieved context chunks
- Context Recall: Recall of relevant context chunks
- TruLens Triad: ≥ 0.7 (configurable)
- Ragas Faithfulness: ≥ 0.7 (configurable)
- Context Precision: ≥ 0.6 (configurable)
- Answer Consistency: ≥ 90% with fixed seeds
We welcome contributions to the askme project! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Make your changes following the code style guidelines
- Add or update tests for your changes
- Ensure all tests pass and code quality checks succeed
- Update documentation as needed
- Submit a pull request with a clear description
- Language: All code, documentation, and commit messages must be in English
- Formatting: Use Black (line length 88) and isort for imports
- Type Hints: All functions must include proper type annotations
- Documentation: Follow Google-style docstrings
- Testing: Maintain or improve test coverage (currently ~88%)
Before submitting a PR, ensure:
# All tests pass
uv run pytest
# Code formatting
uv run black askme tests && uv run isort askme tests
# Type checking
uv run mypy askme
# Security check
uv run bandit -r askme
# Basic evaluation passes
./scripts/evaluate.sh --suite=quick
-
Milvus Container Startup Issues
- If Milvus fails to start with port binding errors, try using Weaviate instead:
docker run --name weaviate -p 8081:8080 -p 8082:50051 -d \ cr.weaviate.io/semitechnologies/weaviate:1.24.1 \ --host 0.0.0.0 --port 8080 --scheme http
- Update
configs/askme.yaml
to usevector_backend: weaviate
- Ensure both HTTP (8081) and gRPC (8082) ports are exposed for Weaviate
- If Milvus fails to start with port binding errors, try using Weaviate instead:
-
Script Command Syntax
- Use
--param=value
format for script parameters:# Correct ./scripts/retrieve.sh "query" --topk=25 --alpha=0.7 # Incorrect ./scripts/retrieve.sh "query" --topk 25 --alpha 0.7
- Use
-
Slow Retrieval Performance
- Check hybrid search parameters (alpha, RRF vs alpha fusion)
- Verify vector database connection and indexing
- Monitor embedding service latency
-
Poor Reranking Quality
- Ensure local BGE-reranker model is properly loaded
- Check Cohere API key and fallback configuration
- Verify reranking score thresholds
-
Memory Issues
- Adjust batch sizes in
configs/askme.yaml
- Use
ASKME_SKIP_HEAVY_INIT=1
for development - Monitor model memory usage (BGE-M3 + reranker)
- Adjust batch sizes in
-
Evaluation Failures
- Check TruLens and Ragas library versions
- Verify evaluation dataset format (JSONL with required fields)
- Ensure OpenAI-compatible API access for evaluation LLM
- Submit issues via GitHub Issues
- Review CLAUDE.md for development context
- Check the codebase for implementation details and examples
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- BAAI for BGE-M3 embeddings and reranker models
- Milvus for hybrid search capabilities with sparse BM25 support
- TruLens and Ragas for evaluation frameworks
- FastAPI for the high-performance web framework