A general-purpose RAG (Retrieval-Augmented Generation) application for research document drafting. Upload PDFs and other documents, build a searchable vector database, and chat with multiple LLM providers using your knowledge base as context.
- Multi-Format Support: PDF, Word (.docx), Markdown (.md), Plain Text (.txt), XML
- Document Processing: PyMuPDF for PDFs, python-docx for Word documents
- Vector Database: ChromaDB with OpenAI embeddings for semantic search
- Citation Management: Automatic BibTeX generation with page-level tracking
- MCP Server: Claude Code CLI integration for direct querying
- Multi-Provider Support: OpenAI, Anthropic, Google, DeepSeek, Groq, xAI, OpenRouter
- Sources Panel: Upload documents (PDF, DOCX, TXT, MD, XML), select for RAG context, view chunk statistics
- Chat Panel: Multi-provider LLM chat with streaming responses, model selection, and conversation history
- Project Instructions: Customizable system prompt editor with file upload support for .md/.txt files
uv venv .venv
uv pip install -r requirements.txtCreate a .env file with your API keys:
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
GOOGLE_API_KEY=your-google-key
DEEPSEEK_API_KEY=your-deepseek-key
GROQ_API_KEY=your-groq-key
GROK_API_KEY=your-grok-key
OPENROUTER_API_KEY=your-openrouter-key
Only OPENAI_API_KEY is required for embeddings. Other keys are optional based on which LLM providers you want to use.
Create a sources_raw/ directory and add your documents (supports PDF, DOCX, TXT, MD, XML):
mkdir -p sources_rawYou can organize documents in subdirectories as needed (e.g., by topic, date, or type). The application will recursively discover all documents:
sources_raw/
├── your_topic_1/
├── your_topic_2/
└── ...
The --phase filter accepts partial folder name matches for targeted searches.
Process all documents:
uv run python -m app.main buildOptions:
--clear: Clear existing data before building--fallback: Use PyMuPDF parser instead of Docling
Launch Web UI:
uv run python -m app.main uiOpen http://127.0.0.1:7860 in your browser.
Search from Command Line:
uv run python -m app.main search "international law principles"
uv run python -m app.main search "regulatory framework" --phase "meeting_1" -n 10View Statistics:
uv run python -m app.main statsAdd Single Document:
uv run python -m app.main add path/to/document.pdf
uv run python -m app.main add path/to/notes.docxRun MCP Server Directly:
uv run python -m app.main mcpAdd to your Claude Code settings (~/.claude.json or project .claude/settings.json):
{
"mcpServers": {
"research-rag": {
"command": "uv",
"args": ["run", "python", "-m", "mcp_server.server"],
"cwd": "/path/to/your/research-notebook"
}
}
}Once configured, Claude Code can use these tools:
search_documents: Semantic search with metadata filteringget_citation: Get BibTeX citation for a documentverify_citation: Verify statements against source documentslist_documents: List all documents in the knowledge baseget_document_context: Get full content of a specific documentget_database_stats: Get knowledge base statistics
Use the search_documents tool to find information about
regulatory frameworks in the knowledge base.
Source Documents (sources_raw/)
│ Supports: PDF, DOCX, TXT, MD, XML
▼
┌─────────────────────────────────────┐
│ Document Parser (app/parsers/) │
│ - PDF: PyMuPDF text extraction │
│ - DOCX: python-docx │
│ - TXT/MD/XML: Native parsing │
│ - Character-based chunking │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Vector DB (app/vectordb/) │
│ - ChromaDB + OpenAI embeddings │
│ - Metadata: page, section, folder │
│ - Persistent at data/chroma_db/ │
└─────────────────────────────────────┘
│
├──► MCP Server (mcp_server/)
│ └─► Claude Code CLI tools
│
├──► Gradio UI (app/ui/)
│ └─► Document upload/search/chat
│
└──► Citations (app/citations/)
└─► BibTeX generation
├── app/ # Main application
│ ├── parsers/ # Document processing (PDF, DOCX, TXT, MD, XML)
│ ├── vectordb/ # ChromaDB + embeddings
│ ├── citations/ # BibTeX management
│ └── ui/ # Gradio interface
├── mcp_server/ # MCP server for Claude Code
├── scripts/ # Utility scripts
├── data/ # Generated data
│ ├── chroma_db/ # Vector database
│ └── citations/ # BibTeX files
├── sources_raw/ # Your source documents (create this directory)
└── outputs/ # Generated outputs
| Provider | Models |
|---|---|
| OpenAI | GPT-5.x, GPT-4o, GPT-4o Mini, o3/o1 reasoning |
| Anthropic | Claude Sonnet 4, Opus 4, 3.7, 3.5 |
| Gemini 2.0/1.5 | |
| DeepSeek | R1, V3 |
| Groq | Llama 3.3, Mixtral, Gemma2 (free tier) |
| xAI | Grok 2, Grok 2 Vision (free tier) |
| OpenRouter | Llama 3.3 70B, Gemma 2 9B, Mistral 7B (free tier) |
- Chunk size: ~1500 tokens with 200 token overlap
- Sections preserved from markdown headings
- Footnotes attached to parent paragraphs
- Page numbers estimated from document position
Citations are generated in BibTeX format:
@techreport{example2024report,
title = {Research Report Title},
author = {Author Name},
year = {2024},
note = {Document ID: abc123def456},
}Docling installation issues: Use the fallback parser:
uv run python -m app.main build --fallbackOpenAI API errors:
Verify your API key in .env and check rate limits.
Empty search results: Ensure the database is built:
uv run python -m app.main statsMIT License
