My Research Notebook - Local RAG Application

A general-purpose RAG (Retrieval-Augmented Generation) application for research document drafting. Upload PDFs and other documents, build a searchable vector database, and chat with multiple LLM providers using your knowledge base as context.

Screenshots

Features

Core Capabilities

Multi-Format Support: PDF, Word (.docx), Markdown (.md), Plain Text (.txt), XML
Document Processing: PyMuPDF for PDFs, python-docx for Word documents
Vector Database: ChromaDB with OpenAI embeddings for semantic search
Citation Management: Automatic BibTeX generation with page-level tracking
MCP Server: Claude Code CLI integration for direct querying
Multi-Provider Support: OpenAI, Anthropic, Google, DeepSeek, Groq, xAI, OpenRouter

Gradio UI Panels

Sources Panel: Upload documents (PDF, DOCX, TXT, MD, XML), select for RAG context, view chunk statistics
Chat Panel: Multi-provider LLM chat with streaming responses, model selection, and conversation history
Project Instructions: Customizable system prompt editor with file upload support for .md/.txt files

Quick Start

1. Setup Environment (uv)

uv venv .venv
uv pip install -r requirements.txt

2. Configure Environment

Create a .env file with your API keys:

OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
GOOGLE_API_KEY=your-google-key
DEEPSEEK_API_KEY=your-deepseek-key
GROQ_API_KEY=your-groq-key
GROK_API_KEY=your-grok-key
OPENROUTER_API_KEY=your-openrouter-key

Only OPENAI_API_KEY is required for embeddings. Other keys are optional based on which LLM providers you want to use.

3. Add Source Documents

Create a sources_raw/ directory and add your documents (supports PDF, DOCX, TXT, MD, XML):

mkdir -p sources_raw

You can organize documents in subdirectories as needed (e.g., by topic, date, or type). The application will recursively discover all documents:

sources_raw/
├── your_topic_1/
├── your_topic_2/
└── ...

The --phase filter accepts partial folder name matches for targeted searches.

4. Build the Vector Database

Process all documents:

uv run python -m app.main build

Options:

--clear: Clear existing data before building
--fallback: Use PyMuPDF parser instead of Docling

5. Use the Application

Launch Web UI:

uv run python -m app.main ui

Open http://127.0.0.1:7860 in your browser.

Search from Command Line:

uv run python -m app.main search "international law principles"
uv run python -m app.main search "regulatory framework" --phase "meeting_1" -n 10

View Statistics:

uv run python -m app.main stats

Add Single Document:

uv run python -m app.main add path/to/document.pdf
uv run python -m app.main add path/to/notes.docx

Run MCP Server Directly:

uv run python -m app.main mcp

Claude Code MCP Integration

Configure MCP Server

Add to your Claude Code settings (~/.claude.json or project .claude/settings.json):

{
  "mcpServers": {
    "research-rag": {
      "command": "uv",
      "args": ["run", "python", "-m", "mcp_server.server"],
      "cwd": "/path/to/your/research-notebook"
    }
  }
}

Available MCP Tools

Once configured, Claude Code can use these tools:

search_documents: Semantic search with metadata filtering
get_citation: Get BibTeX citation for a document
verify_citation: Verify statements against source documents
list_documents: List all documents in the knowledge base
get_document_context: Get full content of a specific document
get_database_stats: Get knowledge base statistics

Example Usage in Claude Code

Use the search_documents tool to find information about
regulatory frameworks in the knowledge base.

Architecture

Source Documents (sources_raw/)
    │  Supports: PDF, DOCX, TXT, MD, XML
    ▼
┌─────────────────────────────────────┐
│  Document Parser (app/parsers/)     │
│  - PDF: PyMuPDF text extraction     │
│  - DOCX: python-docx                │
│  - TXT/MD/XML: Native parsing       │
│  - Character-based chunking         │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Vector DB (app/vectordb/)          │
│  - ChromaDB + OpenAI embeddings     │
│  - Metadata: page, section, folder  │
│  - Persistent at data/chroma_db/    │
└─────────────────────────────────────┘
    │
    ├──► MCP Server (mcp_server/)
    │    └─► Claude Code CLI tools
    │
    ├──► Gradio UI (app/ui/)
    │    └─► Document upload/search/chat
    │
    └──► Citations (app/citations/)
         └─► BibTeX generation

Project Structure

├── app/                    # Main application
│   ├── parsers/            # Document processing (PDF, DOCX, TXT, MD, XML)
│   ├── vectordb/           # ChromaDB + embeddings
│   ├── citations/          # BibTeX management
│   └── ui/                 # Gradio interface
├── mcp_server/             # MCP server for Claude Code
├── scripts/                # Utility scripts
├── data/                   # Generated data
│   ├── chroma_db/          # Vector database
│   └── citations/          # BibTeX files
├── sources_raw/            # Your source documents (create this directory)
└── outputs/                # Generated outputs

Supported Models

Provider	Models
OpenAI	GPT-5.x, GPT-4o, GPT-4o Mini, o3/o1 reasoning
Anthropic	Claude Sonnet 4, Opus 4, 3.7, 3.5
Google	Gemini 2.0/1.5
DeepSeek	R1, V3
Groq	Llama 3.3, Mixtral, Gemma2 (free tier)
xAI	Grok 2, Grok 2 Vision (free tier)
OpenRouter	Llama 3.3 70B, Gemma 2 9B, Mistral 7B (free tier)

Chunking Strategy

Chunk size: ~1500 tokens with 200 token overlap
Sections preserved from markdown headings
Footnotes attached to parent paragraphs
Page numbers estimated from document position

Citation Format

Citations are generated in BibTeX format:

@techreport{example2024report,
  title = {Research Report Title},
  author = {Author Name},
  year = {2024},
  note = {Document ID: abc123def456},
}

Troubleshooting

Docling installation issues: Use the fallback parser:

uv run python -m app.main build --fallback

OpenAI API errors: Verify your API key in .env and check rate limits.

Empty search results: Ensure the database is built:

uv run python -m app.main stats

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
docs		docs
mcp_server		mcp_server
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mcp_config_example.json		mcp_config_example.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

My Research Notebook - Local RAG Application

Screenshots

Features

Core Capabilities

Gradio UI Panels

Quick Start

1. Setup Environment (uv)

2. Configure Environment

3. Add Source Documents

4. Build the Vector Database

5. Use the Application

Claude Code MCP Integration

Configure MCP Server

Available MCP Tools

Example Usage in Claude Code

Architecture

Project Structure

Supported Models

Chunking Strategy

Citation Format

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

My Research Notebook - Local RAG Application

Screenshots

Features

Core Capabilities

Gradio UI Panels

Quick Start

1. Setup Environment (uv)

2. Configure Environment

3. Add Source Documents

4. Build the Vector Database

5. Use the Application

Claude Code MCP Integration

Configure MCP Server

Available MCP Tools

Example Usage in Claude Code

Architecture

Project Structure

Supported Models

Chunking Strategy

Citation Format

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages