An intelligent documentation crawler and RAG (Retrieval-Augmented Generation) agent built using Pydantic AI and Supabase. The agent can crawl documentation websites, store content in a vector database, and provide intelligent answers to user questions by retrieving and analyzing relevant documentation chunks.
- Documentation website crawling and chunking
- GitHub repository content processing
- Conversation history storage and retrieval
- Vector database storage with Supabase
- Semantic search using sentence-transformers embeddings
- RAG-based question answering with DeepSeek models
- Streamlit UI with conversation management
- Support for code block preservation in chunks
- Available as both API endpoint and web interface
- Python 3.11+
- Supabase account and database
- OpenAI API key
- Streamlit (for web interface)
-
Clone the repository
-
Install dependencies (recommended to use a Python virtual environment):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt- Set up environment variables:
- Rename
.env.exampleto.env - Edit
.envwith your API keys and preferences:
OPENROUTER_API_KEY=your_openrouter_api_key SUPABASE_URL=your_supabase_url SUPABASE_SERVICE_KEY=your_supabase_service_key LLM_MODEL=deepseek/deepseek-chat # or your preferred model EMBEDDING_MODEL=sentence-transformers/multi-qa-mpnet-base-dot-v1 # or your preferred model TOP_K=5 # top 5 most relevant documents retrieved by the search result
- Rename
Execute the SQL commands in site_pages.sql to:
- Create the necessary tables
- Enable vector similarity search
- Set up Row Level Security policies
In Supabase, do this by going to the "SQL Editor" tab and pasting in the SQL into the editor there. Then click "Run".
To crawl and store documentation in the vector database:
python crawl_conflowgen_docs.pyTo crawl and store repository content (markdown files, notebooks, etc.):
python crawl_conflowgen_github.pyThis will:
- Fetch URLs from the documentation sitemap
- Crawl each page and split into chunks
- Generate embeddings and store in Supabase
Note: Before running the above command, make sure to place the conflowgen github repo at the root level of this project.
For an interactive web interface to query the documentation:
streamlit run streamlit_ui.pyThe interface will be available at http://localhost:8501
The Supabase database uses the following schema:
CREATE TABLE site_pages (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
url TEXT,
chunk_number INTEGER,
title TEXT,
summary TEXT,
content TEXT,
metadata JSONB,
embedding VECTOR(768)
);The system maintains conversation history with:
- Persistent storage in Supabase
- Automatic title generation from first message
- Timestamp tracking
- Message serialization/deserialization
Execute conversations.sql to set up the conversation table before using the Streamlit UI.
You can configure chunking parameters in crawl_conflowgen_docs.py:
chunk_size = 5000 # Characters per chunkThe chunker intelligently preserves:
- Code blocks
- Paragraph boundaries
- Sentence boundaries
crawl_conflowgen_docs.py: Documentation website crawlercrawl_conflowgen_github.py: GitHub repo content processorpydantic_ai_rag.py: RAG agent with tool integrationsstreamlit_ui.py: Interactive web interface with chat historysite_pages.sql: Vector storage table setupconversations.sql: Chat history table setup.env.sample: Environment configuration templaterequirements.txt: Python dependencies
The system includes robust error handling for:
- Network failures during crawling
- Content conversion errors
- API rate limits and timeouts
- Database connection issues
- Embedding generation failures
- Invalid URLs or content
- Conversation serialization errors
- Vector search failures