ConFlowGen: Documentation Crawler and RAG Agent

An intelligent documentation crawler and RAG (Retrieval-Augmented Generation) agent built using Pydantic AI and Supabase. The agent can crawl documentation websites, store content in a vector database, and provide intelligent answers to user questions by retrieving and analyzing relevant documentation chunks.

Features

Documentation website crawling and chunking
GitHub repository content processing
Conversation history storage and retrieval
Vector database storage with Supabase
Semantic search using sentence-transformers embeddings
RAG-based question answering with DeepSeek models
Streamlit UI with conversation management
Support for code block preservation in chunks
Available as both API endpoint and web interface

Prerequisites

Python 3.11+
Supabase account and database
OpenAI API key
Streamlit (for web interface)

Installation

Clone the repository
Install dependencies (recommended to use a Python virtual environment):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Set up environment variables:

Rename .env.example to .env
Edit .env with your API keys and preferences:

OPENROUTER_API_KEY=your_openrouter_api_key
SUPABASE_URL=your_supabase_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
LLM_MODEL=deepseek/deepseek-chat  # or your preferred model
EMBEDDING_MODEL=sentence-transformers/multi-qa-mpnet-base-dot-v1 # or your preferred model
TOP_K=5 # top 5 most relevant documents retrieved by the search result

Usage

Database Setup

Execute the SQL commands in site_pages.sql to:

Create the necessary tables
Enable vector similarity search
Set up Row Level Security policies

In Supabase, do this by going to the "SQL Editor" tab and pasting in the SQL into the editor there. Then click "Run".

Crawl Documentation

To crawl and store documentation in the vector database:

python crawl_conflowgen_docs.py

Crawl GitHub Repository

To crawl and store repository content (markdown files, notebooks, etc.):

python crawl_conflowgen_github.py

This will:

Fetch URLs from the documentation sitemap
Crawl each page and split into chunks
Generate embeddings and store in Supabase

Note: Before running the above command, make sure to place the conflowgen github repo at the root level of this project.

Streamlit Web Interface

For an interactive web interface to query the documentation:

streamlit run streamlit_ui.py

The interface will be available at http://localhost:8501

Configuration

Database Schema

The Supabase database uses the following schema:

CREATE TABLE site_pages (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    url TEXT,
    chunk_number INTEGER,
    title TEXT,
    summary TEXT,
    content TEXT,
    metadata JSONB,
    embedding VECTOR(768)
);

Conversation System

The system maintains conversation history with:

Persistent storage in Supabase
Automatic title generation from first message
Timestamp tracking
Message serialization/deserialization

Execute conversations.sql to set up the conversation table before using the Streamlit UI.

Chunking Configuration

You can configure chunking parameters in crawl_conflowgen_docs.py:

chunk_size = 5000  # Characters per chunk

The chunker intelligently preserves:

Code blocks
Paragraph boundaries
Sentence boundaries

Project Structure

crawl_conflowgen_docs.py: Documentation website crawler
crawl_conflowgen_github.py: GitHub repo content processor
pydantic_ai_rag.py: RAG agent with tool integrations
streamlit_ui.py: Interactive web interface with chat history
site_pages.sql: Vector storage table setup
conversations.sql: Chat history table setup
.env.sample: Environment configuration template
requirements.txt: Python dependencies

Error Handling

The system includes robust error handling for:

Network failures during crawling
Content conversion errors
API rate limits and timeouts
Database connection issues
Embedding generation failures
Invalid URLs or content
Conversation serialization errors
Vector search failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConFlowGen: Documentation Crawler and RAG Agent

Features

Prerequisites

Installation

Usage

Database Setup

Crawl Documentation

Crawl GitHub Repository

Streamlit Web Interface

Configuration

Database Schema

Conversation System

Chunking Configuration

Project Structure

Error Handling

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
conflowgen_urls.py		conflowgen_urls.py
conversations.sql		conversations.sql
crawl_conflowgen_docs.py		crawl_conflowgen_docs.py
crawl_conflowgen_github.py		crawl_conflowgen_github.py
pydantic_ai_rag.py		pydantic_ai_rag.py
requirements.txt		requirements.txt
site_pages.sql		site_pages.sql
streamlit_ui.py		streamlit_ui.py

Folders and files

Latest commit

History

Repository files navigation

ConFlowGen: Documentation Crawler and RAG Agent

Features

Prerequisites

Installation

Usage

Database Setup

Crawl Documentation

Crawl GitHub Repository

Streamlit Web Interface

Configuration

Database Schema

Conversation System

Chunking Configuration

Project Structure

Error Handling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages