A Python pipeline for processing documentation files, cleaning them, and storing them in a vector database for efficient retrieval and searching.
- Document Loading: Supports MDX and MD files with flexible directory traversal
- Content Cleaning:
- Removes redundant whitespace from code blocks
- Normalizes document formatting
- Preserves code block metadata and language specifications
- Smart Text Splitting: Uses tiktoken-based splitting for optimal chunk sizes
- Vector Storage: Stores document embeddings in PostgreSQL for efficient similarity search
- OpenAI Integration: Leverages OpenAI's embedding models for high-quality vector representations
- Python 3.11 or higher
- PostgreSQL database
- OpenAI API key
- Clone the repository:
git clone https://github.com/lokeswaran-aj/docs-sync-pipeline.git
cd docs-sync-pipeline
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
cp .env.example .env
Edit .env
with your configuration:
DATABASE_URL=postgresql://user:password@localhost:5432/your_db
OPENAI_API_KEY=your_openai_api_key
EMBEDDING_MODEL=text-embedding-3-small
- Place your documentation files in the target directory:
mkdir -p repos/next.js/docs
# Add your .md or .mdx files to this directory
- Run the processing pipeline:
python src/main.py
The pipeline consists of several key components:
- Document Loading: Uses
DirectoryLoader
to recursively load markdown files - Content Processing:
- Cleans and normalizes document content
- Preserves important formatting and metadata
- Text Splitting:
- Chunks documents into optimal sizes (1000 tokens)
- Maintains context with overlapping chunks (200 tokens)
- Vector Storage:
- Generates embeddings using OpenAI's models
- Stores vectors in PostgreSQL using pgvector
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request