Document Processing and Vector Storage Pipeline

A Python pipeline for processing documentation files, cleaning them, and storing them in a vector database for efficient retrieval and searching.

🚀 Features

Document Loading: Supports MDX and MD files with flexible directory traversal
Content Cleaning:
- Removes redundant whitespace from code blocks
- Normalizes document formatting
- Preserves code block metadata and language specifications
Smart Text Splitting: Uses tiktoken-based splitting for optimal chunk sizes
Vector Storage: Stores document embeddings in PostgreSQL for efficient similarity search
OpenAI Integration: Leverages OpenAI's embedding models for high-quality vector representations

📋 Prerequisites

Python 3.11 or higher
PostgreSQL database
OpenAI API key

🛠️ Installation

Clone the repository:

git clone https://github.com/lokeswaran-aj/docs-sync-pipeline.git
cd docs-sync-pipeline

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env

Edit .env with your configuration:

DATABASE_URL=postgresql://user:password@localhost:5432/your_db
OPENAI_API_KEY=your_openai_api_key
EMBEDDING_MODEL=text-embedding-3-small

💻 Usage

Place your documentation files in the target directory:

mkdir -p repos/next.js/docs
# Add your .md or .mdx files to this directory

Run the processing pipeline:

python src/main.py

🏗️ Architecture

The pipeline consists of several key components:

Document Loading: Uses DirectoryLoader to recursively load markdown files
Content Processing:
- Cleans and normalizes document content
- Preserves important formatting and metadata
Text Splitting:
- Chunks documents into optimal sizes (1000 tokens)
- Maintains context with overlapping chunks (200 tokens)
Vector Storage:
- Generates embeddings using OpenAI's models
- Stores vectors in PostgreSQL using pgvector

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

🙏 Acknowledgments

LangChain for document processing tools
pgvector for vector similarity search in PostgreSQL

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Processing and Vector Storage Pipeline

🚀 Features

📋 Prerequisites

🛠️ Installation

💻 Usage

🏗️ Architecture

🤝 Contributing

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

aidocs-app/docs-sync-pipeline

Folders and files

Latest commit

History

Repository files navigation

Document Processing and Vector Storage Pipeline

🚀 Features

📋 Prerequisites

🛠️ Installation

💻 Usage

🏗️ Architecture

🤝 Contributing

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages