Skip to content

deburky/pgvector_db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Embedding Generation and PostgreSQL Vector Database

Production-ready Python package for generating embeddings with πŸ€— Hugging Face models and storing them in PostgreSQL with pgvector extension.

Author: https://www.github.com/deburky

Python 3.10+ PostgreSQL 17 pgvector

πŸš€ Quick Start

# 1. Install dependencies
uv sync

# 2. Set up PostgreSQL with pgvector (see setup guide below)
brew install postgresql@17 pgvector
brew services start postgresql@17

# 3. Run the demo
export PG_PASSWORD=your_actual_password
uv run python demo/test_real_embeddings.py

πŸ“– Table of Contents

  1. Features
  2. Installation
  3. Quick Demo
  4. PostgreSQL Setup
  5. Usage Examples
  6. AWS Integration
  7. Performance

✨ Features

βœ… Real LLM Embeddings - Generate 384-1536 dimensional vectors using sentence-transformers
βœ… PostgreSQL Integration - Store embeddings with enterprise-grade reliability
βœ… Vector Similarity Search - Fast semantic search with IVFFLAT/HNSW indexing
βœ… Batch Processing - Efficient bulk operations with pg_copy and pg_insert
βœ… Apple Silicon Support - Optimized for MPS (Metal Performance Shaders)
βœ… AWS Integration - S3 storage and RDS support
βœ… Production Ready - Comprehensive error handling and logging

πŸ”§ Installation

This project uses uv for fast Python package management:

# Clone the repository
git clone https://github.com/deburky/pgvector_db.git
cd pgvector_db

# Install dependencies
uv sync

# Optional: Install with development dependencies
uv sync --extra dev --extra test

🎯 Quick Demo

Try our interactive demos to see the package in action:

# 1. Basic connectivity test
uv run python demo/test_db_connection.py

# 2. Real LLM embeddings generation and storage
uv run python demo/test_real_embeddings.py

# 3. Interactive similarity search demo
uv run python demo/similarity_search_demo.py

πŸ“ For detailed instructions, see demo/README.md

πŸ“Š Embeddings

An embedding is a mapping from discrete objects (words, sentences, documents) to points in a continuous vector space. This enables neural networks and machine learning models to process text semantically.

We use open-source models from πŸ€— Hugging Face and sentence-transformers to generate high-quality embeddings for downstream tasks like similarity search, classification, and retrieval-augmented generation (RAG).

Download and Cache Models

from pgvector_db.utils import download_llm

# Sentence Transformers model
model_path = download_llm(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    save_path="./models",
    model_type="sentence-transformers"
)

# Hugging Face model
hf_model_path = download_llm(
    model_name="bert-base-uncased",
    save_path="./models",
    model_type="huggingface"
)

Generate Embeddings and Store to S3

from pgvector_db.generate_embeddings import EmbeddingGenerator

# Initialize with local model
generator = EmbeddingGenerator(
    model_path="./models/sentence-transformers-all-MiniLM-L6-v2",
    model_type="sentence-transformers",
    batch_size=32,
    device="mps"  # Use Apple Silicon GPU
)

texts = ["Hello world", "Vector databases are powerful"]
embeddings = generator.generate_embeddings(texts)

# Store to S3 with partitioning for Athena
generator.save_to_parquet(
    texts, 
    embeddings, 
    "s3://my-bucket/embeddings/",
    partition_cols=["generation_date"]
)

πŸ—„οΈ PostgreSQL Setup

Prerequisites

  • macOS: Homebrew installed
  • Windows: Use Docker, WSL, or native PostgreSQL installer
  • Linux: Use your distribution's package manager

Install PostgreSQL 17 and pgvector

# Install PostgreSQL 17 (latest with pgvector support)
brew install postgresql@17

# Install pgvector extension  
brew install pgvector

# Verify installation
/opt/homebrew/opt/postgresql@17/bin/pg_config --sharedir

Start PostgreSQL and Create Database

# Start PostgreSQL 17
brew services start postgresql@17

# Add to PATH for easier access
export PATH="/opt/homebrew/opt/postgresql@17/bin:$PATH"

# Create database and enable vector extension
psql postgres -c "CREATE DATABASE vector_db;"
psql vector_db -c "CREATE EXTENSION vector;"

Create Database Schema

-- Connect to vector_db
\c vector_db

-- Create documents table (384 dims for all-MiniLM-L6-v2)
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding VECTOR(384)
);

-- Create application user
CREATE USER py_pg_user WITH PASSWORD 'your_secure_password';
GRANT ALL PRIVILEGES ON DATABASE vector_db TO py_pg_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO py_pg_user;
GRANT USAGE, SELECT ON SEQUENCE documents_id_seq TO py_pg_user;

-- Create index for fast similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Environment Variables

export PG_PASSWORD=your_secure_password
export PGUSER=py_pg_user  # Optional

πŸ’» Usage Examples

Basic Database Configuration

from pgvector_db.utils import DBConfigLocal

db_config = DBConfigLocal(
    db_name="vector_db",
    db_user="py_pg_user", 
    db_password=os.getenv('PG_PASSWORD'),
    db_host="localhost",
    schema_name="public",
    table_name="documents"
)

Store Embeddings in PostgreSQL

import pandas as pd
from pgvector_db.pg_insert import pg_insert

# Prepare data with embeddings as lists
dataset = pd.DataFrame({
    'text': ["Sample document", "Another document"],
    'embedding': [embedding1.tolist(), embedding2.tolist()]
})

# Insert into PostgreSQL
pg_insert(dataset, db_config, batch_size=1000)

Vector Similarity Search

-- Find most similar documents to a query
WITH query_embedding AS (
    SELECT embedding FROM documents WHERE text LIKE '%database%' LIMIT 1
)
SELECT 
    d.text,
    1 - (d.embedding <=> qe.embedding) as similarity
FROM documents d, query_embedding qe
WHERE d.text NOT LIKE '%database%'
ORDER BY d.embedding <=> qe.embedding
LIMIT 5;

AWS RDS PostgreSQL Integration

Our package seamlessly works with AWS RDS PostgreSQL instances. Key components for RDS integration include:

  1. Connection Setup

    # Standard password authentication (used in our examples)
    db_config = DBConfigRDS(
        db_name="postgres",
        db_user="admin_user",
        db_password=os.environ.get('ADMIN_PASS'),
        db_host="your-pgvector-rds.region.rds.amazonaws.com",
        schema_name="public",
        table_name="documents"
    )
    
    # Alternative: IAM authentication (more secure for production)
    # db_config = DBConfigRDS(
    #     db_name="postgres",
    #     db_user="admin_user", 
    #     db_host="your-pgvector-rds.region.rds.amazonaws.com",
    #     use_iam=True,  # Enable IAM authentication
    #     aws_region="us-east-1"
    # )
  2. Vector Operations

    # Create table with vector column if needed
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id SERIAL PRIMARY KEY,
        text TEXT,
        embedding vector(384)
    );
    """)
    
    # Create vector index for fast similarity search
    cursor.execute("""
    CREATE INDEX IF NOT EXISTS documents_embedding_idx 
    ON documents USING ivfflat (embedding vector_cosine_ops) 
    WITH (lists = 100);
    """)
  3. Vector Similarity Search

    # Important: Use string representation with explicit type casting
    vector_str = f"'[{','.join(map(str, query_embedding))}]'"
    
    # Execute search with proper vector operator
    query = f"""
    SELECT 
        id, 
        text, 
        1 - (embedding <=> {vector_str}::vector) AS similarity
    FROM 
        documents
    ORDER BY 
        embedding <=> {vector_str}::vector
    LIMIT 3;
    """
  4. Using the AWS RDS Examples

    # Set password for RDS connection
    export ADMIN_PASS=your_actual_rds_password
    
    # Run the AWS RDS examples
    python demo/test_aws_rds_connection.py
    python demo/aws_rds_similarity_search_demo.py

For more details, explore the demo examples in the demo/ directory.

⚑ Performance Options

Bulk Operations (Recommended)

For large datasets, use the COPY-based workflow:

from pgvector_db.pg_copy import pg_copy

# Efficient bulk insert using PostgreSQL COPY
pg_copy(dataset, db_config)

Real-time Operations

For single records or small batches:

from pgvector_db.pg_insert import pg_insert

# Row-by-row insertion for real-time applications
pg_insert(dataset, db_config, batch_size=100)

πŸ› οΈ Development

Project Structure

pgvector_db/
β”œβ”€β”€ src/pgvector_db/           # Main package
β”‚   β”œβ”€β”€ generate_embeddings.py # LLM embedding generation
β”‚   β”œβ”€β”€ pg_copy.py             # Bulk COPY operations  
β”‚   β”œβ”€β”€ pg_insert.py           # INSERT operations
β”‚   └── utils.py               # Database configs & utilities
β”œβ”€β”€ demo/                      # Working examples
β”‚   β”œβ”€β”€ README.md              # Demo instructions
β”‚   β”œβ”€β”€ test_real_embeddings.py # Complete pipeline test
β”‚   └── similarity_search_demo.py # Interactive demo
β”œβ”€β”€ notebooks/                 # Jupyter examples
└── tests/                     # Unit tests

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=pgvector_db

# Run specific test
uv run python demo/test_real_embeddings.py

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“š Resources

πŸ“„ License

MIT License - see LICENSE file for details.


⭐ Star this repo if you found it helpful!

About

Postgres Vector DB for Embeddings Storage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors