Skip to content

dingqiangliu/GenerateEmbedding

Repository files navigation

Vertica Embedding UDx

A Python User Defined Function (UDx) for Vertica that generates text embeddings from content using various providers including local models, or remote with custom APIs.

Features

  • Multiple Providers: Support for local embeddings and remote APIs
  • Text and Image Embeddings: Generate embeddings for text content using sentence transformer model, for both text content and images using CLIP-based models
  • Flexible Configuration: Easy to configure for different embedding services
  • Batch Processing: High-performance batch processing for both local and remote providers
  • Vertica UDx SDK Compliance: Follows official Vertica Python UDx patterns and works well with Vertica SQL queries

Package Structure

/opt/vertica/packages/GenerateEmbedding/
├── package.conf              # Package metadata and version info
├── generate_embedding.py     # Main Python UDx implementation
├── config_example.json       # Configuration template
├── README.md                 # This documentation
├── requirements.txt          # Python package dependencies
├── ddl/
│   ├── install.sql           # Installation script
│   ├── isinstalled.sql       # Installation verification
│   └── uninstall.sql         # Cleanup and removal script
└── examples/
    ├── basic.sql             # Basic usage examples
    ├── similarity_search.sql # Advanced similarity search examples
    ├── image_search.sql      # Image embedding and similarity search examples
    └── images/               # Sample images for image embedding examples

Setup Instructions

1. Prerequisites

  • Access to your chosen embedding provider

2. Installation

  1. Copy the Python file to each of your Vertica servers:

    # Copy generate_embedding.py to a location accessible by Vertica
    scp -r ./GenerateEmbedding/ vertica_server:/opt/vertica/packages/
  2. Install required Python packages on each Vertica server:

    # Download packages to wheels directory (optional, preparing for offline installation)
    # python3 -m pip download -r requirements.txt --dest wheels/
    
    # Install packages (either from internet or local packages)
    python3 -m pip install --no-index --find-links wheels/ -r requirements.txt
  3. Configure the embedding providers on each Vertica server:

    # Copy and edit the configuration file
    cp config_example.json /opt/vertica/config/generate_embedding_config.json
    # Edit the file with your API keys and settings
  4. Register the UDx in Vertica:

    -- Run the registration script (paths are already configured)
    \i ddl/install.sql
  5. Verify Installation:

    -- Check if the UDx is properly installed
    \i ddl/isinstalled.sql
    -- Should return 't' (true) if installation is successful
  6. Explore Examples:

    -- View usage examples
    \i examples/basic.sql
  7. Uninstall (if needed):

    -- WARNING: This will remove all embedding functions
    \i ddl/uninstall.sql

Package Dependencies

The GenerateEmbedding UDx requires several Python packages for local embedding generation:

  • sentence-transformers - For generating embeddings using the local model
  • requests - For remote API calls

Note: The requests library is already available in the Vertica environment.

For production deployment, you can either:

  • Install packages directly: python3 -m pip install -r requirements.txt
  • Download packages first: python3 -m pip download -r requirements.txt --dest wheels/
  • Install from local packages: python3 -m pip install --no-index --find-links wheels/ -r requirements.txt

3. Configuration

Update the /opt/vertica/config/generate_embedding_config.json file with your specific settings:

{
  "logging": {
    "level": "INFO"
  },
  "embedding_config": {
    "default_provider": "local",
    "batch_size": 20,
    "max_content_length": 10000,
    "local": {
      "model": "/opt/vertica/packages/GenerateEmbedding/model"
    },
    "remote": {
      "model": "nomic-embed-text-v2-moe",
      "api_key": "your-api-key-here-eg-ollama",
      "api_url": "http://172.24.80.1:11434/api/embed",
      "timeout": 60
    }
  }
}
  • Local: Uses sentence transformer model
  • Remote: Configure your custom embedding API endpoint, such as Ollama and OpenAI

Usage Examples

Basic Usage

-- Generate text embedding using UDx SDK with default parameters
SELECT generate_embedding('Hello world') AS embedding;

-- Generate text embedding with specific provider
SELECT generate_embedding('Hello world' USING PARAMETERS provider='local') AS embedding;

-- Generate image embedding (treat input as image file path)
SELECT generate_embedding('/path/to/image.jpg' USING PARAMETERS is_image=true, provider='local') AS embedding;

-- Generate embedding with customized remote API
SELECT generate_embedding('Hello world' USING PARAMETERS provider='remote'
    , model='your-model', api_key='your-key', api_url='your-url') AS embedding;

-- Generate text embeddings for table data
SELECT id, text_content, generate_embedding(text_content) AS embedding
FROM documents
LIMIT 10;

-- Generate image embeddings for table data
SELECT id, image_path, generate_embedding(image_path USING PARAMETERS is_image=true, provider='local') AS embedding
FROM images
LIMIT 10;

Working with Embeddings

-- Check embedding dimensions
SELECT ARRAY_LENGTH(generate_embedding('test')) AS embedding_dim;

-- Calculate cosine similarity between embeddings
SELECT
    product_id,
    product_name,
    description,
    price,
    COSINE_SIMILARITY(product_embedding, generate_embedding('I love high-quality audio equipment and music technology') ) AS similarity_score
FROM products
ORDER BY similarity_score DESC
LIMIT 3;

Advanced Usage

-- Generate embeddings with custom parameters
SELECT generate_embedding(text_content USING PARAMETERS
      provider='remote',
      model='custom-model',
      api_key='your-api-key',
      api_url='https://your-api.com/embeddings'
  ) AS embedding
FROM documents;

Configuration Options

Logging Configuration

  • level: Logging level ('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL') - default: 'INFO'

Global Settings

  • default_provider: Default provider type ('local' or 'remote') - default: 'local'
  • batch_size: Number of records to process in each batch - default: 20
  • max_content_length: Maximum content length in characters - default: 10000

Local Provider

  • model: Path to sentence transformer model directory - default: '/opt/vertica/packages/GenerateEmbedding/model'
  • device: Device to run the model on ('CPU' or 'GPU') - default: 'CPU'

Remote Provider

  • model: Model name for remote API - default: 'nomic-embed-text-v2-moe'
  • api_key: API key for remote providers - default: NULL
  • api_url: API URL for remote providers - default: 'http://172.24.80.1:11434/api/embed' (Ollama)
  • timeout: API request timeout in seconds - default: 60

UDx Architecture

This implementation follows the official Vertica Python UDx SDK patterns:

  • GenerateEmbedding: Main scalar function class implementing vertica_sdk.ScalarFunction
  • GenerateEmbeddingFactory: Factory class implementing vertica_sdk.ScalarFunctionFactory

Parameters

The UDx supports the following parameters:

  • is_image: Whether to treat input as image file paths (true/false) - default: false
  • provider: Provider type ('local', 'remote') - default: 'local'
  • model: Model name - default: '/opt/vertica/packages/GenerateEmbedding/model' for local, 'nomic-embed-text-v2-moe' for remote
  • api_key: API key for remote providers - default: NULL
  • api_url: API URL for remote providers - default: 'http://172.24.80.1:11434/api/embed' (Ollama)

The UDx SDK approach provides:

  • Better integration with Vertica's query processing
  • Proper error handling and logging
  • Type safety and validation
  • Runtime parameter configuration

Error Handling

The UDx includes comprehensive error handling:

  • Empty content: Returns descriptive error message
  • API failures: Detailed error logging with specific HTTP status codes
  • Invalid providers: Raises descriptive error messages
  • Network timeouts: Configurable timeout handling
  • Response validation: Validates embedding format and dimensions
  • Fallback mechanism: Returns zero vector on critical failures

Performance Considerations

  • Batch Processing: Automatic batch processing with configurable batch size (default: 20) for optimal performance
  • Local Model: Uses sentence-transformers with batch encoding for efficient local embedding generation
  • Remote APIs: Multi-threaded parallel API calls for maximum throughput with remote providers
  • Caching: Implement caching for frequently used text
  • Rate Limiting: Be aware of API rate limits for external providers
  • Configuration: Use external configuration files for easy management
  • Input Validation: Content length limits prevent memory issues
  • Connection Pooling: Reuse connections for multiple requests

Security

  • API Keys: Store API keys securely, not in plain text
  • Network: Ensure secure network connectivity to embedding APIs
  • Data Privacy: Review data privacy requirements for your embedding provider

Troubleshooting

Common Issues

  1. Module not found: Ensure Python packages are installed on Vertica server
  2. Permission denied: Check file permissions for the Python UDx file
  3. API errors: Verify API keys and network connectivity
  4. Memory issues: Large embeddings may require increased memory allocation
  5. Logging level: Set logging level to DEBUG in configuration for detailed troubleshooting

Debug Mode

Configure logging level in the configuration file:

{
  "logging": {
    "level": "DEBUG"
  },
  "embedding_config": {
    ...
  }
}

Available logging levels:

  • DEBUG: Detailed information for diagnosing problems
  • INFO: Confirmation that things are working as expected
  • WARNING: Something unexpected happened (default: INFO)
  • ERROR: Serious problem, some functionality may not work
  • CRITICAL: Critical error, program may be unable to continue

License

This project is provided as-is for educational and development purposes.

Support

For issues and questions, please refer to:

  • Vertica documentation for UDx setup
  • Your embedding provider's API documentation
  • Python requests library documentation

About

A Python User Defined Function (UDx) for Vertica that generates text embeddings from content using various providers including local models, or remote with custom APIs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages