A comprehensive framework for evaluating and selecting the most suitable embedding model for various NLP tasks. This tool provides detailed performance metrics, cost analysis, and resource utilization comparisons across different embedding providers including OpenAI, VoyageAI, Hugging Face models, and Sentence Transformers.
- Comprehensive embedding model evaluation framework
- Support for multiple embedding providers:
- OpenAI (text-embedding-3-small, text-embedding-3-large)
- VoyageAI (voyage-2, voyage-3)
- Hugging Face models (BERT, DistilBERT)
- Sentence Transformers
- Detailed performance metrics and resource monitoring
- Cost analysis and estimation
- Task-specific model recommendations
- Automated model selection based on requirements
- Python 3.8+
- CUDA-compatible GPU (optional, but recommended for better performance)
- API keys for OpenAI and VoyageAI (if using these providers)
- Clone the repository:
git@github.com:Ziv2/Embedding-selection-framework.git
cd embedding-selection-framework- Install required packages:
pip install -r requirements.txt- Set up environment variables for API keys:
export OPENAI_API_KEY='your-openai-api-key'
export VOYAGE_API_KEY='your-voyage-api-key'from embedding_evaluator import EnhancedEmbeddingEvaluator
from embedding_providers import OpenAIEmbedding, VoyageAIEmbedding, HuggingFaceEmbedding
# Initialize providers
providers = [
HuggingFaceEmbedding('bert-base-uncased'),
OpenAIEmbedding('text-embedding-3-small'),
VoyageAIEmbedding('voyage-2')
]
# Compare embeddings
results = EnhancedEmbeddingEvaluator.compare_embeddings(
embedding_providers=providers,
classifier_factory=lambda: SVC(kernel='linear'),
train_texts=train_texts,
train_labels=train_labels,
test_texts=test_texts,
test_labels=test_labels
)from nlp_embeddings import NLPEmbeddingHandler
handler = NLPEmbeddingHandler()
result = handler.get_embeddings('sentiment_analysis', "This product is amazing!")The framework provides comprehensive metrics including:
- Accuracy, Precision, Recall, F1 score
- Processing speed (tokens/second)
- Memory usage (RAM and GPU)
- Cost estimates
- Embedding dimensions
- Total processing time
The framework supports various configuration options through environment variables or a config file:
MODEL_SPECS = {
'OpenAI_text-embedding-3-small': {
'dim': 1536,
'cost_per_1k_tokens': 0.00002,
'max_tokens': 8191,
'suggested_batch': 2048
},
# ... more model specifications
}Model Accuracy F1-Score Tokens/sec Cost/1M tokens
---------------------------- ----------- ----------- ------------- ----------------
OpenAI_text-embedding-3-small 0.945 0.943 5000 $0.02
VoyageAI_voyage-2 0.932 0.930 4500 $0.12
HF_bert-base-uncased 0.918 0.915 2000 $0.00
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face for their transformer models
- Sentence Transformers team
- OpenAI for their embedding models and API
- VoyageAI for their embedding services