fix: batching in Metric #2

jjmachan · 2023-05-12T18:28:12Z

No description provided.

shahules786

LGTM

small changes, mostly utms

## Issue Link / Problem Description This PR introduces a comprehensive, reusable testing infrastructure to streamline the migration of legacy metrics to the modern collections architecture. **Problem**: Migrating metrics requires consistent validation that new implementations match legacy behavior, but this process was ad-hoc and time-consuming without standardized tooling. **Solution**: A complete testing framework with: - Step-by-step migration guide - Configuration-driven validation notebook - Shared test utilities and fixtures - Dataset loading with safe fallbacks ## Changes Made ### 📋 Migration Documentation - **`tests/e2e/metrics_migration/plan-for-metrics-migration.md`** - Complete migration workflow (pre-migration → implementation → testing → finalization) - Code templates for prompts, output models, and metric classes - Validation criteria for different metric types (LLM-based, embeddings-based, deterministic) ### 📓 Testing Notebook - **`tests/e2e/metrics_migration/metric_score_diff.ipynb`** + **`tests/notebooks/metric_score_diff.ipynb`** - General-purpose, configuration-driven (edit 1 cell to test any metric) - Concurrent dataset comparison (Amnesty QA + FIQA) - Statistical analysis with 7-plot visualizations - Automated validation criteria checking ### 🧪 Test Infrastructure - **`tests/e2e/metrics_migration/base_migration_test.py`** - `BaseMigrationTest` class with reusable test methods - `run_e2e_compatibility_test()` - Compare legacy vs modern with tolerance - `run_metric_specific_test()` - Custom behavior validation - **`tests/e2e/metrics_migration/conftest.py`** - Shared pytest fixtures: `legacy_llm`, `modern_llm`, `legacy_embeddings`, `modern_embeddings` - Automatic API key validation and graceful skipping - **`tests/e2e/metrics_migration/test_utils.py`** - Helper utilities for migration tests ### 🏭 Component Factories - **`tests/utils/llm_setup.py`** - `create_legacy_llm()` / `create_modern_llm()` - LLM initialization for both architectures - `create_legacy_embeddings()` / `create_modern_embeddings()` - Embeddings initialization - `check_api_key()` - Validation utility ### ⚡ Optimized Comparison Engine - **`tests/utils/metric_comparison.py`** - `compare_metrics()` - Concurrent execution with configurable parallelism - Support for both parallel (independent) and sequential (dependent) metric execution - `ComparisonResult` dataclass with statistical analysis and pandas export ### 📊 Dataset Utilities - **`tests/e2e/test_dataset_utils.py`** - `load_amnesty_dataset_safe()` - Amnesty QA dataset with local fallback - `load_fiqa_dataset_safe()` - FIQA dataset with local fallback - Embedded sample data for offline/CI testing ### ⚙️ Configuration - **`.gitignore`** - Added `plan/` directory for migration planning docs - **`CLAUDE.md`** - Updated with migration workflow guidance ## Testing ### Validation Status - [x] Automated tests: Infrastructure provides framework for all future migrations - [x] Manual validation: Tested with Context Recall migration ### Test Results (Context Recall Migration) | Dataset | Samples | Mean |Diff| | Within Tolerance | Status | |---------|---------|---------------|------------------|--------| | Amnesty QA | 20 | 0.0708 | 90% (18/20) | ✅ | | FIQA | 30 | 0.0667 | 93.3% (28/30) | ✅ | **Validation Criteria Met:** - ✅ Mean |diff| < 0.15 (stricter than per-case tolerance) - ✅ >90% within 0.2 tolerance for LLM-based metrics - ✅ No systematic bias (mean diff < 0.05) - ✅ Domain generalization confirmed across datasets ## Impact & Benefits **Before this PR:** - Ad-hoc testing with no standardized approach - Difficult to validate score consistency - Time-consuming manual comparison - No reusable infrastructure **After this PR:** 1. **Faster migrations** - Standardized workflow reduces implementation time 2. **Higher quality** - Dataset-based validation catches issues early 3. **Consistency** - All metrics follow same validation process 4. **Reusability** - Shared utilities eliminate boilerplate 5. **Documentation** - Clear guide enables team-wide contribution ## Architecture ``` tests/ ├── e2e/metrics_migration/ │ ├── plan-for-metrics-migration.md # Migration guide │ ├── metric_score_diff.ipynb # Testing notebook │ ├── base_migration_test.py # Base test class │ ├── conftest.py # Shared fixtures │ └── test_utils.py # Test helpers ├── e2e/test_dataset_utils.py # Dataset loading └── utils/ ├── llm_setup.py # Component factories └── metric_comparison.py # Comparison engine ``` ## Next Steps This infrastructure enables the following PRs: 1. **PR #2** (depends on this): Context Recall migration 2. **PR #3** (depends on PR #2): Context Precision migration 3. **Future PRs**: Remaining metric migrations using this framework --- **Note**: This PR contains **only infrastructure and documentation** - no actual metric implementations. Metric migrations follow in subsequent PRs.

Jithin James and others added 2 commits May 12, 2023 23:57

fix: batching in Metric

5826734

Merge branch 'main' into fix/batching-metrics

21e806d

shahules786 approved these changes May 12, 2023

View reviewed changes

shahules786 merged commit 6a8fcee into main May 12, 2023

jjmachan deleted the fix/batching-metrics branch May 13, 2023 05:33

SalwaMostafa mentioned this pull request Jul 15, 2024

Local LLM with Ragas evaluation issue #1100

Open

jjmachan pushed a commit that referenced this pull request Sep 7, 2024

Merge pull request #2 from comet-ml/abby/small_changes

65ba694

small changes, mostly utms

NirantK added a commit to ScaledFocus/ragas that referenced this pull request Aug 19, 2025

attempt explodinggradients#2 at typing issues I can't repro locally

4fc9f34

jjmachan mentioned this pull request Oct 16, 2025

feat: Add reusable testing infrastructure for metrics migration #2370

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: batching in Metric #2

fix: batching in Metric #2

Uh oh!

jjmachan commented May 12, 2023

Uh oh!

shahules786 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: batching in Metric #2

fix: batching in Metric #2

Uh oh!

Conversation

jjmachan commented May 12, 2023

Uh oh!

shahules786 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants