feat: Add reusable testing infrastructure for metrics migration #2370
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue Link / Problem Description
This PR introduces a comprehensive, reusable testing infrastructure to streamline the migration of legacy metrics to the modern collections architecture.
Problem: Migrating metrics requires consistent validation that new implementations match legacy behavior, but this process was ad-hoc and time-consuming without standardized tooling.
Solution: A complete testing framework with:
Changes Made
📋 Migration Documentation
tests/e2e/metrics_migration/plan-for-metrics-migration.md
📓 Testing Notebook
tests/e2e/metrics_migration/metric_score_diff.ipynb
+tests/notebooks/metric_score_diff.ipynb
🧪 Test Infrastructure
tests/e2e/metrics_migration/base_migration_test.py
BaseMigrationTest
class with reusable test methodsrun_e2e_compatibility_test()
- Compare legacy vs modern with tolerancerun_metric_specific_test()
- Custom behavior validationtests/e2e/metrics_migration/conftest.py
legacy_llm
,modern_llm
,legacy_embeddings
,modern_embeddings
tests/e2e/metrics_migration/test_utils.py
🏭 Component Factories
tests/utils/llm_setup.py
create_legacy_llm()
/create_modern_llm()
- LLM initialization for both architecturescreate_legacy_embeddings()
/create_modern_embeddings()
- Embeddings initializationcheck_api_key()
- Validation utility⚡ Optimized Comparison Engine
tests/utils/metric_comparison.py
compare_metrics()
- Concurrent execution with configurable parallelismComparisonResult
dataclass with statistical analysis and pandas export📊 Dataset Utilities
tests/e2e/test_dataset_utils.py
load_amnesty_dataset_safe()
- Amnesty QA dataset with local fallbackload_fiqa_dataset_safe()
- FIQA dataset with local fallback⚙️ Configuration
.gitignore
- Addedplan/
directory for migration planning docsCLAUDE.md
- Updated with migration workflow guidanceTesting
Validation Status
Test Results (Context Recall Migration)
| Dataset | Samples | Mean |Diff| | Within Tolerance | Status |
|---------|---------|---------------|------------------|--------|
| Amnesty QA | 20 | 0.0708 | 90% (18/20) | ✅ |
| FIQA | 30 | 0.0667 | 93.3% (28/30) | ✅ |
Validation Criteria Met:
Impact & Benefits
Before this PR:
After this PR:
Architecture
Next Steps
This infrastructure enables the following PRs:
Note: This PR contains only infrastructure and documentation - no actual metric implementations. Metric migrations follow in subsequent PRs.