Skip to content

Conversation

jjmachan
Copy link
Member

@jjmachan jjmachan commented Oct 16, 2025

Issue Link / Problem Description

This PR introduces a comprehensive, reusable testing infrastructure to streamline the migration of legacy metrics to the modern collections architecture.

Problem: Migrating metrics requires consistent validation that new implementations match legacy behavior, but this process was ad-hoc and time-consuming without standardized tooling.

Solution: A complete testing framework with:

  • Step-by-step migration guide
  • Configuration-driven validation notebook
  • Shared test utilities and fixtures
  • Dataset loading with safe fallbacks

Changes Made

📋 Migration Documentation

  • tests/e2e/metrics_migration/plan-for-metrics-migration.md
    • Complete migration workflow (pre-migration → implementation → testing → finalization)
    • Code templates for prompts, output models, and metric classes
    • Validation criteria for different metric types (LLM-based, embeddings-based, deterministic)

📓 Testing Notebook

  • tests/e2e/metrics_migration/metric_score_diff.ipynb + tests/notebooks/metric_score_diff.ipynb
    • General-purpose, configuration-driven (edit 1 cell to test any metric)
    • Concurrent dataset comparison (Amnesty QA + FIQA)
    • Statistical analysis with 7-plot visualizations
    • Automated validation criteria checking

🧪 Test Infrastructure

  • tests/e2e/metrics_migration/base_migration_test.py

    • BaseMigrationTest class with reusable test methods
    • run_e2e_compatibility_test() - Compare legacy vs modern with tolerance
    • run_metric_specific_test() - Custom behavior validation
  • tests/e2e/metrics_migration/conftest.py

    • Shared pytest fixtures: legacy_llm, modern_llm, legacy_embeddings, modern_embeddings
    • Automatic API key validation and graceful skipping
  • tests/e2e/metrics_migration/test_utils.py

    • Helper utilities for migration tests

🏭 Component Factories

  • tests/utils/llm_setup.py
    • create_legacy_llm() / create_modern_llm() - LLM initialization for both architectures
    • create_legacy_embeddings() / create_modern_embeddings() - Embeddings initialization
    • check_api_key() - Validation utility

⚡ Optimized Comparison Engine

  • tests/utils/metric_comparison.py
    • compare_metrics() - Concurrent execution with configurable parallelism
    • Support for both parallel (independent) and sequential (dependent) metric execution
    • ComparisonResult dataclass with statistical analysis and pandas export

📊 Dataset Utilities

  • tests/e2e/test_dataset_utils.py
    • load_amnesty_dataset_safe() - Amnesty QA dataset with local fallback
    • load_fiqa_dataset_safe() - FIQA dataset with local fallback
    • Embedded sample data for offline/CI testing

⚙️ Configuration

  • .gitignore - Added plan/ directory for migration planning docs
  • CLAUDE.md - Updated with migration workflow guidance

Testing

Validation Status

  • Automated tests: Infrastructure provides framework for all future migrations
  • Manual validation: Tested with Context Recall migration

Test Results (Context Recall Migration)

| Dataset | Samples | Mean |Diff| | Within Tolerance | Status |
|---------|---------|---------------|------------------|--------|
| Amnesty QA | 20 | 0.0708 | 90% (18/20) | ✅ |
| FIQA | 30 | 0.0667 | 93.3% (28/30) | ✅ |

Validation Criteria Met:

  • ✅ Mean |diff| < 0.15 (stricter than per-case tolerance)
  • ✅ >90% within 0.2 tolerance for LLM-based metrics
  • ✅ No systematic bias (mean diff < 0.05)
  • ✅ Domain generalization confirmed across datasets

Impact & Benefits

Before this PR:

  • Ad-hoc testing with no standardized approach
  • Difficult to validate score consistency
  • Time-consuming manual comparison
  • No reusable infrastructure

After this PR:

  1. Faster migrations - Standardized workflow reduces implementation time
  2. Higher quality - Dataset-based validation catches issues early
  3. Consistency - All metrics follow same validation process
  4. Reusability - Shared utilities eliminate boilerplate
  5. Documentation - Clear guide enables team-wide contribution

Architecture

tests/
├── e2e/metrics_migration/
│   ├── plan-for-metrics-migration.md    # Migration guide
│   ├── metric_score_diff.ipynb          # Testing notebook
│   ├── base_migration_test.py           # Base test class
│   ├── conftest.py                      # Shared fixtures
│   └── test_utils.py                    # Test helpers
├── e2e/test_dataset_utils.py            # Dataset loading
└── utils/
    ├── llm_setup.py                     # Component factories
    └── metric_comparison.py             # Comparison engine

Next Steps

This infrastructure enables the following PRs:

  1. PR fix: batching in Metric #2 (depends on this): Context Recall migration
  2. PR added textual entailment score #3 (depends on PR fix: batching in Metric #2): Context Precision migration
  3. Future PRs: Remaining metric migrations using this framework

Note: This PR contains only infrastructure and documentation - no actual metric implementations. Metric migrations follow in subsequent PRs.

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Oct 16, 2025
@jjmachan jjmachan changed the title Feat/metrics migration infrastructure feat: Add reusable testing infrastructure for metrics migration Oct 16, 2025
@jjmachan
Copy link
Member Author

this helps us compare metrics against datasets like

  • amnesty qa
  • fiqa
image image

Copy link

openhands-ai bot commented Oct 16, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • CI

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #2370 at branch `feat/metrics-migration-infrastructure`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

- Add --exclude src/ragas/_version.py to format and CI commands
- This matches GitHub CI behavior which overrides pyproject.toml exclusions
- Ensures notebooks are formatted locally, preventing CI failures
- Add noqa: E402 comment to notebook for intentional sys.path modification
@jjmachan jjmachan requested a review from anistark October 16, 2025 23:34
@jjmachan jjmachan merged commit af38eec into main Oct 17, 2025
9 checks passed
@jjmachan jjmachan deleted the feat/metrics-migration-infrastructure branch October 17, 2025 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants