Skip to content

Conversation

jjmachan
Copy link
Member

No description provided.

Copy link
Member

@shahules786 shahules786 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shahules786 shahules786 merged commit 6a8fcee into main May 12, 2023
@jjmachan jjmachan deleted the fix/batching-metrics branch May 13, 2023 05:33
jjmachan pushed a commit that referenced this pull request Sep 7, 2024
NirantK added a commit to ScaledFocus/ragas that referenced this pull request Aug 19, 2025
jjmachan added a commit that referenced this pull request Oct 17, 2025
## Issue Link / Problem Description

This PR introduces a comprehensive, reusable testing infrastructure to
streamline the migration of legacy metrics to the modern collections
architecture.

**Problem**: Migrating metrics requires consistent validation that new
implementations match legacy behavior, but this process was ad-hoc and
time-consuming without standardized tooling.

**Solution**: A complete testing framework with:
- Step-by-step migration guide
- Configuration-driven validation notebook
- Shared test utilities and fixtures
- Dataset loading with safe fallbacks

## Changes Made

### 📋 Migration Documentation
- **`tests/e2e/metrics_migration/plan-for-metrics-migration.md`**
- Complete migration workflow (pre-migration → implementation → testing
→ finalization)
  - Code templates for prompts, output models, and metric classes
- Validation criteria for different metric types (LLM-based,
embeddings-based, deterministic)

### 📓 Testing Notebook
- **`tests/e2e/metrics_migration/metric_score_diff.ipynb`** +
**`tests/notebooks/metric_score_diff.ipynb`**
- General-purpose, configuration-driven (edit 1 cell to test any metric)
  - Concurrent dataset comparison (Amnesty QA + FIQA)
  - Statistical analysis with 7-plot visualizations
  - Automated validation criteria checking

### 🧪 Test Infrastructure
- **`tests/e2e/metrics_migration/base_migration_test.py`**
  - `BaseMigrationTest` class with reusable test methods
- `run_e2e_compatibility_test()` - Compare legacy vs modern with
tolerance
  - `run_metric_specific_test()` - Custom behavior validation
  
- **`tests/e2e/metrics_migration/conftest.py`**
- Shared pytest fixtures: `legacy_llm`, `modern_llm`,
`legacy_embeddings`, `modern_embeddings`
  - Automatic API key validation and graceful skipping

- **`tests/e2e/metrics_migration/test_utils.py`**
  - Helper utilities for migration tests

### 🏭 Component Factories
- **`tests/utils/llm_setup.py`**
- `create_legacy_llm()` / `create_modern_llm()` - LLM initialization for
both architectures
- `create_legacy_embeddings()` / `create_modern_embeddings()` -
Embeddings initialization
  - `check_api_key()` - Validation utility

### ⚡ Optimized Comparison Engine
- **`tests/utils/metric_comparison.py`**
- `compare_metrics()` - Concurrent execution with configurable
parallelism
- Support for both parallel (independent) and sequential (dependent)
metric execution
- `ComparisonResult` dataclass with statistical analysis and pandas
export

### 📊 Dataset Utilities
- **`tests/e2e/test_dataset_utils.py`**
- `load_amnesty_dataset_safe()` - Amnesty QA dataset with local fallback
  - `load_fiqa_dataset_safe()` - FIQA dataset with local fallback
  - Embedded sample data for offline/CI testing

### ⚙️ Configuration
- **`.gitignore`** - Added `plan/` directory for migration planning docs
- **`CLAUDE.md`** - Updated with migration workflow guidance

## Testing

### Validation Status
- [x] Automated tests: Infrastructure provides framework for all future
migrations
- [x] Manual validation: Tested with Context Recall migration

### Test Results (Context Recall Migration)
| Dataset | Samples | Mean |Diff| | Within Tolerance | Status |
|---------|---------|---------------|------------------|--------|
| Amnesty QA | 20 | 0.0708 | 90% (18/20) | ✅ |
| FIQA | 30 | 0.0667 | 93.3% (28/30) | ✅ |

**Validation Criteria Met:**
- ✅ Mean |diff| < 0.15 (stricter than per-case tolerance)
- ✅ >90% within 0.2 tolerance for LLM-based metrics
- ✅ No systematic bias (mean diff < 0.05)
- ✅ Domain generalization confirmed across datasets

## Impact & Benefits

**Before this PR:**
- Ad-hoc testing with no standardized approach
- Difficult to validate score consistency
- Time-consuming manual comparison
- No reusable infrastructure

**After this PR:**
1. **Faster migrations** - Standardized workflow reduces implementation
time
2. **Higher quality** - Dataset-based validation catches issues early
3. **Consistency** - All metrics follow same validation process
4. **Reusability** - Shared utilities eliminate boilerplate
5. **Documentation** - Clear guide enables team-wide contribution

## Architecture

```
tests/
├── e2e/metrics_migration/
│   ├── plan-for-metrics-migration.md    # Migration guide
│   ├── metric_score_diff.ipynb          # Testing notebook
│   ├── base_migration_test.py           # Base test class
│   ├── conftest.py                      # Shared fixtures
│   └── test_utils.py                    # Test helpers
├── e2e/test_dataset_utils.py            # Dataset loading
└── utils/
    ├── llm_setup.py                     # Component factories
    └── metric_comparison.py             # Comparison engine
```

## Next Steps

This infrastructure enables the following PRs:
1. **PR #2** (depends on this): Context Recall migration
2. **PR #3** (depends on PR #2): Context Precision migration
3. **Future PRs**: Remaining metric migrations using this framework

---

**Note**: This PR contains **only infrastructure and documentation** -
no actual metric implementations. Metric migrations follow in subsequent
PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants