-
Notifications
You must be signed in to change notification settings - Fork 1.1k
added textual entailment score #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jjmachan
reviewed
May 12, 2023
belar/metrics/factual.py
Outdated
self.id2label = model_config["id2label"] | ||
|
||
def name(self): | ||
return "Entailment-Score" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change
return "Entailment-Score" | |
return "entailment" |
belar/metrics/factual.py
Outdated
Comment on lines
58
to
59
ground_truth: t.Union[str, t.List[str]], | ||
generated_text: t.Union[str, t.List[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use the new version?
str | list[str]
whichever we choose, we have to be consistent
Co-authored-by: Jithin James <jamesjithin97@gmail.com>
jjmachan
approved these changes
May 12, 2023
2 tasks
jjmachan
added a commit
that referenced
this pull request
Oct 17, 2025
## Issue Link / Problem Description This PR introduces a comprehensive, reusable testing infrastructure to streamline the migration of legacy metrics to the modern collections architecture. **Problem**: Migrating metrics requires consistent validation that new implementations match legacy behavior, but this process was ad-hoc and time-consuming without standardized tooling. **Solution**: A complete testing framework with: - Step-by-step migration guide - Configuration-driven validation notebook - Shared test utilities and fixtures - Dataset loading with safe fallbacks ## Changes Made ### 📋 Migration Documentation - **`tests/e2e/metrics_migration/plan-for-metrics-migration.md`** - Complete migration workflow (pre-migration → implementation → testing → finalization) - Code templates for prompts, output models, and metric classes - Validation criteria for different metric types (LLM-based, embeddings-based, deterministic) ### 📓 Testing Notebook - **`tests/e2e/metrics_migration/metric_score_diff.ipynb`** + **`tests/notebooks/metric_score_diff.ipynb`** - General-purpose, configuration-driven (edit 1 cell to test any metric) - Concurrent dataset comparison (Amnesty QA + FIQA) - Statistical analysis with 7-plot visualizations - Automated validation criteria checking ### 🧪 Test Infrastructure - **`tests/e2e/metrics_migration/base_migration_test.py`** - `BaseMigrationTest` class with reusable test methods - `run_e2e_compatibility_test()` - Compare legacy vs modern with tolerance - `run_metric_specific_test()` - Custom behavior validation - **`tests/e2e/metrics_migration/conftest.py`** - Shared pytest fixtures: `legacy_llm`, `modern_llm`, `legacy_embeddings`, `modern_embeddings` - Automatic API key validation and graceful skipping - **`tests/e2e/metrics_migration/test_utils.py`** - Helper utilities for migration tests ### 🏭 Component Factories - **`tests/utils/llm_setup.py`** - `create_legacy_llm()` / `create_modern_llm()` - LLM initialization for both architectures - `create_legacy_embeddings()` / `create_modern_embeddings()` - Embeddings initialization - `check_api_key()` - Validation utility ### ⚡ Optimized Comparison Engine - **`tests/utils/metric_comparison.py`** - `compare_metrics()` - Concurrent execution with configurable parallelism - Support for both parallel (independent) and sequential (dependent) metric execution - `ComparisonResult` dataclass with statistical analysis and pandas export ### 📊 Dataset Utilities - **`tests/e2e/test_dataset_utils.py`** - `load_amnesty_dataset_safe()` - Amnesty QA dataset with local fallback - `load_fiqa_dataset_safe()` - FIQA dataset with local fallback - Embedded sample data for offline/CI testing ### ⚙️ Configuration - **`.gitignore`** - Added `plan/` directory for migration planning docs - **`CLAUDE.md`** - Updated with migration workflow guidance ## Testing ### Validation Status - [x] Automated tests: Infrastructure provides framework for all future migrations - [x] Manual validation: Tested with Context Recall migration ### Test Results (Context Recall Migration) | Dataset | Samples | Mean |Diff| | Within Tolerance | Status | |---------|---------|---------------|------------------|--------| | Amnesty QA | 20 | 0.0708 | 90% (18/20) | ✅ | | FIQA | 30 | 0.0667 | 93.3% (28/30) | ✅ | **Validation Criteria Met:** - ✅ Mean |diff| < 0.15 (stricter than per-case tolerance) - ✅ >90% within 0.2 tolerance for LLM-based metrics - ✅ No systematic bias (mean diff < 0.05) - ✅ Domain generalization confirmed across datasets ## Impact & Benefits **Before this PR:** - Ad-hoc testing with no standardized approach - Difficult to validate score consistency - Time-consuming manual comparison - No reusable infrastructure **After this PR:** 1. **Faster migrations** - Standardized workflow reduces implementation time 2. **Higher quality** - Dataset-based validation catches issues early 3. **Consistency** - All metrics follow same validation process 4. **Reusability** - Shared utilities eliminate boilerplate 5. **Documentation** - Clear guide enables team-wide contribution ## Architecture ``` tests/ ├── e2e/metrics_migration/ │ ├── plan-for-metrics-migration.md # Migration guide │ ├── metric_score_diff.ipynb # Testing notebook │ ├── base_migration_test.py # Base test class │ ├── conftest.py # Shared fixtures │ └── test_utils.py # Test helpers ├── e2e/test_dataset_utils.py # Dataset loading └── utils/ ├── llm_setup.py # Component factories └── metric_comparison.py # Comparison engine ``` ## Next Steps This infrastructure enables the following PRs: 1. **PR #2** (depends on this): Context Recall migration 2. **PR #3** (depends on PR #2): Context Precision migration 3. **Future PRs**: Remaining metric migrations using this framework --- **Note**: This PR contains **only infrastructure and documentation** - no actual metric implementations. Metric migrations follow in subsequent PRs.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add entailment score using NLI.