added textual entailment score #3

shahules786 · 2023-05-12T20:00:08Z

Add entailment score using NLI.

belar/utils.py

jjmachan · 2023-05-12T21:10:23Z

belar/metrics/factual.py

+            self.id2label = model_config["id2label"]
+
+    def name(self):
+        return "Entailment-Score"


Suggested change

return "Entailment-Score"

return "entailment"

belar/metrics/factual.py

jjmachan · 2023-05-12T21:15:57Z

belar/metrics/factual.py

+        ground_truth: t.Union[str, t.List[str]],
+        generated_text: t.Union[str, t.List[str]],


should we use the new version?

str | list[str]

whichever we choose, we have to be consistent

Co-authored-by: Jithin James <jamesjithin97@gmail.com>

## Issue Link / Problem Description This PR introduces a comprehensive, reusable testing infrastructure to streamline the migration of legacy metrics to the modern collections architecture. **Problem**: Migrating metrics requires consistent validation that new implementations match legacy behavior, but this process was ad-hoc and time-consuming without standardized tooling. **Solution**: A complete testing framework with: - Step-by-step migration guide - Configuration-driven validation notebook - Shared test utilities and fixtures - Dataset loading with safe fallbacks ## Changes Made ### 📋 Migration Documentation - **`tests/e2e/metrics_migration/plan-for-metrics-migration.md`** - Complete migration workflow (pre-migration → implementation → testing → finalization) - Code templates for prompts, output models, and metric classes - Validation criteria for different metric types (LLM-based, embeddings-based, deterministic) ### 📓 Testing Notebook - **`tests/e2e/metrics_migration/metric_score_diff.ipynb`** + **`tests/notebooks/metric_score_diff.ipynb`** - General-purpose, configuration-driven (edit 1 cell to test any metric) - Concurrent dataset comparison (Amnesty QA + FIQA) - Statistical analysis with 7-plot visualizations - Automated validation criteria checking ### 🧪 Test Infrastructure - **`tests/e2e/metrics_migration/base_migration_test.py`** - `BaseMigrationTest` class with reusable test methods - `run_e2e_compatibility_test()` - Compare legacy vs modern with tolerance - `run_metric_specific_test()` - Custom behavior validation - **`tests/e2e/metrics_migration/conftest.py`** - Shared pytest fixtures: `legacy_llm`, `modern_llm`, `legacy_embeddings`, `modern_embeddings` - Automatic API key validation and graceful skipping - **`tests/e2e/metrics_migration/test_utils.py`** - Helper utilities for migration tests ### 🏭 Component Factories - **`tests/utils/llm_setup.py`** - `create_legacy_llm()` / `create_modern_llm()` - LLM initialization for both architectures - `create_legacy_embeddings()` / `create_modern_embeddings()` - Embeddings initialization - `check_api_key()` - Validation utility ### ⚡ Optimized Comparison Engine - **`tests/utils/metric_comparison.py`** - `compare_metrics()` - Concurrent execution with configurable parallelism - Support for both parallel (independent) and sequential (dependent) metric execution - `ComparisonResult` dataclass with statistical analysis and pandas export ### 📊 Dataset Utilities - **`tests/e2e/test_dataset_utils.py`** - `load_amnesty_dataset_safe()` - Amnesty QA dataset with local fallback - `load_fiqa_dataset_safe()` - FIQA dataset with local fallback - Embedded sample data for offline/CI testing ### ⚙️ Configuration - **`.gitignore`** - Added `plan/` directory for migration planning docs - **`CLAUDE.md`** - Updated with migration workflow guidance ## Testing ### Validation Status - [x] Automated tests: Infrastructure provides framework for all future migrations - [x] Manual validation: Tested with Context Recall migration ### Test Results (Context Recall Migration) | Dataset | Samples | Mean |Diff| | Within Tolerance | Status | |---------|---------|---------------|------------------|--------| | Amnesty QA | 20 | 0.0708 | 90% (18/20) | ✅ | | FIQA | 30 | 0.0667 | 93.3% (28/30) | ✅ | **Validation Criteria Met:** - ✅ Mean |diff| < 0.15 (stricter than per-case tolerance) - ✅ >90% within 0.2 tolerance for LLM-based metrics - ✅ No systematic bias (mean diff < 0.05) - ✅ Domain generalization confirmed across datasets ## Impact & Benefits **Before this PR:** - Ad-hoc testing with no standardized approach - Difficult to validate score consistency - Time-consuming manual comparison - No reusable infrastructure **After this PR:** 1. **Faster migrations** - Standardized workflow reduces implementation time 2. **Higher quality** - Dataset-based validation catches issues early 3. **Consistency** - All metrics follow same validation process 4. **Reusability** - Shared utilities eliminate boilerplate 5. **Documentation** - Clear guide enables team-wide contribution ## Architecture ``` tests/ ├── e2e/metrics_migration/ │ ├── plan-for-metrics-migration.md # Migration guide │ ├── metric_score_diff.ipynb # Testing notebook │ ├── base_migration_test.py # Base test class │ ├── conftest.py # Shared fixtures │ └── test_utils.py # Test helpers ├── e2e/test_dataset_utils.py # Dataset loading └── utils/ ├── llm_setup.py # Component factories └── metric_comparison.py # Comparison engine ``` ## Next Steps This infrastructure enables the following PRs: 1. **PR #2** (depends on this): Context Recall migration 2. **PR #3** (depends on PR #2): Context Precision migration 3. **Future PRs**: Remaining metric migrations using this framework --- **Note**: This PR contains **only infrastructure and documentation** - no actual metric implementations. Metric migrations follow in subsequent PRs.

shahules786 added 6 commits May 12, 2023 22:24

add sbert score

9873234

add relative import

ca9676f

add entailment score

8b8507b

add relative import

56a67bf

add device assigner

fb379f9

merge main

763bca2

jjmachan reviewed May 12, 2023

View reviewed changes

shahules786 and others added 3 commits May 13, 2023 02:59

fix input type

47e830d

change input type

e363a30

change DEVICE type

a404038

Co-authored-by: Jithin James <jamesjithin97@gmail.com>

shahules786 changed the title ~~Textual entailment score~~ added textual entailment score May 12, 2023

jjmachan approved these changes May 12, 2023

View reviewed changes

jjmachan merged commit 36f3f56 into main May 12, 2023

jjmachan deleted the dev-metrics branch May 12, 2023 21:55

SalwaMostafa mentioned this pull request Jul 15, 2024

Local LLM with Ragas evaluation issue #1100

Open

jjmachan mentioned this pull request Oct 16, 2025

feat: Add reusable testing infrastructure for metrics migration #2370

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added textual entailment score #3

added textual entailment score #3

Uh oh!

shahules786 commented May 12, 2023

Uh oh!

Uh oh!

jjmachan May 12, 2023

Uh oh!

Uh oh!

jjmachan May 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ground_truth: t.Union[str, t.List[str]],
		generated_text: t.Union[str, t.List[str]],

added textual entailment score #3

added textual entailment score #3

Uh oh!

Conversation

shahules786 commented May 12, 2023

Uh oh!

Uh oh!

jjmachan May 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjmachan May 12, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants