# How do we _evaluate_ generated summaries on factual inconsistency / hallucination?

## One way is with an _evaluator_ model

- **Input:** (Source document, summary, [label])
- **Output:** Probability of summary being factually (in)consistent with source

## How we'll frame the task: Natural Language Inference (NLI)

### The conventional NLI task: Entailment = True, Contradiction = False
![](https://eugeneyan.com/assets/nli.jpg)

### NLI applied to factual inconsistency detection: Contradiction = Factual Inconsistency
![](https://eugeneyan.com/assets/summary-nli.jpg)

## Agenda

### 0. Objective
- **Finetune an evaluator-model** that can catch hallucinations in Factual Inconsistency Benchmark (FIB)
- **Eval the evaluator-model** through each epoch and data blend
- (Optional assignment: Use the evaluator-model to **eval** generative models)
- (Optional assignment: Use the evaluator-model as a **guardrail**)

### 1. Examine, prepare, and split our data
- Factual Inconsistency Benchmark (FIB) and Unified Summarization Benchmark (USB)
- Link: [1_prep_data.ipynb](1_prep_data.ipynb)

### 2. Finetuning on FIB
- We'll see that performance isn't as good üòî
- Link: [2_ft_fib.ipynb](2_ft_fib.ipynb)

### 3. Blending in and finetuning on USB before FIB
- Profit! üìà
- Link: [3_ft_usb_then_fib.ipynb](3_ft_usb_then_fib.ipynb)

### Appendix
- [Task-Specific LLM Evals that Do & Don't Work (writeup)](https://eugeneyan.com/writing/evals/)
- [Evaluation & Hallucination Detection for Abstractive Summaries (writeup)](https://eugeneyan.com/writing/abstractive/)
- [Out-of-Domain Finetuning to Bootstrap Hallucination Detection (writeup)](https://eugeneyan.com/writing/finetuning/)
- [evals ü§ù finetuning ü§ù evals ü§ù finetuning ... (slides)](https://docs.google.com/presentation/d/1sH6RsoEUM6P38R_A3mkjvvHYhfilzVIi74pQTYY90rQ)

# Next: [1_prep_data.ipynb](1_prep_data.ipynb)