# Notebook 1 – Core Evaluation Metrics
In this notebook we will:
* Install necessary evaluation libraries
* Compute classic metrics (Accuracy, Precision, Recall, F1)
* Evaluate generation quality with **BLEU** and **ROUGE**
* Explore specialised benchmarks (TruthfulQA, Hallucination‑Rate)
* Design a **human‑evaluation rubric** and capture results

In [None]:
!pip -q install evaluate rouge-score datasets

## 1. Classification Metrics Demo

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
y_true = ['positive', 'negative', 'negative', 'positive']
y_pred = ['positive', 'negative', 'positive', 'positive']
acc = accuracy_score(y_true, y_pred)
prec, rec, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary', pos_label='positive')
print(f'Accuracy: {acc:.2f}\nPrecision: {prec:.2f}\nRecall: {rec:.2f}\nF1: {f1:.2f}')

## 2. Generation Metrics Demo – ROUGE

In [None]:
from rouge_score import rouge_scorer
reference = 'The cat sat on the mat.'
candidate = 'A cat was sitting on the mat.'
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
scores

## 3. TruthfulQA Mini‑Check
We will load a tiny subset from the *truthful_qa* dataset and score whether the model stays truthful. (For demonstration only.)

In [None]:
from datasets import load_dataset
ds = load_dataset('truthful_qa', 'generation', split='validation[:5]')
ds.to_pandas()[['question', 'best_answer']].head()

## 4. Designing a Human‑Evaluation Rubric
Fill in the table below with your criteria (1–5):

| Criterion | 1 (Poor) | 3 (OK) | 5 (Great) |
|-----------|----------|--------|-----------|
| Relevance |          |        |           |
| Factuality|          |        |           |
| Style     |          |        |           |
