# Day 5 – Refined Evaluation Plan + PAIR Codelab

## 1. Objective
Extend evaluation criteria to include fairness, trust, and subgroup alignment. Optional: complete PAIR Codelab on building trusted AI systems.

## 2. Extended Evaluation Criteria – Human-Centered Design

To move beyond technical accuracy, we expand evaluation to include **trust, fairness, and safety**. These dimensions help ensure the AI product serves real users equitably and transparently.

### Trust / Perception

- Log **user-facing feedback** (e.g., 1–5 star helpfulness or clarity ratings)
- Track **user override behavior** (e.g., do users ignore or correct the AI?)
- Monitor **retention and drop-off** patterns over time
- Detect erosion of trust by logging repeated questions or failed resolution chains

### Fairness / Subgroup Alignment

- Disaggregate performance metrics across **user segments**:
  - Skill level (e.g., novice vs advanced)
  - Language background (e.g., ESL vs native speaker)
  - Device context (e.g., mobile vs desktop)
  - Gender, age, or region (if known and ethically sourced)
- Define thresholds for **performance gaps** across groups
- Include **manual audits** or flagged edge cases from underperforming segments

### Safety & Recovery

- Log **confidently wrong answers** as a high-risk failure mode
- Measure **escalation effectiveness** (e.g., how often tutor fallback resolves the issue)
- Add **recovery metrics**:
  - % of failures that trigger helpful fallback
  - % of recovery attempts that result in successful resolution
- Reward low-risk uncertainty (e.g., deferring when unsure)

These dimensions make the evaluation process **reflect real human use**, not just technical model performance. This helps Tangent (AI tutor) deliver trustable, fair, and supportive learning experiences.

## Refined Evaluation Matrix – Tangent

| Dimension     | Metric                              | Method of Collection     | Frequency     |
|---------------|--------------------------------------|---------------------------|---------------|
| Accuracy       | % correct solutions                  | System logs               | Per problem   |
| Trust          | User helpfulness score (1–5)         | UI rating prompt          | Per session   |
| Fairness       | Accuracy by user segment             | Segment audit             | Weekly/Monthly|
| Safety         | % confidently wrong answers          | Error log inspection      | Per batch     |
| Recovery       | % of escalated sessions resolved     | Escalation flow logs      | Per session   |
| Confidence     | Avg. model certainty score           | Model output metadata     | Per response  |

## Summary

- Extended evaluation to include human-centered metrics: trust, fairness, and safety
- Created subgroup-aware audit plan for fairness