## Evaluating Classifiers: From Confusion Matrix to F1 Score

Classification models are evaluated using metrics derived from the **confusion matrix**, which counts True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).[1][2][3]

```
Confusion Matrix
           Predicted
           0                   1
Actual  0  TN                  FP(Type I error)
        1  FN(Type II error)   TP    
```

**TP**: Correctly predicted positive  
**TN**: Correctly predicted negative  
**FP**: Wrongly predicted positive (Type I error)  
**FN**: Wrongly predicted negative (Type II error)

### Core Metrics Explained (With Real Examples)

#### Accuracy: "Overall Correctness"
**Formula**: `(TP + TN) / (TP + TN + FP + FN)`  
**Question**: "What fraction of all predictions were correct?"

**Example**: 9/10 predictions correct → **90% accuracy**  
**When to use**: Balanced classes (equal positives/negatives)  
**Problem**: Fails with imbalanced data (e.g., 99% healthy patients → 99% accuracy by predicting "healthy" always)

#### Precision: "How Trustworthy Are Positives?"
**Formula**: `TP / (TP + FP)`  
**Question**: "Of everything predicted positive, how much was actually positive?"

**Example**: 8/10 cancer predictions correct → **80% precision** (2 healthy flagged as cancer)  
**When to use**: False positives costly (spam filter, fraud detection, loan approval)  
**Real scenario**: Credit risk—high precision avoids risky loans

#### Recall (Sensitivity): "Did We Catch All Positives?"
**Formula**: `TP / (TP + FN)`  
**Question**: "Of all actual positives, how many did we catch?"

**Example**: 7/10 COVID patients detected → **70% recall** (3 missed)  
**When to use**: False negatives deadly (disease detection, quality control)  
**Real scenario**: Medical diagnosis—high recall catches sick patients

#### Specificity: "True Negative Rate/Did We Catch All Negatives?"
**Formula**: `TN / (TN + FP)`  
**Question**: "Of all actual negatives, how many were correctly identified?"

**Example**: 85/100 healthy patients correctly identified → **Specificity=0.85**  
**When to use**: Costly false positives, balanced negative class focus  
**Perfect for**: Medical screening, spam filtering

#### F1 Score: "Balance of Precision + Recall"
**Formula**: `2 × (precision × recall) / (precision + recall)`  
**Question**: "Single number balancing both metrics?"

**Example**: Precision=0.8, Recall=0.7 → **F1=0.74**  
**When to use**: Imbalanced classes, need holistic view  
**Perfect for**: Fraud detection, rare disease diagnosis

### When to Use Each Metric (Quick Guide)

| Metric | Best For | Example Scenario | Watch Out For |
|--------|----------|------------------|---------------|
| **Accuracy** | Balanced classes | Equal spam/ham emails | Imbalanced data |
| **Precision** | Costly false positives | Fraud alerts, spam filter | Misses true positives |
| **Recall** | Costly false negatives | Cancer screening | Too many false positives |
| **F1** | Imbalanced + need balance | Rare disease detection | None—most versatile |

### Precision vs Recall Trade-off (Business Decisions)

**Key insight**: You can't maximize both—raising precision often lowers recall (and vice versa).

| Scenario | Prioritize Precision | Prioritize Recall |
|----------|---------------------|-------------------|
| **Medical Diagnosis** | Avoid unnecessary treatments | Catch all sick patients |
| **Fraud Detection** | Avoid bothering honest customers | Catch all fraudsters |
| **Quality Control** | Avoid scrapping good products | Catch all defects |
| **Spam Filter** | Don't block important emails | Block all spam |

**Balance both** → Use F1 score or tune decision threshold via ROC curves.

### Python Quick Check (scikit-learn)

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# After predictions
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

**Sample output**:
```
              precision    recall  f1-score   support
       Paid       0.85      0.92      0.88        50
Did not pay       0.90      0.82      0.86        50
```

### ROC Curve (Bonus Visual Metric)

**ROC AUC**: Single number (0-1) measuring "how well model separates classes" across all thresholds.  
- AUC=1.0: Perfect separator  
- AUC=0.5: Random guessing  
**Use when**: Comparing models or tuning thresholds

Sources:

[1](https://www.geeksforgeeks.org/machine-learning/sklearn-classification-metrics/)
[2](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall)
[3](https://www.geeksforgeeks.org/machine-learning/how-to-produce-a-confusion-matrix-and-find-the-misclassification-rate-of-the-naive-bayes-classifier-in-r/)
[4](https://www.statology.org/misclassification-rate/)
[5](https://www.geeksforgeeks.org/machine-learning/metrics-for-machine-learning-model/)
[6](https://www.geeksforgeeks.org/machine-learning/evaluation-metrics-for-classification-model-in-python/)