# Evaluating Machine Learning Models

Once we've trained a model, how do we know if it's any good? Accuracy is a common metric, but it can be very misleading, especially for imbalanced datasets.

This notebook covers two of the most important concepts in model evaluation:

1.  **Classification Metrics**: Precision, Recall, and F1-Score.
2.  **Model Diagnostics**: The Bias-Variance Tradeoff.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

np.set_printoptions(precision=2, suppress=True)

What Are Precision, Recall, and F1 Score?

Before jumping into the calculations and code, let’s define these terms:

    Precision: Measures the accuracy of positive predictions. It answers the question, “Of all the items the model labeled as positive, how many were actually positive?”
    Recall (Sensitivity): Measures the model’s ability to find all the positive instances. It answers the question, “Of all the actual positives, how many did the model correctly identify?”
    F1 Score: The harmonic mean of precision and recall. It balances the two metrics into a single number, making it especially useful when precision and recall are in trade-off.

Why Accuracy Isn’t Always Enough

While accuracy is often the first metric to evaluate, it can be misleading in imbalanced datasets. For example:

    Imagine a dataset where 99% of the data belongs to Class A and only 1% to Class B.
    A model that always predicts Class A would have 99% accuracy but would completely fail to detect Class B.

# 1. Precision, Recall, and F1-Score

These metrics are used for **classification** tasks and are much more informative than simple accuracy.

Imagine an email spam detector:
- **True Positive (TP)**: A spam email is correctly identified as spam.
- **True Negative (TN)**: A normal email is correctly identified as not spam.
- **False Positive (FP)**: A normal email is incorrectly identified as spam (ouch!).
- **False Negative (FN)**: A spam email is incorrectly identified as normal (annoying).

These four outcomes are summarized in a **confusion matrix**.

In [None]:
# Let's imagine some true labels and model predictions
y_true = [0, 1, 0, 1, 0, 0, 1, 1, 0, 1] # 0: Not Spam, 1: Spam
y_pred = [0, 1, 0, 0, 1, 0, 1, 1, 0, 1] # Model's predictions

# Scikit-learn makes it easy to calculate the confusion matrix
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)

tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives (TN): {tn}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"True Positives (TP): {tp}")

### Precision: The 'Purity' of Positive Predictions

Measures the accuracy of positive predictions. It answers the question, “Of all the items the model labeled as positive, how many were actually positive?”
<br/><br/>
**Question**: Of all the emails the model flagged as spam, what fraction were *actually* spam?

`Precision = TP / (TP + FP)`

A high precision means the model is trustworthy when it says something is spam (it generates few false positives).

In [None]:
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.2f}")

### Recall (Sensitivity): Finding All the Positives

Measures the model’s ability to find all the positive instances. It answers the question, “Of all the actual positives, how many did the model correctly identify?”
<br/><br/>
**Question**: Of all the emails that were *actually* spam, what fraction did the model correctly identify?

`Recall = TP / (TP + FN)`

A high recall means the model is good at finding all the spam emails (it generates few false negatives).

In [None]:
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2f}")

### F1-Score: The Harmonic Mean

The harmonic mean of precision and recall. It balances the two metrics into a single number, making it especially useful when precision and recall are in trade-off.
<br/><br/>
Often, there's a tradeoff between precision and recall. The F1-score provides a single metric that balances both.

`F1 = 2 * (Precision * Recall) / (Precision + Recall)`

It's the harmonic mean of precision and recall, and it gives a better measure of a model's performance than accuracy on imbalanced datasets.

In [None]:
f1 = f1_score(y_true, y_pred)
print(f"F1-Score: {f1:.2f}")

# 2. The Bias-Variance Tradeoff

This is one of the most fundamental concepts in machine learning. It helps us diagnose our model's errors.

- **Bias**: The error from incorrect assumptions in the learning algorithm. High bias can cause the model to miss the relevant relations between features and target outputs (**underfitting**).
- **Variance**: The error from sensitivity to small fluctuations in the training set. High variance can cause the model to model the random noise in the training data, rather than the intended outputs (**overfitting**).

**The Goal**: Find a balance. A simple model has high bias and low variance. A complex model has low bias and high variance.

In [None]:
# Let's create a synthetic dataset with a non-linear relationship
np.random.seed(42)
x = np.linspace(-5, 5, 100)
y_true = np.sin(x) + np.random.normal(0, 0.2, 100)

# --- Model 1: Low Complexity (High Bias, Low Variance) ---
# This is a simple linear model (degree 1 polynomial)
p1 = np.polyfit(x, y_true, 1)
y_pred1 = np.polyval(p1, x)

# --- Model 2: High Complexity (Low Bias, High Variance) ---
# This is a very complex model (degree 15 polynomial)
p15 = np.polyfit(x, y_true, 15)
y_pred15 = np.polyval(p15, x)

# --- Model 3: 'Just Right' ---
# This model's complexity is close to the true function
p3 = np.polyfit(x, y_true, 3)
y_pred3 = np.polyval(p3, x)

# --- Visualization ---
plt.figure(figsize=(15, 5))

# Underfitting
plt.subplot(1, 3, 1)
plt.scatter(x, y_true, s=10, label='Data')
plt.plot(x, y_pred1, color='red', label='Fit')
plt.title('High Bias (Underfitting)')
plt.legend()

# Overfitting
plt.subplot(1, 3, 2)
plt.scatter(x, y_true, s=10, label='Data')
plt.plot(x, y_pred15, color='red', label='Fit')
plt.title('High Variance (Overfitting)')
plt.legend()

# Good Fit
plt.subplot(1, 3, 3)
plt.scatter(x, y_true, s=10, label='Data')
plt.plot(x, y_pred3, color='red', label='Fit')
plt.title('Good Balance')
plt.legend()

plt.show()