# Error Analysis in Machine Learning

## Introduction

In this notebook, we will explore **Error Analysis** in Machine Learning. Many beginners stop at evaluating a model with overall metrics like accuracy, precision, and recall. 

However, these metrics alone are often **not enough** to fully understand the model’s strengths and weaknesses.

### Why Error Analysis?

Imagine you build a model that predicts whether passengers on the Titanic survived. 

You train it, evaluate it, and achieve an accuracy of 80%. That seems like a great result! But what if we look deeper?

- Did the model perform equally well for **men and women**?
- What about **children versus adults**?
- Did it struggle more with **certain passenger classes**?

Overall performance metrics **hide critical weaknesses**. A model that performs well on average may still make **systematic errors** on important subgroups. 

If these errors affect certain groups more than others, the model may reinforce biases and lead to unfair or unreliable decisions.

### What is Error Analysis?

Error Analysis helps us:
- Identify subgroups where the model performs poorly
- Diagnose biases or unfairness in predictions
- Gain insights to improve model performance in targeted ways
- Detect situations where the model **fails consistently**

Instead of just looking at overall performance, we **slice** our dataset into different subsets and analyze performance within each. This allows us to find hidden weaknesses and understand **why** the model makes mistakes.

### Real-World Importance of Error Analysis

Error Analysis is critical in **high-stakes applications** like:
- **Healthcare:** A diagnostic AI system might perform well overall but fail to detect certain diseases in specific demographics.
- **Finance:** A credit approval model may systematically reject applicants from certain backgrounds due to biases in training data.
- **Autonomous Vehicles:** A self-driving car model may struggle with detecting pedestrians at night compared to daylight conditions.

In short, **Error Analysis bridges the gap between high-level performance metrics and real-world usability**, making models not just accurate but also **fair, reliable, and trustworthy**.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

In [1]:
# Load a simple binary classification dataset (e.g., The Titanic Dataset)

df = sns.load_dataset("titanic")

df.head()

NameError: name 'sns' is not defined

**Note:** 

We will be very brief and *sloppy* with our feature engineering and data prep here - although important, it's not the point of this demonstration.

In [None]:
# Drop rows with missing target values
df = df.dropna(subset=["survived"])

# Convert categorical features to numerical
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['embark_town'] = df['embark_town'].astype('category').cat.codes
df['class'] = df['class'].astype('category').cat.codes
df['who'] = df['who'].astype('category').cat.codes

df.head()

In [None]:
# Select relevant features
features = ['pclass', 'sex', 'age', 'fare', 'sibsp', 'parch']
X = df[features]
y = df['survived']

# Drop rows with missing values in features
X = X.dropna()
y = y.loc[X.index]

print(X.shape)
print(y.shape)

In [None]:
X.head()

In [None]:
y

In [None]:
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [None]:
# Evaluate on test set
y_pred = model.predict(X_test)
print("Overall Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## Error Analysis

Now, let's go beyond these overall metrics and analyze **performance across different subgroups**. 
We will analyze how well the model performs across age groups.

### Performance Across Age Groups

We will here demonstrate how to split the test set into different age groups and compute accuracy, precision, and recall for each.

In [None]:
# Define age bins
age_bins = [0, 18, 40, 60, 100]
age_labels = ['0-18', '19-40', '41-60', '60+']
X_test['age_group'] = pd.cut(X_test['age'], bins=age_bins, labels=age_labels)

# Compute metrics per age group
for age_group in age_labels:
    subset = X_test[X_test['age_group'] == age_group]
    y_true_subset = y_test.loc[subset.index]
    y_pred_series = pd.Series(y_pred, index=X_test.index)  # Ensure aligned indices
    y_pred_subset = y_pred_series.loc[subset.index]
    
    acc = accuracy_score(y_true_subset, y_pred_subset)
    prec = precision_score(y_true_subset, y_pred_subset, zero_division=0)
    rec = recall_score(y_true_subset, y_pred_subset, zero_division=0)
    
    print(f"Performance for Age Group {age_group}:")
    print(f"  Accuracy: {acc:.4f}")
    print(f"  Precision: {prec:.4f}")
    print(f"  Recall: {rec:.4f}\n")

### Conclusion

We can now identify which age groups perform worse. 
If certain subgroups perform poorly, we can:
- Collect more data from those groups
- Try different model architectures
- Adjust loss functions to handle class imbalances better

This is how **Error Analysis** helps improve machine learning models beyond just looking at overall accuracy!