# ML Lab 02: Your Data Is a Mess

In Lab 1 you trained a model on clean, balanced data and got ~97% accuracy. That was the
parking lot. Real data is the highway.

In this lab, you'll intentionally corrupt a dataset to simulate the kinds of problems you'll
hit in production: duplicates, mislabeled examples, garbage strings, and class imbalance.
You'll watch your model fall apart, learn to diagnose the issues with EDA, clean the data,
and build validation checks that catch these problems before they reach your model.

---
## Section 1: The Pristine Baseline

First, let's establish a clean baseline. Same 20 Newsgroups dataset, same pipeline from Lab 1.
This is the score you're trying to recover after we trash the data.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the same dataset from Lab 1
categories = ['rec.sport.baseball', 'sci.space']

train_data = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)

test_data = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42
)

print(f"Training samples: {len(train_data.data)}")
print(f"Test samples:     {len(test_data.data)}")
print(f"Classes:          {train_data.target_names}")

In [None]:
# Train the pristine baseline model
pristine_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

pristine_pipeline.fit(train_data.data, train_data.target)
pristine_preds = pristine_pipeline.predict(test_data.data)
pristine_accuracy = accuracy_score(test_data.target, pristine_preds)

print(f"Pristine baseline accuracy: {pristine_accuracy:.3f}")
print(f"\nThis is the parking lot. Real data is the highway.")
print(f"\nFull classification report:")
print(classification_report(test_data.target, pristine_preds,
                            target_names=train_data.target_names))

**Bookmark this number.** We'll see how far it drops when the data gets messy, and how close we can recover.

---

## Section 2: Meet Your Messy Data

Real-world data is never this clean. Let's simulate the four most common data quality problems:

1. **Duplicates** (~10% of samples repeated) — inflates the dataset, biases the model toward duplicated examples
2. **Mislabeled examples** (~5% of labels flipped) — teaches the model the wrong patterns
3. **Garbage strings** (~3% empty or random characters) — adds noise that confuses the vectorizer
4. **Class imbalance** (90/10 split) — the model learns to always predict the majority class

We'll apply each corruption one at a time so you can see the individual damage.

In [None]:
np.random.seed(42)

# Start with a clean copy
messy_texts = list(train_data.data)
messy_labels = list(train_data.target)

print(f"Starting with {len(messy_texts)} clean samples")
print(f"Class distribution: {sum(np.array(messy_labels) == 0)} baseball, {sum(np.array(messy_labels) == 1)} space")

# --- Corruption 1: Inject duplicates (~10%) ---
n_dups = int(len(messy_texts) * 0.10)
dup_idx = np.random.choice(len(messy_texts), n_dups, replace=False)
for i in dup_idx:
    messy_texts.append(messy_texts[i])
    messy_labels.append(messy_labels[i])
print(f"\n[1] Injected {n_dups} duplicates -> {len(messy_texts)} samples")

# --- Corruption 2: Flip labels (~5%) ---
n_flip = int(len(messy_texts) * 0.05)
flip_idx = np.random.choice(len(messy_texts), n_flip, replace=False)
for i in flip_idx:
    messy_labels[i] = 1 - messy_labels[i]
print(f"[2] Flipped {n_flip} labels")

# --- Corruption 3: Inject garbage strings (~3%) ---
garbage_strings = [
    "",
    " ",
    "asdf",
    "xxx",
    "!!!",
    "null",
    "N/A",
    "test",
    ".",
    "---",
]
n_garbage = int(len(messy_texts) * 0.03)
garbage_idx = np.random.choice(len(messy_texts), n_garbage, replace=False)
for i in garbage_idx:
    messy_texts[i] = np.random.choice(garbage_strings)
print(f"[3] Replaced {n_garbage} samples with garbage strings")

# --- Corruption 4: Create class imbalance (90/10) ---
# Keep all baseball (class 0), subsample space (class 1) to 10% of total
messy_arr = np.array(messy_labels)
class0_idx = np.where(messy_arr == 0)[0]
class1_idx = np.where(messy_arr == 1)[0]

# Target: class 1 should be ~10% of the final dataset
target_class1 = int(len(class0_idx) * 0.11)  # ~10% of total
if len(class1_idx) > target_class1:
    keep_class1 = np.random.choice(class1_idx, target_class1, replace=False)
    keep_idx = np.concatenate([class0_idx, keep_class1])
    np.random.shuffle(keep_idx)
    messy_texts = [messy_texts[i] for i in keep_idx]
    messy_labels = [messy_labels[i] for i in keep_idx]

messy_arr = np.array(messy_labels)
print(f"[4] Created class imbalance -> {len(messy_texts)} samples")
print(f"    Class 0 (baseball): {sum(messy_arr == 0)} ({sum(messy_arr == 0)/len(messy_arr)*100:.1f}%)")
print(f"    Class 1 (space):    {sum(messy_arr == 1)} ({sum(messy_arr == 1)/len(messy_arr)*100:.1f}%)")

In [None]:
# Train the SAME model on the messy data
messy_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

messy_pipeline.fit(messy_texts, messy_labels)
messy_preds = messy_pipeline.predict(test_data.data)
messy_accuracy = accuracy_score(test_data.target, messy_preds)

print(f"Pristine accuracy: {pristine_accuracy:.3f}")
print(f"Messy accuracy:    {messy_accuracy:.3f}")
print(f"Drop:              {pristine_accuracy - messy_accuracy:.3f}")
print(f"\nFull classification report on messy model:")
print(classification_report(test_data.target, messy_preds,
                            target_names=train_data.target_names))

In [None]:
# Side-by-side confusion matrices: pristine vs messy
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

cm_pristine = confusion_matrix(test_data.target, pristine_preds)
sns.heatmap(cm_pristine, annot=True, fmt='d', cmap='Greens',
            xticklabels=train_data.target_names,
            yticklabels=train_data.target_names,
            ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title(f'Pristine Model (acc={pristine_accuracy:.3f})')

cm_messy = confusion_matrix(test_data.target, messy_preds)
sns.heatmap(cm_messy, annot=True, fmt='d', cmap='Reds',
            xticklabels=train_data.target_names,
            yticklabels=train_data.target_names,
            ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title(f'Messy Model (acc={messy_accuracy:.3f})')

plt.tight_layout()
plt.show()

print("Same algorithm, same test set, different data quality. Data quality > model complexity.")

**The model didn't get dumber — the data got worse.** Same algorithm, same hyperparameters.
The only thing that changed was the quality of the training data.

---

## Section 3: Explore Before You Train (EDA)

Before you try to fix anything, you need to *understand* what's wrong. This is **Exploratory Data
Analysis** (EDA) — the step most people skip and then regret.

We'll check:
1. Class distribution
2. Text length distribution
3. Duplicates (exact and near-duplicates)
4. Empty/very short texts
5. Random samples to spot-check labels

In [None]:
# Put messy data into a DataFrame for easier analysis
df = pd.DataFrame({
    'text': messy_texts,
    'label': messy_labels
})
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len().fillna(0).astype(int)

print(f"Dataset shape: {df.shape}")
print(f"\nBasic stats:")
print(df.describe())

In [None]:
# Check 1: Class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

class_counts = df['label'].value_counts().sort_index()
class_names = [train_data.target_names[i] for i in class_counts.index]
colors = ['#2196F3', '#FF9800']

axes[0].bar(class_names, class_counts.values, color=colors)
axes[0].set_title('Class Distribution (Messy Data)')
axes[0].set_ylabel('Count')
for i, (name, count) in enumerate(zip(class_names, class_counts.values)):
    axes[0].text(i, count + 5, f'{count}\n({count/len(df)*100:.1f}%)',
                ha='center', fontweight='bold')

# Compare with what clean data looks like
clean_counts = pd.Series(train_data.target).value_counts().sort_index()
clean_names = [train_data.target_names[i] for i in clean_counts.index]
axes[1].bar(clean_names, clean_counts.values, color=colors, alpha=0.7)
axes[1].set_title('Class Distribution (Clean Data)')
axes[1].set_ylabel('Count')
for i, (name, count) in enumerate(zip(clean_names, clean_counts.values)):
    axes[1].text(i, count + 5, f'{count}\n({count/len(train_data.target)*100:.1f}%)',
                ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

ratio = class_counts.max() / class_counts.min()
print(f"Imbalance ratio: {ratio:.1f}:1")
print(f"A ratio above 3:1 is a red flag. Above 10:1 is a problem.")

In [None]:
# Check 2: Text length distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of text lengths
axes[0].hist(df['text_length'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Text Length Distribution')
axes[0].set_xlabel('Character count')
axes[0].set_ylabel('Frequency')
axes[0].axvline(x=10, color='red', linestyle='--', label='Min threshold (10 chars)')
axes[0].legend()

# Zoom in on the short texts
short_texts = df[df['text_length'] < 50]
axes[1].hist(short_texts['text_length'], bins=20, color='tomato', edgecolor='black', alpha=0.7)
axes[1].set_title(f'Short Texts (< 50 chars) — {len(short_texts)} samples')
axes[1].set_xlabel('Character count')
axes[1].set_ylabel('Frequency')
axes[1].axvline(x=10, color='red', linestyle='--', label='Min threshold (10 chars)')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Texts shorter than 10 chars: {len(df[df['text_length'] < 10])}")
print(f"Empty texts: {len(df[df['text_length'] == 0])}")
print(f"\nSample short texts:")
for _, row in df[df['text_length'] < 10].head(10).iterrows():
    print(f"  label={row['label']} len={row['text_length']}: '{row['text']}'")

In [None]:
# Check 3: Duplicates
n_exact_dups = df.duplicated(subset='text').sum()
print(f"Exact duplicates: {n_exact_dups} ({n_exact_dups/len(df)*100:.1f}%)")

# Show some duplicated texts
dup_mask = df.duplicated(subset='text', keep=False)
dup_df = df[dup_mask].sort_values('text')
if len(dup_df) > 0:
    print(f"\nSample duplicated entries (showing first 5 groups):")
    seen = set()
    count = 0
    for _, row in dup_df.iterrows():
        text_preview = row['text'][:80].replace('\n', ' ')
        if text_preview not in seen:
            seen.add(text_preview)
            n_copies = len(df[df['text'] == row['text']])
            print(f"  [{n_copies} copies] label={row['label']}: \"{text_preview}...\"")
            count += 1
            if count >= 5:
                break

In [None]:
# Check 4: Spot-check labels by sampling random examples
print("Random samples from each class — do the labels look right?\n")

for label in [0, 1]:
    class_name = train_data.target_names[label]
    class_df = df[(df['label'] == label) & (df['text_length'] > 50)]
    samples = class_df.sample(n=min(3, len(class_df)), random_state=42)
    
    print(f"=== {class_name} (label={label}) ===")
    for _, row in samples.iterrows():
        preview = row['text'][:150].replace('\n', ' ')
        print(f"  \"{preview}...\"")
    print()

**EDA Summary:**
- Heavy class imbalance (look at the bar chart)
- Duplicate entries inflating the dataset
- Garbage/empty strings that the model will try to learn from
- Some labels might be wrong (hard to tell without domain knowledge — we'll use a trick in the next section)

**Rule:** Never train on data you haven't explored. 10 minutes of EDA saves hours of debugging model performance.

---

## Section 4: Clean It Up

Now let's fix each issue step by step. We'll track the sample count at each step
so you can see exactly what we're removing and why.

In [None]:
# Start with the messy DataFrame
df_clean = df.copy()
cleaning_log = [{'step': 'Original messy data', 'samples': len(df_clean)}]
print(f"Starting: {len(df_clean)} samples")

# --- Step 1: Remove exact duplicates ---
n_before = len(df_clean)
df_clean = df_clean.drop_duplicates(subset='text', keep='first')
n_removed = n_before - len(df_clean)
cleaning_log.append({'step': 'Remove duplicates', 'samples': len(df_clean)})
print(f"\n[Step 1] Removed {n_removed} exact duplicates -> {len(df_clean)} samples")

# --- Step 2: Remove empty/very short texts ---
n_before = len(df_clean)
df_clean = df_clean[df_clean['text_length'] >= 10]
n_removed = n_before - len(df_clean)
cleaning_log.append({'step': 'Remove short texts', 'samples': len(df_clean)})
print(f"[Step 2] Removed {n_removed} short/empty texts -> {len(df_clean)} samples")

# --- Step 3: Flag suspicious labels ---
# Train a quick model on the current data to find samples where the model's
# prediction strongly disagrees with the label (potential mislabels)
quick_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])
quick_pipe.fit(df_clean['text'].tolist(), df_clean['label'].values)
proba = quick_pipe.predict_proba(df_clean['text'].tolist())
predicted = quick_pipe.predict(df_clean['text'].tolist())

# Flag samples where: prediction disagrees with label AND confidence is high
confidence = np.max(proba, axis=1)
disagree = predicted != df_clean['label'].values
suspicious = disagree & (confidence > 0.9)

n_suspicious = suspicious.sum()
print(f"\n[Step 3] Found {n_suspicious} suspicious labels (model strongly disagrees)")
print(f"  These are likely mislabeled. Removing them.")

n_before = len(df_clean)
df_clean = df_clean[~suspicious]
n_removed = n_before - len(df_clean)
cleaning_log.append({'step': 'Remove suspicious labels', 'samples': len(df_clean)})
print(f"  Removed {n_removed} suspicious samples -> {len(df_clean)} samples")

In [None]:
# Show the cleaning pipeline summary
print("Cleaning Pipeline Summary:")
print("=" * 50)
for i, entry in enumerate(cleaning_log):
    if i == 0:
        print(f"  {entry['step']:30s}  {entry['samples']:5d} samples")
    else:
        removed = cleaning_log[i-1]['samples'] - entry['samples']
        print(f"  {entry['step']:30s}  {entry['samples']:5d} samples  (-{removed})")

total_removed = cleaning_log[0]['samples'] - cleaning_log[-1]['samples']
print(f"{'':30s}  {'':5s}")
print(f"  {'Total removed':30s}  {total_removed:5d} samples ({total_removed/cleaning_log[0]['samples']*100:.1f}%)")

# Show remaining class distribution
print(f"\nRemaining class distribution:")
for label in [0, 1]:
    count = (df_clean['label'] == label).sum()
    print(f"  {train_data.target_names[label]}: {count} ({count/len(df_clean)*100:.1f}%)")

**Note:** We still have class imbalance after cleaning. We'll handle that at training time
using `class_weight='balanced'` in LogisticRegression, which automatically adjusts the loss
function to pay more attention to the minority class.

---

## Section 5: Retrain and Compare

Now let's train on the cleaned data and see how much we recovered.

In [None]:
# Train on cleaned data WITH class_weight='balanced' to handle remaining imbalance
clean_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
])

clean_pipeline.fit(df_clean['text'].tolist(), df_clean['label'].values)
clean_preds = clean_pipeline.predict(test_data.data)
clean_accuracy = accuracy_score(test_data.target, clean_preds)

print(f"Pristine baseline: {pristine_accuracy:.3f}")
print(f"Messy model:       {messy_accuracy:.3f}")
print(f"Cleaned model:     {clean_accuracy:.3f}")
print(f"\nRecovered {clean_accuracy - messy_accuracy:.3f} accuracy points by cleaning the data.")

In [None]:
# Side-by-side comparison table
from sklearn.metrics import precision_recall_fscore_support

def get_metrics(y_true, y_pred, class_names):
    """Get per-class and overall metrics."""
    p, r, f, s = precision_recall_fscore_support(y_true, y_pred, average=None)
    acc = accuracy_score(y_true, y_pred)
    rows = []
    for i, name in enumerate(class_names):
        rows.append({
            'Class': name,
            'Precision': f'{p[i]:.3f}',
            'Recall': f'{r[i]:.3f}',
            'F1': f'{f[i]:.3f}',
            'Support': int(s[i])
        })
    rows.append({
        'Class': 'Overall',
        'Precision': f'{np.mean(p):.3f}',
        'Recall': f'{np.mean(r):.3f}',
        'F1': f'{np.mean(f):.3f}',
        'Support': int(np.sum(s))
    })
    return pd.DataFrame(rows)

print("=== MESSY MODEL ===")
messy_metrics = get_metrics(test_data.target, messy_preds, train_data.target_names)
print(messy_metrics.to_string(index=False))

print("\n=== CLEANED MODEL ===")
clean_metrics = get_metrics(test_data.target, clean_preds, train_data.target_names)
print(clean_metrics.to_string(index=False))

print("\n=== PRISTINE BASELINE ===")
pristine_metrics = get_metrics(test_data.target, pristine_preds, train_data.target_names)
print(pristine_metrics.to_string(index=False))

In [None]:
# Side-by-side confusion matrices: messy vs cleaned vs pristine
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, preds, title, cmap in [
    (axes[0], messy_preds, f'Messy (acc={messy_accuracy:.3f})', 'Reds'),
    (axes[1], clean_preds, f'Cleaned (acc={clean_accuracy:.3f})', 'Blues'),
    (axes[2], pristine_preds, f'Pristine (acc={pristine_accuracy:.3f})', 'Greens'),
]:
    cm = confusion_matrix(test_data.target, preds)
    sns.heatmap(cm, annot=True, fmt='d', cmap=cmap,
                xticklabels=train_data.target_names,
                yticklabels=train_data.target_names,
                ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title(title)

plt.tight_layout()
plt.show()

print("Data quality > model complexity. Cleaning the data recovered most of the performance.")

**Key takeaway:** We didn't change the algorithm. We didn't tune hyperparameters. We just
cleaned the data. That alone recovered most of the lost performance.

---

## Section 6: Build a Data Validation Pipeline

The cleaning we just did was manual. In production, you need **automated checks** that run
before every training job. If the data fails validation, training should not proceed.

Let's build simple validation functions that would have caught every issue from Section 2.

In [None]:
def check_class_balance(labels, max_ratio=3.0):
    """Check if class distribution is within acceptable bounds."""
    counts = pd.Series(labels).value_counts()
    ratio = counts.max() / counts.min()
    passed = ratio <= max_ratio
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Class balance — ratio: {ratio:.1f}:1 (max allowed: {max_ratio:.1f}:1)")
    if not passed:
        print(f"         Distribution: {dict(counts)}")
    return passed

def check_duplicates(texts, max_pct=1.0):
    """Check for exact duplicate texts."""
    n_dups = pd.Series(texts).duplicated().sum()
    pct = n_dups / len(texts) * 100
    passed = pct <= max_pct
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Duplicates — {n_dups} found ({pct:.1f}%, max allowed: {max_pct:.1f}%)")
    return passed

def check_empty_texts(texts, max_pct=0.5):
    """Check for empty or very short texts (< 10 chars)."""
    lengths = pd.Series(texts).str.len()
    n_short = (lengths < 10).sum()
    pct = n_short / len(texts) * 100
    passed = pct <= max_pct
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Short/empty texts — {n_short} found ({pct:.1f}%, max allowed: {max_pct:.1f}%)")
    return passed

def check_text_length_distribution(texts, min_median=100, max_std_ratio=5.0):
    """Check that text lengths are reasonable."""
    lengths = pd.Series(texts).str.len()
    median_len = lengths.median()
    std_ratio = lengths.std() / lengths.median() if lengths.median() > 0 else float('inf')
    passed = median_len >= min_median and std_ratio <= max_std_ratio
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Text length — median: {median_len:.0f} chars, std/median ratio: {std_ratio:.2f}")
    if not passed:
        print(f"         Expected median >= {min_median}, std/median <= {max_std_ratio}")
    return passed

print("Validation functions defined.")

In [None]:
def validate_dataset(texts, labels, dataset_name="dataset"):
    """Run all validation checks on a dataset. Returns True if all pass."""
    print(f"\n{'='*60}")
    print(f"  DATA VALIDATION: {dataset_name}")
    print(f"  Samples: {len(texts)}")
    print(f"{'='*60}")
    
    results = []
    results.append(check_class_balance(labels))
    results.append(check_duplicates(texts))
    results.append(check_empty_texts(texts))
    results.append(check_text_length_distribution(texts))
    
    all_passed = all(results)
    print(f"\n  {'ALL CHECKS PASSED' if all_passed else 'VALIDATION FAILED'} — "
          f"{sum(results)}/{len(results)} checks passed")
    print(f"{'='*60}")
    
    return all_passed

# Run on the messy data (should FAIL)
print("Running validation on the MESSY data:")
messy_passed = validate_dataset(messy_texts, messy_labels, "Messy Training Data")

# Run on the cleaned data (should PASS)
print("\nRunning validation on the CLEANED data:")
clean_passed = validate_dataset(
    df_clean['text'].tolist(),
    df_clean['label'].values.tolist(),
    "Cleaned Training Data"
)

# Run on the original pristine data (should PASS)
print("\nRunning validation on the PRISTINE data:")
pristine_passed = validate_dataset(
    train_data.data,
    train_data.target.tolist(),
    "Pristine Training Data"
)

In [None]:
# Show how this would work as a pre-training gate
print("Example: Pre-training gate\n")

def train_with_validation(texts, labels, test_texts, test_labels, class_names):
    """Only train if data passes validation."""
    passed = validate_dataset(texts, labels, "Pre-Training Check")
    
    if not passed:
        print("\n  TRAINING BLOCKED — fix data quality issues first.")
        return None
    
    print("\n  Validation passed. Training model...")
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
        ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
    ])
    pipe.fit(texts, labels)
    preds = pipe.predict(test_texts)
    acc = accuracy_score(test_labels, preds)
    print(f"  Model trained. Test accuracy: {acc:.3f}")
    return pipe

# Try with messy data — should be blocked
print("Attempt 1: Train on messy data")
result = train_with_validation(messy_texts, messy_labels,
                                test_data.data, test_data.target,
                                train_data.target_names)

# Try with clean data — should proceed
print("\n" + "-"*60)
print("\nAttempt 2: Train on cleaned data")
result = train_with_validation(df_clean['text'].tolist(),
                                df_clean['label'].values.tolist(),
                                test_data.data, test_data.target,
                                train_data.target_names)

**These checks should run before every training job.** In a real ML pipeline, you'd wire
these into your CI/CD or orchestration system so that bad data never silently reaches the model.

---

## Summary

You just experienced the most common reason ML models fail in production: bad data.

| Concept | What You Learned |
|---------|------------------|
| **Data corruption** | Duplicates, mislabels, garbage, and imbalance silently destroy model performance |
| **EDA** | Always explore your data before training — plots and stats catch what code doesn't |
| **Cleaning pipeline** | Step-by-step deduplication, filtering, and label validation |
| **class_weight='balanced'** | A simple way to handle class imbalance at training time |
| **Validation gates** | Automated checks that block training when data quality is too low |
| **Data quality > model complexity** | Cleaning data improves results more than a fancier algorithm |

### What's Next?

In **ML Lab 03**, you'll take a trained model and deploy it as a live API — complete with
health checks, Prometheus metrics, and a Grafana dashboard.