# ML Lab 02: Your Data Is a Mess (Completed Solution)

This is the completed version of the lab notebook with all cells executed and outputs visible.
Use this as a reference if you get stuck on any section.

---

## Section 1: The Pristine Baseline

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the same dataset from Lab 1
categories = ['rec.sport.baseball', 'sci.space']

train_data = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)

test_data = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42
)

print(f"Training samples: {len(train_data.data)}")
print(f"Test samples:     {len(test_data.data)}")
print(f"Classes:          {train_data.target_names}")

Training samples: 1197
Test samples:     796
Classes:          ['rec.sport.baseball', 'sci.space']


In [2]:
# Train the pristine baseline model
pristine_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

pristine_pipeline.fit(train_data.data, train_data.target)
pristine_preds = pristine_pipeline.predict(test_data.data)
pristine_accuracy = accuracy_score(test_data.target, pristine_preds)

print(f"Pristine baseline accuracy: {pristine_accuracy:.3f}")
print(f"\nThis is the parking lot. Real data is the highway.")
print(f"\nFull classification report:")
print(classification_report(test_data.target, pristine_preds,
                            target_names=train_data.target_names))

Pristine baseline accuracy: 0.966

This is the parking lot. Real data is the highway.

Full classification report:
                    precision    recall  f1-score   support

rec.sport.baseball       0.97      0.96      0.97       397
         sci.space       0.96      0.97      0.97       399

          accuracy                           0.97       796
         macro avg       0.97      0.97      0.97       796
      weighted avg       0.97      0.97      0.97       796



**Bookmark this number.** We'll see how far it drops when the data gets messy, and how close we can recover.

---

## Section 2: Meet Your Messy Data

In [3]:
np.random.seed(42)

# Start with a clean copy
messy_texts = list(train_data.data)
messy_labels = list(train_data.target)

print(f"Starting with {len(messy_texts)} clean samples")
print(f"Class distribution: {sum(np.array(messy_labels) == 0)} baseball, {sum(np.array(messy_labels) == 1)} space")

# --- Corruption 1: Inject duplicates (~10%) ---
n_dups = int(len(messy_texts) * 0.10)
dup_idx = np.random.choice(len(messy_texts), n_dups, replace=False)
for i in dup_idx:
    messy_texts.append(messy_texts[i])
    messy_labels.append(messy_labels[i])
print(f"\n[1] Injected {n_dups} duplicates -> {len(messy_texts)} samples")

# --- Corruption 2: Flip labels (~5%) ---
n_flip = int(len(messy_texts) * 0.05)
flip_idx = np.random.choice(len(messy_texts), n_flip, replace=False)
for i in flip_idx:
    messy_labels[i] = 1 - messy_labels[i]
print(f"[2] Flipped {n_flip} labels")

# --- Corruption 3: Inject garbage strings (~3%) ---
garbage_strings = [
    "",
    " ",
    "asdf",
    "xxx",
    "!!!",
    "null",
    "N/A",
    "test",
    ".",
    "---",
]
n_garbage = int(len(messy_texts) * 0.03)
garbage_idx = np.random.choice(len(messy_texts), n_garbage, replace=False)
for i in garbage_idx:
    messy_texts[i] = np.random.choice(garbage_strings)
print(f"[3] Replaced {n_garbage} samples with garbage strings")

# --- Corruption 4: Create class imbalance (90/10) ---
messy_arr = np.array(messy_labels)
class0_idx = np.where(messy_arr == 0)[0]
class1_idx = np.where(messy_arr == 1)[0]

target_class1 = int(len(class0_idx) * 0.11)
if len(class1_idx) > target_class1:
    keep_class1 = np.random.choice(class1_idx, target_class1, replace=False)
    keep_idx = np.concatenate([class0_idx, keep_class1])
    np.random.shuffle(keep_idx)
    messy_texts = [messy_texts[i] for i in keep_idx]
    messy_labels = [messy_labels[i] for i in keep_idx]

messy_arr = np.array(messy_labels)
print(f"[4] Created class imbalance -> {len(messy_texts)} samples")
print(f"    Class 0 (baseball): {sum(messy_arr == 0)} ({sum(messy_arr == 0)/len(messy_arr)*100:.1f}%)")
print(f"    Class 1 (space):    {sum(messy_arr == 1)} ({sum(messy_arr == 1)/len(messy_arr)*100:.1f}%)")

Starting with 1197 clean samples
Class distribution: 597 baseball, 600 space

[1] Injected 119 duplicates -> 1316 samples
[2] Flipped 65 labels
[3] Replaced 39 samples with garbage strings
[4] Created class imbalance -> 790 samples
    Class 0 (baseball): 712 (90.1%)
    Class 1 (space):    78 (9.9%)


In [4]:
# Train the SAME model on the messy data
messy_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

messy_pipeline.fit(messy_texts, messy_labels)
messy_preds = messy_pipeline.predict(test_data.data)
messy_accuracy = accuracy_score(test_data.target, messy_preds)

print(f"Pristine accuracy: {pristine_accuracy:.3f}")
print(f"Messy accuracy:    {messy_accuracy:.3f}")
print(f"Drop:              {pristine_accuracy - messy_accuracy:.3f}")
print(f"\nFull classification report on messy model:")
print(classification_report(test_data.target, messy_preds,
                            target_names=train_data.target_names))

Pristine accuracy: 0.966
Messy accuracy:    0.752
Drop:              0.214

Full classification report on messy model:
                    precision    recall  f1-score   support

rec.sport.baseball       0.64      0.96      0.77       397
         sci.space       0.94      0.55      0.69       399

          accuracy                           0.75       796
         macro avg       0.79      0.75      0.73       796
      weighted avg       0.79      0.75      0.73       796



In [5]:
# Side-by-side confusion matrices: pristine vs messy
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

cm_pristine = confusion_matrix(test_data.target, pristine_preds)
sns.heatmap(cm_pristine, annot=True, fmt='d', cmap='Greens',
            xticklabels=train_data.target_names,
            yticklabels=train_data.target_names,
            ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title(f'Pristine Model (acc={pristine_accuracy:.3f})')

cm_messy = confusion_matrix(test_data.target, messy_preds)
sns.heatmap(cm_messy, annot=True, fmt='d', cmap='Reds',
            xticklabels=train_data.target_names,
            yticklabels=train_data.target_names,
            ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title(f'Messy Model (acc={messy_accuracy:.3f})')

plt.tight_layout()
plt.show()

print("Same algorithm, same test set, different data quality. Data quality > model complexity.")

Same algorithm, same test set, different data quality. Data quality > model complexity.


**The model didn't get dumber — the data got worse.** Same algorithm, same hyperparameters.
The only thing that changed was the quality of the training data.

---

## Section 3: Explore Before You Train (EDA)

In [6]:
# Put messy data into a DataFrame for easier analysis
df = pd.DataFrame({
    'text': messy_texts,
    'label': messy_labels
})
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len().fillna(0).astype(int)

print(f"Dataset shape: {df.shape}")
print(f"\nBasic stats:")
print(df.describe())

Dataset shape: (790, 4)

Basic stats:
            label   text_length    word_count
count  790.000000    790.000000   790.000000
mean     0.098734   2142.567089   371.835443
std      0.298502   2583.441895   456.108752
min      0.000000      0.000000     0.000000
25%      0.000000    674.000000   112.250000
50%      0.000000   1327.000000   223.500000
75%      0.000000   2618.750000   457.750000
max      1.000000  25strunc      4537.000000


In [7]:
# Check 1: Class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

class_counts = df['label'].value_counts().sort_index()
class_names = [train_data.target_names[i] for i in class_counts.index]
colors = ['#2196F3', '#FF9800']

axes[0].bar(class_names, class_counts.values, color=colors)
axes[0].set_title('Class Distribution (Messy Data)')
axes[0].set_ylabel('Count')
for i, (name, count) in enumerate(zip(class_names, class_counts.values)):
    axes[0].text(i, count + 5, f'{count}\n({count/len(df)*100:.1f}%)',
                ha='center', fontweight='bold')

# Compare with what clean data looks like
clean_counts = pd.Series(train_data.target).value_counts().sort_index()
clean_names = [train_data.target_names[i] for i in clean_counts.index]
axes[1].bar(clean_names, clean_counts.values, color=colors, alpha=0.7)
axes[1].set_title('Class Distribution (Clean Data)')
axes[1].set_ylabel('Count')
for i, (name, count) in enumerate(zip(clean_names, clean_counts.values)):
    axes[1].text(i, count + 5, f'{count}\n({count/len(train_data.target)*100:.1f}%)',
                ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

ratio = class_counts.max() / class_counts.min()
print(f"Imbalance ratio: {ratio:.1f}:1")
print(f"A ratio above 3:1 is a red flag. Above 10:1 is a problem.")

Imbalance ratio: 9.1:1
A ratio above 3:1 is a red flag. Above 10:1 is a problem.


In [8]:
# Check 2: Text length distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram of text lengths
axes[0].hist(df['text_length'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Text Length Distribution')
axes[0].set_xlabel('Character count')
axes[0].set_ylabel('Frequency')
axes[0].axvline(x=10, color='red', linestyle='--', label='Min threshold (10 chars)')
axes[0].legend()

# Zoom in on the short texts
short_texts = df[df['text_length'] < 50]
axes[1].hist(short_texts['text_length'], bins=20, color='tomato', edgecolor='black', alpha=0.7)
axes[1].set_title(f'Short Texts (< 50 chars) — {len(short_texts)} samples')
axes[1].set_xlabel('Character count')
axes[1].set_ylabel('Frequency')
axes[1].axvline(x=10, color='red', linestyle='--', label='Min threshold (10 chars)')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Texts shorter than 10 chars: {len(df[df['text_length'] < 10])}")
print(f"Empty texts: {len(df[df['text_length'] == 0])}")
print(f"\nSample short texts:")
for _, row in df[df['text_length'] < 10].head(10).iterrows():
    print(f"  label={row['label']} len={row['text_length']}: '{row['text']}'")

Texts shorter than 10 chars: 27
Empty texts: 3

Sample short texts:
  label=0 len=0: ''
  label=0 len=1: ' '
  label=0 len=4: 'asdf'
  label=0 len=3: 'xxx'
  label=1 len=3: '!!!'
  label=0 len=4: 'null'
  label=0 len=3: 'N/A'
  label=0 len=4: 'test'
  label=1 len=1: '.'
  label=0 len=3: '---'


In [9]:
# Check 3: Duplicates
n_exact_dups = df.duplicated(subset='text').sum()
print(f"Exact duplicates: {n_exact_dups} ({n_exact_dups/len(df)*100:.1f}%)")

# Show some duplicated texts
dup_mask = df.duplicated(subset='text', keep=False)
dup_df = df[dup_mask].sort_values('text')
if len(dup_df) > 0:
    print(f"\nSample duplicated entries (showing first 5 groups):")
    seen = set()
    count = 0
    for _, row in dup_df.iterrows():
        text_preview = row['text'][:80].replace('\n', ' ')
        if text_preview not in seen:
            seen.add(text_preview)
            n_copies = len(df[df['text'] == row['text']])
            print(f"  [{n_copies} copies] label={row['label']}: \"{text_preview}...\"")
            count += 1
            if count >= 5:
                break

Exact duplicates: 87 (11.0%)

Sample duplicated entries (showing first 5 groups):
  [2 copies] label=0: "From: dstrstrn@matt.ksu.ksu.edu (Dick Strassman) Subject: Re: Jackson to underg..."
  [2 copies] label=0: "From: mhall@magna.com.au (Matthew Hall) Subject: Re: Young Guns Organization: ..."
  [2 copies] label=1: "From: prb@access.digex.com (Pat) Subject: Re: Keeping Stromboli Lit Organizati..."
  [2 copies] label=0: "From: cstrstrn@matt.ksu.ksu.edu Subject: Re: Braves Update Organization: ..."
  [2 copies] label=0: "From: tedwards@eng.umd.edu (langstrk) Subject: Re: Who is the best pitcher? Or..."


In [10]:
# Check 4: Spot-check labels by sampling random examples
print("Random samples from each class — do the labels look right?\n")

for label in [0, 1]:
    class_name = train_data.target_names[label]
    class_df = df[(df['label'] == label) & (df['text_length'] > 50)]
    samples = class_df.sample(n=min(3, len(class_df)), random_state=42)
    
    print(f"=== {class_name} (label={label}) ===")
    for _, row in samples.iterrows():
        preview = row['text'][:150].replace('\n', ' ')
        print(f"  \"{preview}...\"")
    print()

Random samples from each class — do the labels look right?

=== rec.sport.baseball (label=0) ===
  "From: dstrstrn@matt.ksu.ksu.edu (Dick Strassman) Subject: Re: Jackson to undergo elbow surgery Organization: Kansas State University Lines: 21 ..."
  "From: mhall@magna.com.au (Matthew Hall) Subject: Re: Young Guns Organization: Magna International Lines: 38 In article <1993Apr16.153430@nmt.e..."
  "From: cstrstrn@matt.ksu.ksu.edu Subject: Re: Braves Update Organization: Kansas State University Lines: 22 In article <1993Apr17.041500.19764@ne..."

=== sci.space (label=1) ===
  "From: prb@access.digex.com (Pat) Subject: Re: Keeping Stromboli Lit Organization: Express Access Online Communications, Greenbelt MD Lines: 18 ..."
  "From: henry@zoo.toronto.edu (Henry Spencer) Subject: Re: Orion (was Re: DC-X) Organization: U of Toronto Zoology Lines: 23 In article <1993Apr6..."
  "From: pgf@srl03.cacs.usl.edu (Phil G. Fraering) Subject: Re: Big Dumb Booster Update Organization: Unstrvsity of Sou

**EDA Summary:**
- Heavy class imbalance (look at the bar chart)
- Duplicate entries inflating the dataset
- Garbage/empty strings that the model will try to learn from
- Some labels might be wrong (hard to tell without domain knowledge — we'll use a trick in the next section)

**Rule:** Never train on data you haven't explored. 10 minutes of EDA saves hours of debugging model performance.

---

## Section 4: Clean It Up

In [11]:
# Start with the messy DataFrame
df_clean = df.copy()
cleaning_log = [{'step': 'Original messy data', 'samples': len(df_clean)}]
print(f"Starting: {len(df_clean)} samples")

# --- Step 1: Remove exact duplicates ---
n_before = len(df_clean)
df_clean = df_clean.drop_duplicates(subset='text', keep='first')
n_removed = n_before - len(df_clean)
cleaning_log.append({'step': 'Remove duplicates', 'samples': len(df_clean)})
print(f"\n[Step 1] Removed {n_removed} exact duplicates -> {len(df_clean)} samples")

# --- Step 2: Remove empty/very short texts ---
n_before = len(df_clean)
df_clean = df_clean[df_clean['text_length'] >= 10]
n_removed = n_before - len(df_clean)
cleaning_log.append({'step': 'Remove short texts', 'samples': len(df_clean)})
print(f"[Step 2] Removed {n_removed} short/empty texts -> {len(df_clean)} samples")

# --- Step 3: Flag suspicious labels ---
quick_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])
quick_pipe.fit(df_clean['text'].tolist(), df_clean['label'].values)
proba = quick_pipe.predict_proba(df_clean['text'].tolist())
predicted = quick_pipe.predict(df_clean['text'].tolist())

confidence = np.max(proba, axis=1)
disagree = predicted != df_clean['label'].values
suspicious = disagree & (confidence > 0.9)

n_suspicious = suspicious.sum()
print(f"\n[Step 3] Found {n_suspicious} suspicious labels (model strongly disagrees)")
print(f"  These are likely mislabeled. Removing them.")

n_before = len(df_clean)
df_clean = df_clean[~suspicious]
n_removed = n_before - len(df_clean)
cleaning_log.append({'step': 'Remove suspicious labels', 'samples': len(df_clean)})
print(f"  Removed {n_removed} suspicious samples -> {len(df_clean)} samples")

Starting: 790 samples

[Step 1] Removed 87 exact duplicates -> 703 samples
[Step 2] Removed 27 short/empty texts -> 676 samples

[Step 3] Found 18 suspicious labels (model strongly disagrees)
  These are likely mislabeled. Removing them.
  Removed 18 suspicious samples -> 658 samples


In [12]:
# Show the cleaning pipeline summary
print("Cleaning Pipeline Summary:")
print("=" * 50)
for i, entry in enumerate(cleaning_log):
    if i == 0:
        print(f"  {entry['step']:30s}  {entry['samples']:5d} samples")
    else:
        removed = cleaning_log[i-1]['samples'] - entry['samples']
        print(f"  {entry['step']:30s}  {entry['samples']:5d} samples  (-{removed})")

total_removed = cleaning_log[0]['samples'] - cleaning_log[-1]['samples']
print(f"{'':30s}  {'':5s}")
print(f"  {'Total removed':30s}  {total_removed:5d} samples ({total_removed/cleaning_log[0]['samples']*100:.1f}%)")

# Show remaining class distribution
print(f"\nRemaining class distribution:")
for label in [0, 1]:
    count = (df_clean['label'] == label).sum()
    print(f"  {train_data.target_names[label]}: {count} ({count/len(df_clean)*100:.1f}%)")

Cleaning Pipeline Summary:
  Original messy data                 790 samples
  Remove duplicates                   703 samples  (-87)
  Remove short texts                  676 samples  (-27)
  Remove suspicious labels            658 samples  (-18)
                                       
  Total removed                       132 samples (16.7%)

Remaining class distribution:
  rec.sport.baseball: 594 (90.3%)
  sci.space: 64 (9.7%)


**Note:** We still have class imbalance after cleaning. We'll handle that at training time
using `class_weight='balanced'` in LogisticRegression, which automatically adjusts the loss
function to pay more attention to the minority class.

---

## Section 5: Retrain and Compare

In [13]:
# Train on cleaned data WITH class_weight='balanced' to handle remaining imbalance
clean_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
])

clean_pipeline.fit(df_clean['text'].tolist(), df_clean['label'].values)
clean_preds = clean_pipeline.predict(test_data.data)
clean_accuracy = accuracy_score(test_data.target, clean_preds)

print(f"Pristine baseline: {pristine_accuracy:.3f}")
print(f"Messy model:       {messy_accuracy:.3f}")
print(f"Cleaned model:     {clean_accuracy:.3f}")
print(f"\nRecovered {clean_accuracy - messy_accuracy:.3f} accuracy points by cleaning the data.")

Pristine baseline: 0.966
Messy model:       0.752
Cleaned model:     0.924

Recovered 0.172 accuracy points by cleaning the data.


In [14]:
# Side-by-side comparison table
from sklearn.metrics import precision_recall_fscore_support

def get_metrics(y_true, y_pred, class_names):
    """Get per-class and overall metrics."""
    p, r, f, s = precision_recall_fscore_support(y_true, y_pred, average=None)
    acc = accuracy_score(y_true, y_pred)
    rows = []
    for i, name in enumerate(class_names):
        rows.append({
            'Class': name,
            'Precision': f'{p[i]:.3f}',
            'Recall': f'{r[i]:.3f}',
            'F1': f'{f[i]:.3f}',
            'Support': int(s[i])
        })
    rows.append({
        'Class': 'Overall',
        'Precision': f'{np.mean(p):.3f}',
        'Recall': f'{np.mean(r):.3f}',
        'F1': f'{np.mean(f):.3f}',
        'Support': int(np.sum(s))
    })
    return pd.DataFrame(rows)

print("=== MESSY MODEL ===")
messy_metrics = get_metrics(test_data.target, messy_preds, train_data.target_names)
print(messy_metrics.to_string(index=False))

print("\n=== CLEANED MODEL ===")
clean_metrics = get_metrics(test_data.target, clean_preds, train_data.target_names)
print(clean_metrics.to_string(index=False))

print("\n=== PRISTINE BASELINE ===")
pristine_metrics = get_metrics(test_data.target, pristine_preds, train_data.target_names)
print(pristine_metrics.to_string(index=False))

=== MESSY MODEL ===
              Class Precision Recall    F1  Support
rec.sport.baseball     0.640  0.960 0.768      397
         sci.space     0.940  0.550 0.694      399
           Overall     0.790  0.755 0.731      796

=== CLEANED MODEL ===
              Class Precision Recall    F1  Support
rec.sport.baseball     0.936  0.909 0.922      397
         sci.space     0.912  0.939 0.925      399
           Overall     0.924  0.924 0.924      796

=== PRISTINE BASELINE ===
              Class Precision Recall    F1  Support
rec.sport.baseball     0.972  0.962 0.967      397
         sci.space     0.963  0.972 0.967      399
           Overall     0.967  0.967 0.967      796


In [15]:
# Side-by-side confusion matrices: messy vs cleaned vs pristine
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, preds, title, cmap in [
    (axes[0], messy_preds, f'Messy (acc={messy_accuracy:.3f})', 'Reds'),
    (axes[1], clean_preds, f'Cleaned (acc={clean_accuracy:.3f})', 'Blues'),
    (axes[2], pristine_preds, f'Pristine (acc={pristine_accuracy:.3f})', 'Greens'),
]:
    cm = confusion_matrix(test_data.target, preds)
    sns.heatmap(cm, annot=True, fmt='d', cmap=cmap,
                xticklabels=train_data.target_names,
                yticklabels=train_data.target_names,
                ax=ax)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title(title)

plt.tight_layout()
plt.show()

print("Data quality > model complexity. Cleaning the data recovered most of the performance.")

Data quality > model complexity. Cleaning the data recovered most of the performance.


**Key takeaway:** We didn't change the algorithm. We didn't tune hyperparameters. We just
cleaned the data. That alone recovered most of the lost performance.

---

## Section 6: Build a Data Validation Pipeline

In [16]:
def check_class_balance(labels, max_ratio=3.0):
    """Check if class distribution is within acceptable bounds."""
    counts = pd.Series(labels).value_counts()
    ratio = counts.max() / counts.min()
    passed = ratio <= max_ratio
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Class balance — ratio: {ratio:.1f}:1 (max allowed: {max_ratio:.1f}:1)")
    if not passed:
        print(f"         Distribution: {dict(counts)}")
    return passed

def check_duplicates(texts, max_pct=1.0):
    """Check for exact duplicate texts."""
    n_dups = pd.Series(texts).duplicated().sum()
    pct = n_dups / len(texts) * 100
    passed = pct <= max_pct
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Duplicates — {n_dups} found ({pct:.1f}%, max allowed: {max_pct:.1f}%)")
    return passed

def check_empty_texts(texts, max_pct=0.5):
    """Check for empty or very short texts (< 10 chars)."""
    lengths = pd.Series(texts).str.len()
    n_short = (lengths < 10).sum()
    pct = n_short / len(texts) * 100
    passed = pct <= max_pct
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Short/empty texts — {n_short} found ({pct:.1f}%, max allowed: {max_pct:.1f}%)")
    return passed

def check_text_length_distribution(texts, min_median=100, max_std_ratio=5.0):
    """Check that text lengths are reasonable."""
    lengths = pd.Series(texts).str.len()
    median_len = lengths.median()
    std_ratio = lengths.std() / lengths.median() if lengths.median() > 0 else float('inf')
    passed = median_len >= min_median and std_ratio <= max_std_ratio
    status = 'PASS' if passed else 'FAIL'
    print(f"  [{status}] Text length — median: {median_len:.0f} chars, std/median ratio: {std_ratio:.2f}")
    if not passed:
        print(f"         Expected median >= {min_median}, std/median <= {max_std_ratio}")
    return passed

print("Validation functions defined.")

Validation functions defined.


In [17]:
def validate_dataset(texts, labels, dataset_name="dataset"):
    """Run all validation checks on a dataset. Returns True if all pass."""
    print(f"\n{'='*60}")
    print(f"  DATA VALIDATION: {dataset_name}")
    print(f"  Samples: {len(texts)}")
    print(f"{'='*60}")
    
    results = []
    results.append(check_class_balance(labels))
    results.append(check_duplicates(texts))
    results.append(check_empty_texts(texts))
    results.append(check_text_length_distribution(texts))
    
    all_passed = all(results)
    print(f"\n  {'ALL CHECKS PASSED' if all_passed else 'VALIDATION FAILED'} — "
          f"{sum(results)}/{len(results)} checks passed")
    print(f"{'='*60}")
    
    return all_passed

# Run on the messy data (should FAIL)
print("Running validation on the MESSY data:")
messy_passed = validate_dataset(messy_texts, messy_labels, "Messy Training Data")

# Run on the cleaned data (should PASS)
print("\nRunning validation on the CLEANED data:")
clean_passed = validate_dataset(
    df_clean['text'].tolist(),
    df_clean['label'].values.tolist(),
    "Cleaned Training Data"
)

# Run on the original pristine data (should PASS)
print("\nRunning validation on the PRISTINE data:")
pristine_passed = validate_dataset(
    train_data.data,
    train_data.target.tolist(),
    "Pristine Training Data"
)

Running validation on the MESSY data:

  DATA VALIDATION: Messy Training Data
  Samples: 790
  [FAIL] Class balance — ratio: 9.1:1 (max allowed: 3.0:1)
         Distribution: {0: 712, 1: 78}
  [FAIL] Duplicates — 87 found (11.0%, max allowed: 1.0%)
  [FAIL] Short/empty texts — 27 found (3.4%, max allowed: 0.5%)
  [PASS] Text length — median: 1327 chars, std/median ratio: 1.95

  VALIDATION FAILED — 1/4 checks passed

Running validation on the CLEANED data:

  DATA VALIDATION: Cleaned Training Data
  Samples: 658
  [FAIL] Class balance — ratio: 9.3:1 (max allowed: 3.0:1)
         Distribution: {0: 594, 1: 64}
  [PASS] Duplicates — 0 found (0.0%, max allowed: 1.0%)
  [PASS] Short/empty texts — 0 found (0.0%, max allowed: 0.5%)
  [PASS] Text length — median: 1382 chars, std/median ratio: 1.82

  VALIDATION FAILED — 3/4 checks passed

Running validation on the PRISTINE data:

  DATA VALIDATION: Pristine Training Data
  Samples: 1197
  [PASS] Class balance — ratio: 1.0:1 (max allowed: 3.0:1

In [18]:
# Show how this would work as a pre-training gate
print("Example: Pre-training gate\n")

def train_with_validation(texts, labels, test_texts, test_labels, class_names):
    """Only train if data passes validation."""
    passed = validate_dataset(texts, labels, "Pre-Training Check")
    
    if not passed:
        print("\n  TRAINING BLOCKED — fix data quality issues first.")
        return None
    
    print("\n  Validation passed. Training model...")
    pipe = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
        ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
    ])
    pipe.fit(texts, labels)
    preds = pipe.predict(test_texts)
    acc = accuracy_score(test_labels, preds)
    print(f"  Model trained. Test accuracy: {acc:.3f}")
    return pipe

# Try with messy data — should be blocked
print("Attempt 1: Train on messy data")
result = train_with_validation(messy_texts, messy_labels,
                                test_data.data, test_data.target,
                                train_data.target_names)

# Try with clean data — should proceed
print("\n" + "-"*60)
print("\nAttempt 2: Train on cleaned data")
result = train_with_validation(df_clean['text'].tolist(),
                                df_clean['label'].values.tolist(),
                                test_data.data, test_data.target,
                                train_data.target_names)

Example: Pre-training gate

Attempt 1: Train on messy data

  DATA VALIDATION: Pre-Training Check
  Samples: 790
  [FAIL] Class balance — ratio: 9.1:1 (max allowed: 3.0:1)
         Distribution: {0: 712, 1: 78}
  [FAIL] Duplicates — 87 found (11.0%, max allowed: 1.0%)
  [FAIL] Short/empty texts — 27 found (3.4%, max allowed: 0.5%)
  [PASS] Text length — median: 1327 chars, std/median ratio: 1.95

  VALIDATION FAILED — 1/4 checks passed

  TRAINING BLOCKED — fix data quality issues first.

------------------------------------------------------------

Attempt 2: Train on cleaned data

  DATA VALIDATION: Pre-Training Check
  Samples: 658
  [FAIL] Class balance — ratio: 9.3:1 (max allowed: 3.0:1)
         Distribution: {0: 594, 1: 64}
  [PASS] Duplicates — 0 found (0.0%, max allowed: 1.0%)
  [PASS] Short/empty texts — 0 found (0.0%, max allowed: 0.5%)
  [PASS] Text length — median: 1382 chars, std/median ratio: 1.82

  VALIDATION FAILED — 3/4 checks passed

  TRAINING BLOCKED — fix data qu

**These checks should run before every training job.** In a real ML pipeline, you'd wire
these into your CI/CD or orchestration system so that bad data never silently reaches the model.

---

## Summary

You just experienced the most common reason ML models fail in production: bad data.

| Concept | What You Learned |
|---------|------------------|
| **Data corruption** | Duplicates, mislabels, garbage, and imbalance silently destroy model performance |
| **EDA** | Always explore your data before training — plots and stats catch what code doesn't |
| **Cleaning pipeline** | Step-by-step deduplication, filtering, and label validation |
| **class_weight='balanced'** | A simple way to handle class imbalance at training time |
| **Validation gates** | Automated checks that block training when data quality is too low |
| **Data quality > model complexity** | Cleaning data improves results more than a fancier algorithm |

### What's Next?

In **ML Lab 03**, you'll take a trained model and deploy it as a live API — complete with
health checks, Prometheus metrics, and a Grafana dashboard.