# ML Lab 01: Train Your First Model (Completed Solution)

This is the completed version of the lab notebook with all cells executed and outputs visible.
Use this as a reference if you get stuck on any section.

---

## Section 1: Load and Explore the Data

In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = ['rec.sport.baseball', 'sci.space']

train_data = fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=42
)

test_data = fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=42
)

print(f"Training samples: {len(train_data.data)}")
print(f"Test samples:     {len(test_data.data)}")
print(f"Classes:          {train_data.target_names}")
print(f"Label encoding:   0 = {train_data.target_names[0]}, 1 = {train_data.target_names[1]}")

Training samples: 1197
Test samples:     796
Classes:          ['rec.sport.baseball', 'sci.space']
Label encoding:   0 = rec.sport.baseball, 1 = sci.space


In [2]:
import numpy as np

for label in [0, 1]:
    idx = np.where(train_data.target == label)[0][0]
    print(f"=== Class: {train_data.target_names[label]} ===")
    print(train_data.data[idx][:500])
    print("...\n")

=== Class: rec.sport.baseball ===
From: dstrstrn@matt.ksu.ksu.edu (Dick Strassman)
Subject: Re: Jackson to undergo elbow surgery
...

=== Class: sci.space ===
From: prb@access.digex.com (Pat)
Subject: Re: Keeping Stromboli Lit
...


In [3]:
import pandas as pd

train_counts = pd.Series(train_data.target).value_counts().sort_index()
print("Training set class distribution:")
for idx, count in train_counts.items():
    print(f"  {train_data.target_names[idx]}: {count} samples ({count/len(train_data.target)*100:.1f}%)")

print(f"\nRoughly balanced? {'Yes' if abs(train_counts.iloc[0] - train_counts.iloc[1]) / len(train_data.target) < 0.1 else 'No'}")

Training set class distribution:
  rec.sport.baseball: 597 samples (49.9%)
  sci.space: 600 samples (50.1%)

Roughly balanced? Yes


**What you should see:** Two roughly balanced classes with ~600 samples each. The text is messy real-world data — email headers, signatures, quoted replies. This is typical of real ML data.

---

## Section 2: Your First Model

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

pipeline.fit(train_data.data, train_data.target)
predictions = pipeline.predict(test_data.data)

accuracy = accuracy_score(test_data.target, predictions)
print(f"Accuracy: {accuracy:.3f}")
print(f"\nThat's {accuracy*100:.1f}% correct! Looks great... right?")

Accuracy: 0.966

That's 96.6% correct! Looks great... right?


---
## Section 3: Why Accuracy Lies

In [5]:
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

baseball_idx = np.where(train_data.target == 0)[0]
space_idx = np.where(train_data.target == 1)[0]

n_majority = 475
n_minority = 25
imb_idx = np.concatenate([
    baseball_idx[:n_majority],
    space_idx[:n_minority]
])
np.random.shuffle(imb_idx)

imb_texts = [train_data.data[i] for i in imb_idx]
imb_labels = train_data.target[imb_idx]

print(f"Imbalanced dataset: {sum(imb_labels == 0)} baseball, {sum(imb_labels == 1)} space")
print(f"Majority class is {sum(imb_labels == 0)/len(imb_labels)*100:.1f}% of the data")

Imbalanced dataset: 475 baseball, 25 space
Majority class is 95.0% of the data


In [6]:
imb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

imb_pipeline.fit(imb_texts, imb_labels)
imb_predictions = imb_pipeline.predict(test_data.data)

imb_accuracy = accuracy_score(test_data.target, imb_predictions)
print(f"Accuracy: {imb_accuracy:.3f}")
print(f"\nStill looks decent... but let's look deeper.")

Accuracy: 0.849

Still looks decent... but let's look deeper.


In [7]:
cm = confusion_matrix(test_data.target, imb_predictions)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=train_data.target_names,
            yticklabels=train_data.target_names,
            ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix (Imbalanced Training)')

cm_balanced = confusion_matrix(test_data.target, predictions)
sns.heatmap(cm_balanced, annot=True, fmt='d', cmap='Greens',
            xticklabels=train_data.target_names,
            yticklabels=train_data.target_names,
            ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Confusion Matrix (Balanced Training)')

plt.tight_layout()
plt.show()

print("=== Imbalanced Model ===")
print(classification_report(test_data.target, imb_predictions,
                            target_names=train_data.target_names))

print("\n=== Balanced Model ===")
print(classification_report(test_data.target, predictions,
                            target_names=train_data.target_names))

=== Imbalanced Model ===
                    precision    recall  f1-score   support

rec.sport.baseball       0.82      0.96      0.88       397
         sci.space       0.94      0.74      0.83       399

          accuracy                           0.85       796
         macro avg       0.88      0.85      0.85       796
      weighted avg       0.88      0.85      0.85       796


=== Balanced Model ===
                    precision    recall  f1-score   support

rec.sport.baseball       0.97      0.96      0.97       397
         sci.space       0.96      0.97      0.97       399

          accuracy                           0.97       796
         macro avg       0.97      0.97      0.97       796
      weighted avg       0.97      0.97      0.97       796


---
## Section 4: Proper Train/Test/Validation Split

In [8]:
from sklearn.model_selection import train_test_split

all_texts = train_data.data + test_data.data
all_labels = np.concatenate([train_data.target, test_data.target])

X_train, X_temp, y_train, y_temp = train_test_split(
    all_texts, all_labels, test_size=0.4, random_state=42, stratify=all_labels
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print(f"Train:      {len(X_train)} samples ({len(X_train)/len(all_texts)*100:.0f}%)")
print(f"Validation: {len(X_val)} samples ({len(X_val)/len(all_texts)*100:.0f}%)")
print(f"Test:       {len(X_test)} samples ({len(X_test)/len(all_texts)*100:.0f}%)")

Train:      1195 samples (60%)
Validation: 399 samples (20%)
Test:       399 samples (20%)


In [9]:
model = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])
model.fit(X_train, y_train)

val_accuracy = model.score(X_val, y_val)
print(f"Validation accuracy: {val_accuracy:.3f}")
print("(Use this to decide if you need to change your approach)")

test_accuracy = model.score(X_test, y_test)
print(f"\nFinal test accuracy: {test_accuracy:.3f}")
print("(This is your real-world performance estimate)")

Validation accuracy: 0.967
(Use this to decide if you need to change your approach)

Final test accuracy: 0.972
(This is your real-world performance estimate)


---
## Section 5: Overfit on Purpose

In [10]:
from sklearn.tree import DecisionTreeClassifier

overfit_model = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
    ('clf', DecisionTreeClassifier(random_state=42))
])

overfit_model.fit(X_train, y_train)

train_acc = overfit_model.score(X_train, y_train)
val_acc = overfit_model.score(X_val, y_val)
test_acc = overfit_model.score(X_test, y_test)

print(f"Train accuracy:      {train_acc:.3f}")
print(f"Validation accuracy: {val_acc:.3f}")
print(f"Test accuracy:       {test_acc:.3f}")
print(f"\nTrain-Test gap:      {train_acc - test_acc:.3f}")
print(f"\nThe model memorized the training data — 100% on train, but much worse on new data.")

Train accuracy:      1.000
Validation accuracy: 0.912
Test accuracy:       0.917

Train-Test gap:      0.083

The model memorized the training data — 100% on train, but much worse on new data.


In [11]:
depths = [1, 3, 5, 10, 20, 50, None]
train_scores = []
val_scores = []

for depth in depths:
    dt = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=10000, stop_words='english')),
        ('clf', DecisionTreeClassifier(max_depth=depth, random_state=42))
    ])
    dt.fit(X_train, y_train)
    train_scores.append(dt.score(X_train, y_train))
    val_scores.append(dt.score(X_val, y_val))

fig, ax = plt.subplots(figsize=(10, 6))
x_labels = [str(d) if d else 'None' for d in depths]
x_pos = range(len(depths))

ax.plot(x_pos, train_scores, 'o-', label='Train Accuracy', linewidth=2, markersize=8)
ax.plot(x_pos, val_scores, 's-', label='Validation Accuracy', linewidth=2, markersize=8)
ax.set_xticks(x_pos)
ax.set_xticklabels(x_labels)
ax.set_xlabel('max_depth')
ax.set_ylabel('Accuracy')
ax.set_title('Overfitting: Train vs Validation Accuracy by Tree Depth')
ax.legend()
ax.grid(True, alpha=0.3)

best_val_idx = np.argmax(val_scores)
ax.axvline(x=best_val_idx, color='green', linestyle='--', alpha=0.5,
           label=f'Best validation (depth={x_labels[best_val_idx]})')
ax.legend()

plt.tight_layout()
plt.show()

print(f"\nBest validation accuracy: {max(val_scores):.3f} at max_depth={depths[best_val_idx]}")
print(f"Unconstrained tree: train={train_scores[-1]:.3f}, val={val_scores[-1]:.3f}")
print(f"\nThe gap between train and validation curves IS overfitting.")


Best validation accuracy: 0.960 at max_depth=10
Unconstrained tree: train=1.000, val=0.912

The gap between train and validation curves IS overfitting.


In [12]:
lr_train = model.score(X_train, y_train)
lr_val = model.score(X_val, y_val)

print(f"Logistic Regression:")
print(f"  Train accuracy:      {lr_train:.3f}")
print(f"  Validation accuracy: {lr_val:.3f}")
print(f"  Gap:                 {lr_train - lr_val:.3f}")
print(f"\nMuch smaller gap = less overfitting. Simpler models generalize better.")

Logistic Regression:
  Train accuracy:      0.998
  Validation accuracy: 0.967
  Gap:                 0.031

Much smaller gap = less overfitting. Simpler models generalize better.


---
## Section 6: Save and Inspect the Model

In [13]:
import joblib
import os

model_path = 'sentiment_model.joblib'
joblib.dump(model, model_path)

size_bytes = os.path.getsize(model_path)
size_mb = size_bytes / (1024 * 1024)
print(f"Model saved to: {model_path}")
print(f"File size: {size_mb:.2f} MB ({size_bytes:,} bytes)")
print(f"\nThis file contains everything the model learned:")
print(f"  - The TF-IDF vocabulary ({len(model.named_steps['tfidf'].vocabulary_):,} words)")
print(f"  - The IDF weights for each word")
print(f"  - The logistic regression coefficients")
print(f"  - The intercept (bias) term")

Model saved to: sentiment_model.joblib
File size: 0.42 MB (440,320 bytes)

This file contains everything the model learned:
  - The TF-IDF vocabulary (10,000 words)
  - The IDF weights for each word
  - The logistic regression coefficients
  - The intercept (bias) term


In [14]:
loaded_model = joblib.load(model_path)

test_texts = [
    "The pitcher threw a fastball and struck out the batter in the ninth inning.",
    "NASA launched a new satellite to study the atmosphere of Mars.",
    "The home run in the bottom of the seventh sealed the championship.",
    "The telescope captured images of a distant galaxy cluster."
]

loaded_predictions = loaded_model.predict(test_texts)
loaded_probabilities = loaded_model.predict_proba(test_texts)

print("Predictions from the loaded model:\n")
for text, pred, proba in zip(test_texts, loaded_predictions, loaded_probabilities):
    label = train_data.target_names[pred]
    confidence = max(proba) * 100
    print(f"  [{label:>20s}] ({confidence:.0f}% confident) \"{text[:60]}...\"")

Predictions from the loaded model:

  [rec.sport.baseball] (99% confident) "The pitcher threw a fastball and struck out the batter in t..."
  [         sci.space] (98% confident) "NASA launched a new satellite to study the atmosphere of Ma..."
  [rec.sport.baseball] (98% confident) "The home run in the bottom of the seventh sealed the champi..."
  [         sci.space] (97% confident) "The telescope captured images of a distant galaxy cluster..."


In [15]:
feature_names = model.named_steps['tfidf'].get_feature_names_out()
coefficients = model.named_steps['clf'].coef_[0]

sorted_idx = np.argsort(coefficients)

n_top = 15

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

top_baseball = sorted_idx[:n_top]
axes[0].barh(range(n_top), coefficients[top_baseball], color='#2196F3')
axes[0].set_yticks(range(n_top))
axes[0].set_yticklabels([feature_names[i] for i in top_baseball])
axes[0].set_title(f'Top {n_top} words -> {train_data.target_names[0]}')
axes[0].set_xlabel('Coefficient (more negative = stronger signal)')

top_space = sorted_idx[-n_top:]
axes[1].barh(range(n_top), coefficients[top_space], color='#FF9800')
axes[1].set_yticks(range(n_top))
axes[1].set_yticklabels([feature_names[i] for i in top_space])
axes[1].set_title(f'Top {n_top} words -> {train_data.target_names[1]}')
axes[1].set_xlabel('Coefficient (more positive = stronger signal)')

plt.suptitle('What the Model Learned: Most Important Words per Class', fontsize=14)
plt.tight_layout()
plt.show()

print(f"\nThe model learned {len(feature_names):,} word-to-number mappings.")
print(f"Each word has a coefficient: positive = space, negative = baseball.")
print(f"These coefficients ARE the model. That's all there is to it.")


The model learned 10,000 word-to-number mappings.
Each word has a coefficient: positive = space, negative = baseball.
These coefficients ARE the model. That's all there is to it.


In [16]:
os.remove(model_path)
print(f"Cleaned up {model_path}")

Cleaned up sentiment_model.joblib


---
## Summary

You just built a machine learning model from scratch. Here's what you now know:

| Concept | What You Learned |
|---------|------------------|
| **A model** | Learned parameters (coefficients, weights) stored in a file |
| **Training** | Feeding data through an algorithm to learn those parameters |
| **Accuracy trap** | On imbalanced data, accuracy lies — use precision, recall, F1, confusion matrix |
| **Train/val/test** | Three-way split prevents you from fooling yourself during tuning |
| **Overfitting** | Memorizing training data (100% train, bad test) — fix with simpler models or constraints |
| **Model file** | A serialized object containing vocabulary + learned weights — nothing magical |

### What's Next?

In **ML Lab 02**, you'll take this saved model and deploy it behind a FastAPI endpoint — turning a file on disk into a live prediction service.