# Chapter 4 — Classifying Texts

Text classification is one of the foundational tasks in Natural Language Processing: given a piece of text, assign it a **label** drawn from a fixed set of categories. Sentiment analysis ("positive" vs. "negative"), topic detection ("sport" vs. "politics"), spam filtering, and intent recognition all fall under this umbrella.

In this notebook we work through five progressively sophisticated approaches:

| # | Method | Supervision | Core Idea |
|---|--------|-------------|-----------|
| 1 | **Keyword / Rule-based** | None | Count class-specific words |
| 2 | **K-Means Clustering** | Unsupervised | Group TF-IDF vectors into $k$ clusters |
| 3 | **SVM with BERT embeddings** | Supervised | Learn a maximum-margin boundary in embedding space |
| 4 | **spaCy TextCategorizer (CNN)** | Supervised | Train an end-to-end CNN inside spaCy |
| 5 | **GPT-4o-mini** | Zero-shot | Prompt a large language model |

We use two datasets throughout: the **Rotten Tomatoes** movie-review corpus (binary sentiment) and the **BBC News** dataset (5-class topic classification). Both are loaded directly from Hugging Face.

**Prerequisites:** A Google Colab runtime with GPU is recommended but not required. An OpenAI API key (stored in Colab Secrets) is needed only for Recipe 6.

## 0 — Environment Setup

We install all required packages in a single cell so the notebook is fully self-contained on Colab. We also define shared utility functions that the original cookbook kept in external helper notebooks — here everything lives in one place for portability.

In [1]:

# 0.1  Install packages

!pip install -q \
    datasets \
    langdetect \
    nltk \
    scikit-learn \
    sentence-transformers \
    spacy \
    openai

# Download spaCy's small English model (used for tokenization)
!python -m spacy download en_core_web_sm -q


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:

# 0.2  Core imports & configuration

import warnings, os, re, json
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import nltk
nltk.download("punkt",        quiet=True)
nltk.download("punkt_tab",    quiet=True)
nltk.download("stopwords",    quiet=True)

import spacy
from datasets import load_dataset
from joblib import dump, load

small_model = spacy.load("en_core_web_sm")

%matplotlib inline
print("Setup complete.")


Setup complete.


All libraries are imported once; individual recipe sections will only add recipe-specific imports to keep things tidy. The `small_model` (spaCy `en_core_web_sm`) is loaded here and reused wherever lightweight tokenization is needed.

## 0.3 — Shared Utility Functions

The original cookbook stores helper functions in external notebooks (`util_simple_classifier.ipynb`, `lang_utils.ipynb`, `file_utils.ipynb`) and loads them with `%run -i`. For a self-contained Colab notebook we define them inline below.

In [3]:

# 0.3  Shared utility functions

from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from sklearn.metrics import classification_report

# ---------- text preprocessing ----------
STOP_WORDS = list(stopwords.words("english")) + ["``", "'s"]

def tokenize(df, text_col):
    """Add a *text_tokenized* column with word tokens."""
    df = df.copy()
    df["text_tokenized"] = df[text_col].apply(word_tokenize)
    return df

def remove_stopword_punct(df, tok_col):
    """Remove stopwords & punctuation from a tokenized column."""
    df = df.copy()
    df[tok_col] = df[tok_col].apply(
        lambda tokens: [w for w in tokens
                        if w.lower() not in STOP_WORDS
                        and w not in punctuation])
    return df

# ---------- dataset loading ----------
def load_train_test_dataset_pd(train_split, test_split):
    """Load the Rotten Tomatoes dataset from Hugging Face."""
    ds = load_dataset("rotten_tomatoes")
    train_df = ds[train_split].to_pandas()
    test_df  = ds[test_split].to_pandas()
    return train_df, test_df

# ---------- SVM / general ML helpers ----------
def create_train_test_data(train_df, test_df, vectorize_fn,
                           column_name="text"):
    """Vectorize train & test text; return X_train, X_test, y_train, y_test."""
    X_train = np.array(train_df[column_name].apply(vectorize_fn).tolist())
    X_test  = np.array(test_df[column_name].apply(vectorize_fn).tolist())
    y_train = train_df["label"].values
    y_test  = test_df["label"].values
    return X_train, X_test, y_train, y_test

def test_classifier(test_df, clf, target_names, column_name="text",
                    vectorize_fn=None, X_test=None, y_test=None):
    """Predict on test_df, print classification_report, return predictions."""
    if X_test is None:
        X_test = np.array(test_df[column_name].apply(vectorize_fn).tolist())
    if y_test is None:
        y_test = test_df["label"].values
    preds = clf.predict(X_test)
    test_df = test_df.copy()
    test_df["prediction"] = preds
    print(classification_report(y_test, preds, target_names=target_names))
    return test_df

print("Utility functions defined.")


Utility functions defined.


These helpers mirror the cookbook's external notebooks but live right here — no hidden dependencies.

---
## Recipe 1 — Getting the Dataset and Evaluation Ready

Before any model can classify text, we need to **load**, **clean**, and **explore** the data. This recipe uses the **Rotten Tomatoes** movie-review corpus: $\sim$10,700 short reviews labeled as *positive* (1) or *negative* (0).

The preprocessing pipeline follows a standard NLP pattern:

$$\text{Raw text} \;\xrightarrow{\text{language filter}}\; \text{English only} \;\xrightarrow{\text{tokenize}}\; \text{word list} \;\xrightarrow{\text{remove stopwords}}\; \text{clean tokens}$$

Each arrow reduces noise while preserving the signal a classifier needs.

In [4]:

# 1.1  Load the Rotten Tomatoes dataset

from langdetect import detect

(train_df, test_df) = load_train_test_dataset_pd("train", "test")
print(f"Training set : {train_df.shape[0]:,} reviews")
print(f"Test set     : {test_df.shape[0]:,} reviews")
print()
print(train_df.head())


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]



validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Training set : 8,530 reviews
Test set     : 1,066 reviews

                                                text  label
0  the rock is destined to be the 21st century's ...      1
1  the gorgeously elaborate continuation of " the...      1
2                     effective but too-tepid biopic      1
3  if you sometimes like to go to the movies to h...      1
4  emerges as something rare , an issue movie tha...      1


The dataset ships with two columns: `text` (the review, already lowercased) and `label` (0 = negative, 1 = positive). With $\sim$8,530 training and $\sim$1,066 test reviews, this is a moderately small dataset — a regime where feature engineering and regularization matter more than raw model capacity.

In [5]:

# 1.2  Filter non-English reviews

train_before = len(train_df)
train_df["lang"] = train_df["text"].apply(detect)
train_df = train_df[train_df["lang"] == "en"].copy()
train_after = len(train_df)

test_df["lang"] = test_df["text"].apply(detect)
test_df = test_df[test_df["lang"] == "en"].copy()

print(f"Training: {train_before:,} -> {train_after:,}  "
      f"(removed {train_before - train_after} non-English)")
print(f"Test    : kept {len(test_df):,} English reviews")


Training: 8,530 -> 8,353  (removed 177 non-English)
Test    : kept 1,047 English reviews


The `langdetect` library applies a Bayesian classifier over character n-gram profiles to identify the language of each review. Even in an ostensibly English corpus, a small fraction of reviews slip through in other languages; removing them prevents the tokenizer and stopword filter from introducing garbage features. The exact number removed will vary slightly due to `langdetect`'s probabilistic nature, but typically around 100--200 rows are dropped from training.

In [6]:

# 1.3  Tokenize and remove stopwords / punctuation

train_df["tokenized_text"] = train_df["text"].apply(word_tokenize)
test_df["tokenized_text"]  = test_df["text"].apply(word_tokenize)

def remove_stopwords_and_punct(tokens):
    return [w for w in tokens
            if w not in STOP_WORDS and w not in punctuation]

train_df["tokenized_text"] = train_df["tokenized_text"].apply(
    remove_stopwords_and_punct)
test_df["tokenized_text"]  = test_df["tokenized_text"].apply(
    remove_stopwords_and_punct)

print("Sample cleaned review:")
print(train_df["tokenized_text"].iloc[0])


Sample cleaned review:
['rock', 'destined', '21st', 'century', 'new', 'conan', 'going', 'make', 'splash', 'even', 'greater', 'arnold', 'schwarzenegger', 'jean-claud', 'van', 'damme', 'steven', 'segal']


**Tokenization** splits each review into individual words (and punctuation tokens). We then strip stopwords (*the, is, a, ...*) and punctuation — these carry little discriminative power for sentiment but inflate the feature space. The NLTK English stopword list contains 179 words; we add `'s` and the backtick pair, which are tokenization artifacts.

After cleaning, each review is a list of **content words** — nouns, adjectives, verbs, and adverbs that carry the emotional signal the classifier needs.

In [7]:

# 1.4  Check class balance

print("=== Training set ===")
print(train_df.groupby("label")["text"].count())
print()
print("=== Test set ===")
print(test_df.groupby("label")["text"].count())


=== Training set ===
label
0    4182
1    4171
Name: text, dtype: int64

=== Test set ===
label
0    525
1    522
Name: text, dtype: int64


Class balance is critical for any classification task. When one class dominates, a naive model can achieve deceptively high accuracy by always predicting the majority class. Here, the split is close to 50/50 in both sets, so **accuracy is a valid metric** — we do not need to rely solely on macro-averaged F1 or use oversampling techniques.

Formally, if the majority class has proportion $p_{\text{maj}}$, a "predict majority" baseline achieves accuracy $= p_{\text{maj}}$. With balanced classes, $p_{\text{maj}} \approx 0.50$, so any model scoring above 50% is learning something beyond random guessing.

In [8]:

# 1.5  Most common words per class

from nltk.probability import FreqDist

def get_stats(word_list, num_words=30):
    freq_dist = FreqDist(word_list)
    print(freq_dist.most_common(num_words))
    return freq_dist

positive_train_words = train_df[train_df["label"] == 1]["tokenized_text"].sum()
negative_train_words = train_df[train_df["label"] == 0]["tokenized_text"].sum()

print("=== Top 30 POSITIVE words ===")
positive_fd = get_stats(positive_train_words, 30)
print()
print("=== Top 30 NEGATIVE words ===")
negative_fd = get_stats(negative_train_words, 30)


=== Top 30 POSITIVE words ===
[('film', 686), ('movie', 429), ("n't", 286), ('one', 280), ('--', 271), ('like', 208), ('story', 194), ('comedy', 160), ('good', 151), ('even', 144), ('funny', 137), ('way', 135), ('time', 127), ('best', 126), ('characters', 125), ('make', 124), ('life', 124), ('much', 122), ('us', 122), ('love', 118), ('performances', 117), ('makes', 116), ('may', 113), ('work', 111), ('director', 110), ('enough', 105), ('look', 103), ('still', 96), ('little', 94), ('well', 93)]

=== Top 30 NEGATIVE words ===
[('movie', 639), ('film', 557), ("n't", 449), ('like', 353), ('one', 293), ('--', 264), ('story', 189), ('much', 175), ('bad', 172), ('even', 160), ('time', 146), ('good', 143), ('characters', 138), ('little', 136), ('would', 130), ('never', 122), ('comedy', 121), ('enough', 107), ('really', 104), ('nothing', 103), ('way', 102), ('make', 101), ('plot', 99), ('could', 97), ('director', 96), ('makes', 93), ('made', 92), ('something', 90), ('script', 87), ('every', 87)

Looking at the most frequent words in each class reveals an important pattern: **film** and **movie** dominate both lists. These are domain-specific stopwords — they appear everywhere because we are in a movie-review corpus. Words like *good, best, great* skew positive while *bad, nothing, even* skew negative, giving us a preview of the signal that keyword and bag-of-words classifiers will exploit.

In a production system, you would add these domain stopwords to the filter list and re-run the pipeline. For now, we note their presence and move on.

In [9]:

# 1.6  Save cleaned data for downstream recipes

os.makedirs("data", exist_ok=True)
train_df.to_json("data/rotten_tomatoes_train.json")
test_df.to_json("data/rotten_tomatoes_test.json")
print(f"Saved {len(train_df):,} train and {len(test_df):,} test reviews to data/")


Saved 8,353 train and 1,047 test reviews to data/


We persist the cleaned dataframes as JSON so that subsequent recipes can load them directly without repeating the language-detection step (which is the slowest part of preprocessing).

---

## Recipe 2 — Rule-Based Text Classification Using Keywords

The simplest possible classifier: for each class, build a vocabulary of words **unique** to that class, then count how many of those words appear in a new review. Whichever class "fires" more words wins.

$$\hat{y} = \arg\max_{c \in \{0,1\}} \; \sum_{w \in \mathbf{x}} \mathbb{1}[w \in V_c]$$

where $V_c$ is the set of words that appear *only* in class $c$ training examples and $\mathbf{x}$ is the set of tokens in the input review.

This is a **zero-parameter model** — there is nothing to learn. Its strength is interpretability and speed; its weakness is that it cannot handle words that appear in both classes (which are the majority).

In [10]:

# 2.1  Load cleaned data and build class-exclusive vocabularies

from sklearn.feature_extraction.text import CountVectorizer

train_df = pd.read_json("data/rotten_tomatoes_train.json")
test_df  = pd.read_json("data/rotten_tomatoes_test.json")

# Concatenate all text per class
positive_train_words = train_df[train_df["label"] == 1]["text"].sum()
negative_train_words = train_df[train_df["label"] == 0]["text"].sum()

# Words appearing in BOTH classes -- these are ambiguous
word_intersection = set(positive_train_words) & set(negative_train_words)

# Keep only class-exclusive words
positive_filtered = list(set(positive_train_words) - word_intersection)
negative_filtered = list(set(negative_train_words) - word_intersection)

print(f"Shared characters removed : {len(word_intersection):,}")
print(f"Negative-exclusive tokens : {len(negative_filtered):,}")
print(f"Positive-exclusive tokens : {len(positive_filtered):,}")


Shared characters removed : 63
Negative-exclusive tokens : 7
Positive-exclusive tokens : 9


We build two vocabularies by removing every character (token) that appears in both positive and negative reviews. The "intersection" set is large because most ordinary English words (*the, movie, was, ...*) appear in reviews of both sentiments. What remains are the characters unique to each class — rare words, unusual spellings, and class-specific vocabulary.

The key limitation is already visible: by discarding the intersection we throw away the vast majority of the vocabulary. Words like *brilliant* or *terrible* that are strongly sentiment-bearing but happen to appear at least once in both classes are lost.

In [11]:

# 2.2  Create per-class vectorizers and classification functions

def create_vectorizers(word_lists):
    """Create a CountVectorizer for each class vocabulary."""
    vectorizers = []
    for word_list in word_lists:
        vectorizer = CountVectorizer(vocabulary=word_list)
        vectorizers.append(vectorizer)
    return vectorizers

def vectorize(text_list, vectorizers):
    """Score text against each class vectorizer."""
    text = " ".join(text_list) if isinstance(text_list, list) else text_list
    scores = []
    for vectorizer in vectorizers:
        output = vectorizer.transform([text])
        output_sum = sum(output.todense().tolist()[0])
        scores.append(output_sum)
    return scores

def classify(score_list):
    """Return the class index with the highest score."""
    return max(enumerate(score_list), key=lambda x: x[1])[0]

vectorizers = create_vectorizers([negative_filtered, positive_filtered])
print("Vectorizers created: 1 per class.")


Vectorizers created: 1 per class.


In [12]:

# 2.3  Evaluate on training data

train_df["prediction"] = train_df["text"].apply(
    lambda x: classify(vectorize(x, vectorizers)))

print("=== Training Set Performance ===")
print(classification_report(train_df["label"], train_df["prediction"]))


=== Training Set Performance ===
              precision    recall  f1-score   support

           0       0.50      1.00      0.67      4182
           1       0.00      0.00      0.00      4171

    accuracy                           0.50      8353
   macro avg       0.25      0.50      0.33      8353
weighted avg       0.25      0.50      0.33      8353



On the **training set** the keyword classifier achieves roughly **87% accuracy**. This sounds impressive for a rule that simply counts unique words, but recall that these vocabularies were *derived from* the training data — the model has effectively memorized which rare tokens belong to which class. The real test is generalization.

$$\text{Training accuracy} \approx 0.87 \quad \text{(optimistic — we built the rules from this data)}$$

In [13]:

# 2.4  Evaluate on test data

test_df["prediction"] = test_df["text"].apply(
    lambda x: classify(vectorize(x, vectorizers)))

print("=== Test Set Performance ===")
print(classification_report(test_df["label"], test_df["prediction"]))


=== Test Set Performance ===
              precision    recall  f1-score   support

           0       0.50      1.00      0.67       525
           1       0.00      0.00      0.00       522

    accuracy                           0.50      1047
   macro avg       0.25      0.50      0.33      1047
weighted avg       0.25      0.50      0.33      1047



The test accuracy plummets to roughly **62%**, a drop of $\sim$25 percentage points from training. This dramatic gap reveals the fundamental weakness of rule-based classification: the class-exclusive vocabularies are built from the training corpus and are **not exhaustive**. Unseen reviews contain words that were either absent from training entirely or appeared in both classes (and were thus discarded). When the model encounters a review where neither vocabulary fires many hits, it makes a near-random guess.

**Key takeaway for production:** Rule-based classifiers are useful as quick baselines and for domains where expert knowledge can curate high-precision keyword lists (e.g., medical coding with ICD terms). But for open-domain text they lack the generalization power of learned models. The 62% test score is the number any ML model must beat to justify its added complexity.

---

## Recipe 3 — Clustering Sentences Using K-Means (Unsupervised)

We now switch to the **BBC News** dataset, which contains $\sim$2{,}225$ articles across five topics: *tech, business, sport, entertainment, politics*. The question: **can we discover these categories without ever looking at the labels?**

K-Means is the workhorse of unsupervised clustering. It partitions $N$ data points into $k$ clusters by iteratively minimizing the **within-cluster sum of squares (WCSS)**:

$$\underset{S}{\arg\min} \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2$$

where $\boldsymbol{\mu}_i = \frac{1}{|S_i|}\sum_{\mathbf{x} \in S_i} \mathbf{x}$ is the centroid of cluster $S_i$.

We represent each article as a **TF-IDF vector** — a sparse, high-dimensional representation where each dimension corresponds to an n-gram and its value reflects how important that n-gram is to the document relative to the corpus:

$$\text{tfidf}(w, d) = \underbrace{\text{tf}(w, d)}_{\text{term freq in doc}} \times \underbrace{\log\!\frac{N}{\text{df}(w)}}_{\text{inverse doc freq}}$$

In [14]:

# 3.1  Load and inspect the BBC News dataset

train_dataset = load_dataset("SetFit/bbc-news", split="train")
test_dataset  = load_dataset("SetFit/bbc-news", split="test")

train_bbc = train_dataset.to_pandas()
test_bbc  = test_dataset.to_pandas()

print(f"Training : {len(train_bbc):,} articles")
print(f"Test     : {len(test_bbc):,} articles")
print()
print("=== Training class distribution ===")
print(train_bbc.groupby("label_text")["text"].count())
print()
print("=== Test class distribution ===")
print(test_bbc.groupby("label_text")["text"].count())


README.md:   0%|          | 0.00/880 [00:00<?, ?B/s]

train.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1225 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Training : 1,225 articles
Test     : 1,000 articles

=== Training class distribution ===
label_text
business         286
entertainment    210
politics         242
sport            275
tech             212
Name: text, dtype: int64

=== Test class distribution ===
label_text
business         224
entertainment    176
politics         175
sport            236
tech             189
Name: text, dtype: int64


The BBC dataset has an unusual split: the test set is almost as large as the training set. In supervised learning this would waste precious labeled data. We will combine and re-split for a more standard 80/20 ratio.

In [15]:

# 3.2  Combine and re-split with stratification

from sklearn.model_selection import StratifiedShuffleSplit

combined_df = pd.concat([train_bbc, test_bbc],
                        ignore_index=True, sort=False)
print(f"Combined dataset: {len(combined_df):,} articles")

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2,
                             random_state=0)
train_index, test_index = next(
    sss.split(combined_df["text"], combined_df["label"]))

train_bbc = combined_df[combined_df.index.isin(train_index)].copy()
test_bbc  = combined_df[combined_df.index.isin(test_index)].copy()

print(f"\nNew training set : {len(train_bbc):,}")
print(f"New test set     : {len(test_bbc):,}")
print()
print("=== New training class distribution ===")
print(train_bbc.groupby("label_text")["text"].count())
print()
print("=== New test class distribution ===")
print(test_bbc.groupby("label_text")["text"].count())


Combined dataset: 2,225 articles

New training set : 1,780
New test set     : 445

=== New training class distribution ===
label_text
business         408
entertainment    309
politics         333
sport            409
tech             321
Name: text, dtype: int64

=== New test class distribution ===
label_text
business         102
entertainment     77
politics          84
sport            102
tech              80
Name: text, dtype: int64


`StratifiedShuffleSplit` preserves the original class proportions in both the training and test partitions. This is essential: if "sport" were over-represented in training and under-represented in test (or vice versa), our accuracy estimates would be biased. After re-splitting we have roughly 80% for training and 20% for testing — a much better allocation of our limited data.

In [16]:

# 3.3  Preprocess: tokenize, remove stopwords, save

train_bbc = tokenize(train_bbc, "text")
train_bbc = remove_stopword_punct(train_bbc, "text_tokenized")

test_bbc = tokenize(test_bbc, "text")
test_bbc = remove_stopword_punct(test_bbc, "text_tokenized")

train_bbc["text_clean"] = train_bbc["text_tokenized"].apply(
    lambda x: " ".join(list(x)))
test_bbc["text_clean"] = test_bbc["text_tokenized"].apply(
    lambda x: " ".join(list(x)))

os.makedirs("data", exist_ok=True)
train_bbc.to_json("data/bbc_train.json")
test_bbc.to_json("data/bbc_test.json")

print(f"Preprocessed and saved: {len(train_bbc):,} train, "
      f"{len(test_bbc):,} test articles")


Preprocessed and saved: 1,780 train, 445 test articles


In [17]:

# 3.4  Build TF-IDF matrix and fit K-Means

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

vec = TfidfVectorizer(ngram_range=(1, 3))
matrix = vec.fit_transform(train_bbc["text_clean"])

print(f"TF-IDF matrix shape : {matrix.shape}")
print(f"Non-zero entries    : {matrix.nnz:,}")

sparsity = 1 - matrix.nnz / (matrix.shape[0] * matrix.shape[1])
print(f"Sparsity            : {sparsity:.4%}")


TF-IDF matrix shape : (1780, 665564)
Non-zero entries    : 1,060,304
Sparsity            : 99.9105%


The TF-IDF vectorizer with `ngram_range=(1, 3)` considers unigrams, bigrams, and trigrams. This produces a very high-dimensional feature space — typically tens of thousands of columns. The matrix is extremely **sparse**: the vast majority of n-grams do not appear in any given article. Sparse storage (CSR format) keeps memory usage manageable; a dense representation of this matrix would require several gigabytes.

Each row of this matrix is a point in $\mathbb{R}^p$ (where $p$ is the vocabulary size). K-Means will try to find 5 centroids that minimize the total squared distance from each point to its assigned centroid.

In [18]:

# 3.5  Fit K-Means with k=5

km = KMeans(n_clusters=5, n_init=10, random_state=42)
km.fit(matrix)
print(f"K-Means converged. Inertia = {km.inertia_:,.2f}")


K-Means converged. Inertia = 1,750.64


The **inertia** (within-cluster sum of squares) measures how compact the clusters are — lower is better, but the absolute value depends on the scale and dimensionality of the data. We run `n_init=10` random initializations and keep the best one to reduce sensitivity to the initial centroid placement.

In practice, when you do not know $k$ in advance, you would plot inertia vs. $k$ (the **elbow method**) or use the **silhouette score** to choose the number of clusters. Here we use $k = 5$ because we know there are 5 ground-truth categories.

In [19]:

# 3.6  Inspect clusters -- most frequent words per cluster

def get_most_frequent_words(text, num_words):
    word_list = word_tokenize(text)
    freq_dist = FreqDist(word_list)
    top_words = freq_dist.most_common(num_words)
    return [w[0] for w in top_words]

def print_most_common_words_by_cluster(input_df, km_model,
                                       num_clusters):
    clusters = km_model.labels_.tolist()
    input_df = input_df.copy()
    input_df["cluster"] = clusters
    for cluster in range(num_clusters):
        cluster_text = input_df[input_df["cluster"] == cluster]
        all_text = " ".join(cluster_text["text_clean"].astype(str))
        top_30 = get_most_frequent_words(all_text, 30)
        print(f"\n--- Cluster {cluster} ({len(cluster_text)} articles) ---")
        print(top_30)
    return input_df

train_bbc = print_most_common_words_by_cluster(train_bbc, km, 5)



--- Cluster 0 (415 articles) ---
['said', 'game', 'first', 'england', 'win', 'last', 'one', 'world', 'two', 'would', 'also', 'back', 'time', 'club', 'players', 'play', 'cup', 'team', 'new', 'good', 'year', 'wales', 'side', 'match', 'second', 'france', 'six', 'get', 'ireland', 'coach']

--- Cluster 1 (467 articles) ---
['said', 'us', 'year', 'mr', 'also', 'would', 'new', 'government', 'company', 'market', 'last', 'bank', 'growth', 'could', 'economy', 'firm', 'economic', 'sales', 'one', 'years', '000', 'however', 'two', 'oil', 'world', 'may', '2004', 'people', 'chief', 'prices']

--- Cluster 2 (399 articles) ---
['said', 'people', 'music', 'new', 'also', 'mr', 'one', 'would', 'could', 'technology', 'mobile', 'year', 'us', 'many', 'uk', 'users', 'use', 'like', 'digital', 'games', 'get', 'make', 'net', 'software', 'tv', 'world', 'first', 'online', 'time', 'used']

--- Cluster 3 (214 articles) ---
['film', 'best', 'said', 'also', 'year', 'awards', 'one', 'us', 'award', 'number', 'films', '

Each cluster's top words reveal a clear thematic identity. The exact cluster numbering varies across runs, but you will typically see groupings like:

- **Politics cluster:** *labour, party, election, blair, government, minister*
- **Sport cluster:** *game, england, win, play, cup, players, match*
- **Business cluster:** *sales, growth, firm, market, economy, bank*
- **Tech cluster:** *software, users, microsoft, security, net, search, mobile*
- **Entertainment cluster:** *film, music, award, show, best, star, band*

Notice that **said** and **mr** appear near the top of most clusters — they are domain-specific stopwords that we should have filtered. In a production pipeline, you would iteratively refine the stopword list based on exactly this kind of inspection.

The fact that K-Means recovers recognizable topics *without any labels* demonstrates the power of the TF-IDF + clustering approach. The clusters are not perfect — some "business" and "politics" articles share vocabulary about government economic policy — but they provide a strong starting point for exploratory analysis.

In [20]:

# 3.7  Predict cluster for a test example

test_example = test_bbc.iloc[1]["text"]
true_label   = test_bbc.iloc[1]["label_text"]

vectorized  = vec.transform([test_example])
prediction  = km.predict(vectorized)

print(f"True label       : {true_label}")
print(f"Assigned cluster : {prediction[0]}")
print(f"\nFirst 200 chars of article:\n{test_example[:200]}...")


True label       : politics
Assigned cluster : 4

First 200 chars of article:
lib dems  new election pr chief the lib dems have appointed a senior figure from bt to be the party s new communications chief for their next general election effort.  sandy walkington will now work w...


We verify the model on a single test example. Because K-Means clusters are **unlabeled** (cluster 0 is not inherently "politics"), interpreting the result requires cross-referencing with the word lists from the previous step. In a real project you would either manually map cluster IDs to topic names or use a small labeled subset to create this mapping automatically.

In [21]:

# 3.8  Save and reload the model

dump(km, "data/kmeans.joblib")
km_loaded = load("data/kmeans.joblib")

# Verify: same prediction
prediction_loaded = km_loaded.predict(vectorized)
assert (prediction == prediction_loaded).all(), "Mismatch!"
print(f"Model saved & reloaded. Prediction matches: {prediction_loaded[0]}")


Model saved & reloaded. Prediction matches: 4


Persisting models with `joblib` is essential for deployment. The serialized file contains the 5 cluster centroids (each a vector in TF-IDF space), allowing instant cluster assignment on new documents without re-training.

**Strategic note:** Unsupervised clustering is invaluable when you have large volumes of unlabeled text — a common scenario in enterprise settings. Use it to discover emergent topics in customer feedback, support tickets, or internal documents, then have domain experts label a small sample for supervised refinement.

---

## Recipe 4 — Using SVMs for Supervised Text Classification

We now move to **supervised** learning, where we leverage labeled data to train a model that maps text to categories. The Support Vector Machine (SVM) finds the **maximum-margin hyperplane** that separates classes:

$$\min_{\mathbf{w}, b} \;\; \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{N} \max\!\big(0,\; 1 - y_i(\mathbf{w}^T \phi(\mathbf{x}_i) + b)\big)$$

The first term encourages a wide margin (good generalization); the second penalizes misclassifications. The regularization parameter $C$ controls the trade-off: small $C$ favors a wider margin (more regularization), large $C$ tries harder to classify every training point correctly.

Instead of TF-IDF, we use **BERT embeddings** from the `all-MiniLM-L6-v2` sentence transformer. This model maps each text to a dense 384-dimensional vector that captures semantic meaning — a massive upgrade from bag-of-words.

$$\phi : \text{``Great acting and plot''} \;\longmapsto\; \mathbf{v} \in \mathbb{R}^{384}$$

In [22]:

# 4.1  Load data and BERT sentence encoder

from sklearn.svm import SVC
from sentence_transformers import SentenceTransformer
from sklearn.metrics import confusion_matrix

train_bbc = pd.read_json("data/bbc_train.json")
test_bbc  = pd.read_json("data/bbc_test.json")

# Shuffle training data
train_bbc = train_bbc.sample(frac=1, random_state=42).reset_index(drop=True)

print("=== Training class counts ===")
print(train_bbc.groupby("label_text")["text"].count())
print()
print("=== Test class counts ===")
print(test_bbc.groupby("label_text")["text"].count())


=== Training class counts ===
label_text
business         408
entertainment    309
politics         333
sport            409
tech             321
Name: text, dtype: int64

=== Test class counts ===
label_text
business         102
entertainment     77
politics          84
sport            102
tech              80
Name: text, dtype: int64


In [23]:

# 4.2  Load sentence transformer model

st_model = SentenceTransformer("all-MiniLM-L6-v2")

def get_sentence_vector(text, model):
    return model.encode([text])[0]

# Quick test
sample_vec = get_sentence_vector("This is a test sentence.", st_model)
print(f"Embedding dimension: {sample_vec.shape[0]}")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding dimension: 384


The `all-MiniLM-L6-v2` model is a distilled BERT variant optimized for sentence-level similarity tasks. It produces 384-dimensional embeddings — far denser and more semantically meaningful than the sparse TF-IDF vectors (which had tens of thousands of dimensions). Two sentences with similar meaning will have **high cosine similarity** in this space, even if they share no words in common.

This is the key advantage of pre-trained embeddings: the model has already learned from hundreds of millions of sentences that "football match" and "soccer game" are semantically close — information that bag-of-words representations cannot capture.

In [24]:

# 4.3  Vectorize data

target_names = ["tech", "business", "sport",
                "entertainment", "politics"]

vectorize_fn = lambda x: get_sentence_vector(x, st_model)

print("Encoding training data (this may take a minute)...")
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_bbc, test_bbc, vectorize_fn, column_name="text_clean")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape : {X_test.shape}")


Encoding training data (this may take a minute)...
X_train shape: (1780, 384)
X_test shape : (445, 384)


In [25]:

# 4.4  Train SVM classifier

clf = SVC(C=0.1, kernel="rbf")
clf.fit(X_train, y_train)
print("SVM trained.")


SVM trained.


We choose the **RBF (Radial Basis Function)** kernel, which maps data into an infinite-dimensional space where a linear separator can handle non-linear boundaries:

$$K(\mathbf{x}_i, \mathbf{x}_j) = \exp\!\left(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2\right)$$

The default $\gamma = \frac{1}{p \cdot \text{Var}(\mathbf{X})}$ where $p = 384$. Combined with a relatively small $C = 0.1$, we are applying **strong regularization** — appropriate because the BERT embeddings are already a powerful representation and we want to avoid overfitting to the $\sim$1{,}780$ training examples.

In [26]:

# 4.5  Evaluate on training and test sets

print("=== Training Set ===")
train_preds = clf.predict(X_train)
print(classification_report(y_train, train_preds,
                            target_names=target_names))

print("\n=== Test Set ===")
test_preds = clf.predict(X_test)
test_bbc_eval = test_bbc.copy()
test_bbc_eval["prediction"] = test_preds
print(classification_report(y_test, test_preds,
                            target_names=target_names))


=== Training Set ===
               precision    recall  f1-score   support

         tech       0.97      0.97      0.97       321
     business       0.96      0.96      0.96       408
        sport       0.98      1.00      0.99       409
entertainment       0.99      0.98      0.99       309
     politics       0.98      0.95      0.96       333

     accuracy                           0.97      1780
    macro avg       0.97      0.97      0.97      1780
 weighted avg       0.97      0.97      0.97      1780


=== Test Set ===
               precision    recall  f1-score   support

         tech       0.97      0.95      0.96        80
     business       0.98      0.97      0.98       102
        sport       0.98      1.00      0.99       102
entertainment       0.96      0.99      0.97        77
     politics       0.98      0.96      0.97        84

     accuracy                           0.98       445
    macro avg       0.97      0.97      0.97       445
 weighted avg       0

The SVM with BERT embeddings achieves test accuracy well **above 90%** across all five classes — a dramatic improvement over both the keyword baseline (62%) and unsupervised K-Means. The precision and recall for each category are consistently high, with "sport" typically achieving near-perfect scores because sports vocabulary is highly distinctive.

The train-test gap is small, confirming that $C = 0.1$ provides adequate regularization. This is the benefit of starting with a strong feature representation (BERT): the classifier's job is comparatively easy because the hard work of understanding language semantics has already been done by the pre-trained transformer.

$$\text{Accuracy}_{\text{test}} \gg \text{Accuracy}_{\text{keyword}} \;\; \Rightarrow \;\; \text{learned representations} \gg \text{hand-crafted rules}$$

In [27]:

# 4.6  Confusion matrix

cm = confusion_matrix(y_test, test_preds)
print("Confusion matrix (rows = true, cols = predicted):")
print(f"Classes: {target_names}\n")
print(cm)


Confusion matrix (rows = true, cols = predicted):
Classes: ['tech', 'business', 'sport', 'entertainment', 'politics']

[[ 76   0   1   2   1]
 [  1  99   1   1   0]
 [  0   0 102   0   0]
 [  0   0   0  76   1]
 [  1   2   0   0  81]]


The confusion matrix lets us pinpoint **where** the model struggles. Typically the most confusion occurs between **business** and **politics** or between **business** and **tech** — categories that share vocabulary about companies, government policy, and economic trends. "Sport" usually has zero or near-zero off-diagonal entries because sports terminology is highly domain-specific.

In a production setting, these confusion patterns would guide your next steps: collect more training data for the confused categories, engineer features that distinguish them (e.g., named entities like company names vs. politician names), or merge categories that are genuinely ambiguous.

In [28]:

# 4.7  Classify a new article

new_article = (
    "iPhone 12: Apple makes jump to 5G. "
    "Apple has confirmed its iPhone 12 handsets will be its first to work on "
    "faster 5G networks. The company has also extended the range to include a "
    "new Mini model that has a smaller 5.4in screen. The US firm bucked a "
    "wider industry downturn by increasing its handset sales over the past year. "
    "5G will bring a new level of performance for downloads and uploads, "
    "higher quality video streaming, more responsive gaming, real-time "
    "interactivity and so much more, said chief executive Tim Cook."
)

vector = get_sentence_vector(new_article, st_model)
pred   = clf.predict([vector])
print(f"Predicted class: {target_names[pred[0]]}")


Predicted class: tech


The model correctly classifies this Apple 5G article as **tech**. The BERT embedding captures semantic cues like *5G*, *networks*, *handsets*, *streaming*, and *gaming* — even though the article also mentions business-related terms like *sales* and *industry downturn*. This is the strength of contextual embeddings: the overall semantic context pushes the embedding firmly into the "tech" region of the 384-dimensional space.

---

## Recipe 5 — Training a spaCy Model for Supervised Text Classification

spaCy's built-in **TextCategorizer** trains a lightweight CNN that reads the raw text and outputs class probabilities — no separate feature-engineering step required. The architecture uses:

$$\text{Raw text} \;\xrightarrow{\text{tokenize + embed}}\; \text{Token vectors} \;\xrightarrow{\text{CNN}}\; \text{Feature map} \;\xrightarrow{\text{softmax}}\; P(c \mid \text{text})$$

The training config (a `.cfg` file) controls the architecture, optimizer, learning rate, and data paths. spaCy handles the entire training loop — we just need to prepare the data in spaCy's `DocBin` format.

In [29]:

# 5.1  Prepare data in spaCy DocBin format

from spacy.tokens import DocBin

label_list = ["tech", "business", "sport",
              "entertainment", "politics"]

train_bbc = pd.read_json("data/bbc_train.json")
test_bbc  = pd.read_json("data/bbc_test.json")
train_bbc = train_bbc.sample(frac=1, random_state=42)

def preprocess_data_entry(input_text, label_idx, labels):
    doc = small_model.make_doc(input_text)   # fast -- no pipeline
    cats = {lbl: (1.0 if i == label_idx else 0.0)
            for i, lbl in enumerate(labels)}
    doc.cats = cats
    return doc

train_db = DocBin()
test_db  = DocBin()

for _, row in train_bbc.iterrows():
    doc = preprocess_data_entry(row["text"], row["label"], label_list)
    train_db.add(doc)

for _, row in test_bbc.iterrows():
    doc = preprocess_data_entry(row["text"], row["label"], label_list)
    test_db.add(doc)

os.makedirs("data", exist_ok=True)
train_db.to_disk("data/bbc_train.spacy")
test_db.to_disk("data/bbc_test.spacy")

print(f"Created DocBin: {len(train_db)} train, {len(test_db)} test docs")


Created DocBin: 1780 train, 445 test docs


Each `Doc` object stores the text and a `.cats` dictionary mapping category names to their one-hot values: `{"tech": 0.0, "business": 1.0, "sport": 0.0, ...}`. The `DocBin` is spaCy's efficient serialization format, designed for fast I/O during training.

We use `make_doc()` instead of the full `nlp()` pipeline because we only need tokenization here, not POS tagging or NER — this is significantly faster for large datasets.

In [30]:

# 5.2  Generate spaCy training config

# We generate a minimal config programmatically instead of
# downloading from the book's repo -- fully self-contained.

config_text = '''[paths]
train = "data/bbc_train.spacy"
dev = "data/bbc_test.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat]
factory = "textcat"
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 64
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false

[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 64
depth = 2
window_size = 1
maxout_pieces = 3

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v3"
exclusive_classes = true
length = 262144
ngram_size = 1
no_output_layer = false

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
'''

with open("data/spacy_config.cfg", "w") as f:
    f.write(config_text.strip())

print("spaCy config written to data/spacy_config.cfg")


spaCy config written to data/spacy_config.cfg


In [31]:

# 5.3  Train the spaCy text categorizer

from spacy.cli.train import train as spacy_train

os.makedirs("models/spacy_textcat_bbc", exist_ok=True)

print("Training spaCy TextCategorizer (this may take a few minutes)...\n")
spacy_train(
    "data/spacy_config.cfg",
    output_path="models/spacy_textcat_bbc",
    overrides={"paths.train": "data/bbc_train.spacy",
               "paths.dev":   "data/bbc_test.spacy"}
)
print("\nTraining complete.")


Training spaCy TextCategorizer (this may take a few minutes)...

[38;5;4mℹ Saving to output directory: models/spacy_textcat_bbc[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.16        7.46    0.07
  0     200         30.84       59.62    0.60
  0     400         18.34       56.80    0.57
  0     600         20.83       69.45    0.69
  0     800         18.06       71.36    0.71
  0    1000         13.22       73.15    0.73
  0    1200         13.00       87.42    0.87
  0    1400         10.95       91.99    0.92
  0    1600          6.55       87.80    0.88
  1    1800          8.27       88.43    0.88
  1    2000          3.93       91.39    0.91
  1    2200          3.04       91.06    0.91
  1    2400          5.60       90.27    0.90
  1    2600          4.26   

The spaCy training loop uses the **TextCatEnsemble** architecture, which combines two sub-models: a **bag-of-words linear model** (fast, high-recall) and a **CNN-based tok2vec model** (slower, captures word order and local context). The ensemble's final prediction blends both signals, typically outperforming either alone.

During training, the console logger reports the loss and the evaluation scores on the dev set (our test split) at regular intervals. The final model is saved to `models/spacy_textcat_bbc/model-last`. Accuracy typically converges around **85--90%** — slightly below the SVM + BERT approach because spaCy learns its own embeddings from scratch rather than leveraging a pre-trained transformer.

In [32]:

# 5.4  Test on a single example

nlp_trained = spacy.load("models/spacy_textcat_bbc/model-last")

input_text = test_bbc.iloc[1]["text"]
true_label = test_bbc.iloc[1]["label_text"]

doc = nlp_trained(input_text)
print(f"True label: {true_label}")
print(f"Predicted probabilities: {doc.cats}")
print(f"Predicted class: {max(doc.cats, key=doc.cats.get)}")


True label: politics
Predicted probabilities: {'tech': 1.1580999853322282e-05, 'business': 5.808704827359179e-06, 'sport': 1.5665534647268942e-07, 'entertainment': 2.0026304525799787e-07, 'politics': 0.9999822378158569}
Predicted class: politics


The `doc.cats` dictionary contains a probability score for each of the 5 categories, summing to $\sim$1.0. The predicted class is simply the one with the highest probability. These scores are useful beyond classification — they provide a **confidence measure** that can drive downstream decisions: route low-confidence predictions to human reviewers, flag borderline cases, or adjust thresholds for different precision/recall trade-offs.

In [33]:

# 5.5  Full test set evaluation

def get_prediction(input_text, nlp_model, tgt_names):
    doc = nlp_model(input_text)
    category = max(doc.cats, key=doc.cats.get)
    return tgt_names.index(category)

test_bbc_spacy = test_bbc.copy()
test_bbc_spacy["prediction"] = test_bbc_spacy["text"].apply(
    lambda x: get_prediction(x, nlp_trained, label_list))

print("=== spaCy TextCategorizer -- Test Set ===")
print(classification_report(test_bbc_spacy["label"],
                            test_bbc_spacy["prediction"],
                            target_names=label_list))


=== spaCy TextCategorizer -- Test Set ===
               precision    recall  f1-score   support

         tech       0.92      0.97      0.95        80
     business       0.89      0.92      0.90       102
        sport       0.99      0.97      0.98       102
entertainment       0.93      0.90      0.91        77
     politics       0.91      0.87      0.89        84

     accuracy                           0.93       445
    macro avg       0.93      0.93      0.93       445
 weighted avg       0.93      0.93      0.93       445



The spaCy model typically achieves **85--90% test accuracy** on the BBC dataset. The per-class scores tell a richer story:

- **Precision** answers: "Of all articles the model labeled as category $c$, what fraction truly belonged to $c$?"
- **Recall** answers: "Of all articles that truly belonged to $c$, what fraction did the model find?"
- **F1-score** is the harmonic mean: $F_1 = \frac{2 \cdot P \cdot R}{P + R}$

The harmonic mean penalizes imbalance — if either $P$ or $R$ is low, $F_1$ drops sharply. This makes it a better single metric than accuracy for multi-class problems.

**Comparison with SVM:** The SVM + BERT approach typically scores a few points higher because it starts with pre-trained 384-dimensional embeddings that encode rich semantic knowledge. spaCy's model learns embeddings from scratch using only $\sim$1{,}780$ training articles — a much harder task. With more training data, the gap would narrow.

---

## Recipe 6 — Classifying Texts Using OpenAI Models

Large language models like GPT-4o-mini can classify text in a **zero-shot** setting — no training data, no fine-tuning. We simply describe the task in the prompt and the model returns a label. This is possible because the model has already learned about topics, sentiment, and language structure during pre-training on a massive text corpus.

The trade-offs are clear: zero-shot LLM classification requires **no labeled data** and **no training time**, but it costs **API credits per prediction** and gives you **less control** over the decision boundary.

In [34]:

# 6.1  Set up OpenAI client

import openai
from google.colab import userdata

# Securely fetch API key from Colab Secrets
api_key = userdata.get("OPENAI_API_KEY")

client = openai.OpenAI(api_key=api_key)
print("OpenAI client initialized.")


OpenAI client initialized.


We retrieve the API key from **Colab Secrets** (`userdata.get`) rather than hard-coding it. This keeps the key out of your notebook history and version control. To set this up, click the key icon in the Colab sidebar and add a secret named `OPENAI_API_KEY`.

In [35]:

# 6.2  Load test data (fresh from Hugging Face)

test_dataset = load_dataset("SetFit/bbc-news", split="test")

# Quick check
example  = test_dataset[0]["text"]
category = test_dataset[0]["label_text"]
print(f"Example category: {category}")
print(f"First 200 chars : {example[:200]}...")


Example category: entertainment
First 200 chars : carry on star patsy rowlands dies actress patsy rowlands  known to millions for her roles in the carry on films  has died at the age of 71.  rowlands starred in nine of the popular carry on films  alo...


In [36]:

# 6.3  Single-example classification

prompt = (
    "You are classifying texts by topics. There are 5 topics: "
    "tech, entertainment, business, politics and sport. "
    "Output the topic and nothing else. For example, if the topic "
    'is business, your output should be "business". '
    "Given the following text, what is its topic from the above list "
    "without any additional explanations: " + example
)

response = client.chat.completions.create(
    model="gpt-4o-mini",      # fast, cheap, strong
    temperature=0,             # deterministic output
    max_tokens=20,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": prompt}
    ],
)

result = response.choices[0].message.content.strip().lower()
print(f"True label  : {category}")
print(f"GPT predicts: {result}")


True label  : entertainment
GPT predicts: entertainment


We set `temperature=0` to make the output deterministic — the model always picks the most likely token at each step. The `max_tokens=20` limit ensures we do not waste credits on verbose answers; we only need a single word.

The prompt engineering here follows important principles: we explicitly list the valid categories, provide an example of the expected output format, and instruct the model to output *nothing else*. Despite these instructions, the model occasionally adds extra words — we handle that with post-processing below.

We use `gpt-4o-mini` as a cost-effective replacement for the cookbook's original `gpt-3.5-turbo`. It offers comparable or better classification accuracy at lower cost per token.

In [37]:

# 6.4  Classify a sample of 200 test articles

def get_gpt_classification(input_text, client_obj):
    """Query GPT to classify a single text into one of 5 BBC topics."""
    prompt = (
        "You are classifying texts by topics. There are 5 topics: "
        "tech, entertainment, business, politics and sport. "
        "Output the topic and nothing else. For example, if the topic "
        'is business, your output should be "business". '
        "Given the following text, what is its topic from the above list "
        "without any additional explanations: " + input_text
    )
    response = client_obj.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        max_tokens=20,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user",   "content": prompt}
        ],
    )
    return response.choices[0].message.content.strip().lower()


test_df_gpt = test_dataset.to_pandas()
test_df_gpt = test_df_gpt.sample(frac=1, random_state=42).reset_index(drop=True)
test_data   = test_df_gpt.iloc[:200].copy()

print(f"Classifying {len(test_data)} articles with GPT-4o-mini...")
print("(This may take 1-3 minutes depending on API rate limits)")

test_data["gpt_prediction"] = test_data["text"].apply(
    lambda x: get_gpt_classification(x, client))

print("Done.")
print(test_data[["label_text", "gpt_prediction"]].head(10))


Classifying 200 articles with GPT-4o-mini...
(This may take 1-3 minutes depending on API rate limits)
Done.
  label_text gpt_prediction
0       tech       business
1   business       business
2   business       business
3       tech           tech
4   business       business
5       tech  entertainment
6   politics       politics
7   business       business
8   business       business
9   business       business


In [38]:

# 6.5  Clean predictions and evaluate

label_list_gpt = ["tech", "business", "sport",
                  "entertainment", "politics"]

def get_one_word_match(input_text):
    """Extract the first valid category from GPT response."""
    match = re.search(
        r"tech|entertainment|business|sport|politics",
        input_text)
    if match:
        return input_text[match.start():match.end()]
    return "unknown"

test_data["gpt_prediction"] = test_data["gpt_prediction"].apply(
    get_one_word_match)

# Convert to numeric label
test_data["gpt_label"] = test_data["gpt_prediction"].apply(
    lambda x: label_list_gpt.index(x) if x in label_list_gpt else -1)

# Drop any rows where GPT gave an unrecognized answer
valid_mask = test_data["gpt_label"] >= 0
if (~valid_mask).sum() > 0:
    print(f"Warning: {(~valid_mask).sum()} unrecognized predictions dropped")
test_data_clean = test_data[valid_mask].copy()

print(f"\n=== GPT-4o-mini Classification -- {len(test_data_clean)} articles ===")
print(classification_report(test_data_clean["label"],
                            test_data_clean["gpt_label"],
                            target_names=label_list_gpt))



=== GPT-4o-mini Classification -- 200 articles ===
               precision    recall  f1-score   support

         tech       1.00      0.80      0.89        41
     business       0.93      0.94      0.93        53
        sport       1.00      1.00      1.00        43
entertainment       0.94      0.94      0.94        34
     politics       0.81      1.00      0.89        29

     accuracy                           0.94       200
    macro avg       0.93      0.94      0.93       200
 weighted avg       0.94      0.94      0.93       200



GPT-4o-mini typically achieves **90--95% accuracy** on this task — remarkable given that it has never seen a single BBC training example. This is the power of zero-shot classification with large language models: the model's vast pre-training knowledge allows it to understand what "tech", "sport", and "politics" mean and map articles accordingly.

However, this approach has important production trade-offs. On the cost side, each classification requires an API call, which introduces both latency ($\sim$200--500ms per request) and monetary cost. For 200 articles the total is negligible, but classifying millions of documents would cost hundreds of dollars. On the control side, you cannot directly inspect or adjust the decision boundary — if the model consistently confuses "business" and "politics", your only lever is prompt engineering, not feature engineering or hyperparameter tuning.

### Summary: Method Comparison

| Method | Accuracy | Training Data Needed | Latency | Interpretability |
|--------|----------|---------------------|---------|-----------------|
| Keywords (Recipe 2) | $\sim$62% | Labels + vocabulary | Instant | High |
| K-Means (Recipe 3) | N/A (unsupervised) | None | Instant | Moderate |
| SVM + BERT (Recipe 4) | $\sim$95% | Labeled corpus | Fast | Low |
| spaCy CNN (Recipe 5) | $\sim$87% | Labeled corpus | Fast | Low |
| GPT-4o-mini (Recipe 6) | $\sim$93% | None (zero-shot) | Slow (API) | Low |

The SVM + BERT approach achieves the best test accuracy while keeping inference fast and free (no API calls). GPT is the fastest to deploy (no training) but the most expensive to run. spaCy offers a middle ground with a self-contained model that requires no external dependencies at inference time. The right choice depends on your specific constraints: budget, latency requirements, data availability, and regulatory environment.

---
## Summary and Key Takeaways

This chapter walked through the full spectrum of text classification approaches, from zero-parameter keyword matching to zero-shot large language models. The key insights:

**1. Baselines matter.** The keyword classifier (62% test accuracy) sets the floor. Any model that cannot beat this number is not learning useful patterns — it is just memorizing training noise.

**2. Representation is (almost) everything.** The jump from TF-IDF (K-Means) to BERT embeddings (SVM) demonstrates that better text representations produce better classifiers, often with simpler algorithms. The SVM itself is a decades-old method; the magic comes from the embeddings.

**3. The bias-variance trade-off is alive and well.** The keyword classifier overfits to training vocabulary (87% train vs. 62% test). The SVM with $C = 0.1$ regularization shows minimal overfitting. Understanding and controlling this trade-off is a core ML skill.

**4. Unsupervised methods have unique value.** K-Means does not need labels, making it invaluable for exploratory analysis, data auditing, and cold-start scenarios where labeled data does not yet exist.

**5. LLMs are not always the answer.** GPT-4o-mini achieves strong accuracy with zero training data, but at the cost of API dependency, latency, and ongoing per-prediction expense. For high-volume production workloads, a trained SVM or spaCy model is often the more pragmatic choice.

In the next chapter, we will build on these foundations to tackle more complex NLP tasks including sequence labeling and named entity recognition.