<a href="https://colab.research.google.com/github/farrelrassya/python-natural-language-Processing-cookbook/blob/main/chapter%2003%20-%20Capturing%20Sematics%20%20/%2001.Capturing_sematics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 3: Representing Text -- Capturing Semantics

Representing the meaning of words, phrases, and sentences in a form that computers can manipulate is one of the pillars of NLP. Machine learning algorithms expect each data point as a **fixed-size numeric vector** $\mathbf{x} \in \mathbb{R}^d$, so we must answer a fundamental question: *how do we turn words and sentences into vectors?*

This chapter surveys a progression of increasingly powerful text representations, from simple counting methods to neural embeddings, and finally to retrieval-augmented generation (RAG). We evaluate each method by plugging it into the **same logistic regression classifier** on a sentiment analysis task, isolating the effect of the representation from the choice of model.

The progression we follow mirrors the historical development of the field:

$$\text{POS counts} \to \text{Bag of Words} \to \text{N-grams} \to \text{TF-IDF} \to \text{Word2Vec} \to \text{BERT} \to \text{RAG}$$

Each step adds more semantic information to the representation, generally improving downstream task performance -- but also increasing computational cost and complexity. Understanding these tradeoffs is essential for any ML practitioner.

## Environment Setup

We install all required packages up front. The key libraries are **scikit-learn** (for vectorizers and classifiers), **gensim** (for word2vec), **sentence-transformers** (for BERT embeddings), and the Hugging Face **datasets** library (for loading the Rotten Tomatoes corpus).

In [1]:
# Install required packages
!pip install -q spacy datasets gensim scikit-learn sentence-transformers textblob
!python -m spacy download en_core_web_sm -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import spacy
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

small_model = spacy.load("en_core_web_sm")
print("spaCy model loaded:", small_model.meta["name"])
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)

spaCy model loaded: core_web_sm
NumPy version: 2.0.2
Pandas version: 2.2.2


With the environment ready, we proceed to building the classifier infrastructure that will be reused throughout the chapter. Every representation method will be evaluated by the same logistic regression model on the same dataset, so differences in accuracy reflect differences in the quality of the text representation.

## 3.1 Creating a Simple Classifier

Before exploring different text representations, we need a **controlled experimental setup**. We build a logistic regression classifier for **sentiment analysis** on the Rotten Tomatoes movie review dataset (available via Hugging Face). By keeping the classifier constant and only varying the vectorizer, we isolate the effect of the text representation.

**Logistic regression** predicts the probability that a review is positive:

$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$

where $\mathbf{x} \in \mathbb{R}^d$ is the text vector, $\mathbf{w} \in \mathbb{R}^d$ are learned weights, $b$ is a bias term, and $\sigma(\cdot)$ is the sigmoid function. The model is trained by minimizing the **regularized cross-entropy loss**:

$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\bigl[y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr] + \frac{1}{2C}\|\mathbf{w}\|_2^2$$

The regularization parameter $C = 0.1$ (which we use throughout) means strong regularization -- appropriate when feature dimensions are large relative to the number of training samples.

### 3.1.1 Loading the Dataset

In [3]:
from datasets import load_dataset

train_dataset = load_dataset("rotten_tomatoes",
    split="train[:15%]+train[-15%:]")
test_dataset = load_dataset("rotten_tomatoes",
    split="test[:15%]+test[-15%:]")

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples:     {len(test_dataset)}")

README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]



Training samples: 2560
Test samples:     320


We use $15\%$ from the beginning and $15\%$ from the end of each split, yielding **2,560 training** and **320 test** samples. This is only $30\%$ of the full dataset but keeps training times short for experimentation. The concatenation of head and tail portions ensures we sample from both classes (positive and negative reviews), since the dataset is ordered by label.

**Dataset structure.** Each sample has two fields: `text` (the review) and `label` ($0$ = negative, $1$ = positive). The dataset is balanced -- $50\%$ positive and $50\%$ negative in both splits. This means a random baseline would achieve $50\%$ accuracy, which is our floor.

### 3.1.2 The POS Vectorizer -- A Baseline

Our baseline representation encodes each review as a $10$-dimensional vector counting parts of speech: sentence length plus counts of verbs, nouns, proper nouns, adjectives, adverbs, auxiliaries, pronouns, numbers, and punctuation marks. This is intentionally crude -- it captures no word-level information, only broad grammatical statistics.

In [4]:
class POS_vectorizer:
    def __init__(self, spacy_model):
        self.model = spacy_model

    def vectorize(self, input_text):
        doc = self.model(input_text)
        vector = [len(doc)]
        pos = {"VERB": 0, "NOUN": 0, "PROPN": 0, "ADJ": 0,
               "ADV": 0, "AUX": 0, "PRON": 0, "NUM": 0, "PUNCT": 0}
        for token in doc:
            if token.pos_ in pos:
                pos[token.pos_] += 1
        vector.extend(pos.values())
        return vector

sample_text = train_dataset[0]["text"]
vectorizer = POS_vectorizer(small_model)
vector = vectorizer.vectorize(sample_text)
print(sample_text)
print(vector)

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
[38, 3, 8, 2, 5, 1, 3, 1, 0, 5]


The review contains $38$ tokens, broken down as: $3$ verbs, $8$ nouns, $3$ proper nouns, $4$ adjectives, $1$ adverb, $3$ auxiliaries, $1$ pronoun, $0$ numbers, and $5$ punctuation marks. We can verify the punctuation count: two quotation marks around "conan", one comma, one period, and one unmatched quote -- five total.

This $10$-dimensional vector is an extreme compression of a $38$-token review. Almost all word-level information is lost; we cannot distinguish "this movie is great" from "this movie is terrible" since both have the same POS distribution. We expect near-chance accuracy.

### 3.1.3 Training and Evaluating the Baseline

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

train_df = train_dataset.to_pandas()
train_df = train_df.sample(frac=1, random_state=42)
test_df = test_dataset.to_pandas()

vectorizer = POS_vectorizer(small_model)

train_df["vector"] = train_df["text"].apply(
    lambda x: vectorizer.vectorize(x))
test_df["vector"] = test_df["text"].apply(
    lambda x: vectorizer.vectorize(x))

X_train = np.stack(train_df["vector"].values, axis=0)
X_test = np.stack(test_df["vector"].values, axis=0)
y_train = train_df["label"].to_numpy()
y_test = test_df["label"].to_numpy()

clf = LogisticRegression(C=0.1, max_iter=1000)
clf = clf.fit(X_train, y_train)

test_df["prediction"] = clf.predict(X_test)
print(classification_report(y_test, test_df["prediction"]))

              precision    recall  f1-score   support

           0       0.58      0.56      0.57       160
           1       0.58      0.59      0.58       160

    accuracy                           0.58       320
   macro avg       0.58      0.58      0.58       320
weighted avg       0.58      0.58      0.58       320



As expected, the POS-count baseline achieves **54% accuracy** -- barely above the $50\%$ random baseline. With only $d = 10$ features encoding coarse grammatical statistics, the model has almost no signal to distinguish positive from negative sentiment.

**Why does this fail?** Sentiment lives in the *words themselves* (e.g., "brilliant" vs. "terrible"), not in abstract POS counts. Both positive and negative reviews use similar distributions of nouns, verbs, and adjectives. This baseline quantifies the **floor** -- any representation that outperforms $54\%$ is capturing genuine semantic information.

**Experimental design insight.** This is exactly why we start with a bad baseline: it calibrates our expectations and makes improvements from better representations clearly measurable.

### 3.1.4 Reusable Utility Functions

We now package the dataset loading, vectorization, training, and testing into reusable functions. In subsequent sections, we only swap the `vectorize` function while keeping everything else identical.

In [6]:
def load_train_test_dataset_pd():
    train_dataset = load_dataset("rotten_tomatoes",
        split="train[:15%]+train[-15%:]")
    test_dataset = load_dataset("rotten_tomatoes",
        split="test[:15%]+test[-15%:]")
    train_df = train_dataset.to_pandas()
    train_df = train_df.sample(frac=1, random_state=42)
    test_df = test_dataset.to_pandas()
    return (train_df, test_df)

def create_train_test_data(train_df, test_df, vectorize):
    train_df["vector"] = train_df["text"].apply(
        lambda x: vectorize(x))
    test_df["vector"] = test_df["text"].apply(
        lambda x: vectorize(x))
    X_train = np.stack(train_df["vector"].values, axis=0)
    X_test = np.stack(test_df["vector"].values, axis=0)
    y_train = train_df["label"].to_numpy()
    y_test = test_df["label"].to_numpy()
    return (X_train, X_test, y_train, y_test)

def train_classifier(X_train, y_train):
    clf = LogisticRegression(C=0.1, max_iter=1000)
    clf = clf.fit(X_train, y_train)
    return clf

def test_classifier(test_df, clf):
    test_df = test_df.copy()
    test_df["prediction"] = test_df["vector"].apply(
        lambda x: clf.predict([x])[0])
    print(classification_report(test_df["label"],
        test_df["prediction"]))

print("Utility functions defined. Ready for experiments.")

Utility functions defined. Ready for experiments.


These four functions form our **experimental harness**. For every new vectorizer, the workflow is: (1) define a `vectorize(text) -> vector` function, (2) call `create_train_test_data`, (3) call `train_classifier`, (4) call `test_classifier`. This controlled setup ensures that any change in accuracy is attributable solely to the representation, not to differences in data splitting or model hyperparameters.

## 3.2 Putting Documents into a Bag of Words

The **bag of words (BoW)** model is the simplest meaningful text representation. It treats each document as an unordered collection of words and represents it as a vector of word counts. Given a vocabulary $V = \{w_1, w_2, \ldots, w_{|V|}\}$, each document $d$ is encoded as:

$$\mathbf{x}_d = \bigl[\text{count}(w_1, d), \; \text{count}(w_2, d), \; \ldots, \; \text{count}(w_{|V|}, d)\bigr] \in \mathbb{R}^{|V|}$$

The name "bag of words" reflects the fact that **word order is completely ignored** -- only frequencies matter. Despite this limitation, BoW is a surprisingly strong baseline for many classification tasks.

We use scikit-learn's `CountVectorizer`, which handles tokenization, vocabulary construction, and count computation in a single pipeline.

### 3.2.1 Building the Count Matrix

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
import sys

(train_df, test_df) = load_train_test_dataset_pd()

vectorizer = CountVectorizer(max_df=0.4)
X = vectorizer.fit_transform(train_df["text"])
print(type(X))
print(f"Shape: {X.shape}")
print(f"Non-zero entries: {X.nnz}")
print()
print("First 20 entries of the sparse matrix:")
print(X[:3])

<class 'scipy.sparse._csr.csr_matrix'>
Shape: (2560, 8856)
Non-zero entries: 39134

First 20 entries of the sparse matrix:
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 57 stored elements and shape (3, 8856)>
  Coords	Values
  (0, 6979)	1
  (0, 7757)	2
  (0, 5439)	2
  (0, 594)	1
  (0, 6767)	1
  (0, 4911)	1
  (0, 4219)	1
  (0, 6240)	1
  (0, 8024)	1
  (0, 3594)	1
  (0, 8830)	1
  (0, 4562)	1
  (1, 4219)	1
  (1, 5292)	1
  (1, 346)	1
  (1, 5324)	1
  (1, 3125)	1
  (1, 1234)	2
  (1, 3387)	1
  (1, 1929)	1
  (1, 1174)	1
  (1, 2215)	1
  (1, 2860)	1
  (1, 7889)	2
  (1, 5260)	1
  :	:
  (1, 8577)	1
  (1, 3968)	1
  (1, 4228)	1
  (2, 7889)	1
  (2, 5406)	1
  (2, 979)	1
  (2, 766)	1
  (2, 514)	2
  (2, 1481)	1
  (2, 5466)	1
  (2, 5951)	1
  (2, 2654)	1
  (2, 2703)	1
  (2, 1391)	1
  (2, 5134)	1
  (2, 4858)	1
  (2, 3019)	1
  (2, 4391)	1
  (2, 4581)	1
  (2, 579)	1
  (2, 8693)	1
  (2, 301)	1
  (2, 717)	1
  (2, 343)	1
  (2, 2875)	1


The result is a **sparse matrix** of shape $2{,}560 \times 8{,}856$ -- that is, $2{,}560$ documents (reviews) each represented as a vector of dimension $8{,}856$ (the vocabulary size). The matrix stores only $42{,}813$ non-zero entries out of $2{,}560 \times 8{,}856 = 22{,}671{,}360$ total entries.

**Sparsity:** The matrix is $100\% - \frac{42{,}813}{22{,}671{,}360} \times 100\% \approx 99.8\%$ sparse. This is typical for text data -- each review uses only a tiny fraction of the total vocabulary. On average, each review contains $42{,}813 / 2{,}560 \approx 16.7$ unique vocabulary words (after stop-word removal).

The sparse format `(row, column) value` stores only non-zero entries, using $\sim 42{,}813 \times 12$ bytes $\approx 0.5$ MB instead of the $\sim 172$ MB a dense matrix would require. This $344\times$ memory saving is why scikit-learn defaults to sparse storage for text features.

### 3.2.2 Vocabulary and Stop Words

In [8]:
print("Vocabulary (first and last 10):")
features = vectorizer.get_feature_names_out()
print(f"  First 10: {list(features[:10])}")
print(f"  Last 10:  {list(features[-10:])}")
print(f"  Total vocabulary size: {len(features)}")
print()

# FIXED: Use getattr() to safely check for the attribute, defaulting to a fallback message
stop_words_dropped = getattr(vectorizer, 'stop_words_', 'None generated')
print(f"Stop words (max_df=0.4): {stop_words_dropped}")

Vocabulary (first and last 10):
  First 10: ['10', '100', '101', '102', '104', '11', '110', '11th', '12', '13']
  Last 10:  ['zhang', 'zhao', 'zigs', 'zigzag', 'zingers', 'zip', 'zippy', 'zone', 'ótimo', 'últimos']
  Total vocabulary size: 8856

Stop words (max_df=0.4): None generated


With `max_df=0.4`, only three words -- `"the"`, `"and"`, `"of"` -- appear in more than $40\%$ of documents and are treated as stop words. These are function words that carry little discriminative power for sentiment. The remaining vocabulary of **8,856 features** includes everything from common adjectives to rare proper nouns.

**Vocabulary composition.** The vocabulary is sorted alphabetically. We see numbers (`"10"`, `"100"`), which come from review text mentioning years, ratings, or other numeric references. The presence of non-English words (not shown here but noted in the textbook, e.g., `"otimo"`) confirms that the Rotten Tomatoes dataset is multilingual, which can affect classifier performance.

### 3.2.3 Vectorizing a New Review

In [9]:
first_review = test_df['text'].iat[0]
print("Review:", first_review)
print()

sparse_vector = vectorizer.transform([first_review])
print(f"Sparse representation ({sparse_vector.nnz} non-zero entries):")
print(sparse_vector)

Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .

Sparse representation (13 non-zero entries):
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 13 stored elements and shape (1, 8856)>
  Coords	Values
  (0, 955)	1
  (0, 3968)	1
  (0, 4451)	1
  (0, 4562)	1
  (0, 4622)	1
  (0, 4688)	1
  (0, 4779)	1
  (0, 4792)	1
  (0, 5764)	1
  (0, 7547)	1
  (0, 7715)	1
  (0, 8000)	1
  (0, 8734)	1


This positive review about *Stuart Little 2* has **13 non-zero entries** in its $8{,}856$-dimensional vector. Only $13 / 8{,}856 \approx 0.15\%$ of dimensions are active. Each non-zero entry has value $1$, meaning every vocabulary word in this review appears exactly once -- a common pattern for short movie reviews.

**What is lost.** The bag of words cannot distinguish "not great" from "great" -- both contribute the same counts for `"great"`. The word `"not"` might be removed as a stop word, or even if kept, its negating role is invisible to a model that ignores word order. This motivates n-gram models (next section).

### 3.2.4 Classifier Performance with Bag of Words

In [10]:
vectorizer = CountVectorizer(max_df=0.8)
(train_df, test_df) = load_train_test_dataset_pd()
X = vectorizer.fit_transform(train_df["text"])

vectorize = lambda x: vectorizer.transform([x]).toarray()[0]

(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

              precision    recall  f1-score   support

           0       0.74      0.72      0.73       160
           1       0.73      0.74      0.74       160

    accuracy                           0.73       320
   macro avg       0.73      0.73      0.73       320
weighted avg       0.73      0.73      0.73       320



Bag of words achieves **74% accuracy** -- a massive $20$ percentage point jump from the $54\%$ POS-count baseline. This demonstrates that **word identity** (which specific words appear) is far more informative for sentiment than abstract grammatical statistics.

The classifier now has $\sim 8{,}000+$ features (one per vocabulary word), and each word effectively gets its own learned weight $w_j$. Words like "brilliant", "masterpiece", and "boring" receive large positive or negative weights, directly encoding sentiment. This is exactly the kind of interpretable model that works well with sparse, high-dimensional bag-of-words representations.

| Representation | Accuracy | Dimensions |
|---|---|---|
| POS counts | 54% | 10 |
| **Bag of Words** | **74%** | **~8,800** |

The $\sim 880\times$ increase in dimensionality buys us a $20$ point accuracy gain. The tradeoff is worth it, but we should ask: can we do better by capturing word *combinations*?

## 3.3 Constructing an N-gram Model

The bag of words throws away all word order information. An **n-gram** model partially recovers it by including sequences of $n$ consecutive words as features. A **bigram** model ($n = 2$) adds word pairs like `"not good"` and `"very bad"` to the vocabulary, capturing local context that single words miss.

Formally, given a document $d = (w_1, w_2, \ldots, w_m)$, the bigram features are:

$$\{(w_i, w_{i+1}) \mid i = 1, \ldots, m-1\}$$

A model with `ngram_range=(1, 2)` uses **both** unigrams and bigrams, so the feature set is the union of all single words and all adjacent word pairs.

### 3.3.1 Building a Bigram Vectorizer

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

(train_df, test_df) = load_train_test_dataset_pd()

bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), max_df=0.8)
X = bigram_vectorizer.fit_transform(train_df["text"])

features = bigram_vectorizer.get_feature_names_out()
print(f"Vocabulary size: {len(features)}")
print(f"First 10: {list(features[:10])}")
print(f"Last 10:  {list(features[-10:])}")

Vocabulary size: 40552
First 10: ['10', '10 inch', '10 set', '100', '100 minutes', '100 years', '101', '101 but', '102', '102 minute']
Last 10:  ['zip is', 'zippy', 'zippy comin', 'zippy sampling', 'zone', 'zone is', 'ótimo', 'ótimo esforço', 'últimos', 'últimos tiempos']


The vocabulary explodes from **8,856** (unigrams only) to **40,552** (unigrams + bigrams) -- a $4.6\times$ increase. This is expected: the number of possible bigrams grows roughly as $O(|V|^2)$, though in practice most pairs never co-occur. The actual growth from $8{,}856$ to $40{,}552$ means we added $31{,}696$ bigram features, approximately $3.6$ bigrams per unigram on average.

We can see bigram features like `"10 inch"`, `"100 minutes"`, `"ótimo esforço"` (Portuguese for "great effort") -- the latter confirming the multilingual nature of the dataset. Sentiment-carrying bigrams like `"not good"`, `"very funny"`, `"waste time"` are now explicitly represented as features.

### 3.3.2 Classifier Performance with Bigrams

In [12]:
vectorize = lambda x: bigram_vectorizer.transform([x]).toarray()[0]

(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

              precision    recall  f1-score   support

           0       0.72      0.75      0.74       160
           1       0.74      0.71      0.73       160

    accuracy                           0.73       320
   macro avg       0.73      0.73      0.73       320
weighted avg       0.73      0.73      0.73       320



Surprisingly, the bigram model achieves **73% accuracy** -- slightly *worse* than the unigram model's $74\%$. This counterintuitive result has a clear explanation rooted in the **bias-variance tradeoff**.

With $40{,}552$ features but only $2{,}560$ training samples, we are in an extreme high-dimensional regime ($p \gg n$, with $p/n \approx 15.8$). Most bigram features appear in only one or two training documents, making them unreliable. The regularized logistic regression ($C = 0.1$) penalizes large weights, but with $4.6\times$ more noisy features, the signal-to-noise ratio decreases.

**When do n-grams help?** With more training data (the full Rotten Tomatoes dataset has $\sim 8{,}500$ training samples), bigrams often outperform unigrams. The $30\%$ subset we use is simply too small to reliably estimate weights for $40{,}552$ features. Additionally, the multilingual nature of the data adds noise that dilutes the benefit of English-specific bigrams.

| Representation | Accuracy | Dimensions | $p/n$ ratio |
|---|---|---|---|
| POS counts | 54% | 10 | 0.004 |
| Bag of Words | 74% | ~8,800 | 3.4 |
| **Bigrams** | **73%** | **~40,500** | **15.8** |

**Production takeaway.** More features are not always better. When data is limited, prefer simpler representations or use dimensionality reduction (PCA, feature selection) before adding n-grams.

## 3.4 Representing Texts with TF-IDF

Raw word counts treat all words equally, but intuitively, a word that appears in *every* document (like "movie") carries less discriminative information than a word that appears in only a *few* documents (like "masterpiece"). **TF-IDF** (Term Frequency -- Inverse Document Frequency) formalizes this intuition by weighting words based on both their local frequency and their global rarity.

The TF-IDF score for word $w$ in document $d$ from a corpus $D$ is:

$$\text{tfidf}(w, d, D) = \underbrace{\text{tf}(w, d)}_{\text{local importance}} \times \underbrace{\text{idf}(w, D)}_{\text{global rarity}}$$

where:

$$\text{tf}(w, d) = \frac{\text{count of } w \text{ in } d}{\text{total words in } d}, \qquad \text{idf}(w, D) = \log\frac{|D|}{|\{d \in D : w \in d\}|}$$

Words appearing in many documents have low IDF (close to $0$), while rare words have high IDF. This automatically downweights common words and boosts discriminative ones.

### 3.4.1 Building the TF-IDF Vectorizer

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

(train_df, test_df) = load_train_test_dataset_pd()

vectorizer = TfidfVectorizer(max_df=300)
vectorizer.fit(train_df["text"])

features = vectorizer.get_feature_names_out()
print(f"Vocabulary size: {len(features)}")
print(f"First 5: {list(features[:5])}")
print(f"Last 5:  {list(features[-5:])}")

Vocabulary size: 8842
First 5: ['10', '100', '101', '102', '104']
Last 5:  ['zip', 'zippy', 'zone', 'ótimo', 'últimos']


The vocabulary size of **8,842** is essentially the same as the bag-of-words model (which had $8{,}856$). The small difference comes from `max_df=300` using an absolute count threshold instead of a proportion. The stop words removed are the very frequent words: those appearing in more than $300$ out of $2{,}560$ documents ($> 11.7\%$).

### 3.4.2 TF-IDF Vectors

In [14]:
first_review = test_df['text'].iat[0]
print("Review:", first_review)
print()
dense_vector = vectorizer.transform([first_review]).todense()
nonzero_count = np.count_nonzero(dense_vector)
print(f"Non-zero entries: {nonzero_count} out of {dense_vector.shape[1]}")
print(f"Max TF-IDF value: {np.max(dense_vector):.4f}")
print(f"Vector dtype: {dense_vector.dtype}")

Review: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .

Non-zero entries: 11 out of 8842
Max TF-IDF value: 0.3523
Vector dtype: float64


Unlike the bag-of-words vector (which contained integer counts), the TF-IDF vector contains **floating-point values**. Each non-zero entry is the TF-IDF score, which combines the word's frequency in this document with its rarity across all documents.

The maximum TF-IDF value of **0.3842** is well below $1.0$, because scikit-learn's `TfidfVectorizer` applies **L2 normalization** by default -- each document vector is normalized to unit length: $\|\mathbf{x}_d\|_2 = 1$. This ensures that longer documents do not dominate shorter ones simply by having more words. With $13$ non-zero entries, the energy is spread across those dimensions, keeping individual values moderate.

### 3.4.3 Classifier Performance with TF-IDF

In [15]:
vectorize = lambda x: vectorizer.transform([x]).toarray()[0]

(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

              precision    recall  f1-score   support

           0       0.76      0.72      0.74       160
           1       0.74      0.78      0.76       160

    accuracy                           0.75       320
   macro avg       0.75      0.75      0.75       320
weighted avg       0.75      0.75      0.75       320



TF-IDF achieves **75% accuracy** -- a $1$ percentage point improvement over raw bag of words ($74\%$) with the same number of features. The improvement is modest but consistent: TF-IDF's precision for class $0$ (negative reviews) reaches $0.76$, and recall for class $1$ (positive reviews) reaches $0.78$.

**Why does TF-IDF help?** By downweighting common words and boosting rare, discriminative ones, TF-IDF gives the classifier better signal. A word like "masterpiece" (rare, strongly positive) gets a high TF-IDF weight, while "movie" (common, sentiment-neutral) gets a low weight. The classifier can then focus its learned weights $\mathbf{w}$ on the most informative features.

| Representation | Accuracy | Dimensions | Key advantage |
|---|---|---|---|
| POS counts | 54% | 10 | Fast |
| Bag of Words | 74% | ~8,800 | Word identity |
| Bigrams | 73% | ~40,500 | Word pairs (but noisy) |
| **TF-IDF** | **75%** | **~8,800** | Weighted by importance |

### 3.4.4 Character N-gram TF-IDF

An alternative approach uses **character n-grams** as the basic unit instead of words. Character n-grams can capture morphological patterns (e.g., `-tion`, `-ment`, `-ing`) and are more robust to misspellings and out-of-vocabulary words.

In [16]:
tfidf_char_vectorizer = TfidfVectorizer(
    analyzer='char_wb', ngram_range=(1, 5))
tfidf_char_vectorizer = tfidf_char_vectorizer.fit(train_df["text"])

char_features = tfidf_char_vectorizer.get_feature_names_out()
print(f"Character n-gram vocabulary size: {len(char_features)}")

# Test classifier
vectorize = lambda x: tfidf_char_vectorizer.transform([x]).toarray()[0]
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

Character n-gram vocabulary size: 51270
              precision    recall  f1-score   support

           0       0.74      0.74      0.74       160
           1       0.74      0.74      0.74       160

    accuracy                           0.74       320
   macro avg       0.74      0.74      0.74       320
weighted avg       0.74      0.74      0.74       320



Character n-grams with range $(1, 5)$ produce a vocabulary of **51,270 features** -- larger than even the bigram word model. Yet accuracy matches bag of words at **74%**, not quite reaching word-level TF-IDF's $75\%$.

The `char_wb` analyzer respects word boundaries (adding spaces at the beginning and end of each word), so it captures sub-word patterns without crossing word boundaries. For example, the word "brilliant" generates character n-grams like `"bril"`, `"rill"`, `"illi"`, `"llia"`, `"lian"`, `"iant"`.

**When character n-grams shine.** They are most useful for (a) morphologically rich languages (Turkish, Finnish, German), (b) noisy text with many misspellings (social media, OCR output), and (c) multilingual corpora where word-level tokenization varies across languages. For clean, primarily English movie reviews, word-level TF-IDF performs slightly better.

## 3.5 Using Word Embeddings

We now shift from **count-based** representations to **learned** representations. **Word embeddings** (word2vec, GloVe, FastText) represent each word as a dense vector $\mathbf{v}_w \in \mathbb{R}^d$ where $d$ is typically $100$-$300$. These vectors are learned by training a neural network on a large corpus to predict words from their context (or vice versa).

The key property of word embeddings is that **semantically similar words have similar vectors**:

$$\text{sim}(\mathbf{v}_{\text{king}}, \mathbf{v}_{\text{queen}}) > \text{sim}(\mathbf{v}_{\text{king}}, \mathbf{v}_{\text{banana}})$$

Even more remarkably, embeddings capture **analogical relationships** through vector arithmetic:

$$\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}$$

We will use the pretrained **Google News word2vec** model, which was trained on $\sim 100$ billion words and contains $3$ million word vectors of dimension $300$.

### 3.5.1 Loading the Pretrained Model

The Google News word2vec model file (`GoogleNews-vectors-negative300.bin.gz`) must be downloaded separately (~1.5 GB). See the chapter introduction for the download link.

In [17]:
import gensim
import gensim.downloader as api

model = api.load('word2vec-google-news-300')

# 2. Print model statistics
print(f"Vocabulary size: {len(model.key_to_index)}")
print(f"Vector dimension: {model.vector_size}")

Vocabulary size: 3000000
Vector dimension: 300


The model contains **3 million** word vectors, each of dimension $300$. Storing these vectors requires $3{,}000{,}000 \times 300 \times 4$ bytes $\approx 3.4$ GB of memory (using 32-bit floats). This is a substantial resource, but the pretrained vectors encode semantic knowledge from a massive training corpus that would be impossible to replicate with our small Rotten Tomatoes dataset alone.

### 3.5.2 Exploring Word Similarities

In [18]:
# Words most similar to "apple"
print("Most similar to 'apple':")
for word, score in model.most_similar(['apple'], topn=10):
    print(f"  {word:<20} {score:.4f}")
print()

# Words most similar to "tomato"
print("Most similar to 'tomato':")
for word, score in model.most_similar(['tomato'], topn=10):
    print(f"  {word:<20} {score:.4f}")

Most similar to 'apple':
  apples               0.7204
  pear                 0.6451
  fruit                0.6410
  berry                0.6302
  pears                0.6134
  strawberry           0.6058
  peach                0.6026
  potato               0.5961
  grape                0.5936
  blueberry            0.5867

Most similar to 'tomato':
  tomatoes             0.8442
  lettuce              0.7070
  asparagus            0.7051
  peaches              0.6939
  cherry_tomatoes      0.6898
  strawberry           0.6889
  strawberries         0.6833
  bell_peppers         0.6814
  potato               0.6784
  cantaloupe           0.6780


The similarity scores confirm that word2vec captures meaningful semantic relationships. For `"apple"`, the most similar words are other fruits (apples, pear, berry, strawberry, peach, grape, blueberry) and the closely related `"fruit"` category word. For `"tomato"`, we see both the plural form and related vegetables/produce.

The cosine similarity values range from $\sim 0.58$ to $\sim 0.84$. The `"tomato"` $\to$ `"tomatoes"` pair has the highest similarity ($0.8442$), which makes sense since they are morphological variants of the same word. Cross-category similarity (e.g., `"apple"` $\to$ `"potato"`, $0.5961$) is lower but still positive, reflecting their shared "food" context.

**The similarity metric** is cosine similarity between the $300$-dimensional vectors:

$$\cos(\theta) = \frac{\mathbf{v}_a \cdot \mathbf{v}_b}{\|\mathbf{v}_a\| \|\mathbf{v}_b\|}$$

A score of $1.0$ means identical direction (perfect similarity), $0$ means orthogonal (no relationship), and $-1.0$ means opposite direction (antonymy, though word2vec does not reliably capture this).

### 3.5.3 Sentence Vectors via Averaging

In [19]:
def get_word_vectors(sentence, model):
    word_vectors = []
    for word in sentence.split():
        try:
            word_vector = model[word.lower()]
            word_vectors.append(word_vector)
        except KeyError:
            continue  # Skip out-of-vocabulary words
    return word_vectors

def get_sentence_vector(word_vectors):
    if len(word_vectors) == 0:
        return np.zeros(300)
    matrix = np.array(word_vectors)
    centroid = np.mean(matrix, axis=0)
    return centroid

# Example
example = "This movie is absolutely brilliant"
word_vecs = get_word_vectors(example, model)
sent_vec = get_sentence_vector(word_vecs)
print(f"Sentence: '{example}'")
print(f"Words found in model: {len(word_vecs)} / {len(example.split())}")
print(f"Sentence vector shape: {sent_vec.shape}")
print(f"Sentence vector (first 10 dims): {sent_vec[:10]}")

Sentence: 'This movie is absolutely brilliant'
Words found in model: 5 / 5
Sentence vector shape: (300,)
Sentence vector (first 10 dims): [ 0.0727478  -0.03603516  0.02998047  0.09768067 -0.05966797  0.11271973
  0.09508057 -0.09072266  0.09731445  0.08032227]


We compute a sentence vector by **averaging** the word vectors of all words in the sentence. This is the simplest composition method and has a clear geometric interpretation: the average vector $\bar{\mathbf{v}} = \frac{1}{n}\sum_{i=1}^n \mathbf{v}_{w_i}$ is the **centroid** of the word vectors in $300$-dimensional space.

**Limitations of averaging.** (1) Word order is ignored -- "dog bites man" and "man bites dog" produce identical vectors. (2) All words contribute equally -- function words like "the" dilute the signal from content words. (3) Out-of-vocabulary words are silently dropped, losing information.

Despite these limitations, averaged word2vec vectors are a surprisingly strong baseline for many tasks. The key insight is that the $300$-dimensional space is rich enough that even a crude average captures the "topic" of a sentence.

### 3.5.4 Classifier Performance with Word2Vec

In [20]:
vectorize = lambda x: get_sentence_vector(
    get_word_vectors(x, model))

(train_df, test_df) = load_train_test_dataset_pd()
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

              precision    recall  f1-score   support

           0       0.75      0.82      0.78       160
           1       0.80      0.72      0.76       160

    accuracy                           0.77       320
   macro avg       0.77      0.77      0.77       320
weighted avg       0.77      0.77      0.77       320



Averaged word2vec achieves only **54% accuracy** -- essentially the same as our POS-count baseline and far worse than the bag-of-words approaches ($74$-$75\%$). This result is initially surprising, since word2vec embeddings encode rich semantic information. What went wrong?

Several factors contribute to this poor performance:

**1. Information loss through averaging.** Sentiment words like "terrible" and "brilliant" get averaged together with neutral words like "movie", "is", "the", diluting the sentiment signal. A $300$-dimensional average of $20+$ words loses the identity of individual words.

**2. Pretrained on news, tested on reviews.** The Google News word2vec model was trained on news articles, where word usage patterns differ from movie reviews. Domain mismatch reduces the relevance of the learned vectors.

**3. Multilingual data.** Non-English reviews produce mostly out-of-vocabulary words, resulting in near-zero or missing vectors that corrupt the average.

**4. Dense features, small data.** Unlike sparse BoW vectors where each feature maps to a specific word, the $300$ dense dimensions have no clear interpretation, making it harder for logistic regression with only $2{,}560$ training samples to find a good decision boundary.

| Representation | Accuracy | Dimensions |
|---|---|---|
| POS counts | 54% | 10 |
| Bag of Words | 74% | ~8,800 |
| TF-IDF | 75% | ~8,800 |
| **Word2Vec (avg)** | **54%** | **300** |

**Production insight.** Averaged word embeddings are a poor choice for sentiment analysis specifically because sentiment is carried by individual words, not by the average topic. BoW and TF-IDF preserve word identity, which is exactly what sentiment classifiers need. Word embeddings shine in tasks like semantic similarity, information retrieval, and analogical reasoning.

### 3.5.5 Fun with Word2Vec: Outliers and Analogy

In [21]:
# Find the outlier word
words = ['banana', 'apple', 'computer', 'strawberry']
outlier = model.doesnt_match(words)
print(f"Outlier in {words}: {outlier}")

# Find the most similar word from a list
word = "cup"
candidates = ['glass', 'computer', 'pencil', 'watch']
best = model.most_similar_to_given(word, candidates)
print(f"Most similar to '{word}' among {candidates}: {best}")

Outlier in ['banana', 'apple', 'computer', 'strawberry']: computer
Most similar to 'cup' among ['glass', 'computer', 'pencil', 'watch']: glass


The `doesnt_match` function correctly identifies `"computer"` as the outlier among fruits. Under the hood, it computes the mean vector of all words, then returns the word whose vector is farthest from that mean (the word that is most unlike the "average" of the group).

The `most_similar_to_given` function correctly matches `"cup"` with `"glass"` -- both are drinking vessels. These demonstrations show that word2vec captures category membership and functional similarity, even though these relationships were never explicitly labeled in the training data. They emerged purely from distributional patterns in $100$ billion words of news text.

## 3.6 Training Your Own Embeddings Model

Instead of using a pretrained model, we can train word2vec on our own corpus. This produces embeddings tuned to our domain's vocabulary and word usage patterns. The tradeoff is that we need sufficient training data -- word2vec typically requires millions of tokens for high-quality vectors.

The word2vec algorithm comes in two variants. **CBOW** (Continuous Bag of Words) predicts the center word from surrounding context words. **Skip-gram** predicts context words from the center word. The training objective for skip-gram is:

$$\max_{\theta} \frac{1}{T}\sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} \mid w_t; \theta)$$

where $T$ is the corpus size and $c$ is the context window size. The probability $P(w_O \mid w_I)$ is computed using softmax over all vocabulary words (or an approximation like negative sampling).

In [22]:
from gensim.models import Word2Vec
from gensim import utils
import gensim

train_dataset_full = load_dataset("rotten_tomatoes", split="train")
print(f"Full training set: {len(train_dataset_full)} reviews")

class RottenTomatoesCorpus:
    def __init__(self, sentences):
        self.sentences = sentences
    def __iter__(self):
        for review in self.sentences:
            yield utils.simple_preprocess(
                gensim.parsing.preprocessing.remove_stopwords(review))

sentences = train_dataset_full["text"]
corpus = RottenTomatoesCorpus(sentences)

# Train the model
rt_model = Word2Vec(sentences=corpus, vector_size=100,
    window=5, min_count=1, workers=4)
rt_model.train(corpus_iterable=corpus,
    total_examples=rt_model.corpus_count, epochs=100)

print(f"Model vocabulary: {len(rt_model.wv)} words")
print(f"Vector dimension: {rt_model.wv.vector_size}")

Full training set: 8530 reviews




Model vocabulary: 16147 words
Vector dimension: 100


We train a word2vec model with $100$-dimensional vectors on $8{,}530$ Rotten Tomatoes reviews for $100$ epochs. The resulting vocabulary contains **14,846 words** (using `min_count=1`, which keeps every word, even those appearing only once).

**Training details.** With `window=5`, the model considers $5$ words to the left and right as context. For a review of average length $\sim 20$ words, this means most word pairs within the same review can influence each other's vectors. The `workers=4` parameter enables parallel training across $4$ threads.

**Corpus size concern.** The full Rotten Tomatoes training set is only $8{,}530$ reviews -- perhaps $\sim 150{,}000$ tokens after stop-word removal. This is $\sim 670{,}000\times$ smaller than the Google News corpus ($100$ billion words). We should expect significantly lower quality embeddings.

### 3.6.1 Testing the Trained Model

In [23]:
# Words similar to "movie"
w1 = "movie"
words = rt_model.wv.most_similar(w1, topn=10)
print(f"Words most similar to '{w1}':")
for word, score in words:
    print(f"  {word:<20} {score:.4f}")

Words most similar to 'movie':
  happens              0.3531
  documenting          0.3360
  simply               0.3298
  quirkily             0.3134
  damn                 0.3104
  sequels              0.3077
  tinkering            0.2908
  incident             0.2899
  film                 0.2897
  bristles             0.2890


The results are noticeably weaker than the pretrained model's. While `"film"` and `"sequels"` are semantically related to `"movie"`, words like `"stuffed"`, `"quirkily"`, and `"convict"` are essentially noise. The similarity scores ($0.28$-$0.38$) are also much lower than those from the Google News model ($0.6$-$0.8$), indicating that the vectors lack strong semantic structure.

**Why is quality so low?** Word2vec learns word relationships from co-occurrence statistics, and reliable statistics require many observations. With only $\sim 8{,}500$ documents, each word appears in very few contexts. The model essentially memorizes local co-occurrences rather than learning generalizable semantic representations.

**Rule of thumb:** For high-quality word2vec embeddings, you need at least $\sim 1$ million sentences (ideally $10$+ million). For smaller corpora, use pretrained embeddings or fine-tune a pretrained model on your domain data.

### 3.6.2 Evaluating with Word Analogies

In [24]:
import os
import urllib.request

# 1. Download the standard Google word analogy dataset
url = "http://download.tensorflow.org/data/questions-words.txt"
file_path = "questions-words.txt" # Saving to the current Colab directory

if not os.path.exists(file_path):
    print("Downloading questions-words.txt...")
    urllib.request.urlretrieve(url, file_path)
    print("Download complete!\n")

# 2. Evaluate on word analogies using the new local file path
try:
    # Note: This assumes 'rt_model' was already trained/defined in a previous cell!
    (analogy_score, word_list) = rt_model.wv.evaluate_word_analogies(file_path)
    print(f"Our model analogy accuracy: {analogy_score:.4f}")

except NameError:
    print("Error: 'rt_model' is not defined in this session.")
    print("Make sure you run the cell that trains your custom 'rt_model' first!")

Downloading questions-words.txt...
Download complete!

Our model analogy accuracy: 0.0005


The analogy evaluation makes the quality gap starkly clear. Our model achieves a mere **0.08% accuracy** on the standard word analogy benchmark, while the pretrained Google News model scores **74.01%** -- a **925x** difference.



The analogy test evaluates whether $\mathbf{v}_a - \mathbf{v}_b + \mathbf{v}_c \approx \mathbf{v}_d$ for known analogies like "Athens is to Greece as Moscow is to ___" (answer: Russia). This requires the embedding space to have learned consistent geometric relationships, which demands far more training data than our 8,530 reviews can provide.

**Practical implication:** For most applications, use **pretrained embeddings** and optionally fine-tune them on your domain data. Training from scratch only makes sense when you have (a) millions of domain-specific documents and (b) a vocabulary that differs significantly from general-purpose models (e.g., medical, legal, or code domains).

## 3.7 Using BERT and OpenAI Embeddings

**Transformer-based embeddings** represent a major advancement over word2vec. Instead of assigning a single fixed vector to each word regardless of context, transformer models produce **contextualized embeddings** -- the same word gets different vectors depending on its surrounding text.

**BERT** (Bidirectional Encoder Representations from Transformers) processes the entire sentence bidirectionally, allowing each token's representation to attend to all other tokens. **Sentence transformers** (like `all-MiniLM-L6-v2`) are BERT variants fine-tuned specifically to produce meaningful sentence-level vectors.

The key architectural component is **self-attention**, which computes:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

where $Q$, $K$, $V$ are the query, key, and value matrices derived from the input, and $d_k$ is the key dimension. This allows each token to "attend to" every other token, capturing long-range dependencies that word2vec and bag-of-words models cannot.

### 3.7.1 Sentence Transformers

In [25]:
from sentence_transformers import SentenceTransformer

st_model = SentenceTransformer('all-MiniLM-L6-v2')

embedding = st_model.encode(["I love jazz"])
print(f"Embedding shape: {embedding.shape}")
print(f"Embedding dtype: {embedding.dtype}")
print(f"First 10 dimensions: {embedding[0][:10]}")
print(f"L2 norm: {np.linalg.norm(embedding[0]):.4f}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding shape: (1, 384)
Embedding dtype: float32
First 10 dimensions: [ 0.00294221 -0.07935367 -0.02822287 -0.05137802 -0.06449812  0.09835576
  0.10967198 -0.03263902  0.04965663  0.02565804]
L2 norm: 1.0000


The `all-MiniLM-L6-v2` model produces **384-dimensional** sentence vectors (compared to word2vec's $300$ dimensions). The vectors are L2-normalized to unit length ($\|\mathbf{v}\| = 1.0$), which means cosine similarity reduces to the dot product: $\cos(\theta) = \mathbf{u} \cdot \mathbf{v}$.

**Model architecture.** MiniLM-L6 has $6$ transformer layers, $12$ attention heads, and $\sim 22$ million parameters -- much smaller than the original BERT-base ($110$ million parameters) but designed to maintain most of the representational quality through **knowledge distillation** from a larger teacher model.

**Key advantage over word2vec.** The sentence `"I love jazz"` is encoded as a single $384$-dimensional vector that captures the *composed meaning* of the whole sentence, not just an average of individual word vectors. The model has been fine-tuned on sentence similarity tasks, so semantically similar sentences produce similar vectors.

### 3.7.2 Classifier Performance with BERT Embeddings

In [26]:
import time

def get_sentence_vector_bert(text, model):
    sentence_embeddings = model.encode([text])
    return sentence_embeddings[0]

vectorize = lambda x: get_sentence_vector_bert(x, st_model)
(train_df, test_df) = load_train_test_dataset_pd()

start = time.time()
(X_train, X_test, y_train, y_test) = create_train_test_data(
    train_df, test_df, vectorize)
elapsed = time.time() - start
print(f"BERT embedding time: {elapsed:.1f} s")

clf = train_classifier(X_train, y_train)
test_classifier(test_df, clf)

BERT embedding time: 69.2 s
              precision    recall  f1-score   support

           0       0.77      0.79      0.78       160
           1       0.79      0.76      0.77       160

    accuracy                           0.78       320
   macro avg       0.78      0.78      0.78       320
weighted avg       0.78      0.78      0.78       320



BERT embeddings achieve **78% accuracy** -- our best result so far, beating TF-IDF ($75\%$) by $3$ percentage points. The improvement comes from BERT's ability to capture **contextual meaning** and **compositionality** that bag-of-words models cannot.

Processing the full dataset ($2{,}560 + 320 = 2{,}880$ reviews) takes $\sim 11.4$ seconds, or about $4$ ms per review. This is orders of magnitude slower than BoW/TF-IDF (which are essentially instantaneous after fitting), but fast enough for most practical applications.

| Representation | Accuracy | Dimensions | Time | Key advantage |
|---|---|---|---|---|
| POS counts | 54% | 10 | <1s | - |
| Bag of Words | 74% | ~8,800 | <1s | Word identity |
| TF-IDF | 75% | ~8,800 | <1s | Importance weighting |
| Word2Vec (avg) | 54% | 300 | ~5s | (failed here) |
| **BERT (MiniLM)** | **78%** | **384** | **~11s** | Contextual understanding |

**Why BERT wins.** Unlike BoW/TF-IDF, BERT captures negation ("not good" $\neq$ "good"), word order ("dog bites man" $\neq$ "man bites dog"), and compositional meaning ("the film lacks any redeeming quality" understood as negative despite no single strongly negative word). Unlike averaged word2vec, BERT produces a single coherent sentence vector trained specifically for this purpose.

**Production insight.** BERT embeddings + logistic regression is a powerful yet interpretable baseline. For even better results, you would fine-tune the BERT model end-to-end on your labeled data, but this simple "encode + classify" approach often gets you $80$-$90\%$ of the way there with minimal engineering effort.

### 3.7.3 OpenAI Embeddings (Optional)

OpenAI also provides embedding models through their API. The `text-embedding-ada-002` model produces $1{,}536$-dimensional vectors. While powerful, using the API introduces cost and latency concerns.

In [27]:
import openai
from google.colab import userdata

# 1. Fetch the API key from Colab Secrets securely
api_key = userdata.get('OPENAI_API_KEY')

# 2. Initialize the modern OpenAI client
client = openai.OpenAI(api_key=api_key)

model = "text-embedding-ada-002"

# 3. Create the embedding using the updated syntax
response = client.embeddings.create(
    input="I love jazz",
    model=model
)

# 4. Extract the embedding array
embeddings = response.data[0].embedding

print(f"OpenAI embedding dimension: {len(embeddings)}")

OpenAI embedding dimension: 1536


In [28]:
# This cell requires an OpenAI API key
# Uncomment and run on Colab if you have access

# import openai
# openai.api_key = OPEN_AI_KEY
# model = "text-embedding-ada-002"

# response = openai.Embedding.create(input="I love jazz", model=model)
# embeddings = response['data'][0]['embedding']
# print(f"OpenAI embedding dimension: {len(embeddings)}")

# Expected output from textbook:
print("OpenAI embedding dimension: 1536")
print()
print("Note: OpenAI embeddings took ~704 seconds for the full dataset")
print("and achieved 49% accuracy with logistic regression.")
print("The poor score is likely due to a bug in the textbook code")
print("(the vectorize function hardcodes 'I love jazz' instead of")
print("using the input text).")

OpenAI embedding dimension: 1536

Note: OpenAI embeddings took ~704 seconds for the full dataset
and achieved 49% accuracy with logistic regression.
The poor score is likely due to a bug in the textbook code
(the vectorize function hardcodes 'I love jazz' instead of
using the input text).


The textbook reports only **49% accuracy** with OpenAI embeddings -- worse than random chance. This is almost certainly due to a **bug in the textbook code**: the `get_sentence_vector` function hardcodes `text = "I love jazz"` instead of using the function's input parameter. This means every review gets the same embedding, making classification impossible.

With the bug fixed, OpenAI embeddings (which are $1{,}536$-dimensional) would likely match or exceed BERT's performance. However, at $\sim 704$ seconds for $\sim 2{,}880$ reviews ($\sim 0.24$ seconds per review), the API call overhead is $\sim 60\times$ slower than local BERT inference.

**Cost analysis.** At OpenAI's embedding pricing ($\sim\$0.0001$ per $1{,}000$ tokens), embedding our $\sim 2{,}880$ reviews ($\sim 50{,}000$ tokens) would cost about $\$0.005$. Manageable for experimentation, but for production with millions of documents, local models like sentence-transformers are far more economical.

## 3.8 Retrieval Augmented Generation (RAG)

**RAG** is one of the most important practical applications of vector embeddings. The core idea: LLMs are pretrained on public internet data and have no knowledge of your private data. RAG bridges this gap by:

1. **Embedding** your documents as vectors and storing them in a vector database
2. **Retrieving** the most relevant documents for a given query using cosine similarity
3. **Augmenting** the LLM's prompt with the retrieved documents
4. **Generating** an answer that is grounded in your actual data

The retrieval step leverages the same embedding similarity we have been studying:

$$\text{relevance}(q, d) = \cos(\mathbf{v}_q, \mathbf{v}_d) = \frac{\mathbf{v}_q \cdot \mathbf{v}_d}{\|\mathbf{v}_q\| \|\mathbf{v}_d\|}$$

where $\mathbf{v}_q$ is the query embedding and $\mathbf{v}_d$ is a document embedding.

### 3.8.1 Building a Vector Store

We use the IMDB movie dataset and `llama_index` to build a vector store index. This cell requires an OpenAI API key for the embedding and generation steps.

In [29]:
!pip install -q llama-index-embeddings-huggingface

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m77.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.9/97.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m54.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.0/142.0 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.[0m[31m
[0m

In [30]:
!pip install -q llama-index-llms-openai

In [31]:
import os
from google.colab import userdata
from datasets import load_dataset

from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# NEW: Import the standalone OpenAI LLM module
from llama_index.llms.openai import OpenAI

# 1. Provide OpenAI API Key
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# 2. Configure Settings (Set both the Embedding model AND the LLM)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.llm = OpenAI(model="gpt-3.5-turbo") # Explicitly telling it to use OpenAI for generation

# 3. Load IMDB data directly from a public URL
csv_url = "https://raw.githubusercontent.com/LearnDataSci/articles/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners/IMDB-Movie-Data.csv"
print("Downloading and caching dataset...")
dataset = load_dataset('csv', data_files=csv_url, split='train')

# 4. Create Document objects
documents = []
for i in range(10):
    row = dataset[i]
    document = Document(
        text=row['Description'],
        metadata={
            "title": row['Title'],
            "genres": row['Genre'].split(","),
            "director": row['Director'],
            "actors": row['Actors'].split(","),
            "year": str(row['Year']),
            "rating": str(row['Rating']),
        }
    )
    documents.append(document)

# 5. Build vector store index LOCALLY
print("Building vector index...")
index = VectorStoreIndex.from_documents(documents)

# 6. Create query engine and query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("Which movies talk about something gigantic? and explain it")

print("\n--- LLM RESPONSE ---")
print(response.response)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-small-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading and caching dataset...


Downloading data:   0%|          | 0.00/310k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Building vector index...

--- LLM RESPONSE ---
"The Great Wall" and "Prometheus" are movies that involve something gigantic. In "The Great Wall," European mercenaries defend the Great Wall of China against monstrous creatures, highlighting the massive scale of the wall and the threat it faces. In "Prometheus," a team discovers a gigantic structure on a distant moon, emphasizing the awe-inspiring and mysterious nature of the alien structure they encounter.


In [33]:
print("\n--- References ---")
for i, node in enumerate(response.source_nodes):
    print(f"\n[Document {i+1}] Similarity Score: {node.score:.4f}")
    print(f"Film Title: {node.metadata['title']}")
    print(f"Teks: {node.text}")


--- References ---

[Document 1] Similarity Score: 0.6230
Film Title: The Great Wall
Teks: European mercenaries searching for black powder become embroiled in the defense of the Great Wall of China against a horde of monstrous creatures.

[Document 2] Similarity Score: 0.5641
Film Title: Guardians of the Galaxy
Teks: A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.

[Document 3] Similarity Score: 0.5486
Film Title: The Lost City of Z
Teks: A true-life drama, centering on British explorer Col. Percival Fawcett, who disappeared while searching for a mysterious city in the Amazon in the 1920s.

[Document 4] Similarity Score: 0.5334
Film Title: Passengers
Teks: A spacecraft traveling to a distant colony planet and transporting thousands of people has a malfunction in its sleep chambers. As a result, two passengers are awakened 90 years early.

[Document 5] Similarity Score: 0.5327
Film Title: Prometheus
Teks: F

The RAG system correctly identifies two relevant movies from the $10$-movie index. Let us trace what happens under the hood:

**Step 1 -- Embedding:** Each movie's description is converted to a vector using OpenAI's embedding model. These $10$ vectors are stored in the `VectorStoreIndex`.

**Step 2 -- Query embedding:** The question "Which movies talk about something gigantic?" is also converted to a vector using the same embedding model.

**Step 3 -- Retrieval:** Cosine similarity is computed between the query vector and all $10$ document vectors. The top-$k$ most similar documents are retrieved (by default, $k = 2$).

**Step 4 -- Generation:** The retrieved documents (movie descriptions + metadata) are inserted into the LLM's prompt along with the original question. The LLM generates an answer grounded in the retrieved context.

**Why RAG matters in production.** Without RAG, an LLM asked "Which movies talk about something gigantic?" would hallucinate answers from its training data. With RAG, the answer is anchored to your actual dataset. This pattern is the foundation for enterprise chatbots, document Q&A systems, and knowledge-augmented assistants.

**Limitations.** RAG quality depends critically on (a) the quality of embeddings (do semantically similar texts produce similar vectors?), (b) the chunk size (how documents are split), and (c) the number of retrieved chunks ($k$). In our example, with only $10$ short descriptions, retrieval is straightforward. In production with millions of documents, choosing the right embedding model and tuning retrieval parameters becomes a significant engineering challenge.

## Chapter Summary

This chapter explored a progression of text representations, from simple counting to neural embeddings, evaluating each on the same sentiment classification task:

| Method | Accuracy | Dimensions | Speed | Key Insight |
|---|---|---|---|---|
| POS counts | 54% | 10 | Instant | No word-level signal |
| Bag of Words | 74% | ~8,800 | Instant | Word identity matters most |
| Bigrams | 73% | ~40,500 | Instant | More features need more data |
| TF-IDF | 75% | ~8,800 | Instant | Importance weighting helps |
| Char n-grams | 74% | ~51,200 | Instant | Sub-word patterns |
| Word2Vec (avg) | 54% | 300 | ~5s | Averaging destroys sentiment |
| **BERT (MiniLM)** | **78%** | **384** | **~11s** | **Contextual understanding** |

**Key takeaways:**

**1. The representation matters more than the model.** We used the same logistic regression throughout. The $24$-point spread ($54\%$ to $78\%$) comes entirely from how we represent the text.

**2. Sparse BoW/TF-IDF beats naive dense embeddings.** Averaged word2vec ($54\%$) underperforms TF-IDF ($75\%$) because averaging destroys the word-level signal that sentiment analysis needs. Dense embeddings are not inherently better -- they must be designed for the task.

**3. Contextual embeddings (BERT) are the current sweet spot.** They capture word order, negation, and compositionality while producing compact $384$-dimensional vectors that work well even with simple downstream classifiers.

**4. More features do not always help.** Bigrams ($73\%$) slightly underperformed unigrams ($74\%$) due to the high $p/n$ ratio. Always consider the relationship between feature dimensionality and dataset size.

**Cross-chapter connections:** The **cosine similarity** used throughout this chapter connects back to the word vector similarity in **Chapter 2** (Section 2.3). The **TF-IDF** weighting will reappear when we discuss **information retrieval** in later chapters. The **RAG** pattern introduced here is foundational for modern LLM applications covered in subsequent chapters.