# **Module 3: NLP & LLM Core**

## L13: Text Processing & Embeddings

Welcome to the NLP module. We are moving from Computer Vision (pixels) to Natural Language Processing (sequences/tokens). This is the foundation for the Agentic AI and RAG work you have planned later.

This lesson focuses on how we convert human language into numerical vectors that machines can "understand." We will progress from simple frequency counts to deep semantic representations.

### Topic Breakdown

```text
L13: Text Processing & Embeddings
├── Concept 1: Tokenization (Subword & BPE)
│   ├── Word-level vs. Character-level vs. Subword
│   ├── The OOV (Out of Vocabulary) Problem
│   ├── Byte Pair Encoding (BPE) Intuition
│   ├── Explanation: Breaking text into meaningful chunks (tokens)
│   └── Task: Use a tokenizer to inspect tokenization differences
│
├── Concept 2: Sparse Representations (TF-IDF) [Baseline]
│   ├── Term Frequency (TF)
│   ├── Inverse Document Frequency (IDF)
│   ├── Explanation: Weighing words by how "rare" and "informative" they are
│   └── Task: Compute TF-IDF matrix for a mini-corpus using sklearn
│
├── Concept 3: Static Dense Embeddings (Word2Vec/GloVe Intuition)
│   ├── One-Hot vs. Dense Vectors
│   ├── Semantic Meaning in Vector Space (King - Man + Woman = Queen)
│   ├── Limitation: Context Independence (Polysemy)
│   └── Task: Manual Cosine Similarity calculation on mock embedding vectors
│
├── Concept 4: Transformer Embeddings (Sentence-BERT)
│   ├── Contextual Embeddings (Why "bank" differs in two sentences)
│   ├── The Cross-Encoder vs. Bi-Encoder (Siamese Network) architecture
│   ├── Explanation: Capturing the meaning of whole sentences
│   └── Task: Load a Sentence-Transformer model and encode text
│
└── Mini-Project: Semantic Classifier Comparison
    ├── Dataset: 20 Newsgroups (Subset) or similar text dataset
    ├── Pipeline A: TF-IDF + Logistic Regression
    ├── Pipeline B: SBERT Embeddings + Logistic Regression
    └── Evaluation: Compare Accuracy/F1 Score

```

---


## **Concept 1: Tokenization (Subword & BPE)**

### Intuition

Before a model can process text, it must be broken down into smaller units called **tokens**. The simplest approach is splitting by spaces (Word-level), but this fails when the model encounters a word it hasn't seen before (the "Out-Of-Vocabulary" or **OOV** problem). Conversely, splitting by characters (Character-level) solves OOV but results in extremely long sequences where individual units carry little meaning.

Modern NLP uses **Subword Tokenization** (e.g., Byte-Pair Encoding or BPE). This is the "Goldilocks" zone. It breaks common words into single tokens (e.g., "apple") but breaks rare or complex words into meaningful sub-units (e.g., "tokenization"  "token", "##iza", "##tion"). This allows the model to process *any* text using a fixed-size vocabulary.

### Mechanics: Byte-Pair Encoding (BPE)

BPE works by iteratively merging the most frequently occurring adjacent pairs of characters (or bytes) in the training corpus.
   1. **Initialize:** Vocabulary includes all individual characters.
   2. **Count:** Calculate frequency of all symbol pairs (e.g., "e" + "s" $\rightarrow$ "es").
   3. **Merge:** Add the most frequent pair to the vocabulary as a new symbol.
   4. **Repeat:** Continue until the vocabulary size reaches a target limit (e.g., 30k or 50k tokens).

### Simpler Explanation

Think of tokens like Lego bricks.
   * **Word-level:** Every unique word is a custom-molded brick. If you need a "microscope" brick and don't have it, you can't build the sentence.
   * **Character-level:** You only have 26 types of tiny 1x1 bricks. You can build anything, but it takes thousands of bricks to build a simple house.
   * **Subword (BPE):** You have a set of standard complex shapes (walls, windows) for common structures, but you also keep the tiny 1x1 bricks. If you encounter a rare structure, you build it using the standard shapes and the tiny bricks.

### Trade-offs
   * **Pros:** Solves OOV (can represent any string), balances sequence length and meaning.
   * **Cons:** Handling the "sub-tokens" (like `##ing` in BERT) requires careful implementation. Typos can result in weird subword splits.

---

### Your Task

You will use the Hugging Face `transformers` library to observe how a subword tokenizer handles known words versus rare words/typos.

**Specifications:**
   1. **Import:** `AutoTokenizer` from `transformers`.
   2. **Load:** The tokenizer for `bert-base-uncased`.
   3. **Input Text:** "unaffable" (a standard word) vs "unaffabwle" (a typo/nonsense word).
   4. **Action:**
      * Tokenize both strings.
      * Convert the IDs back to tokens (strings) to see the split.
   5. **Output:** Print the list of tokens for both words.

**Note:** You might need to install transformers: `pip install transformers`



In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

word1 = "unaffable"
word2 = "unaffabwle"

t1 = tokenizer(word1)
t2 = tokenizer(word2)


w1 = tokenizer.convert_ids_to_tokens(t1["input_ids"])
w2 = tokenizer.convert_ids_to_tokens(t2["input_ids"])

print(f"For {word1}\nToken = {tokenizer(word1)}\nBack to string = {w1}")
print(f"For {word2}\nToken = {tokenizer(word2)}\nBack to string = {w2}\n==============================")

For unaffable
Token = {'input_ids': [101, 14477, 20961, 3468, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
Back to string = ['[CLS]', 'una', '##ffa', '##ble', '[SEP]']
For unaffabwle
Token = {'input_ids': [101, 14477, 20961, 2497, 13668, 2063, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Back to string = ['[CLS]', 'una', '##ffa', '##b', '##wl', '##e', '[SEP]']



Excellent. This perfectly illustrates the power of Subword Tokenization.

* **"unaffable"**: Broken into 3 logical chunks (`una`, `##ffa`, `##ble`). The model can likely infer the meaning is related to "unable" or "affable" based on these sub-parts.
* **"unaffabwle"**: The tokenizer didn't panic or crash. It just kept cutting until it found pieces it recognized (`##b`, `##wl`, `##e`). Even though the word is nonsense, the model has a valid input vector to process.

---

## **Concept 2: Sparse Representations (TF-IDF)**

### Intuition

After tokenization, we have a list of tokens. The simplest way to turn these into numbers is counting them ("Bag of Words"). However, raw counts have a flaw: common words like "the", "is", and "and" appear frequently but carry little information.

**TF-IDF** (Term Frequency - Inverse Document Frequency) fixes this by balancing two factors:

1. **Frequency:** How often does the word appear in *this specific* document? (More is better).
2. **Rarity:** How often does the word appear in *all* documents? (Less is better).

### Mechanics

The score $w_{t,d}$ for a term $t$ in document $d$ is:
$$TF_{t,d} = \frac{\text{count of t in d}}{\text{total terms in d}}$$

1. **TF (Term Frequency):**
$$w_{t,d} = TF_{t,d} \times IDF_t$$
*(Note: Implementations often use raw count or log normalization)*

2. **IDF (Inverse Document Frequency):**
$$IDF_t = \log \left( \frac{N}{df_t} \right)$$
Where $N$ is the total number of documents, and $df_t$ is the number of documents containing term $t$.


### Simpler Explanation

Imagine you are scanning a library for books about "Quantum Physics".
   * The word "the" is in every book. $IDF \approx 0$. It gets a score of 0.
   * The word "Quantum" appears many times in specific books, but not in cookbooks or novels. It has high TF (in the physics book) and high IDF (rare globally). It gets a high score.

### Trade-offs
   * **Pros:** Very fast, interpretable (you know exactly which words triggered the score), works surprisingly well for simple keyword matching.
   * **Cons:** **Sparse** (vectors are mostly zeros), **No Semantics** (it doesn't know "car" and "automobile" are related; they are just different orthogonal dimensions).

---

### Your Task

You will manually calculate the TF-IDF matrix using Scikit-Learn to see the sparsity.

**Specifications:**
   1. **Import:** `TfidfVectorizer` from `sklearn.feature_extraction.text`.
   2. **Data:** Create a list of strings:
   ```python
   corpus = [
       "the cat sat on the mat",
       "the dog sat on the log",
       "cats and dogs are great"
   ]
   
   ```

   3. **Action:**
      * Initialize the vectorizer.
      * Fit and transform the corpus.
      * Get the feature names (the vocabulary).
      * Convert the result to a dense array (using `.toarray()`) or a DataFrame for readability.
   
   
   4. **Output:** Print the feature names and the resulting matrix.


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are great"
]

tfiv = TfidfVectorizer()

X = tfiv.fit_transform(corpus)

feature_names = tfiv.get_feature_names_out()

dense_matrix = X.toarray()
df = pd.DataFrame(dense_matrix, columns=feature_names)
df

Unnamed: 0,and,are,cat,cats,dog,dogs,great,log,mat,on,sat,the
0,0.0,0.0,0.427554,0.0,0.0,0.0,0.0,0.0,0.427554,0.325166,0.325166,0.650331
1,0.0,0.0,0.0,0.0,0.427554,0.0,0.0,0.427554,0.0,0.325166,0.325166,0.650331
2,0.447214,0.447214,0.0,0.447214,0.0,0.447214,0.447214,0.0,0.0,0.0,0.0,0.0


### Review


Notice the **Sparsity**:
   * "cat" and "cats" are treated as completely different words (columns). The model doesn't know they are related.
   * "dog" and "cat" have 0 overlap. In this vector space, they are as different as "dog" and "refrigerator".
   * The matrix is mostly zeros (empty space).

This lack of relationship is what Dense Embeddings solve.


---

## **Concept 3: Static Dense Embeddings (Word2Vec/GloVe Intuition)**

### Intuition

To fix the "cat vs cats" problem, we need **Dense Vectors**. instead of a vector of size 10,000 (vocabulary size) with one `1` and mostly `0`s (One-Hot), we compress the meaning into a smaller vector (e.g., size 300) of continuous numbers (floats like 0.2, -0.9, 0.5).

In this "embedding space," words with similar meanings are pushed closer together.
   * **Word2Vec:** Learns by predicting the neighbors of a word (e.g., "The quick brown ____ jumps").
   * **GloVe:** Learns by analyzing global co-occurrence counts across the entire corpus.

### Mechanics: Vector Arithmetic

The most famous property of these embeddings is that they capture semantic relationships algebraically:
$$\vec{King} - \vec{Man} + \vec{Woman} \approx \vec{Queen}$$
This works because the "direction" you travel to go from "Man" to "Woman" (gender dimension) is the same direction you travel to go from "King" to "Queen".

### Trade-offs

* **Pros:** Captures semantic similarity (synonyms are close), efficient (dense).
* **Cons:** **Context Independent (Static)**. The word "bank" has only **one** vector, even if it refers to a river bank or a financial bank. It averages all meanings into one messy vector.

---

### Your Task

You will perform "Semantic Arithmetic" manually using NumPy to understand how similarity works.

**Specifications:**
   1. **Define Mock Vectors:**
   ```python
   import numpy as np
   # Simplified 3D vectors for visualization
   king  = np.array([0.5, 0.7, 0.2])
   man   = np.array([0.5, 0.1, 0.2])
   woman = np.array([0.5, 0.1, 0.8])
   queen = np.array([0.5, 0.7, 0.8])
   
   ```
   
   2. **Vector Math:** Calculate a `target` vector: `king - man + woman`.
   3. **Similarity:** Calculate the **Cosine Similarity** between your `target` vector and the `queen` vector.
      * **Formula:** $\text{Similarity} = \frac{A \cdot B}{||A|| \times ||B||}$
      * Where $A \cdot B$ is the dot product and $||A||$ is the L2 norm (magnitude).
   4. **Constraint:** You **must** write the cosine similarity formula yourself using `np.dot` and `np.linalg.norm`. Do not use `sklearn`.


In [4]:
import numpy as np
# Simplified 3D vectors for visualization
king  = np.array([0.5, 0.7, 0.2])
man   = np.array([0.5, 0.1, 0.2])
woman = np.array([0.5, 0.1, 0.8])
queen = np.array([0.5, 0.7, 0.8])

target = king - man + woman
sim = np.dot(target, queen)/(np.linalg.norm(target) * np.linalg.norm(queen))
sim

np.float64(1.0)

### Review

You got a result of **1.0**. In this idealized example, the math worked perfectly: the "gender direction" you added to King landed exactly on Queen. In real-world data (like GloVe), it's rarely 1.0, but it will be the *closest* vector in the space.

This proves that **math can represent meaning**.

---

## **Concept 4: Transformer Embeddings (Sentence-BERT)**

### Intuition: The Context Problem

Word2Vec is **static**.
   * Sentence A: "I went to the **bank** to deposit money."
   * Sentence B: "I sat on the river **bank**."

In Word2Vec, the word "bank" has the **exact same vector** in both sentences. This confuses the model.

**Transformers (like BERT)** are **dynamic**. They use an "Attention Mechanism" to look at the whole sentence at once. The vector for "bank" changes based on the words around it ("money" vs "river").

**Sentence-BERT (SBERT)** takes this further. Standard BERT gives you a vector for every token. SBERT is fine-tuned to output a single, high-quality vector **for the entire sentence** that is mathematically optimized for similarity search (cosine similarity).

### Mechanics: Siamese Networks

SBERT isn't just one BERT; during training, it uses **Siamese Networks** (twin networks).
   1. Feed Sentence A into BERT A.
   2. Feed Sentence B into BERT B (identical copy).
   3. Compare their outputs using Cosine Similarity.
   4. Backpropagate to ensure similar sentences have similar vectors and dissimilar ones are far apart.

This results in embeddings where `distance = semantic_difference`.

### Trade-offs
   * **Pros:** State-of-the-art accuracy for semantic search and clustering. Understands context deeply.
   * **Cons:** Slower to compute than TF-IDF or Word2Vec. Heavy memory usage.

---

### Your Task

You will use the `sentence-transformers` library to generate these context-aware embeddings.

**Specifications:**
   1. **Install:** `pip install sentence-transformers` (if needed).
   2. **Import:** `SentenceTransformer` from `sentence_transformers`.
   3. **Load Model:** Load the model named `'all-MiniLM-L6-v2'` (This is a small, fast, industry-standard model).
   4. **Data:**
      ```python
      sentences = [
          "That is a happy dog",
          "That is a very happy person",
          "Today is a sunny day"
      ]
      
      ```
   5. **Action:**
      * Encode the sentences into embeddings.
      * Print the **shape** of the resulting embedding matrix.
      * Print the first 5 values of the first sentence's vector (to see what they look like).


In [11]:
from sentence_transformers import SentenceTransformer

sentences = [
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]

model = SentenceTransformer('all-MiniLM-L6-v2')
encode = model.encode(sentences)
print(f"Shape of matrix = {encode.shape}")
print(f"First 5 values = {encode[0][:5]}")

Shape of matrix = (3, 384)
First 5 values = [ 0.00504993  0.06316978  0.01415724  0.02694938 -0.06023403]


### Review


**Analysis of your Output:**
   * **Shape `(3, 384)`:** You have 3 sentences, and each sentence is represented by a vector of **384 dimensions**.
   * Unlike TF-IDF (where dimensions = vocabulary size, often 10,000+), this is a compact, dense representation.
   * Unlike the 3D toy example (King/Queen), this 384-dimensional space captures subtle nuances of grammar, tone, and meaning.
   * **Values:** These floats (`0.005...`) are the coordinates in that high-dimensional space.

---

## **Mini-Project: Semantic Classifier Showdown**

**Objective:**
Build two parallel text classification pipelines to classify news articles into 4 topics. You will prove whether "understanding meaning" (Embeddings) beats "counting words" (TF-IDF).

**Specifications:**
   1. **Dataset:**
      * Use `sklearn.datasets.fetch_20newsgroups`.
      * **Categories:** `['sci.space', 'comp.graphics', 'rec.sport.hockey', 'talk.politics.mideast']`.
      * **Cleaning:** Set `remove=('headers', 'footers', 'quotes')` (Critical: this forces the model to read the actual text, not just email headers).
      * **Subset:** Use `subset='all'` (fetches both train and test for simplicity, we will split manually).
  
   2. **Preprocessing:**
      * Split data into **Train (80%)** and **Test (20%)** using `train_test_split` (random_state=42).
   
   3. **Pipeline A (The Baseline):**
      * Vectorize text using `TfidfVectorizer`.
      * Train a `LogisticRegression` classifier on the TF-IDF vectors.
      * Predict on Test set.
   
   4. **Pipeline B (The Challenger):**
      * Encode text using `SentenceTransformer('all-MiniLM-L6-v2')`.
      * Train a **new** `LogisticRegression` classifier on these dense embeddings.
      * Predict on Test set.
   
   5. **Evaluation:**
      * Print the **Accuracy Score** for both pipelines.
      * (Optional but recommended) Print a `classification_report` for both.

**Forbidden Shortcuts:**

* Do not use raw `CountVectorizer`.
* Do not skip the train/test split.


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sentence_transformers import SentenceTransformer
import numpy as np

# Base Class for Text Classifiers
class BaseTextClassifier:
    def __init__(self):
        self.classifier = LogisticRegression(max_iter=1000)

    def extract_features(self, texts, is_training=False):
        """Convert list of strings to numpy array/matrix."""
        raise NotImplementedError("Child class must implement this")

    def train(self, X_train, y_train):
        print(f"Training {self.__class__.__name__}...")
        # 1. Extract features (is_training=True)
        X_train_features = self.extract_features(X_train, is_training=True)
        # 2. Fit the classifier
        self.classifier.fit(X_train_features, y_train)

    def evaluate(self, X_test, y_test):
        print(f"Evaluating {self.__class__.__name__}...")
        # 1. Extract features (is_training=False)
        X_test_features = self.extract_features(X_test, is_training=False)
        # 2. Predict using the classifier
        y_pred = self.classifier.predict(X_test_features)
        # 3. Print accuracy and classification report
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {accuracy}")
        print("Classification Report:")
        print(classification_report(y_test, y_pred))

# Pipeline A: TF-IDF + Logistic Regression
class TfidfPipeline(BaseTextClassifier):
    def __init__(self):
        super().__init__()
        # Initialize the TfidfVectorizer
        self.vectorizer = TfidfVectorizer()

    def extract_features(self, texts, is_training=False):
        if is_training:
            # Fit and transform on training data
            return self.vectorizer.fit_transform(texts)
        else:
            # Transform the test data (use the fitted vectorizer)
            return self.vectorizer.transform(texts)

# Pipeline B: Sentence Embeddings + Logistic Regression
class SemanticPipeline(BaseTextClassifier):
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        super().__init__()
        # Initialize the SentenceTransformer model
        self.model = SentenceTransformer(model_name)

    def extract_features(self, texts, is_training=False):
        # Extract dense embeddings (this is independent of is_training)
        return self.model.encode(texts)

# Usage Example

# Sample dataset (for demonstration purposes)
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

# Load the dataset (fetching only the 4 specified categories)
categories = ['sci.space', 'comp.graphics', 'rec.sport.hockey', 'talk.politics.mideast']
newsgroups_data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split the data into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(newsgroups_data.data, newsgroups_data.target, test_size=0.2, random_state=42)

# --- Train and Evaluate TfidfPipeline ---
tfidf_pipeline = TfidfPipeline()
tfidf_pipeline.train(X_train, y_train)
tfidf_pipeline.evaluate(X_test, y_test)

# --- Train and Evaluate SemanticPipeline ---
semantic_pipeline = SemanticPipeline()
semantic_pipeline.train(X_train, y_train)
semantic_pipeline.evaluate(X_test, y_test)


Training TfidfPipeline...
Evaluating TfidfPipeline...
Accuracy: 0.9064102564102564
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.90      0.91       192
           1       0.94      0.88      0.91       199
           2       0.82      0.94      0.88       193
           3       0.96      0.90      0.93       196

    accuracy                           0.91       780
   macro avg       0.91      0.91      0.91       780
weighted avg       0.91      0.91      0.91       780

Training SemanticPipeline...
Evaluating SemanticPipeline...
Accuracy: 0.9179487179487179
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.91      0.93       192
           1       0.97      0.89      0.93       199
           2       0.85      0.93      0.89       193
           3       0.92      0.93      0.93       196

    accuracy                           0.92       780
   macro avg       0.

### Code Review & Analysis

**The Verdict:**
   * **TF-IDF Accuracy:** ~90.6%
   * **SBERT Accuracy:** ~91.8%

**Why did SBERT win (slightly)?**
                                                                                                                                                                                                                                
Even though the margin is small (~1.2%), SBERT wins because it understands **context**.

   * If a space article says "The vessel left the atmosphere," TF-IDF might miss it if it only looks for "rocket" or "shuttle."
   * SBERT knows "vessel" + "atmosphere"  "space travel."

**Why was TF-IDF so competitive?**
                                                                                            
The *20 Newsgroups* dataset is very **keyword-heavy**.
   * Space articles are full of "orbit," "nasa," "launch."
   * Hockey articles are full of "puck," "goal," "team."
   * TF-IDF thrives here. If we were classifying subtle sentiment (e.g., sarcasm), TF-IDF would likely fail miserably while SBERT would shine.

### Optimizations for the Future
   1. **Batching:** In `SemanticPipeline`, for huge datasets, `model.encode(texts)` might run out of memory. You would want to pass `batch_size=32`.
   2. **Freezing:** Currently, we used the embeddings as fixed features. In later modules (M3 Week 15), we will "Fine-Tune" BERT, allowing the embeddings themselves to shift to better fit our specific data.

---

### Module Progress

You have successfully completed **L13: Text Processing & Embeddings**.

**Summary of Skills Acquired:**

1. **Tokenization:** You know why we split words into subwords (`##ing`) to handle OOVs.
2. **Sparse Vectors:** You built a TF-IDF baseline and saw its sparsity.
3. **Dense Vectors:** You understand how semantic meaning is captured in vector arithmetic (`King - Man + Woman`).
4. **Transformers:** You implemented a production-grade embedding pipeline using Sentence-BERT.

