# **Module 3: NLP & LLM Core**

## L13: Text Processing & Embeddings

Welcome to the NLP module. We are moving from Computer Vision (pixels) to Natural Language Processing (sequences/tokens). This is the foundation for the Agentic AI and RAG work you have planned later.

This lesson focuses on how we convert human language into numerical vectors that machines can "understand." We will progress from simple frequency counts to deep semantic representations.

### Topic Breakdown

```text
L13: Text Processing & Embeddings
├── Concept 1: Tokenization (Subword & BPE)
│   ├── Word-level vs. Character-level vs. Subword
│   ├── The OOV (Out of Vocabulary) Problem
│   ├── Byte Pair Encoding (BPE) Intuition
│   ├── Explanation: Breaking text into meaningful chunks (tokens)
│   └── Task: Use a tokenizer to inspect tokenization differences
│
├── Concept 2: Sparse Representations (TF-IDF) [Baseline]
│   ├── Term Frequency (TF)
│   ├── Inverse Document Frequency (IDF)
│   ├── Explanation: Weighing words by how "rare" and "informative" they are
│   └── Task: Compute TF-IDF matrix for a mini-corpus using sklearn
│
├── Concept 3: Static Dense Embeddings (Word2Vec/GloVe Intuition)
│   ├── One-Hot vs. Dense Vectors
│   ├── Semantic Meaning in Vector Space (King - Man + Woman = Queen)
│   ├── Limitation: Context Independence (Polysemy)
│   └── Task: Manual Cosine Similarity calculation on mock embedding vectors
│
├── Concept 4: Transformer Embeddings (Sentence-BERT)
│   ├── Contextual Embeddings (Why "bank" differs in two sentences)
│   ├── The Cross-Encoder vs. Bi-Encoder (Siamese Network) architecture
│   ├── Explanation: Capturing the meaning of whole sentences
│   └── Task: Load a Sentence-Transformer model and encode text
│
└── Mini-Project: Semantic Classifier Comparison
    ├── Dataset: 20 Newsgroups (Subset) or similar text dataset
    ├── Pipeline A: TF-IDF + Logistic Regression
    ├── Pipeline B: SBERT Embeddings + Logistic Regression
    └── Evaluation: Compare Accuracy/F1 Score

```

---


## **Concept 1: Tokenization (Subword & BPE)**

### Intuition

Before a model can process text, it must be broken down into smaller units called **tokens**. The simplest approach is splitting by spaces (Word-level), but this fails when the model encounters a word it hasn't seen before (the "Out-Of-Vocabulary" or **OOV** problem). Conversely, splitting by characters (Character-level) solves OOV but results in extremely long sequences where individual units carry little meaning.

Modern NLP uses **Subword Tokenization** (e.g., Byte-Pair Encoding or BPE). This is the "Goldilocks" zone. It breaks common words into single tokens (e.g., "apple") but breaks rare or complex words into meaningful sub-units (e.g., "tokenization"  "token", "##iza", "##tion"). This allows the model to process *any* text using a fixed-size vocabulary.

### Mechanics: Byte-Pair Encoding (BPE)

BPE works by iteratively merging the most frequently occurring adjacent pairs of characters (or bytes) in the training corpus.
   1. **Initialize:** Vocabulary includes all individual characters.
   2. **Count:** Calculate frequency of all symbol pairs (e.g., "e" + "s" $\rightarrow$ "es").
   3. **Merge:** Add the most frequent pair to the vocabulary as a new symbol.
   4. **Repeat:** Continue until the vocabulary size reaches a target limit (e.g., 30k or 50k tokens).

### Simpler Explanation

Think of tokens like Lego bricks.
   * **Word-level:** Every unique word is a custom-molded brick. If you need a "microscope" brick and don't have it, you can't build the sentence.
   * **Character-level:** You only have 26 types of tiny 1x1 bricks. You can build anything, but it takes thousands of bricks to build a simple house.
   * **Subword (BPE):** You have a set of standard complex shapes (walls, windows) for common structures, but you also keep the tiny 1x1 bricks. If you encounter a rare structure, you build it using the standard shapes and the tiny bricks.

### Trade-offs
   * **Pros:** Solves OOV (can represent any string), balances sequence length and meaning.
   * **Cons:** Handling the "sub-tokens" (like `##ing` in BERT) requires careful implementation. Typos can result in weird subword splits.

---

### Your Task

You will use the Hugging Face `transformers` library to observe how a subword tokenizer handles known words versus rare words/typos.

**Specifications:**
   1. **Import:** `AutoTokenizer` from `transformers`.
   2. **Load:** The tokenizer for `bert-base-uncased`.
   3. **Input Text:** "unaffable" (a standard word) vs "unaffabwle" (a typo/nonsense word).
   4. **Action:**
      * Tokenize both strings.
      * Convert the IDs back to tokens (strings) to see the split.
   5. **Output:** Print the list of tokens for both words.

**Note:** You might need to install transformers: `pip install transformers`



In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

word1 = "unaffable"
word2 = "unaffabwle"

t1 = tokenizer(word1)
t2 = tokenizer(word2)


w1 = tokenizer.convert_ids_to_tokens(t1["input_ids"])
w2 = tokenizer.convert_ids_to_tokens(t2["input_ids"])

print(f"For {word1}\nToken = {tokenizer(word1)}\nBack to string = {w1}")
print(f"For {word2}\nToken = {tokenizer(word2)}\nBack to string = {w2}\n==============================")

For unaffable
Token = {'input_ids': [101, 14477, 20961, 3468, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
Back to string = ['[CLS]', 'una', '##ffa', '##ble', '[SEP]']
For unaffabwle
Token = {'input_ids': [101, 14477, 20961, 2497, 13668, 2063, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Back to string = ['[CLS]', 'una', '##ffa', '##b', '##wl', '##e', '[SEP]']



Excellent. This perfectly illustrates the power of Subword Tokenization.

* **"unaffable"**: Broken into 3 logical chunks (`una`, `##ffa`, `##ble`). The model can likely infer the meaning is related to "unable" or "affable" based on these sub-parts.
* **"unaffabwle"**: The tokenizer didn't panic or crash. It just kept cutting until it found pieces it recognized (`##b`, `##wl`, `##e`). Even though the word is nonsense, the model has a valid input vector to process.

---

## **Concept 2: Sparse Representations (TF-IDF)**

### Intuition

After tokenization, we have a list of tokens. The simplest way to turn these into numbers is counting them ("Bag of Words"). However, raw counts have a flaw: common words like "the", "is", and "and" appear frequently but carry little information.

**TF-IDF** (Term Frequency - Inverse Document Frequency) fixes this by balancing two factors:

1. **Frequency:** How often does the word appear in *this specific* document? (More is better).
2. **Rarity:** How often does the word appear in *all* documents? (Less is better).

### Mechanics

The score $w_{t,d}$ for a term $t$ in document $d$ is:
$$TF_{t,d} = \frac{\text{count of t in d}}{\text{total terms in d}}$$

1. **TF (Term Frequency):**
$$w_{t,d} = TF_{t,d} \times IDF_t$$
*(Note: Implementations often use raw count or log normalization)*

2. **IDF (Inverse Document Frequency):**
$$IDF_t = \log \left( \frac{N}{df_t} \right)$$
Where $N$ is the total number of documents, and $df_t$ is the number of documents containing term $t$.


### Simpler Explanation

Imagine you are scanning a library for books about "Quantum Physics".
   * The word "the" is in every book. $IDF \approx 0$. It gets a score of 0.
   * The word "Quantum" appears many times in specific books, but not in cookbooks or novels. It has high TF (in the physics book) and high IDF (rare globally). It gets a high score.

### Trade-offs
   * **Pros:** Very fast, interpretable (you know exactly which words triggered the score), works surprisingly well for simple keyword matching.
   * **Cons:** **Sparse** (vectors are mostly zeros), **No Semantics** (it doesn't know "car" and "automobile" are related; they are just different orthogonal dimensions).

---

### Your Task

You will manually calculate the TF-IDF matrix using Scikit-Learn to see the sparsity.

**Specifications:**
   1. **Import:** `TfidfVectorizer` from `sklearn.feature_extraction.text`.
   2. **Data:** Create a list of strings:
   ```python
   corpus = [
       "the cat sat on the mat",
       "the dog sat on the log",
       "cats and dogs are great"
   ]
   
   ```

   3. **Action:**
      * Initialize the vectorizer.
      * Fit and transform the corpus.
      * Get the feature names (the vocabulary).
      * Convert the result to a dense array (using `.toarray()`) or a DataFrame for readability.
   
   
   4. **Output:** Print the feature names and the resulting matrix.
