### Introduction to Natural Language Processing (NLP)
#### *from classical methods to modern LLMs*

**Part I — Lecture (1.5 hours)**

---

### Speaker: Jeffrey Luo, Ph.D.


## About the Lecturer

**Ph.D. in Physics**, UNC Chapel Hill  
**B.S. in Mathematics and Physics**, Tsinghua University

---
### Experience
- **Workday**: LLM-based agentic systems for chat, search, and automation
- **Gradient AI**: Deep learning models for health insurance underwriting
- **Wolters Kluwer**: ML models for infectious disease prediction 
- **T2 Biosystems**: NMR systems for sepsis detection
- **Schlumberger**: NMR study of porous materials
---
### Hobby Project
**OrcaEcho.ai**: AI assistant for presentation creation and editing  
[Google Slides Add-on](https://workspace.google.com/marketplace/app/orcarina/404235414546) | [Website](https://orcaecho.ai)

---
[LinkedIn](https://www.linkedin.com/in/zhixiang-jeffrey-luo-70850124/) | [Google Scholar](https://scholar.google.com/citations?user=gzyW_GUAAAAJ&hl=en)


# Part I: Lecture Overview

1. **What is NLP and Why It Matters**
2. **Rule-Based NLP (1950s–1980s)**
3. **Statistical NLP (1990s–2010s)**
4. **Word Embeddings and Deep Learning**
5. **Transformers and the LLM Revolution**
6. **Modern LLM Applications**


# 1. What is NLP and Why It Matters


## 1.1 What is NLP?

**Natural Language Processing (NLP)** is a field of AI focused on enabling computers to understand, interpret, generate, and interact with human language.

### Some common things powered by NLP:

- **Translation** (e.g., English → Chinese)
- **Chatbots & assistants**
- **Speech-to-text / text-to-speech**
- **Summarization**
- **Sentiment analysis** (detecting positive/negative tone)
- **Search and question-answering**

NLP blends **linguistics**, **computer science**, and **machine learning** so that machines can handle language in a useful way.


## 1.2 Evolution of NLP Approaches

### Before Computers: Foundation (Ancient Times - 1950s)

**Linguistics**: Analysis of grammar, syntax, semantics, sentence structure, and meaning 
**Mathematical Logic & Formal Models**: Scholars created systems to describe language in precise, machine-like terms: Formal grammars, symbolic logic etc.

---

### After Computers: NLP Timeline

**1954**: First machine translation experiment (Georgetown-IBM)

**1960s–90s**: Rule-based language systems  
→ Translation, parsing algorithms, computational linguistics

**2000s–2010s**: Statistical and neural models  
→ Probabilistic approaches, early neural networks

**2018+**: Transformers and modern LLMs  
→ BERT, GPT, and the transformer revolution


## 1.3 Major NLP Applications

| **Application Category**                   | **Examples & Notes**                                      |
|--------------------------------------------|------------------------------------------------------------|
| **Machine Translation (1950s–)**           | Georgetown-IBM demo; Google Translate, DeepL              |
| **Search & Information Retrieval (1960s–)**| SMART system; AltaVista; Google Search                    |
| **Spell/Grammar & Text Classification (1980s–)** | WordPerfect, spam filters; Grammarly                |
| **Speech Recognition (1990s–)**            | Dragon Dictate; Siri, Alexa, Google Assistant             |
| **Named Entity & Information Extraction (1990s–)** | MUC NER systems; news/event extraction          |
| **Chatbots & QA Systems (2000s–)**         | TREC QA, IBM Watson; ChatGPT, Claude                      |
| **Text Generation & LLMs (2018–)**         | GPT series, Claude; rewriting, summarization              |
| **AI Agents & Tool Use (2023–)**           | LLMs executing actions via APIs/tools                     |


# 2. Rule-Based NLP (1950s–1980s)

## What Is Rule-Based NLP? (1950s–1980s)

Early NLP relied on **manually written rules** from linguists

- **Symbolic, deterministic systems**
- **No machine learning or statistical models**
- Worked only in **narrow, controlled domains**

**Example rule:** "If a sentence starts with 'Who', classify it as a question."


### Language Basics from School

**Sentence Structure:**
- Sentences have parts: subject, verb, object
- Words group into phrases: noun phrase (NP), verb phrase (VP)

**Parts of Speech (POS):**
- Noun, verb, adjective, adverb
- Prepositions, conjunctions

You already identify these intuitively.


### Language Basics from School

**Morphology:**
- Plural forms: cat → cats
- Verb tenses: walk → walked
- Irregulars: go → went, mouse → mice

**Common Patterns:**
- "Who…?" → question
- "I feel X" → emotional statement
- "If…then…" → condition

These everyday language concepts form the foundation of early NLP.


### How Rule-Based NLP Extended Linguistic Concepts

**1. What NLP Inherited from Linguistics**

- Sentence structure → NP, VP, subject, object
- Parts of speech → nouns, verbs, adjectives, etc.
- Morphology concepts → plurals, tenses, irregulars
- Lexicon (linguistic sense) → words + meanings

*(These are the same concepts students learn in school grammar.)*

**2. What NLP Extended into Machine-Readable Form**

- **Formal grammars:** Phrase structure rules rewritten as explicit production rules (e.g., S → NP VP, NP → Det N)
- **Computational lexicon:** Structured entries storing POS, forms, syntactic frames, semantic features

*(Linguistic ideas made precise and explicit for computer use.)*


### How Rule-Based NLP Extended Linguistic Concepts

**3. What NLP Created Anew for Computation**

- **Morphological rule engines:** Algorithms for generating/analyzing word forms, handling exceptions
- **Parsing algorithms:** Automatic construction of parse trees from grammar rules

*(These are engineering mechanisms that did not exist in pure linguistics.)*

**4. Pattern Rules (New for Early NLP Applications)**

- IF text matches pattern → THEN respond or extract
- *Example: "I feel X" → "Why do you feel X?"*
- Enabled early chatbots (ELIZA) and information extraction
- Not derived from linguistics; purely application-driven


## Classic Systems (ELIZA, SHRDLU, MT)

**ELIZA (1966) — Shallow rules**

- Pattern matching; no understanding
- *Example: "I feel sad." → "Why do you feel sad?"*

**SHRDLU (1970s) — Deep rules in a tiny world**

- Real parsing + reasoning, but only in the "blocks world"
- *Example: "Put the red block on the green cube."*

**Rule-Based MT (SYSTRAN) — Large-scale rules**

- Thousands of grammar rules; used in real translation systems
- Expensive to maintain; brittle outside covered patterns


## Strengths, Weaknesses & Why It Ended

**Strengths:**
- Interpretable, linguistically grounded
- Reliable in narrow domains

**Weaknesses:**
- Brittle, unscalable
- Huge rule sets, fails on new inputs
- Could not handle ambiguity or real-world language variation

**Shift in 1990s:** Rise of digital text + computing power enabled statistical NLP  
Data-driven models outperformed handcrafted rules


# 3. Statistical NLP (1990s–2010s)

From hand-written rules → data-driven models.

- Learn patterns from corpora instead of experts
- Use probabilities to handle ambiguity
- Built the foundation for modern NLP applications


### What Statistical NLP Built On

- Parts of speech → still needed for tagging
- Phrase structure → still used for parsing
- Lexicon concepts → still needed for word categories
- Morphology → still informs word forms
- Linguistic features → used as input signals (suffixes, capitalization)

*(Kept linguistic structure, but learned patterns instead of hand-writing them.)*


### Data-Driven Learning

1. **Learn from corpora, not experts**
2. **Probabilities replace deterministic rules**
3. **NLP becomes an empirical science**

*Example:* "What word comes next?" → choose the most probable one from data.


### Key Corpora Fueling Statistical NLP

**POS Tagging:** Penn Treebank WSJ (PTB-POS), Brown Corpus

**Parsing:** PTB phrase-structure trees, PropBank / FrameNet (predicate–argument)

**NER:** CoNLL-2003 (EN/DE), MUC, ACE

**Machine Translation:** IBM Canadian Hansard (EN–FR), EuroParl, early WMT corpora

**Language Modeling:** PTB LM split, Gigaword, Google N-grams

**Text Classification:** Reuters-21578, 20 Newsgroups, TREC QA

*(Data availability enabled empirical training and benchmarking.)*


### Why Penn Treebank Became the Backbone

**High-quality annotation:** POS tags, phrase-structure trees, consistent WSJ text

**Used across tasks:** POS tagging, parsing, LM, syntactic features for NER/QA

**Right size for 1990s–2000s compute:** large enough for statistics, small enough to train

**Standard benchmarks:** shared splits enabled reproducible research

**Reliable source (LDC @ Penn):** clear licensing, consistent versioning

→ Became the de-facto shared dataset for statistical NLP


### Penn Treebank Example (Data Backbone)

A PTB item pairs a sentence with its phrase-structure tree and POS tags.

```
(S
  (NP-SBJ (DT The) (NN company))
  (VP (VBD said)
      (SBAR
        (S (NP-SBJ (PRP it))
           (VP (MD would)
               (VP (VB cut)
                   (NP (NNS costs))
                   (PP (IN by)
                       (NP (CD 10) (NN percent)))))))
  (. .))
```

Raw sentence: *The company said it would cut costs by 10 percent.*

- POS tags: DT, NN, VBD, PRP …
- Phrase labels: NP, VP, PP, SBAR, S
- Bracketed tree = hallmark PTB format

*(Statistical models learned probabilities from thousands of such trees.)*


### Basic Math: Probability & Estimation
**1. Probability as Likelihood (Intuition)**

Language is unpredictable → we model how likely each word or label is
P(A|B) = "probability of A given B"

**Example:**

P("morning" | "good") >> P("hippopotamus" | "good")

**2. Count-Based Estimation**

Probability ≈ how often something appears in a corpus

**Example:**

If "San Francisco" appears 10,000 times and "San Jose" appears 5,000 times, the model learns "San Francisco" is more common.

**3. Local Dependencies (Markov Idea)**

The next word depends mostly on a few previous words.

**Example:**

P("you" | "thank") is high  
P("saucepan" | "thank") is near zero


### Basic Math: Features & Sequence Decisions

**4. Features as Linguistic Clues**

Models use signals from the text to make decisions.

**Examples of features:**

- word suffix "-ed" → likely past tense
- capitalized → maybe a name
- previous word = "Mr." → next likely PER
- contains digits → maybe a date/number

**These features feed:**

- Naive Bayes
- MaxEnt classifiers
- CRFs

**5. Joint Decision for Sequences**

For tasks like POS tagging or NER, labels influence each other.

**Example:**

"New York City" → all three tokens should be labeled as LOCATION

*(Sequence models like HMMs and CRFs enforce consistency.)*


### Probabilistic Models

- **n-grams** for language modeling
- **Hidden Markov Models (HMMs)** for POS tagging & speech recognition
- **Maximum Entropy models** for classification
- **Conditional Random Fields (CRFs)** for sequence labeling (NER, segmentation)

*(Uncertainty is modeled mathematically.)*


### n-gram Language Models

- **What**: Count-based probabilistic models over short word sequences
- **Key concept**: Markov assumption — next word depends on the previous *(n−1)* words; smoothed conditional probabilities
- **Used for**: Language modeling, predictive text, decoding in speech recognition and MT
- **Why it works**: Converts large corpora into likelihoods that capture dominant local usage patterns
- **Builds on**: Extends rule-based frequency tables with statistical estimation and smoothing learned directly from corpora


#### How n-gram Assigns Probability to a Sentence

**Sentence Probability = Product of Local Conditional Probabilities**

Using the chain rule, the probability of a sentence:

*P(w₁, w₂, …, wₜ)*

is approximated by an n-gram model as:

*P(w₁) × P(w₂|w₁) × P(w₃|w₁, w₂) × …*

Under an n-gram assumption:

*P(wₜ|wₜ₋₁, …, wₜ₋ₙ₊₁)*

**Example (bigram):**

Sentence: "I want to eat pizza"

*P(I) × P(want|I) × P(to|want) × P(eat|to) × P(pizza|eat)*

**Interpretation**

- Each conditional probability measures how "natural" that local phrase is
- Multiplying them gives an overall fluency score for the whole sentence
- High score → common, natural sentence
- Very low score → unnatural or incorrect sentence

*(This is the core job of a language model.)*


#### How n-gram Probabilities Add Value in Real Applications

**n-gram LM = Fluency/Correctness Scorer**

| Application | What Creates Candidates | What n-gram Adds | Result |
|------------|------------------------|------------------|--------|
| **Spell Checking** | String-similarity corrections (receive / revise / relieve) | Scores each candidate in sentence context | Picks correct word in context |
| **Speech Recognition** | Acoustic model generates many possible word sequences | Scores each sequence by fluency | Picks the natural sentence, not just sound match |
| **Machine Translation** | Translation model proposes literal or alternative translations | Scores target-language fluency | Picks most natural translation |
| **Predictive Text** | All words in vocabulary | Predictive next-word probabilities | Produces useful next-word suggestions |

**Key Insight**

Across all tasks, n-gram LM is the component that ensures the output "sounds like real language."


### Hidden Markov Models (HMMs)

- **What**: Generative sequence models with hidden state chains emitting observed words or acoustic frames
- **Key concept**: Transition matrix over tags/states + emission probabilities; decoded efficiently via Viterbi
- **Used for**: POS tagging, speech recognition, shallow parsing
- **Why it works**: Separates latent linguistic structure from surface tokens and models uncertainty end-to-end
- **Builds on**: Adds probabilistic state transitions on top of n-gram assumptions and reuses linguistic tagsets from rule-based systems


### Math of \(P(s, x)\) in a Hidden Markov Model

We model the joint probability of:

- Hidden states: $s = (s_1, s_2, ..., s_T)$
- Observed words: $x = (x_1, x_2, ..., x_T)$

The HMM factorizes the joint probability as:

$$P(s, x) = P(s_1) \cdot \prod_{t=2}^{T} P(s_t \mid s_{t-1}) \cdot \prod_{t=1}^{T} P(x_t \mid s_t)$$

**Interpretation**

- $P(s_1)$: probability of the starting state
- $P(s_t \mid s_{t-1})$: transition probability
- $P(x_t \mid s_t)$: emission probability  

This expresses the generative story of an HMM:
1. Pick a starting hidden tag  
2. Emit a word based on that tag  
3. Transition to the next hidden tag  
4. Emit the next word  
5. Repeat


### Example of \(P(s, x)\) for POS Tagging

**Sentence:** `The dog barks`  
**Hidden states (POS):**  
\( s = (DT, NN, VBZ) \)  
**Observed words:**  
\( x = (The, dog, barks) \)

$$P(s, x) = P(DT) \cdot P(NN \mid DT) \cdot P(VBZ \mid NN) \cdot P(The \mid DT) \cdot P(dog \mid NN) \cdot P(barks \mid VBZ)$$

**Meaning:**
- Probability of starting in DT  
- Probability that NN follows DT  
- Probability that VBZ follows NN  
- Probability each word is emitted by its tag  

Multiply all terms → the joint probability of generating both the tag sequence and the sentence under the HMM.


### Maximum Entropy (Log-Linear) Models

- **What**: Conditional probability models that weight arbitrary linguistic features to predict labels
- **Key concept**: Choose weights that maximize entropy subject to empirical feature expectations (a flexible logistic regression)
- **Used for**: Text classification, POS/NER taggers, feature-rich decision components in MT pipelines
- **Why it works**: Combines many overlapping signals without independence assumptions; learns optimal weights from data
- **Builds on**: Takes handcrafted linguistic cues from earlier systems but replaces manual rules with data-driven weight learning


### Conditional Random Fields (CRFs)

- **What**: Discriminative sequence models that directly learn \(P(\mathbf{y}\mid\mathbf{x})\) over label sequences
- **Key concept**: Log-linear factors over adjacent labels and rich features spanning the whole sentence; avoids label bias
- **Used for**: NER, segmentation, POS, bioNLP entity/event extraction
- **Why it works**: Supports arbitrary, overlapping features while enforcing global consistency across the sequence
- **Builds on**: Extends Maximum Entropy to structured outputs and replaces HMM generative assumptions with conditional training


### Math of a Linear-Chain Conditional Random Field (CRF)

A CRF models the conditional probability of a label sequence **y** given the entire observation sequence **x**:

$$P(y \mid x) = \frac{1}{Z(x)} \exp\left( \sum_{t=1}^{T} \sum_{k} \lambda_k \, f_k(y_{t-1}, y_t, x, t) \right)$$

**Components**

- $f_k(\cdot)$: feature functions  
  (e.g., "is the word capitalized?", "previous label = B-LOC?")
- $\lambda_k$: learned weights for each feature
- $Z(x)$: normalization term (ensures probabilities sum to 1)

**Interpretation**

- The model scores a whole **label sequence** using rich features.
- Then normalizes over *all possible label sequences*.
- Training = learn weights $\lambda_k$ that best fit data.


### Example of CRF for NER (Named Entity Recognition)

**Sentence:**  
`New York City is busy`

**Labels (BIO scheme):**  
`B-LOC   I-LOC   I-LOC   O   O`

CRF uses **features across words and labels**:

**Examples of features:**
- Current word is capitalized?
- Previous label = B-LOC?
- Next word starts with capital?
- Word shape ("Xx")
- Suffix "-ity" etc.

CRF score for the sequence:

$$\text{score}(y, x) = \sum_{t,k} \lambda_k \, f_k(y_{t-1}, y_t, x, t)$$

Prediction:

$$\hat{y} = \arg\max_y P(y \mid x)$$

**Key idea:**  
The CRF can learn transitions like:

- B-LOC → I-LOC (good)  
- I-LOC → O (common)  
- B-LOC → B-PER (bad, penalized)

CRF chooses the best entire sequence, not token-by-token.


### HMM vs CRF (Key Differences)

| Aspect | Hidden Markov Model (HMM) | Conditional Random Field (CRF) |
|-------|----------------------------|--------------------------------|
| Model Type | Generative | Discriminative |
| What It Learns | P(y) and P(x | y), combines to P(x, y) | Directly learns P(y | x) |
| Assumption | Strong independence assumptions (word depends only on its tag) | No independence assumptions on features |
| Features | Limited: emissions must be simple (word, maybe morphology) | Arbitrary, overlapping, global features allowed |
| Label Bias Problem | Yes | No |
| Sequence Consistency | Local (only via transitions) | Global (entire sequence optimized jointly) |
| Typical Use | POS tagging (early), speech recognition (with acoustics) | NER, segmentation, chunking, POS tagging (modern) |
| Strength | Simple, fast, usable with small data | Much more accurate, flexible, powerful |
| Weakness | Cannot use rich features; weaker accuracy | Needs labeled data and more compute |

**Summary**  
- **HMM** models how words are *generated* from hidden states.  
- **CRF** models how labels should be assigned *given the whole sentence*.  
- CRFs outperform HMMs on most NLP sequence-labeling tasks.


### Latent Semantic Analysis (LSA)

**What it is**
- Statistical method that discovers hidden semantic structure in text
- Based on co-occurrence counts and **SVD (Singular Value Decomposition)**

**Key idea**
- Words that appear in similar contexts have similar meanings
- Compress the word–document matrix into a low-dimensional semantic space

**Math**
1. Build term-document matrix (TF–IDF)
2. Apply SVD:  M ≈ U Σ Vᵀ   (keep top k dimensions)
3. Use vectors from U/V as semantic embeddings


### Latent Semantic Analysis (LSA)

**Used for**
- Document similarity, clustering  
- Topic modeling (pre-LDA era)  
- Early summarization and search  
- Foundation for word embeddings

**Why it mattered**
- First successful statistical method capturing semantic similarity  
- Precursor to word2vec/GloVe  

### Latent Dirichlet Allocation (LDA)

**What it is**
- A **probabilistic topic model**
- Assumes each document is a mixture of topics
- Each topic is a distribution over words

**Generative idea**
1. For each document, choose topic proportions (Dirichlet).
2. For each word:
   - Pick a topic from the doc’s topic proportions
   - Pick a word from that topic’s word distribution

**Key equations**
- Topic distribution per document:  θ_d ~ Dirichlet(α)
- Word distribution per topic:    φ_k ~ Dirichlet(β)
- Word generation:  z ~ Mult(θ_d),  w ~ Mult(φ_z)

**Why it’s useful**
- Discovers hidden themes in large text collections
- Each doc gets a vector of topic weights → useful for clustering, search, classification.

### Applications Powered by Statistical NLP

- Search engines → relevance ranking (TF–IDF, probabilistic IR)
- Spam filtering → Naive Bayes, SVM
- Speech recognition → HMM-based decoding
- Machine translation → IBM statistical models
- NER, POS tagging → HMM/CRF taggers
- Sentiment analysis → supervised classification

*(Scaled to real-world data for the first time.)*


### Strengths of Statistical NLP

- Learns from data → scalable
- Handles messy real-world text
- Manages ambiguity probabilistically
- Reduces manual rule writing
- Foundation for early web search & speech engines


### Limitations → Motivation for Deep Learning

- Heavy feature engineering required
- Sparse data → poor generalization
- Struggles with long-distance dependencies
- Requires labeled training data
- Words treated as symbols (no notion of meaning)

*(Set the stage for word embeddings and neural networks.)*


### How Deep Learning Eliminates Sparsity (Word-Level & Above)

**1. Word Embeddings → Fix sparsity above the word level**
- Before: each word = isolated symbol (Paris, London, Berlin unrelated)  
  → unseen word combinations = zero counts, brittle models  
- After: dense word vectors  
  → similar words cluster; models generalize to unseen phrases/sentences

**2. Subword/Character Embeddings → Fix sparsity at the word level**
- Before: OOV words = impossible to represent (probability = 0)  
- After: BPE/WordPiece split words into pieces (un-, -tion, -ing…)  
  → even unseen words have vectors; morphology becomes learnable

**Result:**  
Word embeddings handle semantic similarity, subwords handle OOV/morphology, and neural layers build contextual meaning → sparsity becomes a minor issue compared to statistical NLP.


### How Deep Learning Fixes “Words Treated as Symbols” (with definition of meaning)

**What is “meaning” in NLP?**  
- Meaning is relational: a word is defined by the contexts it appears in and its similarity to other words.

**Before (Statistical NLP):**
- Words were **IDs only**, no semantic relation (Paris ≠ London ≠ Berlin in the model).
- Only **primitive meaning signals** existed (POS, morphology), which tell *role* but not *true meaning*.
  - e.g., *dog* and *cat* both → noun (POS), but models cannot tell they are semantically close.

**After (Deep Learning):**
- **Word embeddings** learn relational meaning:
  - Similar contexts → nearby vectors (king–man + woman ≈ queen).
- **Contextual embeddings** give different meanings based on usage:
  - “bank” (river) vs “bank” (finance).
- **Subword embeddings** capture morphology and rare words:
  - “bioluminescence” → bio + lumi + nescence.

**Result:**  
Deep learning replaces symbolic word IDs with **rich semantic, relational, context-dependent representations**, providing real “meaning” instead of shallow syntactic categories.


# 4. Word Embeddings and Deep Learning

**Deep Learning NLP (2010s–2018)**

## Evolution

- **Word2Vec (2013)** — meaning as vectors
- **RNN/LSTM (2014–2016)** — sequential modeling
- **Attention (2017)**
- **Transformer (2018)**

## Capabilities

- Better translation
- Better speech recognition
- Better sentiment, NER
- End of feature engineering

<!-- slide -->

### Why We Need Word Embeddings

**Statistical NLP limitations:**
- Words treated as **IDs only** → no semantic similarity  
- Sparse data → unseen combinations get **zero counts**  
- Heavy reliance on **manual features** (suffixes, capitalization, POS)  
- No way to capture relationships like:  
  *Paris ≈ London ≈ Berlin*  
  *run ≈ jog ≈ sprint*

**Core problem:**  
The model sees no relation between similar words — all are isolated symbols.

**Motivation for embeddings:**  
We need a **dense, learned, continuous representation** so that:
- similar words are near each other,
- rare/unseen words still get meaningful vectors,
- models can generalize beyond observed counts.


<!-- slide: subslide -->

### What Is a Word Embedding?

**Definition:**  
A *word embedding* is a **dense vector** (typically 50–300 dimensions) that represents a word based on **its context and usage**.

**Key properties:**
- Similar words → **nearby vectors**  
  - *Paris*, *London*, *Berlin* cluster together  
  - *king – man + woman ≈ queen*
- Encodes **semantic** and **syntactic** information  
- Learned automatically from large corpora (distributional hypothesis)

**Conceptual picture:**
- One-hot:  
  `[0 0 0 ... 1 ... 0]`  → no relationship
- Embedding:  
  `[0.39, -0.12, 0.85, ...]` → captures meaning via geometry

**Why it matters:**  
Embeddings transform language from **symbolic** to **semantic**, enabling neural networks to learn meaning instead of memorizing patterns.


<!-- slide -->

### Timeline of Word Embeddings (2000–2017)

**2000–2010 — Neural Language Models (early DL)**  
- Bengio et al. (2003): neural LM with learned embeddings  
- Too slow to scale, but introduced the core idea: **embedding matrix + neural prediction**

**2013 — Word2Vec (Mikolov et al.)**  
- Skip-Gram and CBOW  
- Efficient training on billions of tokens  
- Sparked the modern embedding revolution

**2014 — GloVe (Pennington et al.)**  
- Global co-occurrence matrix factorization  
- Complement to window-based Word2Vec

**2014–2016 — fastText (Facebook)**  
- Subword/character n-gram embeddings  
- Solved OOV and morphology limitations

**2015–2017 — Contextual Precursor Models**  
- ELMo (2018 precursor architecture)  
- Bi-LSTM + embeddings → context-dependent meaning


<!-- slide -->

### How Word Embeddings Are Trained (Intuition)

**Key idea: Distributional Hypothesis**  
“You shall know a word by the company it keeps.”

Words appearing in similar contexts → acquire similar vectors.

Two major training styles:

**1. Skip-Gram (Predict context given a word)**  
- Input: “bank”  
- Predict nearby words: *river, water, money, loan*  
- Objective: words with similar neighborhoods → similar vectors

**2. CBOW (Continuous Bag of Words — Predict the word from its context)**  
- Input: “the *?* is flowing fast”  
- Predict: *river*

**Common idea:**  
Train on massive corpora → embeddings emerge automatically from co-occurrence behavior.


<!-- slide -->

### Skip-Gram (Window Size = 2)

Sentence: **The quick brown fox jumps over the lazy dog**  
Indexes:  
1:The  2:quick  3:brown  4:fox  5:jumps  6:over  7:the  8:lazy  9:dog

Window size = **2**

#### Center = “fox” (position 4)
Context positions = **2, 3, 5, 6**

Training pairs:
- (fox → quick)
- (fox → brown)
- (fox → jumps)
- (fox → over)

#### Center = “jumps” (position 5)
Context positions = **3, 4, 6, 7**

Training pairs:
- (jumps → brown)
- (jumps → fox)
- (jumps → over)
- (jumps → the)

**Summary:**  
Skip-Gram predicts **context words from the center word**.


<!-- slide -->

### CBOW (Window Size = 2)

Sentence: **The quick brown fox jumps over the lazy dog**  
Indexes:  
1:The  2:quick  3:brown  4:fox  5:jumps  6:over  7:the  8:lazy  9:dog

Window size = **2**

#### Center = “fox” (position 4)
Context positions = **2, 3, 5, 6**

- Input: **{quick, brown, jumps, over}**  
- Predict: **fox**

#### Center = “jumps” (position 5)
Context positions = **3, 4, 6, 7**

- Input: **{brown, fox, over, the}**  
- Predict: **jumps**

**Summary:**  
CBOW predicts the **center word from surrounding context words**.


<!-- slide -->

### Why Word Embeddings Exploded Around 2013

**1. Data got big enough**  
- Web-scale corpora (billions of words) became available  
- Finally enough signal to learn stable semantic patterns

**2. Compute got fast enough**  
- Multicore CPUs + GPUs made vector math cheap  
- Training neural models became practical (hours, not weeks)

**3. Algorithms became efficient**  
- Word2Vec introduced **negative sampling** + **subsampling**  
- Shallow SG/CBOW architectures scaled to huge vocabularies

**Result:**  
Embeddings became fast, scalable, and meaningful — enabling modern neural NLP.


<!-- slide -->

### Why Word Embeddings Mattered (The Real Breakthrough)

**Embeddings finally showed *true semantic structure*.**  
This was the moment the NLP community realized neural methods worked.

**Shocking results at the time:**
- *king – man + woman ≈ queen*  
- *Paris ≈ London ≈ Berlin*  
- *walk, walking, walked* clustering naturally

**What made this transformative:**
- **Linear analogies** emerged from geometry  
- **High-quality nearest neighbors** (semantic and syntactic)  
- **Smooth, meaningful clusters** across massive vocabularies  

Earlier methods (LSA, early neural LMs) hinted at similarity,  
but **never at this clarity, scale, or consistency**.

**Impact:**  
This single discovery convinced researchers that  
**dense vectors encode real meaning**, launching modern neural NLP.


<!-- slide -->

### GloVe: What It Adds on Top of Word2Vec

**What Word2Vec uses:**  
- Local context windows (Skip-Gram / CBOW)  
- Learns meaning from *predicting* nearby words  
- Great semantic clusters but only local statistics

---

### **What GloVe adds: Global Co-Occurrence Information**
- Builds a **global word–word co-occurrence matrix**  
- Learns embeddings by **factorizing** it with a weighted least-squares loss  
- Embeddings capture **ratios** of probabilities  
  - ice : cold  vs.  steam : hot  
- Encodes semantic structure that **local windows miss**

---

### **How GloVe + Word2Vec connect**
- Word2Vec = **local prediction**  
- GloVe = **global co-occurrence factorization**  
- Both produce dense vectors, but GloVe injects **global corpus structure** into the embedding space

**Result:**  
More stable embeddings and better analogy structure on some tasks.


<!-- slide -->

### fastText: What It Adds on Top of Word2Vec

**What Word2Vec lacks:**  
- Treats each word as a whole symbol  
- Fails on **OOV** (out-of-vocabulary) words  
- Weak on morphology (walk, walking, walker)

---

### **What fastText adds: Subword / Character n-grams**
- Splits each word into character-level n-grams  
  - “playing” → play, lay, yin, ing, pla…  
- Embedding = **sum of subword embeddings**  
- Even unseen words are representable (OOV solved)  
- Morphology becomes learnable automatically

---

### **How fastText + Word2Vec connect**
- fastText still uses **Skip-Gram / CBOW**  
- But replaces “word as a single token” with  
  **word = bag of character n-grams**
- Keeps Word2Vec’s training idea  
- Extends it to handle **rare words, morphology, OOV**

**Result:**  
Word2Vec semantics + subword structure = robust, generalizable embeddings.


<!-- slide -->

### Embedding Dimensions: Typical Sizes & What They Mean

#### **Typical Dimensions**
| Model       | Common Dim |
|-------------|------------|
| **Word2Vec** | 100–300 |
| **GloVe**     | 50–300 |
| **fastText**  | 100–300 |
| **Contextual Models (BERT/GPT)** | 768–12k+ |

**Why 100–300?**  
- Big enough to encode rich semantic structure  
- Small enough to train efficiently  
- Empirically the best tradeoff for static embeddings

---

### **Do dimensions have actual meaning?**

**Not individually**  
- No single dimension corresponds cleanly to a topic or concept
- You cannot point to dimension **#47** and say “this is the *animal* dimension”  

**Think of it like a coordinate system:**  
Individual axes don’t mean anything alone,  
but **distances and directions** encode meaningful relationships.



<!-- slide -->

### From Embeddings to Deep Neural Models

Once words are represented as dense vectors, they can be fed into neural networks:

**1. RNNs (Recurrent Neural Networks)**  
- Read tokens sequentially  
- Capture short-range dependencies  
- Still struggled with long-distance context

**2. LSTMs / GRUs**  
- Improved memory  
- Enabled early neural MT, tagging, classification

**3. CNNs for Text**  
- Extract local n-gram patterns  
- Worked well for sentiment and classification

**Embeddings → enable neural networks to reason over meaning,  
not just symbols or counts.**

This sets the stage for:
- Attention (2017)  
- Transformers (2018)  
- Modern LLMs


# 5. Transformers and the LLM Revolution


## 5.1 Self-Attention Mechanism

*[Content to be filled]*

- The attention mechanism
- Query, Key, Value (QKV) framework
- Multi-head attention
- Why attention is powerful


## 5.2 Transformer Architecture

*[Content to be filled]*

- Encoder-decoder architecture
- Positional encoding
- Layer normalization and residual connections
- Feed-forward networks


## 5.3 Major Transformer Models

*[Content to be filled]*

- **BERT**: Bidirectional Encoder Representations
- **GPT**: Generative Pre-trained Transformer
- **T5**: Text-to-Text Transfer Transformer
- Other notable models


# 6. Modern LLM Applications


## 6.1 Core NLP Tasks

*[Content to be filled]*

- **Text Summarization**: Extractive and abstractive
- **Question Answering**: Reading comprehension
- **Sentiment Analysis**: Understanding emotions and opinions
- **Named Entity Recognition**: Identifying entities in text


## 6.2 Advanced Applications

*[Content to be filled]*

- **AI Agents**: Autonomous systems with NLP capabilities
- **Conversational AI**: Chatbots and dialogue systems
- **Code Generation**: Programming with natural language
- **Multimodal Systems**: Text, images, and beyond


## 6.3 Training Paradigms

*[Content to be filled]*

- **Fine-tuning**: Adapting pre-trained models to specific tasks
- **Instruction-tuning**: Training models to follow instructions
- **In-context learning**: Few-shot and zero-shot capabilities


## 6.4 Retrieval-Augmented Generation (RAG)

*[Content to be filled]*

- Combining retrieval with generation
- Knowledge bases and vector databases
- Reducing hallucinations and improving accuracy
- Practical applications


# Summary and Key Takeaways

*[Content to be filled]*

- Evolution from rules to LLMs
- Key concepts and techniques
- Modern applications and future directions


# Questions?

---

Thank you!
