### Introduction to Natural Language Processing (NLP)
#### *from classical methods to modern LLMs*

**Part I ‚Äî Lecture (1.5 hours)**

___

### Speaker: Jeffrey Luo, Ph.D.


## About me

[LinkedIn](https://www.linkedin.com/in/zhixiang-jeffrey-luo-70850124/) | [Google Scholar](https://scholar.google.com/citations?user=gzyW_GUAAAAJ&hl=en)

**Ph.D. in Physics**, UNC Chapel Hill  
**B.S. in Mathematics and Physics**, Tsinghua University
### Experience
- **Workday**: LLM-based agentic systems for chat, search, and automation
- **Gradient AI**: Deep learning models for health insurance underwriting
- **Wolters Kluwer**: ML models for infectious disease prediction 
- **T2 Biosystems**: NMR systems for sepsis detection
- **Schlumberger**: NMR study of porous materials

### Hobby Project
**OrcaEcho.ai**: AI assistant for presentation creation and editing  
[Google Slides Add-on](https://workspace.google.com/marketplace/app/orcarina/404235414546) | [Website](https://orcaecho.ai)



# Part I: Lecture Overview

1. **What is NLP and Why It Matters**
2. **Rule-Based NLP (1950s‚Äì1980s)**
3. **Statistical NLP (1990s‚Äì2010s)**
4. **Word Embeddings and Deep Learning (2010-2018)**
5. **Transformers and the LLM Revolution (2018-today)**
6. **Modern LLM Applications**


# 1. What is NLP and Why It Matters


## 1.1 What is NLP?

**Natural Language Processing (NLP)** is a field of AI focused on enabling computers to understand, interpret, generate, and interact with human language.

### Some common things powered by NLP:

- **Translation** (e.g., English ‚Üí Chinese)
- **Chatbots & assistants**
- **Speech-to-text / text-to-speech**
- **Summarization**
- **Sentiment analysis** (detecting positive/negative tone)
- **Search and question-answering**

NLP blends **linguistics**, **computer science**, and **machine learning** so that machines can handle language in a useful way.


## 1.2 Evolution of NLP Approaches

### Before Computers: Foundation (Ancient Times - 1950s)

**Linguistics**: Analysis of grammar, syntax, semantics, sentence structure, and meaning 
**Mathematical Logic & Formal Models**: Scholars created systems to describe language in precise, machine-like terms: Formal grammars, symbolic logic etc.

### After Computers: NLP Timeline

**1954**: First machine translation experiment (Georgetown-IBM)

**1960s‚Äì90s**: Rule-based language systems  
‚Üí Translation, parsing algorithms, computational linguistics

**2000s‚Äì2010s**: Statistical and neural models  
‚Üí Probabilistic approaches, early neural networks

**2018+**: Transformers and modern LLMs  
‚Üí BERT, GPT, and the transformer revolution


## 1.3 Major NLP Applications

| **Application Category**                   | **Examples & Notes**                                      |
|--------------------------------------------|------------------------------------------------------------|
| **Machine Translation (1950s‚Äì)**           | Georgetown-IBM demo; Google Translate, DeepL              |
| **Search & Information Retrieval (1960s‚Äì)**| SMART system; AltaVista; Google Search                    |
| **Spell/Grammar & Text Classification (1980s‚Äì)** | WordPerfect, spam filters; Grammarly                |
| **Speech Recognition (1990s‚Äì)**            | Dragon Dictate; Siri, Alexa, Google Assistant             |
| **Named Entity & Information Extraction (1990s‚Äì)** | MUC NER systems; news/event extraction          |
| **Chatbots & QA Systems (2000s‚Äì)**         | TREC QA, IBM Watson; ChatGPT, Claude                      |
| **Text Generation & LLMs (2018‚Äì)**         | GPT series, Claude; rewriting, summarization              |
| **AI Agents & Tool Use (2023‚Äì)**           | LLMs executing actions via APIs/tools                     |


# 2. Rule-Based NLP (1950s‚Äì1980s)

## What Is Rule-Based NLP? (1950s‚Äì1980s)

Early NLP relied on **manually written rules** from linguists

- **Symbolic, deterministic systems**
- **No machine learning or statistical models**
- Worked only in **narrow, controlled domains**

**Example rule:** "If a sentence starts with 'Who', classify it as a question."


### Language Basics from School

**Sentence Structure:**
- Sentences have parts: subject, verb, object
- Words group into phrases: noun phrase (NP), verb phrase (VP)

**Parts of Speech (POS):**
- Noun, verb, adjective, adverb
- Prepositions, conjunctions

You already identify these intuitively.


### Language Basics from School

**Morphology:**
- Plural forms: cat ‚Üí cats
- Verb tenses: walk ‚Üí walked
- Irregulars: go ‚Üí went, mouse ‚Üí mice

**Common Patterns:**
- "Who‚Ä¶?" ‚Üí question
- "I feel X" ‚Üí emotional statement
- "If‚Ä¶then‚Ä¶" ‚Üí condition

These everyday language concepts form the foundation of early NLP.


### How Rule-Based NLP Extended Linguistic Concepts

**1. What NLP Inherited from Linguistics**

- Sentence structure ‚Üí NP, VP, subject, object
- Parts of speech ‚Üí nouns, verbs, adjectives, etc.
- Morphology concepts ‚Üí plurals, tenses, irregulars
- Lexicon (linguistic sense) ‚Üí words + meanings

*(These are the same concepts students learn in school grammar.)*

**2. What NLP Extended into Machine-Readable Form**

- **Formal grammars:** Phrase structure rules rewritten as explicit production rules (e.g., S ‚Üí NP VP, NP ‚Üí Det N)
- **Computational lexicon:** Structured entries storing POS, forms, syntactic frames, semantic features

*(Linguistic ideas made precise and explicit for computer use.)*


### How Rule-Based NLP Extended Linguistic Concepts

**3. What NLP Created Anew for Computation**

- **Morphological rule engines:** Algorithms for generating/analyzing word forms, handling exceptions
- **Parsing algorithms:** Automatic construction of parse trees from grammar rules

*(These are engineering mechanisms that did not exist in pure linguistics.)*

**4. Pattern Rules (New for Early NLP Applications)**

- IF text matches pattern ‚Üí THEN respond or extract
- *Example: "I feel X" ‚Üí "Why do you feel X?"*
- Enabled early chatbots (ELIZA) and information extraction
- Not derived from linguistics; purely application-driven


## Classic Systems (ELIZA, SHRDLU, MT)

**ELIZA (1966) ‚Äî Shallow rules**

- Pattern matching; no understanding
- *Example: "I feel sad." ‚Üí "Why do you feel sad?"*

**SHRDLU (1970s) ‚Äî Deep rules in a tiny world**

- Real parsing + reasoning, but only in the "blocks world"
- *Example: "Put the red block on the green cube."*

**Rule-Based MT (SYSTRAN) ‚Äî Large-scale rules**

- Thousands of grammar rules; used in real translation systems
- Expensive to maintain; brittle outside covered patterns


## Strengths, Weaknesses & Why It Ended

**Strengths:**
- Interpretable, linguistically grounded
- Reliable in narrow domains

**Weaknesses:**
- Brittle, unscalable
- Huge rule sets, fails on new inputs
- Could not handle ambiguity or real-world language variation

**Shift in 1990s:** Rise of digital text + computing power enabled statistical NLP  
Data-driven models outperformed handcrafted rules


# 3. Statistical NLP (1990s‚Äì2010s)


### What Statistical NLP Built On

- Parts of speech ‚Üí still needed for tagging
- Phrase structure ‚Üí still used for parsing
- Lexicon concepts ‚Üí still needed for word categories
- Morphology ‚Üí still informs word forms
- Linguistic features ‚Üí used as input signals (suffixes, capitalization)

*(Kept linguistic structure, but learned patterns instead of hand-writing them.)*


### Data-Driven Learning

From hand-written rules ‚Üí data-driven models.

- Learn patterns from corpora instead of experts
- Use probabilities to handle ambiguity
- Built the foundation for modern NLP applications

*Example:* "What word comes next?" ‚Üí choose the most probable one from data.


### Key Corpora Fueling Statistical NLP

- **POS Tagging:** Penn Treebank WSJ (PTB-POS), Brown Corpus

- **Parsing:** PTB phrase-structure trees, PropBank / FrameNet (predicate‚Äìargument)

- **NER:** CoNLL-2003 (EN/DE), MUC, ACE

- **Machine Translation:** IBM Canadian Hansard (EN‚ÄìFR), EuroParl, early WMT corpora

- **Language Modeling:** PTB LM split, Gigaword, Google N-grams

- **Text Classification:** Reuters-21578, 20 Newsgroups, TREC QA

- *(Data availability enabled empirical training and benchmarking.)*


### Why Penn Treebank Became the Backbone

- **High-quality annotation:** POS tags, phrase-structure trees, consistent WSJ text

- **Used across tasks:** POS tagging, parsing, LM, syntactic features for NER/QA

- **Right size for 1990s‚Äì2000s compute:** large enough for statistics, small enough to train

- **Standard benchmarks:** shared splits enabled reproducible research

- **Reliable source (LDC @ Penn):** clear licensing, consistent versioning

‚Üí Became the de-facto shared dataset for statistical NLP


### Penn Treebank Example (Data Backbone)

A PTB item pairs a sentence with its phrase-structure tree and POS tags.

```
(S
  (NP-SBJ (DT The) (NN company))
  (VP (VBD said)
      (SBAR
        (S (NP-SBJ (PRP it))
           (VP (MD would)
               (VP (VB cut)
                   (NP (NNS costs))
                   (PP (IN by)
                       (NP (CD 10) (NN percent)))))))
  (. .))
```
Raw sentence: *The company said it would cut costs by 10 percent.*

- POS tags: DT, NN, VBD, PRP ‚Ä¶
- Phrase labels: NP, VP, PP, SBAR, S
- Bracketed tree = hallmark PTB format

*(Statistical models learned probabilities from thousands of such trees.)*


### Key concepts and assumptions
**1. Conditional probability**

Language is unpredictable ‚Üí we model how likely each word or label is
P(A|B) = "probability of A given B"

**Example:** P("morning" | "good") >> P("hippopotamus" | "good")

**2. Word frequency**

How often something appears in a corpus

**Example:**

If "San Francisco" appears 10,000 times and "San Jose" appears 5,000 times, the model learns "San Francisco" is more common.

**3. Local Dependencies (Markov Idea)**

The next word depends mostly on a few previous words.

**Example:** P("you" | "thank") is high, P("saucepan" | "thank") is near zero


### Key concepts and assumptions

**4. Features from Linguistic Clues** Models use signals from the text to make decisions.

**Examples of features:**

- word suffix "-ed" ‚Üí likely past tense
- capitalized ‚Üí maybe a name
- previous word = "Mr." ‚Üí next likely PER
- contains digits ‚Üí maybe a date/number

**These features feed:** Naive Bayes, MaxEnt classifiers, CRFs

**5. Joint Decision for Sequences**

For tasks like POS tagging or NER, labels influence each other.

**Example:** "New York City" ‚Üí all three tokens should be labeled as LOCATION


### Probabilistic Models

- **n-grams** for language modeling
- **Hidden Markov Models (HMMs)** for POS tagging & speech recognition
- **Maximum Entropy models** for classification
- **Conditional Random Fields (CRFs)** for sequence labeling (NER, segmentation)

*(Uncertainty is modeled mathematically.)*


### n-gram Language Models

- **What**: Count-based probabilistic models over short word sequences, **P(next word | previous n-1 words)**
- **Key concept**: Markov assumption ‚Äî next word depends on the previous *(n‚àí1)* words; smoothed conditional probabilities
- **Used for**: Language modeling, predictive text, decoding in speech recognition and MT
- **Why it works**: Converts large corpora into likelihoods that capture dominant local usage patterns
- **Builds on**: Extends rule-based frequency tables with statistical estimation and smoothing learned directly from corpora


### How n-gram Assigns Probability to a Sentence

**Sentence Probability = Product of Local Conditional Probabilities**

Using the chain rule, the probability of a sentence:

*P(w‚ÇÅ, w‚ÇÇ, ‚Ä¶, w‚Çú) ~ P(w‚ÇÅ) √ó P(w‚ÇÇ|w‚ÇÅ) √ó P(w‚ÇÉ|w‚ÇÅ, w‚ÇÇ) √ó ‚Ä¶ P(w‚Çú|w‚Çú‚Çã‚ÇÅ, ‚Ä¶, w‚Çú‚Çã‚Çô‚Çä‚ÇÅ)*

**Example (bigram):**

Sentence: P("I want to eat pizza") = *P(I) √ó P(want|I) √ó P(to|want) √ó P(eat|to) √ó P(pizza|eat)*

**Interpretation**

- Each conditional probability measures how "natural" that local phrase is
- Multiplying them gives an overall fluency score for the whole sentence
- High score ‚Üí common, natural sentence
- Very low score ‚Üí unnatural or incorrect sentence

*(This is the core job of a language model.)*


### How n-gram Probabilities Add Value in Real Applications

**n-gram LM = Fluency/Correctness Scorer**

| Application | What Creates Candidates | What n-gram Adds | Result |
|------------|------------------------|------------------|--------|
| **Spell Checking** | String-similarity corrections (receive / revise / relieve) | Scores each candidate in sentence context | Picks correct word in context |
| **Speech Recognition** | Acoustic model generates many possible word sequences | Scores each sequence by fluency | Picks the natural sentence, not just sound match |
| **Machine Translation** | Translation model proposes literal or alternative translations | Scores target-language fluency | Picks most natural translation |
| **Predictive Text** | All words in vocabulary | Predictive next-word probabilities | Produces useful next-word suggestions |

**Key Insight**

Across all tasks, n-gram LM is the component that ensures the output "sounds like real language."


### Hidden Markov Models (HMMs)

- **What**: Generative sequence models with hidden state chains emitting observed words or acoustic frames
- **Key concept**: Transition matrix over tags/states + emission probabilities; decoded efficiently via Viterbi
- **Used for**: POS tagging, speech recognition, shallow parsing
- **Why it works**: Separates latent linguistic structure from surface tokens and models uncertainty end-to-end
- **Builds on**: Adds probabilistic state transitions on top of n-gram assumptions and reuses linguistic tagsets from rule-based systems


### How Hidden Markov Model Works

HMM is concerned about two things: word sequence $x = (x_1, x_2, ..., x_T)$, and a hiden state sequence $s = (s_1, s_2, ..., s_T)$ representing some preperty of each word.

The sentense probability is calculated by
$$P(s, x) = P(s_1) \cdot \prod_{t=2}^{T} P(s_t \mid s_{t-1}) \cdot \prod_{t=1}^{T} P(x_t \mid s_t)$$

Where

- $P(s_1)$: probability of the starting state
- $P(s_t \mid s_{t-1})$: transition probability
- $P(x_t \mid s_t)$: emission probability  

In training, transition probability and emission probability are calculated from the data.

In application, those probabilities are used to calculate the probability of the candidate (s,x) to find the best.


### Example of P(s, x) for POS Tagging

**Sentence:** `The dog barks`  
**Hidden states (POS):**  
\( s = (DT, NN, VBZ) \)  
**Observed words:**  
\( x = (The, dog, barks) \)

$$P(s, x) = P(DT) \cdot P(NN \mid DT) \cdot P(VBZ \mid NN) \cdot P(The \mid DT) \cdot P(dog \mid NN) \cdot P(barks \mid VBZ)$$

**Meaning:**
- Probability of starting in DT  
- Probability that NN follows DT  
- Probability that VBZ follows NN  
- Probability each word is emitted by its tag  

Multiply all terms ‚Üí the joint probability of generating both the tag sequence and the sentence under the HMM.


### Maximum Entropy (Log-Linear) Models

- **What**: Conditional probability models that weight arbitrary linguistic features to predict labels
- **Key concept**: Choose weights that maximize entropy subject to empirical feature expectations (a flexible logistic regression)
- **Used for**: Text classification, POS/NER taggers, feature-rich decision components in MT pipelines
- **Why it works**: Combines many overlapping signals without independence assumptions; learns optimal weights from data
- **Builds on**: Takes handcrafted linguistic cues from earlier systems but replaces manual rules with data-driven weight learning


### Conditional Random Fields (CRFs)

- **What**: Discriminative sequence models that directly learn \(P(\mathbf{y}\mid\mathbf{x})\) over label sequences
- **Key concept**: Log-linear factors over adjacent labels and rich features spanning the whole sentence; avoids label bias
- **Used for**: NER, segmentation, POS, bioNLP entity/event extraction
- **Why it works**: Supports arbitrary, overlapping features while enforcing global consistency across the sequence
- **Builds on**: Extends Maximum Entropy to structured outputs and replaces HMM generative assumptions with conditional training


### Math of a Linear-Chain Conditional Random Field (CRF)

A CRF models the conditional probability of a label sequence **y** given the entire observation sequence **x**:

$$P(y \mid x) = \frac{1}{Z(x)} \exp\left( \sum_{t=1}^{T} \sum_{k} \lambda_k \, f_k(y_{t-1}, y_t, x, t) \right)$$

**Components**

- $f_k(\cdot)$: feature functions  
  (e.g., "is the word capitalized?", "previous label = B-LOC?")
- $\lambda_k$: learned weights for each feature
- $Z(x)$: normalization term (ensures probabilities sum to 1)

**Interpretation**

- The model scores a whole **label sequence** using rich features.
- Then normalizes over *all possible label sequences*.
- Training = learn weights $\lambda_k$ that best fit data.


### Example of CRF for NER (Named Entity Recognition)

**Sentence:**  `New York City is busy`

**Labels (BIO scheme):**  `B-LOC   I-LOC   I-LOC   O   O`

CRF uses **features across words and labels**:

**Examples of features:**
- Current word is capitalized?
- Previous label = B-LOC?
- Next word starts with capital?
- Word shape ("Xx")
- Suffix "-ity" etc.

CRF score for the sequence: $\text{score}(y, x) = \sum_{t,k} \lambda_k \, f_k(y_{t-1}, y_t, x, t)$

Prediction: $\hat{y} = \arg\max_y P(y \mid x)$

CRF chooses the best entire sequence, not token-by-token.


### HMM vs CRF (Key Differences)

| Aspect | Hidden Markov Model (HMM) | Conditional Random Field (CRF) |
|-------|----------------------------|--------------------------------|
| Model Type | Generative | Discriminative |
| Assumption | Strong independence assumptions (word depends only on its tag) | No independence assumptions on features |
| Features | Limited: emissions must be simple (word, maybe morphology) | Arbitrary, overlapping, global features allowed |
| Sequence Consistency | Local (only via transitions) | Global (entire sequence optimized jointly) |
| Typical Use | POS tagging (early), speech recognition (with acoustics) | NER, segmentation, chunking, POS tagging (modern) |
| Strength | Simple, fast, usable with small data | Much more accurate, flexible, powerful |
| Weakness | Cannot use rich features; weaker accuracy | Needs labeled data and more compute |

**Summary**  
- **HMM** models how words are *generated* from hidden states.  
- **CRF** models how labels should be assigned *given the whole sentence*.  
- CRFs outperform HMMs on most NLP sequence-labeling tasks.


### Latent Semantic Analysis (LSA)

**What it is**
- Statistical method that discovers hidden semantic structure in text
- Based on co-occurrence counts and **SVD (Singular Value Decomposition)**

**Key idea**
- Words that appear in similar contexts have similar meanings
- Compress the word‚Äìdocument matrix into a low-dimensional semantic space

**Math**
1. Build term-document matrix (TF‚ÄìIDF)
2. Apply SVD:  M ‚âà U Œ£ V·µÄ   (keep top k dimensions)
3. Use vectors from U/V as semantic embeddings


### Latent Semantic Analysis (LSA)

**Used for**
- Document similarity, clustering  
- Topic modeling (pre-LDA era)  
- Early summarization and search  
- Foundation for word embeddings

**Why it mattered**
- First successful statistical method capturing semantic similarity  
- Precursor to word2vec/GloVe  

### Latent Dirichlet Allocation (LDA)

**What it is**
- A **probabilistic topic model**
- Assumes each document is a mixture of topics
- Each topic is a distribution over words

**Generative idea**
1. For each document, choose topic proportions (Dirichlet).
2. For each word:
   - Pick a topic from the doc‚Äôs topic proportions
   - Pick a word from that topic‚Äôs word distribution

**Why it‚Äôs useful**
- Discovers hidden themes in large text collections
- Each doc gets a vector of topic weights ‚Üí useful for clustering, search, classification.

### Applications Powered by Statistical NLP

- Search engines ‚Üí relevance ranking (TF‚ÄìIDF, probabilistic IR)
- Spam filtering ‚Üí Naive Bayes, SVM
- Speech recognition ‚Üí HMM-based decoding
- Machine translation ‚Üí IBM statistical models
- NER, POS tagging ‚Üí HMM/CRF taggers
- Sentiment analysis ‚Üí supervised classification

*(Scaled to real-world data for the first time.)*


### Strengths of Statistical NLP

- Learns from data ‚Üí scalable
- Handles messy real-world text
- Manages ambiguity probabilistically
- Reduces manual rule writing
- Foundation for early web search & speech engines


### Limitations ‚Üí Motivation for Deep Learning

- Heavy feature engineering required
- Sparse data ‚Üí poor generalization
- Struggles with long-distance dependencies
- Requires labeled training data
- Words treated as symbols (no notion of meaning)

*(Set the stage for word embeddings and neural networks.)*


### How Deep Learning Eliminates Sparsity (Word-Level & Above)

**1. Word Embeddings ‚Üí Fix sparsity above the word level**
- Before: each word = isolated symbol (Paris, London, Berlin unrelated)  
  ‚Üí unseen word combinations = zero counts, brittle models  
- After: dense word vectors  
  ‚Üí similar words cluster; models generalize to unseen phrases/sentences

**2. Subword/Character Embeddings ‚Üí Fix sparsity at the word level**
- Before: OOV words = impossible to represent (probability = 0)  
- After: BPE/WordPiece split words into pieces (un-, -tion, -ing‚Ä¶)  
  ‚Üí even unseen words have vectors; morphology becomes learnable

**Result:**  
Word embeddings handle semantic similarity, subwords handle OOV/morphology, and neural layers build contextual meaning ‚Üí sparsity becomes a minor issue compared to statistical NLP.


### How Deep Learning Fixes ‚ÄúWords Treated as Symbols‚Äù (with definition of meaning)

**What is ‚Äúmeaning‚Äù in NLP?**  
- Meaning is relational: a word is defined by the contexts it appears in and its similarity to other words.

**Before (Statistical NLP):**
- Words were **IDs only**, no semantic relation (Paris ‚â† London ‚â† Berlin in the model).
- Only **primitive meaning signals** existed (POS, morphology), which tell *role* but not *true meaning*.
  - e.g., *dog* and *cat* both ‚Üí noun (POS), but models cannot tell they are semantically close.

**After (Deep Learning):**
- **Word embeddings** learn relational meaning:
  - Similar contexts ‚Üí nearby vectors (king‚Äìman + woman ‚âà queen).
- **Contextual embeddings** give different meanings based on usage:
  - ‚Äúbank‚Äù (river) vs ‚Äúbank‚Äù (finance).
- **Subword embeddings** capture morphology and rare words:
  - ‚Äúbioluminescence‚Äù ‚Üí bio + lumi + nescence.

**Result:**  
Deep learning replaces symbolic word IDs with **rich semantic, relational, context-dependent representations**, providing real ‚Äúmeaning‚Äù instead of shallow syntactic categories.


# 4. Word Embeddings and Deep Learning (2010s‚Äì2018)

<!-- slide -->

### Why We Need Word Embeddings

**Statistical NLP limitations:**
- Words treated as **IDs only** ‚Üí no semantic similarity  
- Sparse data ‚Üí unseen combinations get **zero counts**  
- Heavy reliance on **manual features** (suffixes, capitalization, POS)  
- No way to capture relationships like:  
  *Paris ‚âà London ‚âà Berlin*  
  *run ‚âà jog ‚âà sprint*

**Core problem:**  
The model sees no relation between similar words ‚Äî all are isolated symbols.

**Motivation for embeddings:**  
We need a **dense, learned, continuous representation** so that:
- similar words are near each other,
- rare/unseen words still get meaningful vectors,
- models can generalize beyond observed counts.


<!-- slide: subslide -->

### What Is a Word Embedding?

**Definition:**  
A *word embedding* is a **dense vector** (typically 50‚Äì300 dimensions) that represents a word based on **its context and usage**.

**Key properties:**
- Similar words ‚Üí **nearby vectors**  
  - *Paris*, *London*, *Berlin* cluster together  
  - *king ‚Äì man + woman ‚âà queen*
- Encodes **semantic** and **syntactic** information  
- Learned automatically from large corpora (distributional hypothesis)

**Conceptual picture:**
- One-hot:  
  `[0 0 0 ... 1 ... 0]`  ‚Üí no relationship
- Embedding:  
  `[0.39, -0.12, 0.85, ...]` ‚Üí captures meaning via geometry

**Why it matters:**  
Embeddings transform language from **symbolic** to **semantic**, enabling neural networks to learn meaning instead of memorizing patterns.


<!-- slide -->

### Timeline of Word Embeddings (2000‚Äì2017)

**2000‚Äì2010 ‚Äî Neural Language Models (early DL)**  
- Bengio et al. (2003): neural LM with learned embeddings  
- Too slow to scale, but introduced the core idea: **embedding matrix + neural prediction**

**2013 ‚Äî Word2Vec (Mikolov et al.)**  
- Skip-Gram and CBOW  
- Efficient training on billions of tokens  
- Sparked the modern embedding revolution

**2014 ‚Äî GloVe (Pennington et al.)**  
- Global co-occurrence matrix factorization  
- Complement to window-based Word2Vec

**2014‚Äì2016 ‚Äî fastText (Facebook)**  
- Subword/character n-gram embeddings  
- Solved OOV and morphology limitations

**2015‚Äì2017 ‚Äî Contextual Precursor Models**  
- ELMo (2018 precursor architecture)  
- Bi-LSTM + embeddings ‚Üí context-dependent meaning


<!-- slide -->

### How Word Embeddings Are Trained (Intuition)

**Key idea: Distributional Hypothesis**  
‚ÄúYou shall know a word by the company it keeps.‚Äù

Words appearing in similar contexts ‚Üí acquire similar vectors.

Two major training styles:

**1. Skip-Gram (Predict context given a word)**  
- Input: ‚Äúbank‚Äù  
- Predict nearby words: *river, water, money, loan*  
- Objective: words with similar neighborhoods ‚Üí similar vectors

**2. CBOW (Continuous Bag of Words ‚Äî Predict the word from its context)**  
- Input: ‚Äúthe *?* is flowing fast‚Äù  
- Predict: *river*

**Common idea:**  
Train on massive corpora ‚Üí embeddings emerge automatically from co-occurrence behavior.


<!-- slide -->

### Skip-Gram (Window Size = 2)

Sentence: **The quick brown fox jumps over the lazy dog**  
Indexes:  
1:The  2:quick  3:brown  4:fox  5:jumps  6:over  7:the  8:lazy  9:dog

Window size = **2**

#### Center = ‚Äúfox‚Äù (position 4)
Context positions = **2, 3, 5, 6**

Training pairs:
- (fox ‚Üí quick)
- (fox ‚Üí brown)
- (fox ‚Üí jumps)
- (fox ‚Üí over)

#### Center = ‚Äújumps‚Äù (position 5)
Context positions = **3, 4, 6, 7**

Training pairs:
- (jumps ‚Üí brown)
- (jumps ‚Üí fox)
- (jumps ‚Üí over)
- (jumps ‚Üí the)

**Summary:**  
Skip-Gram predicts **context words from the center word**.


<!-- slide -->

### CBOW (Window Size = 2)

Sentence: **The quick brown fox jumps over the lazy dog**  
Indexes:  
1:The  2:quick  3:brown  4:fox  5:jumps  6:over  7:the  8:lazy  9:dog

Window size = **2**

#### Center = ‚Äúfox‚Äù (position 4)
Context positions = **2, 3, 5, 6**

- Input: **{quick, brown, jumps, over}**  
- Predict: **fox**

#### Center = ‚Äújumps‚Äù (position 5)
Context positions = **3, 4, 6, 7**

- Input: **{brown, fox, over, the}**  
- Predict: **jumps**

**Summary:**  
CBOW predicts the **center word from surrounding context words**.


<!-- slide -->

### Why Word Embeddings Exploded Around 2013

**1. Data got big enough**  
- Web-scale corpora (billions of words) became available  
- Finally enough signal to learn stable semantic patterns

**2. Compute got fast enough**  
- Multicore CPUs + GPUs made vector math cheap  
- Training neural models became practical (hours, not weeks)

**3. Algorithms became efficient**  
- Word2Vec introduced **negative sampling** + **subsampling**  
- Shallow SG/CBOW architectures scaled to huge vocabularies

**Result:**  
Embeddings became fast, scalable, and meaningful ‚Äî enabling modern neural NLP.


<!-- slide -->

### Why Word Embeddings Mattered (The Real Breakthrough)

**Embeddings finally showed *true semantic structure*.**  
This was the moment the NLP community realized neural methods worked.

**Shocking results at the time:**
- *king ‚Äì man + woman ‚âà queen*  
- *Paris ‚âà London ‚âà Berlin*  
- *walk, walking, walked* clustering naturally

**What made this transformative:**
- **Linear analogies** emerged from geometry  
- **High-quality nearest neighbors** (semantic and syntactic)  
- **Smooth, meaningful clusters** across massive vocabularies  

Earlier methods (LSA, early neural LMs) hinted at similarity,  
but **never at this clarity, scale, or consistency**.

**Impact:**  
This single discovery convinced researchers that  
**dense vectors encode real meaning**, launching modern neural NLP.


<!-- slide -->

### GloVe: What It Adds on Top of Word2Vec

**What Word2Vec uses:**  
- Local context windows (Skip-Gram / CBOW)  
- Learns meaning from *predicting* nearby words  
- Great semantic clusters but only local statistics

___

### **What GloVe adds: Global Co-Occurrence Information**
- Builds a **global word‚Äìword co-occurrence matrix**  
- Learns embeddings by **factorizing** it with a weighted least-squares loss  
- Embeddings capture **ratios** of probabilities  
  - ice : cold  vs.  steam : hot  
- Encodes semantic structure that **local windows miss**

___

### **How GloVe + Word2Vec connect**
- Word2Vec = **local prediction**  
- GloVe = **global co-occurrence factorization**  
- Both produce dense vectors, but GloVe injects **global corpus structure** into the embedding space

**Result:**  
More stable embeddings and better analogy structure on some tasks.


<!-- slide -->

### fastText: What It Adds on Top of Word2Vec

**What Word2Vec lacks:**  
- Treats each word as a whole symbol  
- Fails on **OOV** (out-of-vocabulary) words  
- Weak on morphology (walk, walking, walker)

___

### **What fastText adds: Subword / Character n-grams**
- Splits each word into character-level n-grams  
  - ‚Äúplaying‚Äù ‚Üí play, lay, yin, ing, pla‚Ä¶  
- Embedding = **sum of subword embeddings**  
- Even unseen words are representable (OOV solved)  
- Morphology becomes learnable automatically

___

### **How fastText + Word2Vec connect**
- fastText still uses **Skip-Gram / CBOW**  
- But replaces ‚Äúword as a single token‚Äù with  
  **word = bag of character n-grams**
- Keeps Word2Vec‚Äôs training idea  
- Extends it to handle **rare words, morphology, OOV**

**Result:**  
Word2Vec semantics + subword structure = robust, generalizable embeddings.


<!-- slide -->

### Embedding Dimensions: Typical Sizes & What They Mean

#### **Typical Dimensions**
| Model       | Common Dim |
|-------------|------------|
| **Word2Vec** | 100‚Äì300 |
| **GloVe**     | 50‚Äì300 |
| **fastText**  | 100‚Äì300 |
| **Contextual Models (BERT/GPT)** | 768‚Äì12k+ |

**Why 100‚Äì300?**  
- Big enough to encode rich semantic structure  
- Small enough to train efficiently  
- Empirically the best tradeoff for static embeddings

___

### **Do dimensions have actual meaning?**

**Not individually**  
- No single dimension corresponds cleanly to a topic or concept
- You cannot point to dimension **#47** and say ‚Äúthis is the *animal* dimension‚Äù  

**Think of it like a coordinate system:**  
Individual axes don‚Äôt mean anything alone,  
but **distances and directions** encode meaningful relationships.



<!-- slide -->

### From Embeddings to Neural Models (2013‚Äì2017)

Once words are dense vectors, neural networks can finally **use meaning**, not symbols.

**1. RNNs (2013‚Äì2014)**  
- Read tokens sequentially  
- Capture short-range patterns  
- Weak long-distance memory

**2. LSTMs / GRUs (2014‚Äì2016)**  
- Add gating + memory  
- Handle long dependencies  
- Powered early neural MT, tagging, sentiment

**3. CNNs for Text (2014‚Äì2016)**  
- Convolution over embeddings  
- Strong for sentiment & classification

**Key impact:**  
Embeddings + these neural models replaced feature engineering and enabled **end-to-end learned NLP**, paving the way for **attention (2017)** and **Transformers**.


<!-- slide -->

### What Is a Neural Network (NN)?

A neural network is a model built from layers of simple mathematical units (‚Äúneurons‚Äù) that learn to transform inputs into useful outputs.

**1. A neuron = weighted sum + nonlinearity**
$$
\sigma(w \cdot x + b)
$$
- $x$: input  
- $w$: learned weights  
- $\sigma$: nonlinear activation (ReLU, tanh‚Ä¶)


This is the basic building block.

**2. A NN = many neurons arranged in layers**
- Input layer receives embeddings or raw features  
- Hidden layers learn internal patterns  
- Output layer predicts the next word / label / score  
- Stacking layers ‚Üí learn complex relationships

**3. NNs learn by adjusting weights**
- Compare prediction vs. truth  
- Compute error  
- Update weights via gradient descent  
- Repeat over millions of examples

This allows the network to gradually discover patterns in the data.


<!-- slide -->

### Why Neural Networks Matter

**‚úî Learn features automatically**  
- No manual feature engineering ‚Äî useful patterns emerge from data.

**‚úî Capture nonlinear relationships**  
- Language structure is nonlinear; NNs model complex interactions naturally.

**‚úî Scale with data**  
- Larger datasets + bigger networks ‚Üí better performance.

___

**Summary:**  
A neural network is a layered function that learns to map inputs (like embeddings) to outputs by adjusting parameters to capture complex patterns in data.


<!-- slide -->
## NN Language Modeling (Setup)

We use the previous *N* words to predict the next one.

Example (window = 5):

The cat sat on the ‚Üí mat


**Flow:**

1. **Token IDs ‚Üí embeddings (300 dim)**  
   Each token ID is looked up in an embedding matrix:  
   `IDs ‚Üí [e‚ÇÅ, e‚ÇÇ, e‚ÇÉ, e‚ÇÑ, e‚ÇÖ]`

2. **Combine embeddings, 1500 dim**  
   Usually concatenation:  
   `[e‚ÇÅ ; e‚ÇÇ ; e‚ÇÉ ; e‚ÇÑ ; e‚ÇÖ]`

3. **Feed into a neural network**  
   One or more layers transform the window into a context representation.


<!-- slide -->
## NN Processing ‚Üí Prediction

Neural layers transform the input:

$$h_1 = f_1(\text{input})$$  
$$h_2 = f_2(h_1) \dots$$  
$$h_{\text{last}}$$  

Final linear layer:

$$\text{logits} = W_{\text{out}} \cdot h_{\text{last}} + b_{\text{out}}$$

Softmax gives probabilities:

$$p(w) = \text{softmax}(\text{logits})$$

**Output:** a probability distribution over the entire vocabulary  
(e.g., all 50k possible next-word candidates).


<!-- slide -->

## Training With Cross-Entropy Loss

If the true next word is **‚Äúmat‚Äù**, the loss is:

$$
L = -\log p(\text{"mat"})
$$

Backpropagation updates **all** parameters:

- network layers  
- output layer  
- embedding matrix (learned like any other parameter)

**Summary:**  
A neural LM takes **token IDs ‚Üí embeddings ‚Üí neural layers ‚Üí softmax**,  
and learns to predict the next token using a **fixed context window**.


<!-- slide -->
## What Is an RNN?

A **Recurrent Neural Network (RNN)** processes a sequence **one token at a time**, keeping a running memory.

At each step it uses  
- current input **\(x_t\)**  
- previous hidden state **\(h_{t-1}\)**  

to compute the next hidden state:

$$
h_t = \tanh(W_x x_t + W_h h_{t-1})
$$

**Meaning of components**
- **\(W_x\)** ‚Äî processes the *current token*
- **\(W_h\)** ‚Äî updates/retains *past information*
- **\(h_t\)** ‚Äî compressed summary of all tokens so far

**Why it matters:**  
RNNs create a **learned, persistent memory** rather than relying on a fixed window.


<!-- slide -->
## How RNN Improves NNLM (Window-Based Models)

Traditional NNLMs use a **fixed window** (e.g., last 5 tokens):

- Only see limited context  
- No memory beyond the window  
- Cannot integrate long-distance patterns  

**RNN removes the fixed window:**
- Reads tokens sequentially  
- Hidden state evolves over time:
  $$
  h_t \rightarrow h_{t+1} \rightarrow h_{t+2}
  $$
- Memory grows with sequence length

**Advantages over NNLM:**
- Learns *what* to remember  
- Learns *how long* to remember it  
- Captures dependencies across arbitrarily long sequences  
- Far more expressive than fixed-window neural models


<!-- slide -->
## Why Vanilla RNNs Fail on Long Sequences

**Core problem: vanishing gradients**
- Training signal must pass through every time step.
- At each step, gradients are multiplied by terms < 1.
- Over long sequences, this product ‚Üí **0**, so early information cannot influence learning.

**Resulting limitations**
- Cannot learn dependencies far in the past.
- Forget subjects in long sentences.
- Memory fades exponentially with sequence length.
- Performs only slightly better than fixed-window NNLMs for long-range tasks.

**Motivation for LSTM/GRU**
- We need a recurrent model that can **decide what to keep**,  
  **what to forget**, and **how long to remember** ‚Äî  
  without losing gradients over time.


<!-- slide -->

## LSTM Timeline ‚Äî Why a 1997 Idea Took Over in 2014

**1997 ‚Äî LSTM Invented**  
Hochreiter & Schmidhuber propose LSTM to address vanishing gradients by introducing memory cells and gating mechanisms.

**2000s ‚Äî Little Adoption**  
LSTMs remain rare because hardware was too slow, datasets were small, optimization methods were immature, and training deep recurrent networks was difficult.

**2013 ‚Äî Conditions Become Right**  
GPUs become powerful enough, large text corpora become available, Word2Vec produces high-quality embeddings, and improved optimizers such as Adam make training stable.

**2014‚Äì2016 ‚Äî Breakthrough Era**  
LSTMs win across speech, handwriting, and early translation tasks; Seq2Seq models (Sutskever et al., 2014) push them into mainstream NLP.

**2014 ‚Äî GRU Introduced**  
GRU offers a simpler gated architecture with performance similar to LSTM.

**2017 ‚Äî Transformers Replace RNNs**  
Self-attention scales better than recurrence, and LSTM/GRU usage rapidly declines.


<img src="21cceeab741f.svg" width="800">

<!-- slide -->
## LSTM Architecture ‚Äî How It Works

An LSTM keeps two states at each step:  
- **hidden state** \( h_t \) ‚Äî short-term signal  
- **cell state** \( C_t \) ‚Äî long-term memory  

### 1. Forget gate ‚Äî decide what to erase
$$
f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)
$$

### 2. Input gate ‚Äî decide what new info to add
$$
i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)
$$

Candidate memory:
$$
\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)
$$

### 3. Update cell state ‚Äî combine past + new
$$
C_t = f_t \odot C_{t-1} \;+\; i_t \odot \tilde{C}_t
$$

### 4. Output gate ‚Äî decide what to expose as hidden state
$$
o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)
$$

Hidden state:
$$
h_t = o_t \odot \tanh(C_t)
$$

### Intuition
- **Forget gate** removes irrelevant old memory.  
- **Input gate + candidate** add useful new information.  
- **Cell state** carries long-term memory with **minor gradient decay**.  
- **Output gate** controls what part of the memory becomes the next hidden state.

This gating structure solves **vanishing gradients** and allows LSTMs to remember information over long sequences.


<!-- slide -->
## What RNNs / LSTMs / GRUs Solved ‚Äî and What Still Remained

### ‚úî What They Solved
- Learned **longer dependencies** than fixed-window NNLMs  
- Added **memory** that adapts to the sequence  
- Reduced **vanishing gradients** with gates (LSTM/GRU)  
- Enabled early **neural machine translation**, tagging, speech  

### ‚úò What Still Remained (Major Limitations)
- **Sequential processing is slow**  
  RNNs read one token at a time ‚Üí no parallelism.

- **Hard to model very long-range dependencies**  
  Even LSTMs eventually forget as sequence grows.

- **Memory bottleneck**  
  All information must pass through one hidden state \(h_t\).

- **Difficulty capturing relationships between distant tokens**  
  RNNs don‚Äôt selectively focus on important words.

- **Training unstable / expensive**  
  Long sequences ‚Üí exploding/vanishing gradients ‚Üí tricky optimization.

___

### Summary
RNN/LSTM/GRU added memory and solved short-term dependency issues,  
but **speed, parallelism, and truly long-range reasoning** were still limited ‚Äî  
*opening the door for attention and Transformers.*


<!-- slide -->
## What NN / RNN / LSTM / GRU Solved ‚Äî and What Still Remained

### ‚úî What They Solved

**Neural Networks (NN)**
- Learned **nonlinear features automatically**  
- Removed the need for **manual feature engineering**  
- Enabled **end-to-end learning** from embeddings  

**Recurrent Neural Networks (RNN)**
- Replaced fixed windows with **learned memory over sequences**  
- Captured **short- and mid-range dependencies**  
- Modeled sequences more naturally than NNLMs  

**LSTM / GRU**
- Added **gating mechanisms** ‚Üí solved vanishing gradients  
- Preserved **longer-term memory**  
- Enabled early **neural MT, tagging, speech, handwriting**  
- Much more stable to train than vanilla RNNs  

___



### ‚úò What Still Remained (Major Limitations)

- **Sequential processing is slow**  
  RNNs must process tokens one-by-one ‚Üí no parallelism.

- **Still struggle with very long-range dependencies**  
  Even LSTM/GRU memory fades as sequences grow.

- **Single-state bottleneck**  
  All information flows through a single hidden state \(h_t\).

- **No selective focus**  
  RNNs treat all past tokens uniformly; cannot highlight important ones.

- **Training instability / cost**  
  Long sequences ‚Üí exploding/vanishing gradients ‚Üí difficult optimization.

___

### Summary  
NN ‚Üí RNN ‚Üí LSTM/GRU steadily improved feature learning, sequence modeling, and memory.  
But **speed, parallelism, and truly long-range reasoning** remained unsolved ‚Äî  
creating the need for **attention** and ultimately **Transformers**.


# 5. Transformers and the LLM Revolution (2018-today)


### Self-Attention Mechanism ‚Äî Intuition

**Self-attention** lets each token look at *all other tokens* in the sequence and decide which ones matter.

- Each token computes weights over all other tokens instead of just relying on a single hidden state.
- The model builds a **contextual representation** by taking a weighted sum of other token representations.
- Long-range dependencies (e.g., subject ‚Üî verb far apart) can be captured in **one step**, not gradually over time.

**Contrast with RNNs:**

- RNNs process tokens **sequentially** and compress everything into one hidden state.
- Self-attention processes all tokens **in parallel**, removing the sequential bottleneck.


### Query‚ÄìKey‚ÄìValue (QKV) Framework

For each token‚Äôs current representation \(x\), the model learns three projections:

- **Query (Q):** what this token is *looking for* in other tokens  
- **Key (K):** what this token *offers* for others to match against  
- **Value (V):** the *information* this token contributes when attended to  

Mathematically:

$$
Q = W_Q x
$$

$$
K = W_K x
$$

$$
V = W_V x
$$

Attention score between token \(i\) and \(j\):

$$
\text{score}(i,j) = \frac{Q_i \cdot K_j}{\sqrt{d_k}}
$$

Attention output:

$$
\text{output}_i = \sum_j \text{softmax}(\text{score}(i,j)) \cdot V_j
$$


### Transformer Architecture

- Encoder-decoder architecture
- Positional encoding
- Layer normalization and residual connections
- Feed-forward networks


<img src="transformer.png" width="800">

### Structure of a Single Transformer Block

A Transformer block has **two sublayers**, both preserving the model dimension $(d_{\text{model}})$

**1. Multi-Head Self-Attention (MHSA)**  
Input:
$$
X \in \mathbb{R}^{T \times d_{\text{model}}}
$$

Linear projections:
$$
Q = X W_Q,\quad K = X W_K,\quad V = X W_V
$$

Where:
$$
W_Q, W_K, W_V \in \mathbb{R}^{d_{\text{model}} \times (h \cdot d_k)}
$$

Heads: split into \(h\) heads, each of size
$$
d_k = \frac{d_{\text{model}}}{h}
$$

Concatenate heads and project back:
$$
W_O \in \mathbb{R}^{(h \cdot d_k) \times d_{\text{model}}}
$$

**Output shape:** $(T \times d_{\text{model}}$)




**2. Feed-Forward Network (FFN)**  
Two-layer MLP applied per token:
$$
d_{\text{model}} \;\rightarrow\; d_{\text{ff}} \;\rightarrow\; d_{\text{model}}
$$

Typical:
$$
d_{\text{ff}} = 4\, d_{\text{model}}
$$

___

**3. Residual + LayerNorm**  
Applied around both sub

### What One Transformer Block Can Learn

A single block (MHSA + FFN) gives **one round of contextualization**:

- Learns **local context**: pronoun‚Äìnoun links, adjective‚Äìnoun, negation, short-range dependencies  
- Forms **attention patterns**: heads focusing on punctuation, capitalization, subjects, nearby important tokens  
- Captures **shallow structure**: basic roles (noun/verb), phrase boundaries, simple entity signals  
- FFN adds **nonlinear refinement**: polarity, simple logic (‚ÄúA but B‚Äù)  
- Enables **basic next-token prediction**: common continuations, grammar, punctuation  

**Cannot:** handle long-range dependencies, multi-step reasoning, deep semantics, or world knowledge.  


### Hi-Level Summary of Modern LLM Structure

Modern LLMs (GPT-3/4/5, LLaMA-3, Mistral, Gemini) share the same core design:

**1. Embeddings**
- Token embeddings of size $d_{\text{model}}$
- Add positional info (learned or RoPE)
- Output: $T \times d_{\text{model}}$

**2. Repeated Transformer Blocks (12‚Äì120+)**
- **Multi-Head Self-Attention (MHSA):** mixes information across all tokens  
- **Feed-Forward Network (FFN):** $d_{\text{model}} \rightarrow d_{\text{ff}} \rightarrow d_{\text{model}}$  
- **Residual + LayerNorm:** stabilize training, keep dimensions fixed  
- Shape preserved: $T \times d_{\text{model}}$ throughout

**3. Final LayerNorm + LM Head**
- Project hidden states to vocabulary
- Softmax ‚Üí next-token probabilities

**Modern Additions:** RoPE, gated FFNs, long-context attention, sparse/MoE routing, multimodality.

**Core Idea:** LLMs = deep stacks of identical Transformer blocks trained on massive data.


### üß† LLM Evolution Timeline 

| Model (Large Version)     | Release Date | d_model | Heads | Layers | FFN Dim | Params | Notes |
|---------------------------|--------------|---------|--------|---------|----------|---------|-------|
| **Transformer Big**       | Jun 2017     | 1024    | 16     | 6 enc / 6 dec | 4096 | ~213M | Original Transformer (Attention Is All You Need) |
| **BERT Large**            | Oct 2018     | 1024    | 16     | 24 enc | 4096     | 340M  | Bidirectional encoder-only model |
| **GPT-2 XL**              | Feb 2019     | 1600    | 25     | 48     | 6400     | 1.5B  | Largest GPT-2 model |
| **GPT-3 (175B)**          | Jun 2020     | 12288   | 96     | 96     | 49152    | 175B  | Major scaling breakthrough |
| **GPT-3.5**               | Nov 2022     | ~12288  | ~96    | ~96    | ~49k     | 175‚Äì180B | Powered ChatGPT v1 |
| **GPT-4** (closed)        | Mar 2023     | ~15k‚Äì20k | ~120  | ~120   | ~60k‚Äì80k | ~500B‚Äì1T est. | MoE; ~220B active |
| **LLaMA-3 (70B)**         | Apr 2024     | ~11k    | ~88    | 80     | ~43k     | 70B   | Meta‚Äôs 2024 flagship |
| **Mistral Large**         | early 2024   | ~8192   | ~64    | ~80    | ~28k     | ~80B  | High-efficiency dense model |
| **Gemini 2.0 Pro**        | Dec 2024     | ‚Äî       | ‚Äî      | ‚Äî      | ‚Äî        | >20B (MoE) | Sparse MoE; partial activation |
| **Gemini 2.5 Pro**        | Feb 2025     | ‚Äî       | ‚Äî      | ‚Äî      | ‚Äî        | unknown | Long-context, multimodal |
| **GPT-5**                 | 2025 | ‚Äî       | ‚Äî      | ‚Äî      | ‚Äî        | unknown | Likely MoE, > GPT-4 |
| **Claude 3.7**            | 2025         | ‚Äî       | ‚Äî      | ‚Äî      | ‚Äî        | unknown | NeMo architecture |


### üìö How LLMs Learn ‚Äî Sources of Training Data

| Source Type | Examples | Approx. Scale | Purpose in Training | Quality / Notes |
|-------------|----------|----------------|----------------------|------------------|
| **Reference & Knowledge Bases** | Wikipedia, Gutenberg, open-access research (arXiv), PubMed, government docs | **10‚Äì100B words** (‚âà 15‚Äì130B tokens) | Build factual grounding, structured knowledge, reasoning | üü¢ High quality |
| **Modern Writing & Media** | News articles (Reuters, BBC), educational content, technical docs | **100‚Äì300B words** (‚âà 150‚Äì400B tokens) | Teach current language, tone, domain context | üü¢ High quality (partly licensed) |
| **Web & Conversational Data** | Common Crawl, Reddit, Stack Overflow (licensed), blogs, forums | **1‚Äì3T words** (‚âà 1.3‚Äì4T tokens) | Capture natural phrasing, diversity, informal usage | üü° Mixed quality; heavily filtered |
| **Code & Technical Corpora** | GitHub open-source code, API docs, comments | **100M+ repos** (‚âà 20‚Äì80B lines of code) | Teach logic, structure, problem-solving | üü¢ Reliable when curated |
| **Licensed & Curated Sets** | Paid news, textbooks, alignment datasets (RLHF) | **10‚Äì100B words** (‚âà 15‚Äì130B tokens) | Improve factual reliability & human intent alignment | üü¢ High quality; proprietary |
| **Excluded / Not Used** | Private data, paywalled content, nondigitized archives | ‚Äî | ‚Äî | üî¥ Excluded due to privacy/copyright |


### What Large-Scale LLMs Solve (vs. Earlier DL Models)

**1. Long-Range Reasoning**
- Attention over *entire sequences* (not limited memory)
- Handles multi-sentence context, documents, instructions

**2. Rich Knowledge & Generalization**
- Trained on trillion-token corpora ‚Üí broad world knowledge  
- Learns patterns not present in earlier small-data models

**3. Compositional & Multi-Step Reasoning**
- Chain-of-thought, planning, tool use, multi-hop logic  
- Earlier models handled only local, shallow correlations

**4. Robustness & Versatility**
- Works across tasks (QA, code, translation, dialogue) without task-specific training
- Earlier models needed separate architectures per task

**5. Emergent Abilities from Scale**
- In-context learning, few-shot prompting, instruction following  
- Not present in RNN/LSTM or small Transformers


### üß© Traditional NLP Pre-Processing (Before Deep Learning)

**Goal:** Clean and normalize text before feature extraction.

**1. Tokenization**
- Rule-based splitting (whitespace, punctuation)
- Produces whole-word tokens

**2. Normalization**
- Lowercasing
- Remove punctuation / stopwords
- Handle numbers, dates, URLs

**3. Stemming**
- Rule-based suffix stripping  
  *(running ‚Üí run, studies ‚Üí studi)*
- Fast but crude; loses grammatical meaning

**4. Lemmatization**
- Dictionary-based canonical form  
  *(better ‚Üí good, mice ‚Üí mouse)*
- More accurate but slower

**5. Why so much preprocessing?**
- Traditional models used **bags-of-words**, **n-grams**, **TF‚ÄìIDF**  
- Needed clean, consistent token forms to reduce sparsity


### üî° Tokenization in Modern LLMs (Deep Learning Era)

**Goal:** Convert text ‚Üí model-friendly tokens while preserving meaning.

**1. Subword Tokenization (dominant today)**
- Algorithms: **BPE**, **WordPiece**, **SentencePiece**
- Break rare/complex words into meaningful chunks  
  *‚Äúbioluminescence‚Äù ‚Üí bio + lumi + nescence*
- Keeps vocabulary small (‚âà 30k‚Äì50k tokens)

**2. Minimal Normalization**
- No stemming or lemmatization  
- Casing, morphology, punctuation kept intact  
- Models learn meaning directly from data

**3. Why subwords?**
- Handles rare words and typos  
- Preserves semantic structure  
- Enables open vocabulary

**Key shift:**  
Traditional NLP relied on heavy text cleaning.  
LLMs rely on **subword tokenization + large training data**, and learn everything else in the model.


<!-- slide -->
### üî£ How LLM Tokenizers Handle Punctuation & Special Characters

**Modern LLM tokenizers (BPE / WordPiece / SentencePiece):**
- Treat punctuation as **meaningful tokens**, not noise  
- Keep characters like `, . ! ? ; : - ' "` as **separate subword units**
- Handle emojis, math symbols, and Unicode via **dedicated tokens**
- Preserve case, punctuation, and spacing ‚Üí model learns patterns directly

**Examples**
- `"Hello!"` ‚Üí `["Hello", "!"]`
- `"don't"` ‚Üí `["don", "'", "t"]` (subword splits)
- `"C++"` ‚Üí `["C", "+", "+"]`
- Emojis ‚Üí single token: `"üòä"`  
- Mixed scripts ‚Üí tokenizer keeps Unicode blocks intact

**Why this matters**
- Preserves semantics (e.g., **‚Äúnot good‚Äù**, **quotes**, **exclamation emphasis**)  
- Reduces vocabulary size while still handling rare forms  
- Lets the model learn punctuation-based structure: sentence boundaries, lists, emphasis, dialogue

**Key idea:**  
LLMs *do not strip or normalize punctuation* ‚Äî they **learn its function** directly from data.


# 6. Modern LLM Applications


## Core NLP Tasks


| Task | Impact of LLMs | Simple Reason |
|------|----------------|----------------|
| **Summarization** | **Revolutionary** | First time models can *truly rewrite* text with abstraction, coherence, and long-range understanding. |
| **Question Answering** | **Revolutionary** | Moves from keyword lookup ‚Üí *actual comprehension and reasoning* across sentences/documents. |
| **Sentiment Analysis** | **Large improvement, not revolutionary** | Much better at nuance and sarcasm, but the task was already well-solved with older methods. |
| **NER** | **Large improvement, not revolutionary** | More robust and contextual, but earlier CRF/BiLSTM systems already handled core patterns well. |



## Advanced Applications

| Application | What It Is | Key Capabilities |
|------------|------------|------------------|
| **AI Agents** | Autonomous LLM-driven systems | Planning, tool use, API calls, workflow automation |
| **Conversational AI** | Dialogue systems & chatbots | Multi-turn reasoning, task chat, interactive support |
| **Code Generation** | Natural language ‚Üí executable code | Autocomplete, debugging, refactoring, API usage |
| **Multimodal Systems** | Models using text + images/audio/video | Captioning, OCR, visual reasoning, grounded understanding |

**Impact:**  
These application ideas aren‚Äôt new ‚Äî but **LLMs transformed them into reality**, delivering **human-level fluency**, **reasoning**, and **cross-task versatility** at **real-world scale** for the first time.


## Retrieval-Augmented Generation (RAG)

### What It Is
RAG combines **retrieval** (finding relevant external information) with **generation** (LLM reasoning and synthesis) to produce accurate, grounded answers.

### How It Works
1. **Query ‚Üí Embedding**  
   Convert the user query into a vector.
2. **Vector Search**  
   Retrieve top-k relevant documents from a knowledge base or vector store.
3. **Augment Context**  
   Insert retrieved text into the LLM prompt.
4. **Generate Answer**  
   LLM synthesizes a response grounded in retrieved evidence.


### Why It Matters
- Provides **fresh, domain-specific knowledge** not in training data  
- **Reduces hallucinations** by anchoring output in retrieved text  
- Scales with **your documents**, not model size  
- Knowledge can be **updated instantly** without retraining  

### Practical Applications
- Enterprise QA over docs, wikis, RCAs, Slack  
- Customer support using manuals and troubleshooting guides  
- Semantic search assistants for large corpora  
- Code assistants retrieving APIs and examples  
- Research tools surfacing papers and reports  

**Core Idea:**  
RAG = **retrieval for facts** + **LLM for reasoning**, producing responses that are accurate, grounded, and up-to-date.


## AI agent

<img src="agent-overview.png" width="800">

### How an AI Agent Behaves Like a Human

The LLM's ability to **understand and generate natural language** enables AI agent to mirror the way a human thinks and acts:

1. **Brain = LLM (Central Reasoning Core)**  
   The large language model functions like the agent‚Äôs *brain* ‚Äî it interprets user input, understands intentions, and decides what to do next.

2. **Perception ‚Üí Understanding Intent**  
   When the user provides an instruction, the LLM processes the text the way a person listens:  
   - extracts meaning  
   - infers goals  
   - identifies missing information  
   - forms an initial plan  

3. **Planning ‚Üí Figuring Out What‚Äôs Needed**  
   The ‚Äúbrain‚Äù determines:  
   - What info do I already have?  
   - What info do I need to gather?  
   - What tools can help me?  
   - What steps should I take next?


4. **Tools = Human Abilities (Memory, Search, Calculator, APIs)**  
   Just as humans use notebooks, search engines, calendars, or calculators, the AI agent uses external tools:  
   - search  
   - code interpreter  
   - calendar/api calls  
   - knowledge retrieval  
   - long-term memory  

5. **Reasoning Loop (Perception ‚Üí Action ‚Üí Reflection)**  
   The agent repeatedly:  
   - retrieves info using tools  
   - returns it to the ‚Äúbrain‚Äù  
   - re-evaluates what it knows  
   - refines the plan  
   This loop continues until the agent feels it has enough information.

6. **Decision + Action**  
   Once ready, the agent:  
   - **responds to the user**, *or*  
   - **executes actions** using tools (run code, fetch data, update a system, etc.)

**Core idea:**  
An AI agent behaves like a human problem-solver ‚Äî the LLM acts as the brain, tools act as abilities, and the agent cycles through understanding, planning, gathering information, and acting until the task is complete.

# Questions?

___

Thank you!
