# Lecture-1: Overview and Tokenization

In the first lecture of CS-336 we mainly studied about **tokenization** and with the focus of mainly at **BPE (Byte-Pair Encoding) tokenizer**. 

## 1. Intro To Tokenization

When we talk about Language Models we consider them like a giant math functions. They don't understand "words"; they understand numbers. We need a way to turn `The quick brown fox` into something like `[42, 512, 999, 204]`. This is where tokenization comes into play. Tokenization is the process of breaking a stream of raw text (like above example) into smaller and discrete units called tokens. <br>
A language model places a proabbility distribution over sequence of tokens. Hence, we need a procedure that encodes strings into tokens and also need a procedure that decodes tokens back into strings. A **tokenizer** is a class that implements the encode and decode methods.

### 1.1 Approach-1: Character Based Tokenization

Before fancy algorithms like BPE, WordPiece and Unigram came there existed the most fundamental tokenization method of all which was tokenizing text at the character level. In the layman terms this concept sounds too simple and that's the point of character based tokenization.<br>

In character based tokenization our token is literally a single character.<br>
For example the sentence:
```css
hello world
```

gets tokenized as:

```css
['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
```

Every letter or punctuation mark, whitespace and symbol becomes its own token. That means the tokenizer’s vocabulary is simply:
- All letters (a–z, A–Z)
- All digits (0–9)
- All punctuation and symbols (!, @, #, $, …)
- All whitespace types
- Any special tokens you define (e.g., BOS, EOS)

So the vocabulary size might be ~100–300 tokens depending on the language hence the first drawback of this method is that it is is tiny in comparison.


#### Some strengths of this approach

1. **Zero Out-of-Vocabulary Issues**:

With word-level tokenizers if we encounter a rare word like:
```css
supercalifragilisticexpialidocious
```
then the model can’t represent it as a single token unless it was already in the vocabulary. But character level tokenization solves this probelm. It breaks the above word into:
```css
['s', 'u', 'p', 'e', 'r', ...]
```
and makes everything representable.

2. **Useful in Low-Data Settings**

If we are training a small RNN, a tiny GPT-like model from scratch(nanoGPT experiments), or a language model for a very narrow domain then small vocabulary helps the model converge faster and this is where character based approach is much useful. Karpathy’s early [char-RNNs for generating Shakespeare](https://github.com/karpathy/char-rnn) relied on this simplicity.

#### But Then What’s the Drawback?

1. **Sequence Length Explosion:**

Consider the below example:

```bash
hello world
```

If we see here clearly there are `11` character level tokens. Now imagine if we are doing some real life language modelling hence then we need real sentences, paragraphs or books and then model will need to process much longer sequences. Longer sequence will mean more memory, more training time, fewer tokens per batch (slower learning) and also more attention computation (quadratic cost). Hence **this is the single biggest reason modern LLMs don’t use pure character tokenization.**


### 1.2 Approach-2: Word Based Tokenization

After failure of character based tokenization failure the researchers thought that **A text is just a sequence of words. So what if we tokenize the text by splitting it into words.** This idea is very natural because that’s how humans think but then machines aren't humans and they don't think in words and that's why this mismatch turned word-based tokenization from NLP’s first “obvious solution” into one of its biggest limitations.<br>

#### What is Word-Based Tokenization?

In simple explanation word tokenization splits text at whitespace and punctuation into words. For example:<br>

```bash
Input:   "Hello, world! I’m learning NLP."
Tokens:  ["Hello", "world", "I’m", "learning", "NLP"]
```

The tokenizer removes punctuation or isolates it and then everything else becomes a word. After this each unique word becomes a vocabulary entry and then each word maps to an integer ID and finally model processes sequences of those integer IDs.

#### How This Approach Fails:

1. **Vocabulary Explosion:**

When every unique word becomes an entry in the vocabulary then even simple variations produce new tokens. For example words like below:

```css
apple
apples
Apple
APPLE
apple's
apple-like
```
will generate new tokens with even simple variations and this will result in size explosion.


2. **Out of Vocabulary (OOV):**

If a word never appeared during training the model cannot represent it. Hence the model can’t learn representations for words that don’t exist in training.