# Understanding LLMs

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from dotenv import load_dotenv

In [4]:
load_dotenv()

True

## Tokenizing Text

### Why Tokenization?

Tokenization transforms text into a format that models can comprehend. There are several methods for tokenizing text, each with its pros and cons:

1. **Character-Based Tokenization**:
   - **Method**: Splitting the text into individual characters and assigning each a unique numerical ID.
   - **Pros**: Works well for languages like Chinese, where each character carries significant information.
   - **Cons**: Creates a small vocabulary but requires many tokens to represent a string. This can affect performance and accuracy since individual characters carry minimal information.

2. **Word-Based Tokenization**:
   - **Method**: Splitting the text into individual words.
   - **Pros**: Captures more meaning per token.
   - **Cons**: Results in a large vocabulary with many unknown words (e.g., typos, slang) and different word forms (e.g., "run", "runs", "running").

### Modern Tokenization Strategies

Modern approaches balance character and word tokenization by splitting text into subwords. These methods effectively capture both the structure and meaning of the text while efficiently handling unknown words and different forms of the same word.

- **Subword Tokenization**:
  - **Method**: Frequently occurring words or subwords are assigned a single token, while complex words are split into multiple tokens, each representing a meaningful part of the word.
  - **Example**: "flabbergasted" could be split into:
              
              tensor(781) 	:  fl
              tensor(397) 	: ab
              tensor(3900) 	: berg
              tensor(8992) 	: asted

Different models use different tokenizers, each with its unique strategy and vocabulary size. Let's see how the GPT-2 tokenizer handles a sentence.

### Example with GPT-2 Tokenizer

We'll use the GPT-2 tokenizer to tokenize the sentence shown below. This involves converting the text into tokens and then decoding those tokens back into text.

In [8]:
# Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
# Getting the token ids
input_ids = tokenizer("Preposterous, I'm Flabbergasted!", return_tensors='pt').input_ids
print(input_ids)
# Decoding the tokens back into text
for t in input_ids[0]:
    print(t,'\t:', tokenizer.decode(t))

tensor([[37534,  6197,   516,    11,   314,  1101,  1610,   397,  3900,  8992,
             0]])
tensor(37534) 	: Prep
tensor(6197) 	: oster
tensor(516) 	: ous
tensor(11) 	: ,
tensor(314) 	:  I
tensor(1101) 	: 'm
tensor(1610) 	:  Fl
tensor(397) 	: ab
tensor(3900) 	: berg
tensor(8992) 	: asted
tensor(0) 	: !


In [11]:
input_ids2 = tokenizer("I skip across the", return_tensors="pt").input_ids
for t2 in input_ids2[0]:
    print(t2, "\t:", tokenizer.decode(t2))

tensor(40) 	: I
tensor(14267) 	:  skip
tensor(1973) 	:  across
tensor(262) 	:  the


As shown, the tokenizer splits the input string into a series of tokens, each assigned a unique ID. Most words are represented by a single token, but longer words (or even shorter ones!) can be split into multiple tokens. Play around with this!

### Training Tokenizers vs. Training Models

It's important to note that training tokenizers differs from training models. Training a model is a stochastic (non-deterministic) process, while training a tokenizer is deterministic and statistical. The tokenizer learns which subwords to use based on the dataset, a design decision of the tokenization algorithm.

Popular subword tokenization approaches include Byte-level BPE (used in GPT-2), WordPiece, and SentencePiece. Each method has its advantages and is chosen based on the specific needs of the model and dataset.

By understanding tokenization, we can better appreciate how models process text and generate meaningful outputs.