# Understanding Large Language Models (LLMs)
## A Beginner's Guide to Next Token Prediction, Tokenization, and Embeddings

**Learning Objectives:**
1. Understand how LLMs predict the next token
2. Learn about tokenization and how it works across different languages
3. Build intuition about vector embeddings and how meaning is represented

**Prerequisites:**
- Basic Python knowledge
- Understanding of basic machine learning concepts (helpful but not required)

---

## Setup and Installation

First, let's install the required libraries. We'll use open-source models from Hugging Face.

In [None]:
# Install required packages
!pip install transformers torch numpy matplotlib seaborn scikit-learn tokenizers sentencepiece --quiet

print("‚úÖ All packages installed successfully!")

In [None]:
# Import libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2LMHeadModel, GPT2Tokenizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")

---
# Part 1: Next Token Prediction - The Core of LLMs

## What is Next Token Prediction?

LLMs work by predicting the next token (word or subword) given a sequence of previous tokens. This simple idea is the foundation of how models like GPT, LLaMA, and others generate text.

**Key Concept:** Given "The cat sat on the", the model predicts "mat" (or "chair", "floor", etc.) based on probabilities.

Let's see this in action with GPT-2, a small open-source model.

In [None]:
# Load GPT-2 small model (124M parameters)
print("Loading GPT-2 model...")
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()  # Set to evaluation mode

print(f"‚úÖ Model loaded: {model_name}")
print(f"Model size: ~124M parameters")

## Visualizing Next Token Prediction

Let's see what the model predicts as the next token for different prompts.

In [None]:
def predict_next_tokens(text, top_k=10):
    """
    Predict the most likely next tokens given input text.
    
    Args:
        text: Input text prompt
        top_k: Number of top predictions to show
    """
    # Tokenize input
    input_ids = tokenizer.encode(text, return_tensors='pt')
    
    # Get model predictions
    with torch.no_grad():
        outputs = model(input_ids)
        predictions = outputs.logits
    
    # Get the predictions for the next token (last position)
    next_token_logits = predictions[0, -1, :]
    
    # Convert to probabilities
    next_token_probs = torch.softmax(next_token_logits, dim=-1)
    
    # Get top k predictions
    top_probs, top_indices = torch.topk(next_token_probs, top_k)
    
    # Display results
    print(f"\nüìù Input: '{text}'\n")
    print("Top predictions for the next token:\n")
    print(f"{'Rank':<6} {'Token':<20} {'Probability':<12}")
    print("-" * 50)
    
    for rank, (prob, idx) in enumerate(zip(top_probs, top_indices), 1):
        token = tokenizer.decode([idx])
        print(f"{rank:<6} {repr(token):<20} {prob.item():.4f} ({prob.item()*100:.2f}%)")
    
    return top_probs, top_indices

# Example 1: Simple completion
predict_next_tokens("The capital of Rwanda is")

In [None]:
# Example 2: Another completion
predict_next_tokens("Once upon a time")

In [None]:
# Example 3: Technical context
predict_next_tokens("Machine learning is")

### üéØ Exercise 1: Experiment with Next Token Prediction

Try different prompts and observe:
1. How do probabilities change with different contexts?
2. What happens with ambiguous prompts?
3. Try prompts in different languages (if the model supports them)

In [None]:
# Your turn! Try your own prompts here:
your_prompt = "The weather today is"  # Change this!
predict_next_tokens(your_prompt)

## Visualizing Probability Distribution

Let's visualize how confident the model is about different predictions.

In [None]:
def visualize_predictions(text, top_k=15):
    """
    Visualize the probability distribution of next token predictions.
    """
    input_ids = tokenizer.encode(text, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model(input_ids)
        next_token_logits = outputs.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
    
    top_probs, top_indices = torch.topk(next_token_probs, top_k)
    tokens = [tokenizer.decode([idx]) for idx in top_indices]
    
    # Create visualization
    plt.figure(figsize=(12, 6))
    plt.barh(range(top_k), top_probs.numpy())
    plt.yticks(range(top_k), [f"{i+1}. {repr(t)}" for i, t in enumerate(tokens)])
    plt.xlabel('Probability')
    plt.title(f'Top {top_k} Next Token Predictions for: "{text}"')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
visualize_predictions("The capital of Rwanda is")

## Understanding Temperature in Text Generation

Temperature controls the randomness of predictions. Let's see how it affects generation.

In [None]:
def generate_with_temperature(prompt, temperature=1.0, max_length=50):
    """
    Generate text with different temperature settings.
    
    Temperature:
    - Low (0.1-0.5): More deterministic, focused
    - Medium (0.7-1.0): Balanced
    - High (1.5-2.0): More random, creative
    """
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    
    output = model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Temperature: {temperature}")
    print(f"Generated: {generated_text}\n")
    print("-" * 80)

prompt = "Artificial intelligence will"

print("Comparing different temperatures:\n")
generate_with_temperature(prompt, temperature=0.3)
generate_with_temperature(prompt, temperature=1.0)
generate_with_temperature(prompt, temperature=1.5)

---
# Part 2: Tokenization - Breaking Text into Pieces

## What is Tokenization?

Tokenization is the process of breaking text into smaller units (tokens) that the model can process. Different languages and writing systems require different tokenization strategies.

**Key Concepts:**
- Tokens can be words, subwords, or characters
- Different tokenizers handle different languages differently
- Languages with rich morphology (like Kinyarwanda) may be tokenized less efficiently

## Comparing Different Tokenizers

In [None]:
# Load different tokenizers
print("Loading different tokenizers...\n")

tokenizers_to_compare = {
    "GPT-2": AutoTokenizer.from_pretrained("gpt2"),
    "BERT": AutoTokenizer.from_pretrained("bert-base-uncased"),
    "RoBERTa": AutoTokenizer.from_pretrained("roberta-base"),
}

print("‚úÖ Tokenizers loaded successfully!")

In [None]:
def compare_tokenization(text, tokenizers_dict):
    """
    Compare how different tokenizers process the same text.
    """
    print(f"\nüìù Original text: '{text}'\n")
    print("=" * 80)
    
    for name, tokenizer in tokenizers_dict.items():
        tokens = tokenizer.tokenize(text)
        token_ids = tokenizer.encode(text, add_special_tokens=False)
        
        print(f"\n{name}:")
        print(f"  Number of tokens: {len(tokens)}")
        print(f"  Tokens: {tokens}")
        print(f"  Token IDs: {token_ids}")
    
    print("\n" + "=" * 80)

# Example 1: English text
compare_tokenization("Hello, how are you today?", tokenizers_to_compare)

In [None]:
# Example 2: Technical text
compare_tokenization("Machine learning is revolutionizing technology.", tokenizers_to_compare)

## Tokenization for Different Languages

Let's see how tokenization works for different languages, including Kinyarwanda. This is important because most tokenizers are trained primarily on English data.

In [None]:
# Test sentences in different languages
multilingual_examples = {
    "English": "Hello, how are you?",
    "Kinyarwanda": "Mwaramutse, mumeze mute?",
    "French": "Bonjour, comment allez-vous?",
    "Swahili": "Habari, unajisikiaje?",
    "Spanish": "Hola, ¬øc√≥mo est√°s?",
}

def analyze_multilingual_tokenization(examples, tokenizer, tokenizer_name):
    """
    Analyze how a tokenizer handles different languages.
    """
    print(f"\n{'='*80}")
    print(f"Tokenizer: {tokenizer_name}")
    print(f"{'='*80}\n")
    
    results = {}
    
    for language, text in examples.items():
        tokens = tokenizer.tokenize(text)
        num_tokens = len(tokens)
        num_chars = len(text)
        efficiency = num_chars / num_tokens if num_tokens > 0 else 0
        
        results[language] = {
            'tokens': tokens,
            'num_tokens': num_tokens,
            'num_chars': num_chars,
            'efficiency': efficiency
        }
        
        print(f"{language}:")
        print(f"  Text: '{text}'")
        print(f"  Tokens: {tokens}")
        print(f"  Number of tokens: {num_tokens}")
        print(f"  Characters per token: {efficiency:.2f}")
        print()
    
    return results

# Analyze with GPT-2 tokenizer
gpt2_results = analyze_multilingual_tokenization(
    multilingual_examples, 
    tokenizers_to_compare["GPT-2"],
    "GPT-2"
)

In [None]:
# Visualize tokenization efficiency across languages
def visualize_tokenization_efficiency(results):
    """
    Visualize how efficiently different languages are tokenized.
    """
    languages = list(results.keys())
    num_tokens = [results[lang]['num_tokens'] for lang in languages]
    efficiency = [results[lang]['efficiency'] for lang in languages]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Number of tokens
    ax1.bar(languages, num_tokens, color='steelblue')
    ax1.set_ylabel('Number of Tokens')
    ax1.set_title('Number of Tokens per Language')
    ax1.tick_params(axis='x', rotation=45)
    
    # Efficiency (chars per token)
    ax2.bar(languages, efficiency, color='coral')
    ax2.set_ylabel('Characters per Token')
    ax2.set_title('Tokenization Efficiency (Higher = More Efficient)')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

visualize_tokenization_efficiency(gpt2_results)

## üéØ Exercise 2: Explore Tokenization

### Part A: Experiment with Different Texts

Try tokenizing:
1. Long Kinyarwanda sentences
2. Technical terms in Kinyarwanda
3. Mixed language text (code-switching)

**Questions to consider:**
- Which languages are tokenized more efficiently?
- Why might some languages require more tokens?
- What are the implications for LLM performance?

In [None]:
# Your turn! Add your own examples
your_examples = {
    "Example 1": "Add your text here",
    "Example 2": "Add another example",
    # Add more examples
}

# Uncomment to test:
# your_results = analyze_multilingual_tokenization(your_examples, tokenizers_to_compare["GPT-2"], "GPT-2")
# visualize_tokenization_efficiency(your_results)

### Part B: OpenAI Tokenizer Playground

**üìé Online Exercise:**

Visit the OpenAI Tokenizer Playground: https://platform.openai.com/tokenizer

**Tasks:**
1. Test the same Kinyarwanda sentences you used above
2. Compare the token counts with GPT-2
3. Try different GPT models (GPT-3.5, GPT-4) and observe differences
4. Experiment with:
   - Punctuation
   - Numbers
   - Special characters
   - Emojis

**Discussion Points:**
- Why do newer models (GPT-4) tokenize some languages more efficiently?
- What does this mean for cost and performance?
- How might this affect model training on low-resource languages?

## Understanding Subword Tokenization

Let's visualize how subword tokenization works with a detailed example.

In [None]:
def visualize_subword_tokens(text, tokenizer, tokenizer_name):
    """
    Visualize how text is broken into subword tokens.
    """
    tokens = tokenizer.tokenize(text)
    
    print(f"\nTokenizer: {tokenizer_name}")
    print(f"Original text: '{text}'")
    print(f"\nToken breakdown:")
    print("-" * 60)
    
    for i, token in enumerate(tokens, 1):
        # Show the token and its representation
        token_clean = token.replace('ƒ†', '‚ñÅ')  # Show spaces as ‚ñÅ
        token_id = tokenizer.convert_tokens_to_ids([token])[0]
        print(f"Token {i:2d}: {token_clean:20s} (ID: {token_id})")
    
    print("-" * 60)
    print(f"Total tokens: {len(tokens)}\n")

# Example with uncommon/technical words
examples = [
    "The biotechnology industry is growing.",
    "Umunyarwanda w'umwanditsi",  # Kinyarwanda
    "Preprocessing and tokenization",
]

for example in examples:
    visualize_subword_tokens(example, tokenizers_to_compare["GPT-2"], "GPT-2")

---
# Part 3: Vector Embeddings - Representing Meaning

## What are Embeddings?

Embeddings are numerical representations (vectors) of tokens that capture their meaning. Similar words have similar embeddings.

**Key Concepts:**
- Each token is represented as a vector of numbers (typically 768 or 1024 dimensions)
- Similar meanings ‚Üí Similar vectors
- We can measure similarity using cosine similarity

## Extracting Embeddings from GPT-2

In [None]:
def get_word_embedding(word, model, tokenizer):
    """
    Get the embedding vector for a word.
    """
    # Get token ID
    token_id = tokenizer.encode(word, add_special_tokens=False)[0]
    
    # Get embedding from model's embedding layer
    embedding = model.transformer.wte.weight[token_id].detach().numpy()
    
    return embedding

# Get embeddings for some words
words = ["king", "queen", "man", "woman", "cat", "dog", "computer", "phone"]
embeddings = {}

for word in words:
    embeddings[word] = get_word_embedding(word, model, tokenizer)
    print(f"‚úÖ Embedding for '{word}': shape {embeddings[word].shape}")

print(f"\nEmbedding dimension: {embeddings[words[0]].shape[0]}")

## Computing Similarity Between Words

In [None]:
def compute_similarity_matrix(words, embeddings):
    """
    Compute cosine similarity between all pairs of words.
    """
    n = len(words)
    similarity_matrix = np.zeros((n, n))
    
    for i, word1 in enumerate(words):
        for j, word2 in enumerate(words):
            emb1 = embeddings[word1].reshape(1, -1)
            emb2 = embeddings[word2].reshape(1, -1)
            similarity_matrix[i, j] = cosine_similarity(emb1, emb2)[0, 0]
    
    return similarity_matrix

def visualize_similarity_matrix(words, similarity_matrix):
    """
    Visualize the similarity matrix as a heatmap.
    """
    plt.figure(figsize=(10, 8))
    sns.heatmap(similarity_matrix, 
                xticklabels=words, 
                yticklabels=words,
                annot=True, 
                fmt='.3f',
                cmap='coolwarm',
                center=0.5,
                vmin=0,
                vmax=1)
    plt.title('Cosine Similarity Between Word Embeddings')
    plt.tight_layout()
    plt.show()

# Compute and visualize similarities
similarity_matrix = compute_similarity_matrix(words, embeddings)
visualize_similarity_matrix(words, similarity_matrix)

## Interpreting Similarity Scores

**What do the numbers mean?**
- 1.0: Identical (same word)
- 0.8-0.9: Very similar meaning
- 0.6-0.7: Related concepts
- 0.4-0.5: Some relation
- < 0.4: Not very related

**Observations from the heatmap:**
- Words with similar meanings have higher similarity scores
- Semantic relationships are captured (e.g., king-queen, man-woman)
- Category relationships (e.g., cat-dog, computer-phone)

In [None]:
def find_most_similar(target_word, words, embeddings, top_k=5):
    """
    Find the most similar words to a target word.
    """
    target_emb = embeddings[target_word].reshape(1, -1)
    similarities = []
    
    for word in words:
        if word != target_word:
            emb = embeddings[word].reshape(1, -1)
            sim = cosine_similarity(target_emb, emb)[0, 0]
            similarities.append((word, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\nWords most similar to '{target_word}':")
    print("-" * 40)
    for i, (word, sim) in enumerate(similarities[:top_k], 1):
        print(f"{i}. {word:<15} (similarity: {sim:.4f})")

find_most_similar("king", words, embeddings)
find_most_similar("computer", words, embeddings)

## Visualizing Embeddings in 2D

Embeddings exist in high-dimensional space (768 dimensions for GPT-2). We can use dimensionality reduction to visualize them in 2D.

In [None]:
def visualize_embeddings_2d(words, embeddings):
    """
    Visualize embeddings in 2D using PCA.
    """
    # Prepare embedding matrix
    embedding_matrix = np.array([embeddings[word] for word in words])
    
    # Reduce to 2D using PCA
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embedding_matrix)
    
    # Plot
    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100, alpha=0.6)
    
    # Add labels
    for i, word in enumerate(words):
        plt.annotate(word, 
                    (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                    fontsize=12,
                    ha='center',
                    va='bottom')
    
    plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.2%} variance)')
    plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.2%} variance)')
    plt.title('Word Embeddings Visualized in 2D (PCA)')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print(f"\nTotal variance explained: {sum(pca.explained_variance_ratio_):.2%}")

visualize_embeddings_2d(words, embeddings)

## Vector Arithmetic: The Famous "King - Man + Woman = Queen" Example

In [None]:
def vector_arithmetic_example(embeddings, tokenizer):
    """
    Demonstrate vector arithmetic with embeddings.
    """
    # Get embeddings
    king_emb = embeddings['king']
    man_emb = embeddings['man']
    woman_emb = embeddings['woman']
    
    # Compute: king - man + woman
    result_emb = king_emb - man_emb + woman_emb
    
    # Find closest word to result
    vocab_size = len(tokenizer)
    all_embeddings = model.transformer.wte.weight.detach().numpy()
    
    # Compute similarities with all words (sample first 5000 for speed)
    sample_size = min(5000, vocab_size)
    similarities = cosine_similarity([result_emb], all_embeddings[:sample_size])[0]
    
    # Get top 10 matches
    top_indices = np.argsort(similarities)[::-1][:10]
    
    print("Vector Arithmetic: king - man + woman = ?\n")
    print("Top 10 closest words:")
    print("-" * 50)
    
    for i, idx in enumerate(top_indices, 1):
        word = tokenizer.decode([idx])
        sim = similarities[idx]
        print(f"{i:2d}. {word:<20} (similarity: {sim:.4f})")

vector_arithmetic_example(embeddings, tokenizer)

## Exploring More Word Relationships

In [None]:
# Let's explore more semantic categories
semantic_groups = {
    "Royalty": ["king", "queen", "prince", "princess"],
    "Animals": ["cat", "dog", "lion", "tiger"],
    "Technology": ["computer", "phone", "internet", "software"],
    "Countries": ["France", "Rwanda", "Japan", "Brazil"],
}

# Get embeddings for all words
all_words = []
all_embeddings = {}

for category, words_list in semantic_groups.items():
    for word in words_list:
        try:
            all_embeddings[word] = get_word_embedding(word, model, tokenizer)
            all_words.append(word)
        except:
            print(f"Could not get embedding for: {word}")

print(f"\nGot embeddings for {len(all_words)} words")

# Visualize all semantic groups
if len(all_words) > 0:
    visualize_embeddings_2d(all_words, all_embeddings)

## üéØ Exercise 3: Explore Embeddings

### Part A: Custom Word Lists

Create your own word lists and explore their embeddings:

**Suggested explorations:**
1. Professional titles (doctor, teacher, engineer, farmer)
2. Colors (red, blue, green, yellow)
3. Emotions (happy, sad, angry, excited)
4. Foods (rice, bread, banana, coffee)
5. Kinyarwanda words (if available in tokenizer)

In [None]:
# Your turn! Create your own word list
your_words = [
    "doctor", "teacher", "engineer", "farmer",
    "hospital", "school", "office", "farm"
]

# Get embeddings
your_embeddings = {}
for word in your_words:
    try:
        your_embeddings[word] = get_word_embedding(word, model, tokenizer)
    except:
        print(f"Skipping: {word}")

# Analyze
if len(your_embeddings) > 1:
    print("\nSimilarity Analysis:")
    valid_words = list(your_embeddings.keys())
    sim_matrix = compute_similarity_matrix(valid_words, your_embeddings)
    visualize_similarity_matrix(valid_words, sim_matrix)
    visualize_embeddings_2d(valid_words, your_embeddings)

### Part B: Vector Arithmetic Experiments

Try your own vector arithmetic:
- Paris - France + Rwanda = ?
- Doctor - Hospital + School = ?
- Computer - Technology + Nature = ?

Think about:
- What relationships are captured?
- What relationships are missed?
- Why might some analogies work better than others?

In [None]:
# Your custom vector arithmetic here
# Example: word1 - word2 + word3 = ?

def custom_vector_arithmetic(word1, word2, word3, tokenizer, model, top_k=10):
    """
    Compute: word1 - word2 + word3 = ?
    """
    try:
        emb1 = get_word_embedding(word1, model, tokenizer)
        emb2 = get_word_embedding(word2, model, tokenizer)
        emb3 = get_word_embedding(word3, model, tokenizer)
        
        result = emb1 - emb2 + emb3
        
        # Find closest words
        all_embeddings = model.transformer.wte.weight.detach().numpy()
        similarities = cosine_similarity([result], all_embeddings[:5000])[0]
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        print(f"\n{word1} - {word2} + {word3} = ?\n")
        print("Top matches:")
        print("-" * 40)
        for i, idx in enumerate(top_indices, 1):
            word = tokenizer.decode([idx])
            print(f"{i:2d}. {word:<20} ({similarities[idx]:.4f})")
    except Exception as e:
        print(f"Error: {e}")

# Try some analogies
custom_vector_arithmetic("Paris", "France", "Rwanda", tokenizer, model)

---
# Summary and Key Takeaways

## What We Learned

### 1. Next Token Prediction
- LLMs predict the next token based on probability distributions
- Temperature controls randomness in generation
- The model assigns probabilities to thousands of possible next tokens

### 2. Tokenization
- Text is broken into tokens (words, subwords, or characters)
- Different tokenizers handle languages differently
- Languages with less training data are often tokenized less efficiently
- Kinyarwanda and other low-resource languages may require more tokens
- This affects both cost (API pricing) and model performance

### 3. Vector Embeddings
- Words are represented as vectors in high-dimensional space
- Similar meanings have similar vectors
- We can measure similarity using cosine similarity
- Embeddings capture semantic relationships
- Vector arithmetic can reveal word analogies

## Important Implications

### For Low-Resource Languages (like Kinyarwanda):
1. **Tokenization Challenges**: More tokens needed ‚Üí Higher costs, longer context
2. **Representation**: Fewer examples in training data ‚Üí Potentially less accurate
3. **Solutions**:
   - Train language-specific tokenizers
   - Use multilingual models
   - Fine-tune on local language data
   - Develop community datasets

### For Model Development:
1. Tokenization strategy affects model performance
2. Embeddings quality depends on training data
3. Context length limitations impact what the model can process

## Next Steps

1. **Explore More Models**: Try different open-source models (Llama, Mistral, etc.)
2. **Build Custom Tokenizers**: Create tokenizers optimized for Kinyarwanda
3. **Fine-tuning**: Adapt models for specific tasks or languages
4. **Contribute**: Help build datasets for low-resource languages

## Additional Resources

- OpenAI Tokenizer: https://platform.openai.com/tokenizer
- Hugging Face Transformers: https://huggingface.co/transformers/
- Papers:
  - "Attention Is All You Need" (Transformer architecture)
  - "Language Models are Few-Shot Learners" (GPT-3)
  - "Neural Machine Translation of Rare Words with Subword Units" (BPE)

---

## üéì Final Exercise: Reflection Questions

1. How might tokenization inefficiency affect the cost of using LLMs for Kinyarwanda applications?
2. What are some strategies to improve LLM performance for low-resource languages?
3. How do embeddings capture meaning, and what are their limitations?
4. Why is understanding these fundamentals important for building AI applications?

**Discussion**: Share your insights with your peers and instructor!