# Understanding Large Language Models (LLMs)
## A Beginner's Guide to Next Token Prediction, Tokenization, and Embeddings

**Learning Objectives:**
1. Understand how LLMs predict the next token
2. Learn about tokenization and how it works across different languages
3. Build intuition about vector embeddings and how meaning is represented
4. Explore vector databases and semantic search for document retrieval

**Prerequisites:**
- Basic Python knowledge
- Understanding of basic machine learning concepts (helpful but not required)

---

## Setup and Installation

First, let's install the required libraries. We'll use open-source models from Hugging Face.

In [None]:
# Install required packages
!pip install transformers torch numpy matplotlib seaborn scikit-learn tokenizers sentencepiece chromadb sentence-transformers --quiet

print("‚úÖ All packages installed successfully!")

In [None]:
# Import libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2LMHeadModel, GPT2Tokenizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")

---
# Part 1: Next Token Prediction - The Core of LLMs

## What is Next Token Prediction?

LLMs work by predicting the next token (word or subword) given a sequence of previous tokens. This simple idea is the foundation of how models like GPT, LLaMA, and others generate text.

**Key Concept:** Given "The cat sat on the", the model predicts "mat" (or "chair", "floor", etc.) based on probabilities.

Let's see this in action with GPT-2, a small open-source model.

In [None]:
# Load GPT-2 small model (124M parameters)
print("Loading GPT-2 model...")
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()  # Set to evaluation mode

print(f"‚úÖ Model loaded: {model_name}")
print(f"Model size: ~124M parameters")

## Visualizing Next Token Prediction

Let's see what the model predicts as the next token for different prompts.

In [None]:
def predict_next_tokens(text, top_k=10):
    """
    Predict the most likely next tokens given input text.
    
    Args:
        text: Input text prompt
        top_k: Number of top predictions to show
    """
    # Tokenize input
    input_ids = tokenizer.encode(text, return_tensors='pt')
    
    # Get model predictions
    with torch.no_grad():
        outputs = model(input_ids)
        predictions = outputs.logits
    
    # Get the predictions for the next token (last position)
    next_token_logits = predictions[0, -1, :]
    
    # Convert to probabilities
    next_token_probs = torch.softmax(next_token_logits, dim=-1)
    
    # Get top k predictions
    top_probs, top_indices = torch.topk(next_token_probs, top_k)
    
    # Display results
    print(f"\nüìù Input: '{text}'\n")
    print("Top predictions for the next token:\n")
    print(f"{'Rank':<6} {'Token':<20} {'Probability':<12}")
    print("-" * 50)
    
    for rank, (prob, idx) in enumerate(zip(top_probs, top_indices), 1):
        token = tokenizer.decode([idx])
        print(f"{rank:<6} {repr(token):<20} {prob.item():.4f} ({prob.item()*100:.2f}%)")
    
    return top_probs, top_indices

# Example 1: Simple completion
predict_next_tokens("The capital of Rwanda is")

In [None]:
# Example 2: Another completion
predict_next_tokens("Once upon a time")

In [None]:
# Example 3: Technical context
predict_next_tokens("Machine learning is")

### üéØ Exercise 1: Experiment with Next Token Prediction

Try different prompts and observe:
1. How do probabilities change with different contexts?
2. What happens with ambiguous prompts?
3. Try prompts in different languages (if the model supports them)

In [None]:
# Your turn! Try your own prompts here:
your_prompt = "The weather today is"  # Change this!
predict_next_tokens(your_prompt)

## Visualizing Probability Distribution

Let's visualize how confident the model is about different predictions.

In [None]:
def visualize_predictions(text, top_k=15):
    """
    Visualize the probability distribution of next token predictions.
    """
    input_ids = tokenizer.encode(text, return_tensors='pt')
    
    with torch.no_grad():
        outputs = model(input_ids)
        next_token_logits = outputs.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
    
    top_probs, top_indices = torch.topk(next_token_probs, top_k)
    tokens = [tokenizer.decode([idx]) for idx in top_indices]
    
    # Create visualization
    plt.figure(figsize=(12, 6))
    plt.barh(range(top_k), top_probs.numpy())
    plt.yticks(range(top_k), [f"{i+1}. {repr(t)}" for i, t in enumerate(tokens)])
    plt.xlabel('Probability')
    plt.title(f'Top {top_k} Next Token Predictions for: "{text}"')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
visualize_predictions("The capital of Rwanda is")

## Understanding Temperature in Text Generation

Temperature controls the randomness of predictions. Let's see how it affects generation.

In [None]:
def generate_with_temperature(prompt, temperature=1.0, max_length=50):
    """
    Generate text with different temperature settings.
    
    Temperature:
    - Low (0.1-0.5): More deterministic, focused
    - Medium (0.7-1.0): Balanced
    - High (1.5-2.0): More random, creative
    """
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    
    output = model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Temperature: {temperature}")
    print(f"Generated: {generated_text}\n")
    print("-" * 80)

prompt = "Artificial intelligence will"

print("Comparing different temperatures:\n")
generate_with_temperature(prompt, temperature=0.3)
generate_with_temperature(prompt, temperature=1.0)
generate_with_temperature(prompt, temperature=1.5)

---
# Part 2: Tokenization - Breaking Text into Pieces

## What is Tokenization?

Tokenization is the process of breaking text into smaller units (tokens) that the model can process. Different languages and writing systems require different tokenization strategies.

**Key Concepts:**
- Tokens can be words, subwords, or characters
- Different tokenizers handle different languages differently
- Languages with rich morphology (like Kinyarwanda) may be tokenized less efficiently

## Comparing Different Tokenizers

In [None]:
# Load different tokenizers
print("Loading different tokenizers...\n")

tokenizers_to_compare = {
    "GPT-2": AutoTokenizer.from_pretrained("gpt2"),
    "BERT": AutoTokenizer.from_pretrained("bert-base-uncased"),
    "RoBERTa": AutoTokenizer.from_pretrained("roberta-base"),
}

print("‚úÖ Tokenizers loaded successfully!")

In [None]:
def compare_tokenization(text, tokenizers_dict):
    """
    Compare how different tokenizers process the same text.
    """
    print(f"\nüìù Original text: '{text}'\n")
    print("=" * 80)
    
    for name, tokenizer in tokenizers_dict.items():
        tokens = tokenizer.tokenize(text)
        token_ids = tokenizer.encode(text, add_special_tokens=False)
        
        print(f"\n{name}:")
        print(f"  Number of tokens: {len(tokens)}")
        print(f"  Tokens: {tokens}")
        print(f"  Token IDs: {token_ids}")
    
    print("\n" + "=" * 80)

# Example 1: English text
compare_tokenization("Hello, how are you today?", tokenizers_to_compare)

In [None]:
# Example 2: Technical text
compare_tokenization("Machine learning is revolutionizing technology.", tokenizers_to_compare)

## Tokenization for Different Languages

Let's see how tokenization works for different languages, including Kinyarwanda. This is important because most tokenizers are trained primarily on English data.

In [None]:
# Test sentences in different languages
multilingual_examples = {
    "English": "Hello, how are you?",
    "Kinyarwanda": "Mwaramutse, mumeze mute?",
    "French": "Bonjour, comment allez-vous?",
    "Swahili": "Habari, unajisikiaje?",
    "Spanish": "Hola, ¬øc√≥mo est√°s?",
}

def analyze_multilingual_tokenization(examples, tokenizer, tokenizer_name):
    """
    Analyze how a tokenizer handles different languages.
    """
    print(f"\n{'='*80}")
    print(f"Tokenizer: {tokenizer_name}")
    print(f"{'='*80}\n")
    
    results = {}
    
    for language, text in examples.items():
        tokens = tokenizer.tokenize(text)
        num_tokens = len(tokens)
        num_chars = len(text)
        efficiency = num_chars / num_tokens if num_tokens > 0 else 0
        
        results[language] = {
            'tokens': tokens,
            'num_tokens': num_tokens,
            'num_chars': num_chars,
            'efficiency': efficiency
        }
        
        print(f"{language}:")
        print(f"  Text: '{text}'")
        print(f"  Tokens: {tokens}")
        print(f"  Number of tokens: {num_tokens}")
        print(f"  Characters per token: {efficiency:.2f}")
        print()
    
    return results

# Analyze with GPT-2 tokenizer
gpt2_results = analyze_multilingual_tokenization(
    multilingual_examples, 
    tokenizers_to_compare["GPT-2"],
    "GPT-2"
)

In [None]:
# Visualize tokenization efficiency across languages
def visualize_tokenization_efficiency(results):
    """
    Visualize how efficiently different languages are tokenized.
    """
    languages = list(results.keys())
    num_tokens = [results[lang]['num_tokens'] for lang in languages]
    efficiency = [results[lang]['efficiency'] for lang in languages]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Number of tokens
    ax1.bar(languages, num_tokens, color='steelblue')
    ax1.set_ylabel('Number of Tokens')
    ax1.set_title('Number of Tokens per Language')
    ax1.tick_params(axis='x', rotation=45)
    
    # Efficiency (chars per token)
    ax2.bar(languages, efficiency, color='coral')
    ax2.set_ylabel('Characters per Token')
    ax2.set_title('Tokenization Efficiency (Higher = More Efficient)')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

visualize_tokenization_efficiency(gpt2_results)

## üéØ Exercise 2: Explore Tokenization

### Part A: Experiment with Different Texts

Try tokenizing:
1. Long Kinyarwanda sentences
2. Technical terms in Kinyarwanda
3. Mixed language text (code-switching)

**Questions to consider:**
- Which languages are tokenized more efficiently?
- Why might some languages require more tokens?
- What are the implications for LLM performance?

In [None]:
# Your turn! Add your own examples
your_examples = {
    "Example 1": "Add your text here",
    "Example 2": "Add another example",
    # Add more examples
}

# Uncomment to test:
# your_results = analyze_multilingual_tokenization(your_examples, tokenizers_to_compare["GPT-2"], "GPT-2")
# visualize_tokenization_efficiency(your_results)

### Part B: OpenAI Tokenizer Playground

**üìé Online Exercise:**

Visit the OpenAI Tokenizer Playground: https://platform.openai.com/tokenizer

**Tasks:**
1. Test the same Kinyarwanda sentences you used above
2. Compare the token counts with GPT-2
3. Try different GPT models (GPT-3.5, GPT-4) and observe differences
4. Experiment with:
   - Punctuation
   - Numbers
   - Special characters
   - Emojis

**Discussion Points:**
- Why do newer models (GPT-4) tokenize some languages more efficiently?
- What does this mean for cost and performance?
- How might this affect model training on low-resource languages?

## Understanding Subword Tokenization

Let's visualize how subword tokenization works with a detailed example.

In [None]:
def visualize_subword_tokens(text, tokenizer, tokenizer_name):
    """
    Visualize how text is broken into subword tokens.
    """
    tokens = tokenizer.tokenize(text)
    
    print(f"\nTokenizer: {tokenizer_name}")
    print(f"Original text: '{text}'")
    print(f"\nToken breakdown:")
    print("-" * 60)
    
    for i, token in enumerate(tokens, 1):
        # Show the token and its representation
        token_clean = token.replace('ƒ†', '‚ñÅ')  # Show spaces as ‚ñÅ
        token_id = tokenizer.convert_tokens_to_ids([token])[0]
        print(f"Token {i:2d}: {token_clean:20s} (ID: {token_id})")
    
    print("-" * 60)
    print(f"Total tokens: {len(tokens)}\n")

# Example with uncommon/technical words
examples = [
    "The biotechnology industry is growing.",
    "Umunyarwanda w'umwanditsi",  # Kinyarwanda
    "Preprocessing and tokenization",
]

for example in examples:
    visualize_subword_tokens(example, tokenizers_to_compare["GPT-2"], "GPT-2")

---
# Part 3: Vector Embeddings - Representing Meaning

## What are Embeddings?

Embeddings are numerical representations (vectors) of tokens that capture their meaning. Similar words have similar embeddings.

**Key Concepts:**
- Each token is represented as a vector of numbers (typically 768 or 1024 dimensions)
- Similar meanings ‚Üí Similar vectors
- We can measure similarity using cosine similarity

## Extracting Embeddings from GPT-2

In [None]:
def get_word_embedding(word, model, tokenizer):
    """
    Get the embedding vector for a word.
    """
    # Get token ID
    token_id = tokenizer.encode(word, add_special_tokens=False)[0]
    
    # Get embedding from model's embedding layer
    embedding = model.transformer.wte.weight[token_id].detach().numpy()
    
    return embedding

# Get embeddings for some words
words = ["king", "queen", "man", "woman", "cat", "dog", "computer", "phone"]
embeddings = {}

for word in words:
    embeddings[word] = get_word_embedding(word, model, tokenizer)
    print(f"‚úÖ Embedding for '{word}': shape {embeddings[word].shape}")

print(f"\nEmbedding dimension: {embeddings[words[0]].shape[0]}")

## Computing Similarity Between Words

In [None]:
def compute_similarity_matrix(words, embeddings):
    """
    Compute cosine similarity between all pairs of words.
    """
    n = len(words)
    similarity_matrix = np.zeros((n, n))
    
    for i, word1 in enumerate(words):
        for j, word2 in enumerate(words):
            emb1 = embeddings[word1].reshape(1, -1)
            emb2 = embeddings[word2].reshape(1, -1)
            similarity_matrix[i, j] = cosine_similarity(emb1, emb2)[0, 0]
    
    return similarity_matrix

def visualize_similarity_matrix(words, similarity_matrix):
    """
    Visualize the similarity matrix as a heatmap.
    """
    plt.figure(figsize=(10, 8))
    sns.heatmap(similarity_matrix, 
                xticklabels=words, 
                yticklabels=words,
                annot=True, 
                fmt='.3f',
                cmap='coolwarm',
                center=0.5,
                vmin=0,
                vmax=1)
    plt.title('Cosine Similarity Between Word Embeddings')
    plt.tight_layout()
    plt.show()

# Compute and visualize similarities
similarity_matrix = compute_similarity_matrix(words, embeddings)
visualize_similarity_matrix(words, similarity_matrix)

## Interpreting Similarity Scores

**What do the numbers mean?**
- 1.0: Identical (same word)
- 0.8-0.9: Very similar meaning
- 0.6-0.7: Related concepts
- 0.4-0.5: Some relation
- < 0.4: Not very related

**Observations from the heatmap:**
- Words with similar meanings have higher similarity scores
- Semantic relationships are captured (e.g., king-queen, man-woman)
- Category relationships (e.g., cat-dog, computer-phone)

In [None]:
def find_most_similar(target_word, words, embeddings, top_k=5):
    """
    Find the most similar words to a target word.
    """
    target_emb = embeddings[target_word].reshape(1, -1)
    similarities = []
    
    for word in words:
        if word != target_word:
            emb = embeddings[word].reshape(1, -1)
            sim = cosine_similarity(target_emb, emb)[0, 0]
            similarities.append((word, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\nWords most similar to '{target_word}':")
    print("-" * 40)
    for i, (word, sim) in enumerate(similarities[:top_k], 1):
        print(f"{i}. {word:<15} (similarity: {sim:.4f})")

find_most_similar("king", words, embeddings)
find_most_similar("computer", words, embeddings)

## Visualizing Embeddings in 2D

Embeddings exist in high-dimensional space (768 dimensions for GPT-2). We can use dimensionality reduction to visualize them in 2D.

In [None]:
def visualize_embeddings_2d(words, embeddings):
    """
    Visualize embeddings in 2D using PCA.
    """
    # Prepare embedding matrix
    embedding_matrix = np.array([embeddings[word] for word in words])
    
    # Reduce to 2D using PCA
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embedding_matrix)
    
    # Plot
    plt.figure(figsize=(12, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100, alpha=0.6)
    
    # Add labels
    for i, word in enumerate(words):
        plt.annotate(word, 
                    (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                    fontsize=12,
                    ha='center',
                    va='bottom')
    
    plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.2%} variance)')
    plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.2%} variance)')
    plt.title('Word Embeddings Visualized in 2D (PCA)')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print(f"\nTotal variance explained: {sum(pca.explained_variance_ratio_):.2%}")

visualize_embeddings_2d(words, embeddings)

## Vector Arithmetic: The Famous "King - Man + Woman = Queen" Example

In [None]:
def vector_arithmetic_example(embeddings, tokenizer):
    """
    Demonstrate vector arithmetic with embeddings.
    """
    # Get embeddings
    king_emb = embeddings['king']
    man_emb = embeddings['man']
    woman_emb = embeddings['woman']
    
    # Compute: king - man + woman
    result_emb = king_emb - man_emb + woman_emb
    
    # Find closest word to result
    vocab_size = len(tokenizer)
    all_embeddings = model.transformer.wte.weight.detach().numpy()
    
    # Compute similarities with all words (sample first 5000 for speed)
    sample_size = min(5000, vocab_size)
    similarities = cosine_similarity([result_emb], all_embeddings[:sample_size])[0]
    
    # Get top 10 matches
    top_indices = np.argsort(similarities)[::-1][:10]
    
    print("Vector Arithmetic: king - man + woman = ?\n")
    print("Top 10 closest words:")
    print("-" * 50)
    
    for i, idx in enumerate(top_indices, 1):
        word = tokenizer.decode([idx])
        sim = similarities[idx]
        print(f"{i:2d}. {word:<20} (similarity: {sim:.4f})")

vector_arithmetic_example(embeddings, tokenizer)

## Exploring More Word Relationships

In [None]:
# Let's explore more semantic categories
semantic_groups = {
    "Royalty": ["king", "queen", "prince", "princess"],
    "Animals": ["cat", "dog", "lion", "tiger"],
    "Technology": ["computer", "phone", "internet", "software"],
    "Countries": ["France", "Rwanda", "Japan", "Brazil"],
}

# Get embeddings for all words
all_words = []
all_embeddings = {}

for category, words_list in semantic_groups.items():
    for word in words_list:
        try:
            all_embeddings[word] = get_word_embedding(word, model, tokenizer)
            all_words.append(word)
        except:
            print(f"Could not get embedding for: {word}")

print(f"\nGot embeddings for {len(all_words)} words")

# Visualize all semantic groups
if len(all_words) > 0:
    visualize_embeddings_2d(all_words, all_embeddings)

## üéØ Exercise 3: Explore Embeddings

### Part A: Custom Word Lists

Create your own word lists and explore their embeddings:

**Suggested explorations:**
1. Professional titles (doctor, teacher, engineer, farmer)
2. Colors (red, blue, green, yellow)
3. Emotions (happy, sad, angry, excited)
4. Foods (rice, bread, banana, coffee)
5. Kinyarwanda words (if available in tokenizer)

In [None]:
# Your turn! Create your own word list
your_words = [
    "doctor", "teacher", "engineer", "farmer",
    "hospital", "school", "office", "farm"
]

# Get embeddings
your_embeddings = {}
for word in your_words:
    try:
        your_embeddings[word] = get_word_embedding(word, model, tokenizer)
    except:
        print(f"Skipping: {word}")

# Analyze
if len(your_embeddings) > 1:
    print("\nSimilarity Analysis:")
    valid_words = list(your_embeddings.keys())
    sim_matrix = compute_similarity_matrix(valid_words, your_embeddings)
    visualize_similarity_matrix(valid_words, sim_matrix)
    visualize_embeddings_2d(valid_words, your_embeddings)

### Part B: Vector Arithmetic Experiments

Try your own vector arithmetic:
- Paris - France + Rwanda = ?
- Doctor - Hospital + School = ?
- Computer - Technology + Nature = ?

Think about:
- What relationships are captured?
- What relationships are missed?
- Why might some analogies work better than others?

In [None]:
# Your custom vector arithmetic here
# Example: word1 - word2 + word3 = ?

def custom_vector_arithmetic(word1, word2, word3, tokenizer, model, top_k=10):
    """
    Compute: word1 - word2 + word3 = ?
    """
    try:
        emb1 = get_word_embedding(word1, model, tokenizer)
        emb2 = get_word_embedding(word2, model, tokenizer)
        emb3 = get_word_embedding(word3, model, tokenizer)
        
        result = emb1 - emb2 + emb3
        
        # Find closest words
        all_embeddings = model.transformer.wte.weight.detach().numpy()
        similarities = cosine_similarity([result], all_embeddings[:5000])[0]
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        print(f"\n{word1} - {word2} + {word3} = ?\n")
        print("Top matches:")
        print("-" * 40)
        for i, idx in enumerate(top_indices, 1):
            word = tokenizer.decode([idx])
            print(f"{i:2d}. {word:<20} ({similarities[idx]:.4f})")
    except Exception as e:
        print(f"Error: {e}")

# Try some analogies
custom_vector_arithmetic("Paris", "France", "Rwanda", tokenizer, model)

---
# Part 4: Vector Databases - Storing and Retrieving Documents

## What are Vector Databases?

Vector databases are specialized databases designed to store and efficiently search through vector embeddings. They are crucial for:
- **Semantic search**: Finding documents by meaning, not just keywords
- **RAG (Retrieval Augmented Generation)**: Providing LLMs with relevant context
- **Recommendation systems**: Finding similar items
- **Question answering**: Retrieving relevant information

**How it works:**
1. Convert documents into embeddings
2. Store embeddings in a vector database
3. Convert user queries into embeddings
4. Find similar documents using vector similarity search

## Installing Vector Database Libraries

We'll use **ChromaDB** - a lightweight, open-source vector database perfect for learning.

In [None]:
# Install ChromaDB and sentence-transformers for better embeddings
!pip install chromadb sentence-transformers --quiet

import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
from typing import List
import uuid

print("‚úÖ Vector database libraries installed!")

## Creating a Sample Document Collection

Let's create a collection of documents about Rwanda that we'll store in our vector database.

In [None]:
# Sample documents about Rwanda and technology
documents = [
    {
        "text": "Rwanda is a landlocked country in East Africa, known as the land of a thousand hills. The capital city is Kigali.",
        "metadata": {"category": "geography", "topic": "rwanda_overview"}
    },
    {
        "text": "Kigali is one of the cleanest cities in Africa. It has modern infrastructure and is a growing technology hub.",
        "metadata": {"category": "cities", "topic": "kigali"}
    },
    {
        "text": "Rwanda has made significant progress in technology adoption. The country has invested heavily in ICT infrastructure and digital literacy.",
        "metadata": {"category": "technology", "topic": "digital_transformation"}
    },
    {
        "text": "Machine learning and artificial intelligence are emerging fields in Rwanda. Several startups are working on AI solutions for agriculture and healthcare.",
        "metadata": {"category": "technology", "topic": "ai_ml"}
    },
    {
        "text": "Kinyarwanda is the national language of Rwanda, spoken by most of the population. French and English are also official languages.",
        "metadata": {"category": "language", "topic": "kinyarwanda"}
    },
    {
        "text": "The African Institute for Mathematical Sciences (AIMS) in Rwanda provides advanced training in mathematical sciences and data science.",
        "metadata": {"category": "education", "topic": "aims"}
    },
    {
        "text": "Rwanda's economy has grown rapidly, with technology and services sectors leading the growth. The country aims to become a knowledge-based economy.",
        "metadata": {"category": "economy", "topic": "growth"}
    },
    {
        "text": "Natural language processing for Kinyarwanda is an active research area. Challenges include limited training data and unique linguistic features.",
        "metadata": {"category": "technology", "topic": "nlp_kinyarwanda"}
    },
    {
        "text": "Mountain gorillas can be found in the Volcanoes National Park in Rwanda. Gorilla trekking is a major tourist attraction.",
        "metadata": {"category": "tourism", "topic": "wildlife"}
    },
    {
        "text": "Rwanda has implemented various digital government services. Citizens can access many government services online through the Irembo platform.",
        "metadata": {"category": "technology", "topic": "e_government"}
    }
]

print(f"üìö Created {len(documents)} sample documents")
print("\nSample document:")
print(f"Text: {documents[0]['text']}")
print(f"Metadata: {documents[0]['metadata']}")

## Setting Up the Embedding Model

We'll use a sentence transformer model that's optimized for creating semantic embeddings of text.

In [None]:
# Load a sentence transformer model
# Using 'all-MiniLM-L6-v2' - a good balance of quality and speed
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"‚úÖ Model loaded: all-MiniLM-L6-v2")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

# Test the model
test_text = "Rwanda is a beautiful country"
test_embedding = embedding_model.encode(test_text)
print(f"\nTest embedding shape: {test_embedding.shape}")

## Creating a Vector Database

Now let's create a ChromaDB database and add our documents to it.

In [None]:
# Initialize ChromaDB client
chroma_client = chromadb.Client(Settings(
    anonymized_telemetry=False,
    allow_reset=True
))

# Create or get a collection
collection_name = "rwanda_documents"
collection = chroma_client.create_collection(
    name=collection_name,
    metadata={"description": "Collection of documents about Rwanda"}
)

print(f"‚úÖ Created collection: {collection_name}")

## Adding Documents to the Vector Database

Let's convert our documents to embeddings and store them in ChromaDB.

In [None]:
def add_documents_to_collection(documents, collection, embedding_model):
    """
    Convert documents to embeddings and add them to the collection.
    """
    print("Converting documents to embeddings...\n")
    
    for i, doc in enumerate(documents, 1):
        # Create embedding
        embedding = embedding_model.encode(doc["text"]).tolist()
        
        # Generate unique ID
        doc_id = f"doc_{i}"
        
        # Add to collection
        collection.add(
            embeddings=[embedding],
            documents=[doc["text"]],
            metadatas=[doc["metadata"]],
            ids=[doc_id]
        )
        
        print(f"‚úì Added document {i}/{len(documents)}: {doc['text'][:60]}...")
    
    print(f"\n‚úÖ Successfully added {len(documents)} documents to the vector database!")

# Add all documents to the collection
add_documents_to_collection(documents, collection, embedding_model)

# Verify the count
print(f"\nTotal documents in collection: {collection.count()}")

## Semantic Search: Querying the Vector Database

Now comes the exciting part - searching for relevant documents based on the meaning of our query!

In [None]:
def semantic_search(query, collection, embedding_model, n_results=3):
    """
    Perform semantic search on the vector database.
    
    Args:
        query: Text query to search for
        collection: ChromaDB collection
        embedding_model: Model to create query embedding
        n_results: Number of results to return
    """
    # Convert query to embedding
    query_embedding = embedding_model.encode(query).tolist()
    
    # Search the collection
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    # Display results
    print(f"\nüîç Query: '{query}'\n")
    print("="*80)
    print(f"\nTop {n_results} most relevant documents:\n")
    
    for i, (doc, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1):
        similarity = 1 - distance  # Convert distance to similarity
        print(f"Result {i}:")
        print(f"  Similarity: {similarity:.4f} ({similarity*100:.2f}%)")
        print(f"  Category: {metadata['category']}")
        print(f"  Text: {doc}")
        print()
    
    return results

# Example queries
queries = [
    "Tell me about artificial intelligence in Rwanda",
    "What is the capital city?",
    "Information about Kinyarwanda language",
]

for query in queries:
    semantic_search(query, collection, embedding_model)
    print("-"*80)

## Understanding the Results

**Key Observations:**

1. **Semantic Matching**: The search finds relevant documents even when they don't contain the exact query words
2. **Similarity Scores**: Higher scores (closer to 1.0) indicate more relevant documents
3. **Context Awareness**: The system understands that "artificial intelligence" relates to "machine learning" and "technology"

**Compare this to keyword search:**
- Keyword search: Looks for exact word matches
- Semantic search: Understands meaning and context

## Visualizing Query Results

Let's visualize how queries relate to documents in the embedding space.

In [None]:
def visualize_query_results(query, documents, embedding_model, n_results=5):
    """
    Visualize query and document embeddings in 2D space.
    """
    # Get embeddings for all documents and query
    doc_texts = [doc['text'] for doc in documents]
    doc_embeddings = embedding_model.encode(doc_texts)
    query_embedding = embedding_model.encode(query)
    
    # Combine all embeddings
    all_embeddings = np.vstack([doc_embeddings, query_embedding.reshape(1, -1)])
    
    # Reduce to 2D using PCA
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(all_embeddings)
    
    # Split back into docs and query
    doc_embeddings_2d = embeddings_2d[:-1]
    query_embedding_2d = embeddings_2d[-1]
    
    # Calculate similarities
    similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:n_results]
    
    # Plot
    plt.figure(figsize=(14, 10))
    
    # Plot documents
    colors = plt.cm.viridis(similarities)
    scatter = plt.scatter(doc_embeddings_2d[:, 0], doc_embeddings_2d[:, 1], 
                         c=similarities, cmap='viridis', s=100, alpha=0.6, 
                         edgecolors='black', linewidth=1)
    
    # Plot query
    plt.scatter(query_embedding_2d[0], query_embedding_2d[1], 
               c='red', s=300, marker='*', edgecolors='black', 
               linewidth=2, label='Query', zorder=5)
    
    # Add labels for top results
    for idx in top_indices:
        plt.annotate(f"Doc {idx+1}\n({similarities[idx]:.3f})",
                    (doc_embeddings_2d[idx, 0], doc_embeddings_2d[idx, 1]),
                    xytext=(10, 10), textcoords='offset points',
                    bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7),
                    fontsize=9, ha='left')
    
    # Add query label
    plt.annotate('Query',
                (query_embedding_2d[0], query_embedding_2d[1]),
                xytext=(10, 10), textcoords='offset points',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='red', alpha=0.7),
                fontsize=10, fontweight='bold', ha='left')
    
    plt.colorbar(scatter, label='Similarity to Query')
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.title(f'Document Embeddings vs Query: "{query}"\n(Brighter colors = Higher similarity)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# Visualize a query
visualize_query_results(
    "What technology initiatives exist in Rwanda?", 
    documents, 
    embedding_model,
    n_results=3
)

## Filtering with Metadata

Vector databases allow you to combine semantic search with metadata filtering.

In [None]:
def search_with_filter(query, collection, embedding_model, filter_dict, n_results=3):
    """
    Perform semantic search with metadata filtering.
    """
    query_embedding = embedding_model.encode(query).tolist()
    
    # Search with filter
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        where=filter_dict
    )
    
    print(f"\nüîç Query: '{query}'")
    print(f"üìã Filter: {filter_dict}\n")
    print("="*80)
    
    if not results['documents'][0]:
        print("No documents found matching the filter.")
        return
    
    print(f"\nTop {len(results['documents'][0])} results:\n")
    
    for i, (doc, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1):
        similarity = 1 - distance
        print(f"Result {i}:")
        print(f"  Similarity: {similarity:.4f}")
        print(f"  Category: {metadata['category']}")
        print(f"  Topic: {metadata['topic']}")
        print(f"  Text: {doc}")
        print()

# Example: Search only in technology category
search_with_filter(
    "machine learning and AI",
    collection,
    embedding_model,
    filter_dict={"category": "technology"},
    n_results=3
)

print("-"*80)

# Example: Search only in geography category
search_with_filter(
    "beautiful landscapes",
    collection,
    embedding_model,
    filter_dict={"category": "geography"},
    n_results=2
)

## Practical Application: Building a Simple RAG System

Let's combine our vector database with an LLM to build a basic Retrieval Augmented Generation (RAG) system.

In [None]:
def simple_rag_query(question, collection, embedding_model, llm_model, llm_tokenizer, n_context=2):
    """
    Simple RAG: Retrieve relevant context and generate answer.
    
    Steps:
    1. Convert question to embedding
    2. Retrieve relevant documents
    3. Use documents as context for LLM
    4. Generate answer
    """
    print(f"\n{'='*80}")
    print(f"‚ùì Question: {question}")
    print(f"{'='*80}\n")
    
    # Step 1: Retrieve relevant documents
    query_embedding = embedding_model.encode(question).tolist()
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_context
    )
    
    print(f"üìö Retrieved {len(results['documents'][0])} relevant documents:\n")
    context_docs = results['documents'][0]
    for i, doc in enumerate(context_docs, 1):
        print(f"  {i}. {doc[:80]}...")
    
    # Step 2: Build context
    context = "\n\n".join(context_docs)
    
    # Step 3: Create prompt
    prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {question}

Answer:"""
    
    print(f"\nü§ñ Generating answer...\n")
    
    # Step 4: Generate answer
    input_ids = llm_tokenizer.encode(prompt, return_tensors='pt')
    output = llm_model.generate(
        input_ids,
        max_length=input_ids.shape[1] + 100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=llm_tokenizer.eos_token_id
    )
    
    # Extract only the generated answer (not the prompt)
    answer = llm_tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True)
    
    print(f"üí° Answer: {answer.strip()}")
    print(f"\n{'='*80}\n")
    
    return answer, context_docs

# Test RAG system
questions = [
    "What technological developments are happening in Rwanda?",
    "Tell me about education in Rwanda",
]

for question in questions:
    simple_rag_query(question, collection, embedding_model, model, tokenizer)

## Comparing Search Methods

Let's compare semantic search with traditional keyword search to see the difference.

In [None]:
def keyword_search(query, documents, n_results=3):
    """
    Simple keyword-based search for comparison.
    """
    query_words = set(query.lower().split())
    scores = []
    
    for doc in documents:
        doc_words = set(doc['text'].lower().split())
        # Count matching words
        matches = len(query_words.intersection(doc_words))
        scores.append(matches)
    
    # Get top results
    top_indices = np.argsort(scores)[::-1][:n_results]
    
    print(f"\nüîç Keyword Search: '{query}'\n")
    print("Results:\n")
    
    for i, idx in enumerate(top_indices, 1):
        print(f"Result {i}:")
        print(f"  Matching words: {scores[idx]}")
        print(f"  Text: {documents[idx]['text']}")
        print()

def compare_search_methods(query, documents, collection, embedding_model):
    """
    Compare semantic search vs keyword search.
    """
    print("\n" + "="*80)
    print(f"Comparing Search Methods for: '{query}'")
    print("="*80)
    
    # Keyword search
    print("\n" + "-"*80)
    print("METHOD 1: KEYWORD SEARCH (Traditional)")
    print("-"*80)
    keyword_search(query, documents, n_results=3)
    
    # Semantic search
    print("\n" + "-"*80)
    print("METHOD 2: SEMANTIC SEARCH (Vector Database)")
    print("-"*80)
    semantic_search(query, collection, embedding_model, n_results=3)

# Test with queries that demonstrate the difference
test_queries = [
    "AI and machine learning innovations",  # Uses different words than documents
    "cleanest urban areas in Africa",  # Synonymous concept
]

for query in test_queries:
    compare_search_methods(query, documents, collection, embedding_model)
    print("\n" + "="*80 + "\n")

## üéØ Exercise 4: Build Your Own Vector Database

### Part A: Create Your Own Document Collection

**Task:** Create a collection of documents about a topic of your choice:
1. Pick a topic (e.g., Rwandan history, technology startups, agriculture, education)
2. Create 8-10 documents about this topic
3. Add relevant metadata to each document
4. Store them in a vector database

**Bonus:** Include some documents in Kinyarwanda if the embedding model supports it!

In [None]:
# Your turn! Create your own document collection
my_documents = [
    {
        "text": "Your document text here",
        "metadata": {"category": "your_category", "topic": "your_topic"}
    },
    # Add more documents...
]

# Create a new collection
# my_collection = chroma_client.create_collection(name="my_collection")
# add_documents_to_collection(my_documents, my_collection, embedding_model)

# Test with queries
# semantic_search("your query", my_collection, embedding_model)

### Part B: Experiment with Different Queries

**Tasks:**
1. Try synonymous queries (e.g., "AI" vs "artificial intelligence")
2. Try queries in different languages
3. Experiment with metadata filtering
4. Compare semantic vs keyword search results

**Questions to consider:**
- How does semantic search handle synonyms?
- What happens with very short vs very long queries?
- How does the number of documents affect search quality?
- How could you improve retrieval accuracy?

In [None]:
# Experiment space
# Try different queries and observe the results

my_query = "Your experimental query here"
# semantic_search(my_query, collection, embedding_model)

## Key Concepts Summary: Vector Databases

### What We Learned

1. **Vector Databases** store embeddings for efficient similarity search
2. **Semantic Search** finds documents by meaning, not just keywords
3. **RAG Systems** combine retrieval with generation for better answers
4. **Metadata Filtering** allows hybrid search (semantic + traditional filters)

### Real-World Applications

1. **Question Answering**: Find relevant information to answer user queries
2. **Document Search**: Search large document collections semantically
3. **Recommendation Systems**: Find similar items/content
4. **Chatbots**: Provide contextual responses using relevant documents
5. **Knowledge Management**: Organize and retrieve organizational knowledge

### Popular Vector Databases

- **ChromaDB**: Lightweight, great for prototyping (used here)
- **Pinecone**: Managed cloud service
- **Weaviate**: Open-source with GraphQL
- **Milvus**: High-performance, scalable
- **FAISS**: Facebook's similarity search library
- **Qdrant**: Written in Rust, high performance

### Best Practices

1. **Choose the right embedding model**: Balance between quality and speed
2. **Chunk documents appropriately**: Not too large, not too small
3. **Add good metadata**: Enables filtering and better organization
4. **Test with real queries**: Evaluate retrieval quality
5. **Monitor performance**: Track search latency and accuracy

---
# Summary and Key Takeaways

## What We Learned

### 1. Next Token Prediction
- LLMs predict the next token based on probability distributions
- Temperature controls randomness in generation
- The model assigns probabilities to thousands of possible next tokens

### 2. Tokenization
- Text is broken into tokens (words, subwords, or characters)
- Different tokenizers handle languages differently
- Languages with less training data are often tokenized less efficiently
- Kinyarwanda and other low-resource languages may require more tokens
- This affects both cost (API pricing) and model performance

### 3. Vector Embeddings
- Words are represented as vectors in high-dimensional space
- Similar meanings have similar vectors
- We can measure similarity using cosine similarity
- Embeddings capture semantic relationships
- Vector arithmetic can reveal word analogies

### 4. Vector Databases
- Store and efficiently search through embeddings
- Enable semantic search based on meaning, not just keywords
- Essential for RAG (Retrieval Augmented Generation) systems
- Support metadata filtering for hybrid search
- Power many real-world AI applications

## Important Implications

### For Low-Resource Languages (like Kinyarwanda):
1. **Tokenization Challenges**: More tokens needed ‚Üí Higher costs, longer context
2. **Representation**: Fewer examples in training data ‚Üí Potentially less accurate
3. **Solutions**:
   - Train language-specific tokenizers
   - Use multilingual models
   - Fine-tune on local language data
   - Develop community datasets

### For Model Development:
1. Tokenization strategy affects model performance
2. Embeddings quality depends on training data
3. Context length limitations impact what the model can process

## Next Steps

1. **Explore More Models**: Try different open-source models (Llama, Mistral, etc.)
2. **Build Custom Tokenizers**: Create tokenizers optimized for Kinyarwanda
3. **Fine-tuning**: Adapt models for specific tasks or languages
4. **Contribute**: Help build datasets for low-resource languages

## Additional Resources

- OpenAI Tokenizer: https://platform.openai.com/tokenizer
- Hugging Face Transformers: https://huggingface.co/transformers/
- Papers:
  - "Attention Is All You Need" (Transformer architecture)
  - "Language Models are Few-Shot Learners" (GPT-3)
  - "Neural Machine Translation of Rare Words with Subword Units" (BPE)

---

## üéì Final Exercise: Reflection Questions

1. How might tokenization inefficiency affect the cost of using LLMs for Kinyarwanda applications?
2. What are some strategies to improve LLM performance for low-resource languages?
3. How do embeddings capture meaning, and what are their limitations?
4. Why is understanding these fundamentals important for building AI applications?
5. How does semantic search in vector databases differ from traditional keyword search?
6. What are the advantages of using RAG systems over standalone LLMs?
7. How could vector databases be used to build applications for Rwanda?

**Discussion**: Share your insights with your peers and instructor!