<a href="https://colab.research.google.com/github/feamcor/bytebyteai-cohort-4/blob/main/project_1_llm_playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1: Build an LLM Playground

Welcome to your first project! In this project, you'll build a simple large language model (LLM) playground, an interactive environment where you can experiment with LLMs and understand how they work under the hood.

The goal here is to understand the foundations and mechanics behind LLMs rather than relying on higher-level abstractions or frameworks. You'll see what happens under the hood, how an LLM receives text, processes it, and generates a response. In later projects, you'll use frameworks like `Ollama` and `LangChain` that simplify many of these steps. But before that, this project will help you build a solid mental model of how LLMs actually work.

---
## Environment Setup
We'll use Google Colab, a free browser-based platform that lets you run Python code and machine learning models without installing anything locally. Go to [Google Colab](https://colab.research.google.com/) and upload this notebook to get started.

If you prefer to run the project locally, you need a reproducible setup. Open a terminal in the same directory as this notebook and run the environment setup commands below to install dependencies and create an isolated environment.

### Step 1: Use Conda or uv to install the project dependencies.

#### Option 1: Conda

[Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/index.html) is an open-source package and environment manager that lets you create isolated environments and install Python and non-Python dependencies together.

```bash
# Create and activate the conda environment
conda env create -f environment.yaml && conda activate llm_playground
```

#### Option 2: uv (faster)

[uv](https://docs.astral.sh/uv/) (faster) is a fast Python package installer and virtual environment tool written in Rust that aims to replace pip, pip-tools, and virtualenv with a single, high-performance workflow.

```bash
# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create venv and install dependencies
uv venv .venv && source .venv/bin/activate
uv pip install -r requirements.txt
```

### Step 2: Register this environment as a Jupyter kernel
This step is optional. Do it only if your environment doesn‚Äôt appear in Jupyter‚Äôs kernel list.
```bash
python -m ipykernel install --user --name=llm_playground --display-name "llm_playground"
```

Now switch the kernel to `llm_playground` (Kernel ‚Üí Change Kernel).

---
## Learning Objectives  
- Understand tokenization and how raw text is converted into a sequence of discrete tokens
- Load and inspect a pretrained LLM (GPT-2) using Hugging Face
- Understand logits, probabilities, and how models predict the next token
- Count and explore model parameters to understand model scale
- Explore decoding strategies: greedy decoding and top-p (nucleus) sampling
- See the leap from GPT-2 (simple text completion) to a modern model that understands questions and thinks before answering
- Look ahead: inference engines for serving models in practice


Let's get started!

---

In [None]:
import string, torch, transformers, tiktoken
print("torch:\t\t", torch.__version__)
print("transformers:\t", transformers.__version__)
print("tiktoken:\t", tiktoken.__version__)
print("Environment check complete ‚úÖ")

# 1: Tokenization

A neural network cannot process raw text directly. It needs numbers.
Tokenization is the process of converting text into numerical IDs that models can understand. In this section, you will learn how tokenization works in practice and why it is an essential step in every language model pipeline.

Tokenization methods generally fall into three main categories:
1. Word-level
2. Character-level
3. Subword-level

### 1.1 - Word-level tokenization
This method splits text by whitespace and treats each word as a single token. In the next cell, you will implement a basic word-level tokenizer by building a vocabulary that maps words to IDs and writing `encode` and `decode` functions.

In [None]:
CORPUS = [
    "Please do not deploy on friday",
    "It works on my machine",
    "Extra spicy noodles are a bad idea!",
    "Tokens are tiny pieces of text",
]
MODEL_GPT2 = "gpt2"
MODEL_QWEN = "Qwen/Qwen3-0.6B"
UNKNOWN = "[UNK]"
SAMPLE_SENTENCE = "Friday I deploy spicy noodles on tiny pieces of my bad machine ‚úÖ"
SAMPLE_PROMPT = "I like spicy"

In [None]:
vocabulary = []
word_to_id = {}
id_to_word = {}

In [None]:
def tokenize_words():
    for sentence in CORPUS:
        for word in sentence.lower().split():
            if word not in word_to_id:
                _id = len(word_to_id)
                word_to_id[word] = _id
                id_to_word[_id] = word
                vocabulary.append(word)

In [None]:
def encode_words(text):
    ids = []
    for word in text.lower().split():
        if word in word_to_id:
            ids.append(word_to_id[word])
        else:
            ids.append(-1)
    return ids

In [None]:
def decode_words(ids):
    words = []
    for _id in ids:
        if _id in id_to_word:
            words.append(id_to_word[_id])
        else:
            words.append(UNKNOWN)
    return words

In [None]:
tokenize_words()

In [None]:
print(vocabulary)
print(word_to_id)
print(id_to_word)

In [None]:
test_words = encode_words(SAMPLE_SENTENCE)
print(test_words)
print(decode_words(test_words))

While word-level tokenization is simple and easy to understand, it has two key limitations that make it impractical for large-scale models:
1. Large vocabulary size: every new word or variation (for example, run, runs, running) increases the total vocabulary, leading to higher memory and training costs.
2. Out-of-vocabulary (OOV) problem: the model cannot handle unseen or rare words that were not part of the training vocabulary, so they must be replaced with a generic [UNK] token.

The next section introduces character-level tokenization, where text is represented as individual characters instead of words.

### 1.2 - Character-level tokenization

In this approach, every single character (including spaces, punctuation, and even emojis) is assigned its own ID.

In the next cell, we will rebuild a tokenizer using the same corpus as before, but this time with a character-level approach.
For simplicity, we will only use English letters (a-z, A-Z) and punctuation.

In [None]:
vocabulary = []
char_to_id = {}
id_to_char = {}

In [None]:
def tokenize_chars():
    for sentence in CORPUS:
        for char in sentence.lower():
            if char not in char_to_id:
                _id = len(char_to_id)
                char_to_id[char] = _id
                id_to_char[_id] = char
                vocabulary.append(char)

In [None]:
def encode_chars(text):
    ids = []
    for char in text.lower():
        if char in char_to_id:
            ids.append(char_to_id[char])
        else:
            ids.append(-1)
    return ids

In [None]:
def decode_chars(ids):
    chars = []
    for _id in ids:
        if _id in id_to_char:
            chars.append(id_to_char[_id])
        else:
            chars.append(UNKNOWN)
    return chars

In [None]:
tokenize_chars()

In [None]:
print(vocabulary)
print(char_to_id)
print(id_to_char)

In [None]:
test_chars = encode_chars(SAMPLE_SENTENCE)
print(test_chars)
print(decode_chars(test_chars))

Character-level tokenization solves the out-of-vocabulary problem but introduces new challenges:

1. Longer sequences: because each word becomes many tokens, models need to process much longer inputs.
2. Weaker semantic representation: individual characters carry very little meaning, so models must learn relationships across many steps.
3. Higher computational cost: longer sequences lead to more tokens per input, which increases training and inference time.

To find a better balance between vocabulary size and sequence length, we move to subword-level tokenization next.

### 1.3 - Subword-level tokenization

Subword methods such as `Byte-Pair Encoding (BPE)`, `WordPiece`, and `SentencePiece` **learn** common groups of characters and merge them into tokens. For example, the word **unbelievable** might turn into three tokens: **["un", "believ", "able"]**. This approach strikes a balance between word-level and character-level methods and fixes their limitations.

The BPE algorithm builds a vocabulary iteratively using the following process:
1. Start with individual characters or bytes (each character is a token).
2. Count all adjacent pairs of tokens in a large text CORPUS.
3. Merge the most frequent pair into new tokens.

Repeat steps 2 and 3 until you reach the desired vocabulary size (for example, 50,000 tokens).

In the next cell, you will experiment with BPE in practice to see how it compresses text into meaningful subword units. Instead of implementing the algorithm from scratch, you will use a pretrained tokenizer, which was already trained on a large text corpus to build its vocabulary, such as the data used to train `GPT-2`. This allows you to see how BPE works in practice with a real, learned vocabulary.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_GPT2)

In [None]:
input_ids = tokenizer.encode(SAMPLE_SENTENCE)
print("Input IDs:\t", input_ids)

tokens = tokenizer.convert_ids_to_tokens(input_ids)
print("BPE Tokens:\t", tokens)

decoded_text = tokenizer.decode(input_ids)
print("Decoded:\t", decoded_text)

### 1.4 - TikToken

`tiktoken` is a fast, production-ready library for tokenization used by OpenAI models.
It is designed for efficiency and consistency with how OpenAI counts tokens in GPT models.

In this section, you will explore how different model families use different tokenizers. We will compare tokenizers used to train `GPT-2` and more powerful models such as `GPT-4`. By trying both, you will see how tokenization has evolved to handle more diverse text (including emojis and special characters) while remaining efficient.

In the next cell, you will use tiktoken to load these encodings and inspect how each one splits the same text. You may find reading this doc helpful: https://github.com/openai/tiktoken

In [None]:
import tiktoken

gpt2_enc = tiktoken.get_encoding(MODEL_GPT2)
gpt4_enc = tiktoken.get_encoding("cl100k_base")

gpt2_tokens = gpt2_enc.encode(SAMPLE_SENTENCE)
gpt4_tokens = gpt4_enc.encode(SAMPLE_SENTENCE)

print(f"GPT-2 ({len(gpt2_tokens)} tokens):\t{gpt2_tokens}")
print(f"GPT-4 ({len(gpt4_tokens)} tokens):\t{gpt4_tokens}")

print(f"GPT-2 Special Tokens:\t{gpt2_enc.special_tokens_set}")
print(f"GPT-4 Special Tokens:\t{gpt4_enc.special_tokens_set}")

Try changing the input sentence and observe how different tokenizers behave.
Experiment with:
- Emojis, special characters, or punctuation
- Code snippets or structured text
- Non-English text (for example, Japanese, French, or Arabic)

If you are curious, you can also attempt to implement the BPE algorithm yourself using a small text corpus to see how token merges are learned in practice.

### 1.5 - Key Takeaways
- **Word-level**: simple and intuitive, but limited by large vocabularies and out-of-vocabulary issues
- **Character-level**: flexible and covers all text, but produces long sequences that are harder to model
- **Subword / BPE**: balances both worlds and is the default choice for most modern LLMs
- **TikToken**: a production-ready tokenizer used in OpenAI models, demonstrating how optimized subword vocabularies are applied in real systems

# 2: What is a Language Model?

At its core, a **language model** is a function that predicts the next token. Given a sequence of tokens `[t‚ÇÅ, t‚ÇÇ, ‚Ä¶, t‚Çô]`, it outputs probabilities for the next token `t‚Çô‚Çä‚ÇÅ`.

Models like GPT-2 use many stacked Transformer layers. Each layer mixes information between tokens (attention) and transforms it (feed-forward). Together, these layers learn patterns in text. At the end, the model produces logits: one score per token in the vocabulary. Higher logits mean the token is more likely to be next.

In this section, you‚Äôll load GPT-2, look at its architecture, count its parameters, and see how it predicts the next token.

### 2.1: Loading GPT-2

There are different ways to load and run pretrained language models.

In this project, we‚Äôll use Hugging Face Transformers, a popular Python library that makes it easy to download models like GPT-2 and run them locally.

There are also dedicated inference engines for serving and running modern LLMs more efficiently, such as **Ollama**, **SGLang**, and **vLLM**. We‚Äôll explore those in future projects.

In the next cell, you‚Äôll load GPT-2 and inspect its architecture.

In [None]:
import torch
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained(MODEL_GPT2)

print("\n==> FULL MODEL ARCHITECTURE <==")
print(model)

print("\n==> FIRST TRANSFORMER BLOCK <==")
print(model.transformer.h[0])

### 2.2: Counting Parameters

You often hear that an LLM has a few million or billion parameters. But what does that actually mean? Every weight and bias value inside every layer is a parameter. These are the numbers that the model learned during training.

Next, you will count the total number of parameters in GPT-2 and break them down by component to see where most of the model's capacity lives.

In [None]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total Parameters in GPT-2:\t{total_params:,}")

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable Parameters:\t\t{trainable_params:,}")

**Think about scale:** GPT-2 Small has 124M parameters. GPT-4 is estimated to have over 1 trillion. If each parameter is a 16-bit floating point number (2 bytes), how much memory would you need just to store GPT-2 in RAM? What about a 70B parameter model?

### 2.3: From Text to Predictions

When you pass a sequence of tokens through a language model, it produces a tensor of logits with shape
`(batch_size, seq_len, vocab_size)`.
Each position in the sequence receives a vector of scores representing how likely every possible token is to appear next. By applying a softmax function on the last dimension, these logits can be converted into probabilities that sum to 1.

In the next cell, you will feed a sentence into GPT-2, print the shape of its logits, and display the five most likely next tokens predicted for the final position in the sequence.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_GPT2)

In [None]:
inputs = tokenizer(SAMPLE_PROMPT, return_tensors="pt")
input_ids = inputs["input_ids"]

In [None]:
with torch.no_grad():
    outputs = model(input_ids)
    logits = outputs.logits

print(f"Logits shape: \t{logits.shape}") # (batch_size, seq_len, vocab_size)

In [None]:
import torch.nn.functional as F

next_token_logits = logits[0, -1, :]

probs = F.softmax(next_token_logits, dim=-1)

top_k = 5
top_probs, top_indices = torch.topk(probs, top_k)

print(f"Prompt:\t{SAMPLE_PROMPT}")
print("Top 5 predicted next tokens:")
for i in range(top_k):
    token = tokenizer.decode([top_indices[i]])
    print(f"{token}\t{top_probs[i].item():.4f}")

### 2.4: Key Takeaway

A language model is not a black box or something mysterious.
It is a large composition of simple, understandable layers such as attention and feed-forward networks, trained together to predict the next token in a sequence.

By learning this next-token prediction task at scale, the model gradually develops an internal understanding of language structure, meaning, and context, which allows it to generate coherent and relevant text.

# 3: Text Generation (Decoding)
Once a language model can predict the next-token probabilities, we can use it to generate text. This is called text generation or decoding.

Conceptually, generation is a loop:

1. Feed the current token sequence into the model.

2. The model outputs a probability distribution over the next token.

3. A decoding algorithm picks the next token from that distribution.

4. Append the chosen token to the sequence and repeat.

In practice, libraries like Hugging Face Transformers provide generate() methods that handle this loop for you, including stopping conditions, batching, and efficiency tricks.

Different decoding strategies control how the next token is chosen, and how creative or deterministic the output feels:

- **Greedy decoding**: always pick the token with the highest probability. Simple and consistent, but often repetitive.

- **Top-p (nucleus) sampling**: randomly sample from the smallest set of tokens whose cumulative probability exceeds a threshold p. This adds variety while keeping outputs coherent.

In the following sections, you'll use generate() to produce text from GPT-2 using both greedy decoding and top-p sampling.

### 3.1: Greedy Decoding
In this section, you will use GPT-2 and Hugging Face's built-in generate method to produce text using the greedy decoding strategy.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(MODEL_GPT2)
model = AutoModelForCausalLM.from_pretrained(MODEL_GPT2)

inputs = tokenizer(SAMPLE_PROMPT, return_tensors="pt")

# Greedy decoding is the default when no sampling is specified
output_tokens = model.generate(**inputs, max_new_tokens=10, do_sample=False)
print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))

In [None]:
tests = ["Once upon a time", "What is 2+2?", "Suggest a party theme."]

def generate_greedy(model, tokenizer, prompt, max_new_tokens=32):
    inputs = tokenizer(prompt, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    return tokenizer.decode(output[0], skip_special_tokens=True)

for prompt in tests:
    print("\n==> GPT-2 | Greedy <==")
    print(generate_greedy(model, tokenizer, prompt, 32))

Naively selecting the single most probable token at each step (known as greedy decoding) often leads to poor results in practice:
- Repetition loops: phrases like "The cat is is is‚Ä¶"
- Short-sighted choices: the most likely token right now might lead to incoherent text later

These issues are why more advanced decoding methods such as top-p (nucleus) sampling are commonly used to make model outputs more diverse and natural.

### 3.2: Top-p (Nucleus) Sampling
Top-p sampling (also called nucleus sampling) introduces controlled randomness into text generation. Instead of always picking the single most likely token, it samples from the smallest set of tokens whose cumulative probability exceeds a threshold `p` (e.g., 0.9).

This allows the model to explore multiple plausible continuations, producing more diverse and natural-sounding text while still staying coherent.

In this section, you will implement a generate function that supports both greedy and top-p strategies.

In [None]:
inputs = tokenizer(SAMPLE_PROMPT, return_tensors="pt")

# top_p=0.9 means we sample from the top 90% of the probability mass
output_tokens = model.generate(**inputs, max_new_tokens=32, do_sample=True, top_p=0.9, temperature=0.7)
print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))

In [None]:
def generate_top_p(model, tokenizer, prompt, strategy="top_p", max_new_tokens=128):
    inputs = tokenizer(prompt, return_tensors="pt")

    if strategy == "top_p":
        output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, top_p=0.9, temperature=0.8)
    else:
        output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)

    return tokenizer.decode(output[0], skip_special_tokens=True)

tests=["Once upon a time","What is 2+2?", "Suggest a party theme."]
for prompt in tests:
    print("\n==> GPT-2 | Top-p <==")
    print(generate_top_p(model, tokenizer, prompt, "top_p", 32))

# 4: From Simple GPT2 to Modern LLMs

So far, we have used **GPT-2**, one of the earliest publicly available language models (released in 2019, 124M parameters). GPT-2 can only do one thing: **complete text**. Given some input, it predicts what words come next. It has no concept of questions, instructions, or conversation. If you type `"What is 2+2?"`, GPT-2 will continue the text as if it were part of a web page or article. It does not understand you are asking a question.

Modern language models are a different. Models like **Qwen3-0.6B** (2025, 600M parameters) have gone through additional training stages that unlock fundamentally new capabilities:

- **Instruction following**: they interpret your input as a request and produce a helpful response
- **Conversation**: they maintain context across a multi-turn dialogue
- **Reasoning**: they can *think step-by-step* before answering, using internal reasoning tokens (`<think>...</think>`) to work through a problem before giving a final answer

In this section, you will see this dramatic contrast firsthand by giving the same prompts to both models. In week 4, we'll learn about thinking and reasoning models in detail.

### 4.1: Chat Templates

Instruction-tuned models expect input in a structured **chat format**, not raw text. Instead of receiving `"What is 2+2?"` as plain text, the model expects a formatted message like:

```
<|user|>What is 2+2?<|assistant|>
```

Each model family defines its own format. The Hugging Face `tokenizer.apply_chat_template()` method handles this formatting automatically. Without it, even an instruction-tuned model receives unstructured text and falls back to simple completion behavior.

### 4.2: GPT-2 vs. Qwen3-0.6B

In the next cells, you will load both models and feed the same prompt to each one:

- **GPT-2**: receives the raw prompt and blindly continues the text
- **Qwen3-0.6B**: receives a properly formatted chat message, *thinks* through the problem, and produces a direct answer

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

gpt2_model = AutoModelForCausalLM.from_pretrained(MODEL_GPT2)
gpt2_tokenizer = AutoTokenizer.from_pretrained(MODEL_GPT2)

qwen_model = AutoModelForCausalLM.from_pretrained(MODEL_QWEN)
qwen_tokenizer = AutoTokenizer.from_pretrained(MODEL_QWEN)

Both models are now loaded. GPT-2 has 124M parameters; Qwen3-0.6B has roughly 600M. If the previous cell took some time, that was mainly due to model download. The models are cached locally, so future runs will be much faster.

Next, we will generate text from both models using the same prompt. For GPT-2, we pass the raw text directly. For Qwen, we first format the prompt as a chat message using `apply_chat_template()`, then generate. Both use **top-p sampling** so the outputs are varied and natural.

In [None]:
prompt = "What is 2+2?"

print("==> GPT-2 Output <==")
inputs_gpt2 = gpt2_tokenizer(prompt, return_tensors="pt")
output_gpt2 = gpt2_model.generate(**inputs_gpt2, max_new_tokens=32, do_sample=True, top_p=0.9)
print(gpt2_tokenizer.decode(output_gpt2[0], skip_special_tokens=True))

print("\n==> Qwen3 Output <==")
messages = [
    {"role": "user", "content": prompt}
]
text = qwen_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs_qwen = qwen_tokenizer([text], return_tensors="pt")
output_qwen = qwen_model.generate(**inputs_qwen, max_new_tokens=64, do_sample=True, top_p=0.9)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs_qwen.input_ids, output_qwen)]
print(qwen_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

# 5. (Optional) A Small Interactive LLM Playground
This section is optional. You do not need to implement it to complete the project. It is meant purely for exploration and will not significantly affect your core AI engineering skills.

If you are curious, you can build a simple interactive playground to experiment with text generation. You can:
- Create input widgets for the prompt, model selection, decoding strategy, and temperature
- Use Hugging Face's generate method to produce text based on the selected settings
- Display the model's response directly in the notebook output

You may find following links helpful:
- https://ipywidgets.readthedocs.io/en/latest/
- https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

prompt_input = widgets.Textarea(value='What is the meaning of life?', placeholder='Type your prompt here...', description='Prompt:', layout={'width': '90%', 'height': '100px'})
model_dropdown = widgets.Dropdown(options=[('GPT-2 (Completion)', 'gpt2'), ('Qwen (Instruction)', 'qwen')], value='qwen', description='Model:')
strategy_dropdown = widgets.Dropdown(options=[('Top-p Sampling', 'top_p'), ('Greedy', 'greedy')], value='top_p', description='Strategy:')
temp_slider = widgets.FloatSlider(value=0.7, min=0.1, max=1.5, step=0.1, description='Temp:')
tokens_slider = widgets.IntSlider(value=64, min=16, max=256, step=16, description='Max Tokens:')
generate_btn = widgets.Button(description='Generate', button_style='primary')
output_area = widgets.Output()

def run_generation(b):
    with output_area:
        clear_output()
        print("Generating...")

        current_model = qwen_model if model_dropdown.value == 'qwen' else gpt2_model
        current_tok = qwen_tokenizer if model_dropdown.value == 'qwen' else gpt2_tokenizer

        text = prompt_input.value
        if model_dropdown.value == 'qwen':
            messages = [{"role": "user", "content": text}]
            text = current_tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

        inputs = current_tok(text, return_tensors="pt")

        do_sample = strategy_dropdown.value == 'top_p'
        top_p = 0.9 if do_sample else None

        outputs = current_model.generate(
            **inputs,
            max_new_tokens=tokens_slider.value,
            do_sample=do_sample,
            top_p=top_p,
            temperature=temp_slider.value,
            pad_token_id=current_tok.eos_token_id
        )

        # Remove input tokens from output
        gen_ids = outputs[0][len(inputs['input_ids'][0]):]
        response = current_tok.decode(gen_ids, skip_special_tokens=True)

        clear_output()
        display(Markdown(f"### Response:\n{response}"))

generate_btn.on_click(run_generation)

ui = widgets.VBox([
    widgets.HTML("<h2>LLM Playground</h2>"),
    prompt_input,
    widgets.HBox([model_dropdown, strategy_dropdown]),
    widgets.HBox([temp_slider, tokens_slider]),
    generate_btn,
    output_area
])
display(ui)

# 6: Inference Engines: Ollama, vLLM, SGLang

So far, we loaded models directly in Python using HuggingFace's `transformers` library. This is great for learning, but in practice models run as **servers** that expose an API. Client applications send requests and receive responses over HTTP ‚Äî the model itself stays loaded in memory (and on the GPU) between requests.

An **inference engine** handles all the heavy lifting: model loading, GPU memory management, request batching, and serving an HTTP API. Popular inference engines include:

| Engine | Best for |
|--------|----------|
| **Ollama** | Easy local use and experimentation |
| **vLLM** | High-throughput production serving |
| **SGLang** | Fast serving + structured output |

Most inference engines expose an **OpenAI-compatible API**. This means you can learn one client library (the `openai` Python package) and swap backends freely: Ollama for local development, vLLM or SGLang for production.

In future weeks, we'll learn about Ollama, set it up, and use it to easily load and build on top of modern powerful LLMs!

## üéâ Congratulations!

You've just explored the internals of a real **LLM**. In this project you:
* Learned how **tokenization** works ‚Äî from word-level to BPE ‚Äî and why it matters
* Used `tiktoken` to compare tokenizers across different model generations
* Loaded GPT-2 and inspected its Transformer blocks and layers
* **Counted parameters** and understood where a model's capacity lives
* Learned how the model produces **logits and probabilities** to predict the next token
* Explored **decoding strategies**: greedy decoding and top-p (nucleus) sampling
* Witnessed the leap from GPT-2 (simple text completion) to Qwen3-0.6B ‚Äî a modern model that **understands questions and thinks before answering**
* Learned about **inference engines** (Ollama, vLLM, SGLang) and the OpenAI-compatible API pattern

üëè **Great job!** Take a moment to celebrate. You now have a working mental model of how LLMs work ‚Äî from raw text input all the way to generated output. The skills and intuitions you built here will serve as the foundation for everything that comes next.

