---

## 🛠️ Use Case: **Customer Support Agent Assistant**

### 🎯 Goal:

Build an Bigram Language Model that can **assist customer support agents** by:

* Understanding customer queries
* Suggesting helpful and brand-aligned responses
* Learning from past resolved issues

This mimics a real-world LLM application in **call centers**, **SaaS platforms**, and **chatbot assistants**.

#### **Understanding the Basics**

##### 🔎 What is a Language Model?

A language model (LM) is a type of neural network that learns to predict the next word or character in a sequence given the previous ones. GPT-style models use a transformer-based architecture for this.


####  **Load from HugginFace as Pandas**

In [None]:
import pandas as pd

df = pd.read_csv("hf://datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset/Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv")

In [None]:
pairs = []

for _, row in df.iterrows():
  prompt =row['instruction'].strip()
  reply = row['response'].strip()

  if prompt and reply:
    text_pair = f"### Instruction:\n{prompt}\n\n### Response:\n{reply}\n\n"
    pairs.append(text_pair)


full_text = "".join(pairs)


output_path = 'customer_support_data.txt'
with open(output_path, "w", encoding="utf-8") as f:
  f.write(full_text)

'customer_support_data.txt'

1. **Initialize an empty list**

   ```python
   pairs = []
   ```

   *We’ll collect each instruction/response block here.*

2. **Loop over every row** in your DataFrame

   ```python
   for _, row in df.iterrows():
   ```

   * `_` is the row index (unused).
   * `row` holds one conversation example at a time.

3. **Extract & clean the text**

   ```python
   prompt = row['instruction'].strip()
   reply  = row['response'].strip()
   ```

   * `.strip()` removes extra whitespace/newlines.
   * Ensures your prompts and replies are tidy.

4. **Only keep valid pairs**

   ```python
   if prompt and reply:
   ```

   * Skips any examples missing either an instruction or a response.

5. **Format as instruction–response blocks**

   ```python
   text_pair = (
       "### Instruction:\n"
       f"{prompt}\n\n"
       "### Response:\n"
       f"{reply}\n\n"
   )
   pairs.append(text_pair)
   ```

   * Wraps each prompt/reply in clear headers.
   * Appends the formatted string to `pairs`.

6. **Join all blocks into one string**

   ```python
   full_text = "".join(pairs)
   ```

   * Creates one continuous text file with all examples back-to-back.

7. **Save to disk**

   ```python
   with open('customer_support_data.txt', "w", encoding="utf-8") as f:
       f.write(full_text)
   ```

   * Writes the complete training data to `customer_support_data.txt`.
   * Ready for tokenization and model training.


#### Read the entire contents of `customer_support_data.txt` into the string variable `text`.

In [11]:
text = open('customer_support_data.txt', 'r').read()

#### Create Vocabulary (Character-Level)

In this step, we build a **character-level vocabulary** — a list of all unique characters that appear in our dataset. This vocabulary will form the basis of our tokenizer, which maps characters to integers and vice versa.



In [14]:
# Create a sorted list of unique characters (vocabulary)
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Output results
print("Unique characters:\n", ''.join(chars))
print("\nVocabulary size:", vocab_size)

Unique characters:
 	
 !"#$&'()*+,-./0123456789:;>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_`abcdefghijklmnopqrstuvwxyz{}¡àé–—’☺✨️🌟👍💡💪🔐🔒🗝😊🙁🙏🛡🤗🤝

Vocabulary size: 112


#### Build Your Tokenizer (Encoder & Decoder)

Now that we know *all* the unique characters in our text (our “vocabulary”), we need a way to turn any string of text into numbers—and back again. That’s what a **tokenizer** does.

---

### 1️⃣ Why Tokenize?

* **Neural nets speak numbers**, not letters.
* We need a consistent way to map each character to an integer ID.
* Later, when we generate text, we’ll convert IDs back into characters.

---

### 2️⃣ How It Works

| Operation  | Input                   | Process                                              | Output                  |
| ---------- | ----------------------- | ---------------------------------------------------- | ----------------------- |
| **Encode** | `"hello!"`              | Look up each character in our `stoi` dictionary      | `[7, 4, 11, 11, 14, 3]` |
| **Decode** | `[7, 4, 11, 11, 14, 3]` | Map each ID back to its character in our `itos` dict | `"hello!"`              |

* **`stoi`** = **S**tring → **I**nteger map
* **`itos`** = **I**nteger → **S**tring map

---

### 3️⃣ Why Character-Level?

* **Simplicity:** Easy to inspect and debug—every token is one character.
* **Transparency:** You see exactly how “a”, “b”, “!” and “4” get their own IDs.
* **Good for small models:** No giant vocabularies, no complex subword merges.

---

### 4️⃣ (Bonus) Word-Piece / BPE Alternative

If you ever want to scale up to word-pieces like GPT-2 uses:

```python
import tiktoken
enc = tiktoken.get_encoding('gpt2')

print("GPT-2 uses", enc.n_vocab, "subword tokens")
print("Example:", enc.encode("hello world"))
```


In [20]:
# Build character-to-index and index-to-character mappings
stoi = { ch:i for i, ch in enumerate(chars) }  # String to Integer
itos = { i:ch for i, ch in enumerate(chars) }  # Integer to String

# Define encoder and decoder functions
encode = lambda s: [stoi[c] for c in s]        # Converts string to list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # Converts list of integers back to string

# Example usage
sample_text = "LLM"
encoded = encode(sample_text)
decoded = decode(encoded)

print("Original text:", sample_text)
print("Encoded:", encoded)
print("Decoded:", decoded)

Original text: LLM
Encoded: [43, 43, 44]
Decoded: LLM


#### Encode the Entire Dataset

Now that we’ve built our **encode** and **decode** functions, it’s time to turn **all** of our raw text into numbers—one big sequence of token IDs that our model can train on.

---

### 🧠 Why Encode the Full Text?

* **Batching & slicing** require numeric tensors, not strings.
* Converting once up front is more efficient than encoding on the fly.
* The resulting tensor is the “source of truth” for all downstream data splits and training loops.




In [21]:
import torch  # We use PyTorch: https://pytorch.org

# Encode the entire text using our character-level encoder
data = torch.tensor(encode(text), dtype=torch.long)

# Print tensor information
print("Data shape:", data.shape)
print("Data type:", data.dtype)

# Preview first 1000 encoded tokens
print("First 1000 tokens:\n", data[:1000])

Data shape: torch.Size([19240191])
Data type: torch.int64
First 1000 tokens:
 tensor([ 5,  5,  5,  2, 40, 75, 80, 81, 79, 82, 64, 81, 70, 76, 75, 27,  1, 78,
        82, 66, 80, 81, 70, 76, 75,  2, 62, 63, 76, 82, 81,  2, 64, 62, 75, 64,
        66, 73, 73, 70, 75, 68,  2, 76, 79, 65, 66, 79,  2, 88, 88, 46, 79, 65,
        66, 79,  2, 45, 82, 74, 63, 66, 79, 89, 89,  1,  1,  5,  5,  5,  2, 49,
        66, 80, 77, 76, 75, 80, 66, 27,  1, 40,  8, 83, 66,  2, 82, 75, 65, 66,
        79, 80, 81, 76, 76, 65,  2, 86, 76, 82,  2, 69, 62, 83, 66,  2, 62,  2,
        78, 82, 66, 80, 81, 70, 76, 75,  2, 79, 66, 68, 62, 79, 65, 70, 75, 68,
         2, 64, 62, 75, 64, 66, 73, 70, 75, 68,  2, 76, 79, 65, 66, 79,  2, 88,
        88, 46, 79, 65, 66, 79,  2, 45, 82, 74, 63, 66, 79, 89, 89, 13,  2, 62,
        75, 65,  2, 40,  8, 74,  2, 69, 66, 79, 66,  2, 81, 76,  2, 77, 79, 76,
        83, 70, 65, 66,  2, 86, 76, 82,  2, 84, 70, 81, 69,  2, 81, 69, 66,  2,
        70, 75, 67, 76, 79, 74, 62, 81, 70

#### Train/Test Split

Split data into training (90%) and validation (10%) sets.

In [22]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

#### **Creating Mini‐Batches from chunks for Training**
Instead of training on one long sequence at a time, we cut our data into many small windows—then group those windows into mini‐batches so the model learns faster and more stably.

#### 1️⃣ Why Mini‐Batches?

* **Efficiency:** GPUs work best on parallel data.
* **Stability:** Averaging the loss over multiple windows smooths out noisy gradients.
* **Coverage:** Random windows from across the text expose the model to more diverse patterns each step.


#### Full data (token IDs):

  ```
  data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …]
  ```
* **block\_size** = 4
* **Batch size** (B) = 3

##### Picking a Chunk

Each chunk will be length `block_size + 1 = 5`.

chunk = [0, 1, 2, 3, 4]


#### Extracting `x`, and `y` from chunk

* **`chunk`**: 5 tokens including one extra for alignment.
* **`x[b]`**: the **context window** of length 4 - [:block_size]
* **`y[b]`**: the **next-token labels**, shifted by one - [1:block_size+1]

In [24]:
block_size = 8  # This is the context window length
chunk = train_data[:block_size + 1]  # One extra token to create input-output alignment
print("📦 Sample chunk of tokens:", chunk.tolist())

# Create input (x) and target (y) sequences
x = chunk[:block_size]            # inputs
y = chunk[1:block_size + 1]       # targets (shifted by one)


print("\n🔁 Next-token prediction learning:")
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"📚 When input is {context.tolist()} → target is: {target.item()}")

📦 Sample chunk of tokens: [5, 5, 5, 2, 40, 75, 80, 81, 79]

🔁 Next-token prediction learning:
📚 When input is [5] → target is: 5
📚 When input is [5, 5] → target is: 5
📚 When input is [5, 5, 5] → target is: 2
📚 When input is [5, 5, 5, 2] → target is: 40
📚 When input is [5, 5, 5, 2, 40] → target is: 75
📚 When input is [5, 5, 5, 2, 40, 75] → target is: 80
📚 When input is [5, 5, 5, 2, 40, 75, 80] → target is: 81
📚 When input is [5, 5, 5, 2, 40, 75, 80, 81] → target is: 79


#### Create Data Batches

* **Full data** (token IDs):

  ```
  data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …]
  ```
* **block\_size** = 4
* **Batch size** (B) = 3

---

### 1️⃣ Picking Random Chunks

Suppose we randomly choose start indices:

```
start_idxs = [2, 5, 8]
```

Each chunk will be length `block_size + 1 = 5`.

---

### 2️⃣ Extracting `x`, and `y` from `chunk`

| Batch b | start\_idx | chunk = data\[i\:i+5] | x\[b] = first 4 tokens | y\[b] = last 4 tokens |
| :-----: | :--------: | :-------------------: | :--------------------: | :-------------------: |
|  **0**  |      2     |   `[2, 3, 4, 5, 6]`   |     `[2, 3, 4, 5]`     |     `[3, 4, 5, 6]`    |
|  **1**  |      5     |   `[5, 6, 7, 8, 9]`   |     `[5, 6, 7, 8]`     |     `[6, 7, 8, 9]`    |
|  **2**  |      8     |  `[8, 9, 10, 11, 12]` |    `[8, 9, 10, 11]`    |   `[9, 10, 11, 12]`   |

* **`chunk`**: 5 tokens including one extra for alignment.
* **`x[b]`**: the **context window** of length 4.
* **`y[b]`**: the **next-token labels**, shifted by one.

---

### 3️⃣ Resulting Mini‐Batch

* `x` has shape `(3, 4)` and looks like:

  ```
  [
    [2,  3,  4,  5],
    [5,  6,  7,  8],
    [8,  9, 10, 11],
  ]
  ```
* `y` has shape `(3, 4)` and looks like:

  ```
  [
    [3,  4,  5,  6],
    [6,  7,  8,  9],
    [9, 10, 11, 12],
  ]
  ```

In [None]:
import torch

# Set random seed for reproducibility
torch.manual_seed(1337)

batch_size = 4   # How many sequences to process in parallel
block_size = 8   # Length of each sequence (context window)

def get_batch(split):
    """
    Samples a mini-batch of input (x) and target (y) sequences from the dataset.

    Args:
        split (str): One of 'train' or 'val' to choose the dataset split.

    Returns:
        x (torch.Tensor): Input sequences of shape (batch_size, block_size)
        y (torch.Tensor): Target sequences of shape (batch_size, block_size)
                          Each y[i, t] is the next character after x[i, t]
    """
    assert split in ['train', 'val'], "split must be 'train' or 'val'"

    data_source = train_data if split == 'train' else val_data

    # Randomly sample starting indices for each sequence
    start_indices = torch.randint(0, len(data_source) - block_size, (batch_size,))

    # Build input and target tensors using slicing
    x = torch.stack([data_source[i:i + block_size] for i in start_indices])
    y = torch.stack([data_source[i + 1:i + block_size + 1] for i in start_indices])

    return x, y



In [37]:
# Generate a training batch
xb, yb = get_batch('train')

# Inspect the shape of the input and target tensors
print("🧮 Input batch shape:", xb.shape)   # Expected: (4, 8)
print("🧮 Target batch shape:", yb.shape) # Expected: (4, 8)

# View actual data
print("\n🧾 Inputs (xb):")
print(xb)

print("\n🎯 Targets (yb):")
print(yb)


🧮 Input batch shape: torch.Size([4, 8])
🧮 Target batch shape: torch.Size([4, 8])

🧾 Inputs (xb):
tensor([[86, 76, 82,  2, 64, 62, 75,  2],
        [ 2, 70, 75, 67, 76, 79, 74, 62],
        [84, 76, 79, 65,  2, 67, 76, 79],
        [ 2, 80, 77, 66, 64, 70, 67, 70]])

🎯 Targets (yb):
tensor([[76, 82,  2, 64, 62, 75,  2, 80],
        [70, 75, 67, 76, 79, 74, 62, 81],
        [76, 79, 65,  2, 67, 76, 79,  2],
        [80, 77, 66, 64, 70, 67, 70, 64]])


#### 🧠 Why Do We Train This Way?

This is called **next-token prediction** — the foundational idea behind models like GPT.

At each position in a sequence, the model learns to predict **what character comes next** based on the context it has seen so far.

* In this example:

  * Feed the model `[86]` → expect it to predict `76`
  * Feed `[86, 76]` → expect `82`
  * Feed `[86, 76, 82]` → expect `2`
  * ... and so on


### 📚 What the Model Learns Over Time

By seeing **many random sequences from across the dataset**, the model learns:

* Patterns in character sequences
* What tokens frequently follow others
* How to continue a sentence or phrase
* Eventually — how to generate coherent text from scratch

This chunking and batching strategy is the **core training loop** of autoregressive language models.


#### Build the Model - Bigram Language Model**

### ⚙️ Step 9: Define a Simple Model

In this step, we define and test a **Bigram Language Model** — the simplest type of autoregressive model. It predicts the next token **only** based on the current token, without considering any context before it.

---

### 🔧 What Is a Bigram Model?

A **bigram model** looks at just **one character (token)** and tries to predict **what comes next**. For example:
- Given `'H'`, predict `'e'`
- Given `'e'`, predict `'l'`
- Given `'l'`, predict `'l'`
- Given `'l'`, predict `'o'`

This is the minimal form of a language model — it **ignores all previous context** except the current token.

---

### `token_embedding_table`


* We define a PyTorch `nn.Module`.
* Inside, we create a simple `nn.Embedding` layer.

  * This is a **lookup table** with shape `(vocab_size, vocab_size)`
  * Each token maps directly to a row — a vector of logits predicting the next token.


**Defining the Embedding Layer**

Assuming we have a vocab_size of 5

```python
self.token_embedding_table = nn.Embedding(5, 5)
```

This creates a **learnable** weight matrix of shape `(5 × 5)`. Internally, PyTorch initializes it (e.g. randomly) as follows:

| **Token ID** | **Dim 0** | **Dim 1** | **Dim 2** | **Dim 3** | **Dim 4** |
| :----------: | :-------: | :-------: | :-------: | :-------: | :-------: |
|     **0**    |    0.12   |   –0.34   |    0.45   |   –0.01   |    0.33   |
|     **1**    |   –0.05   |    0.78   |   –1.10   |    0.22   |    0.09   |
|     **2**    |    1.30   |   –0.44   |    0.00   |    0.55   |   –0.88   |
|     **3**    |   –0.77   |    0.21   |    0.64   |   –0.19   |    1.15   |
|     **4**    |    0.03   |   –0.67   |   –0.40   |    0.90   |   –0.12   |

* **Rows** correspond to **token IDs** (0–4).
* **Columns** are the **embedding dimensions**, each of which will serve as the raw “logit score” for predicting the next token when you use this embedding in a Bigram model.

---
### `The Input Indices (`idx`)`

Suppose during training you form a mini-batch of **B = 2** sequences, each of length **T = 3**. You might have:

```
idx = [
  [2, 0, 4],   # ← Batch 0’s token IDs at time-steps 0,1,2
  [1, 3, 2],   # ← Batch 1’s token IDs
]
```

Shape: `(2, 3)`

---

## 3. Looking Up Embeddings

When you call:

```python
logits = self.token_embedding_table(idx)
```

PyTorch “gathers” the corresponding rows from the 5×5 table for each position in `idx`, producing a tensor of shape **(B, T, 5)**. Concretely:

### Batch 0 (`idx[0] = [2,0,4]`)

| Time-step (t) | token ID | Embedding Vector (Dim 0–4)           |
| ------------: | :------: | :----------------------------------- |
|         **0** |     2    | \[ 1.30, –0.44,  0.00,  0.55, –0.88] |
|         **1** |     0    | \[ 0.12, –0.34,  0.45, –0.01,  0.33] |
|         **2** |     4    | \[ 0.03, –0.67, –0.40,  0.90, –0.12] |

### Batch 1 (`idx[1] = [1,3,2]`)

| Time-step (t) | token ID | Embedding Vector (Dim 0–4)           |
| ------------: | :------: | :----------------------------------- |
|         **0** |     1    | \[–0.05,  0.78, –1.10,  0.22,  0.09] |
|         **1** |     3    | \[–0.77,  0.21,  0.64, –0.19,  1.15] |
|         **2** |     2    | \[ 1.30, –0.44,  0.00,  0.55, –0.88] |

---

## 4. Resulting `logits` Tensor

Putting those together, `logits` is:

```
logits = [
  [  # Batch 0 (shape: 3×5)
    [1.30, –0.44,  0.00,  0.55, –0.88],  # t=0
    [0.12, –0.34,  0.45, –0.01,  0.33],  # t=1
    [0.03, –0.67, –0.40,  0.90, –0.12],  # t=2
  ],
  [  # Batch 1 (shape: 3×5)
    [–0.05,  0.78, –1.10,  0.22,  0.09], # t=0
    [–0.77,  0.21,  0.64, –0.19,  1.15], # t=1
    [ 1.30, –0.44,  0.00,  0.55, –0.88], # t=2
  ]
] #(shape: 2x3×5 = BxTxC)
```

Shape: **(2, 3, 5)** in our toy; in your real model it’s **(B, T, vocab\_size)**.

---

#### 🎯 What Are Logits?

* They are **raw scores** for all possible next tokens.
* Later used in `cross_entropy()` to calculate loss.
* Think of them as the model saying:

  > “Here are my 112 guesses for what comes after this token.”

---

### `Loss Calculation`

* PyTorch's `cross_entropy` expects:

  * `logits` to be `(N, C)` → N is the number of predictions, C is the number of classes
  * `targets` to be a 1D tensor of shape `(N,)`
* So we reshape the tensors:

  * `B*T` is how many total predictions we made
* The result is a **single scalar loss** that tells how well the model predicts the next tokens across the batch.

### Recall Our Toy Example

* **Batch size (B)** = 2
* **Time-steps (T)** = 3
* **Classes (C)** = 5

```python
# (B, T) input indices
idx = [
  [2, 0, 4],   # Batch 0
  [1, 3, 2],   # Batch 1
]

# After embedding lookup → (B, T, C) logits:
logits = [
  [  # Batch 0
    [1.30, –0.44,  0.00,  0.55, –0.88],  # t=0
    [0.12, –0.34,  0.45, –0.01,  0.33],  # t=1
    [0.03, –0.67, –0.40,  0.90, –0.12],  # t=2
  ],
  [  # Batch 1
    [–0.05,  0.78, –1.10,  0.22,  0.09], # t=0
    [–0.77,  0.21,  0.64, –0.19,  1.15], # t=1
    [ 1.30, –0.44,  0.00,  0.55, –0.88], # t=2
  ]
]
```

Suppose our **targets** tensor is the “true” next-token IDs for each position:

```python
targets = [
  [0, 2, 1],   # Batch 0’s true next-token labels
  [3, 4, 0],   # Batch 1’s labels
]
```

---

## 1. Shapes Before Flattening

| Tensor    | Shape       | Description                                   |
| --------- | ----------- | --------------------------------------------- |
| `logits`  | `(2, 3, 5)` | 2 batches × 3 time-steps × 5 class-scores     |
| `targets` | `(2, 3)`    | 2 batches × 3 true labels (one per time-step) |

---

## 2. Flattening to Fit `F.cross_entropy`

PyTorch’s **`cross_entropy`** expects:

* **`logits`** shape `(N, C)`
* **`targets`** shape `(N,)`

where **N** is the total number of predictions (`B × T`). We flatten accordingly:

```python
B, T, C = logits.shape              # B=2, T=3, C=5

logits = logits.view(B * T, C)      # → shape (6, 5)
targets = targets.view(B * T)       # → shape (6,)
```

---

### 3. How the Flattening Works

| (batch, time) | Flat index n | Logits row (length 5)               | Target label |
| ------------: | -----------: | ----------------------------------- | -----------: |
|        (0, 0) |            0 | \[1.30, –0.44,  0.00,  0.55, –0.88] |            0 |
|        (0, 1) |            1 | \[0.12, –0.34,  0.45, –0.01,  0.33] |            2 |
|        (0, 2) |            2 | \[0.03, –0.67, –0.40,  0.90, –0.12] |            1 |
|        (1, 0) |            3 | \[–0.05, 0.78, –1.10, 0.22,  0.09]  |            3 |
|        (1, 1) |            4 | \[–0.77, 0.21,  0.64, –0.19, 1.15]  |            4 |
|        (1, 2) |            5 | \[ 1.30, –0.44, 0.00,  0.55, –0.88] |            0 |

* **Rows 0–5** of the flattened `logits` correspond to each `(batch, time)` pair in row-major order.
* The flattened `targets` vector is `[0, 2, 1, 3, 4, 0]`.

---

## 4. Computing the Loss

With shapes `(6, 5)` and `(6,)`, we can now call:

```python
loss = F.cross_entropy(logits, targets)
```

* For each of the 6 rows, `cross_entropy`

  1. applies `softmax` over the 5 class-scores
  2. picks out the probability of the true class (from `targets`)
  3. computes `–log(p_true)`
* Finally, it **averages** these 6 values into a single scalar loss.


### 📉 What `loss` Gives You:

* A single **scalar value** (e.g., `4.87`)
* Measures **how wrong** the model’s predictions are
* Lower is better! (ideal random loss ≈ `-ln(1/vocab_size)`)

#### `Generate Method (Sampling Text)`


#### Setup Recap

**Embedding table** (5×5) from before:

| Token ID | Dim 0 | Dim 1 | Dim 2 | Dim 3 | Dim 4 |
| :------: | :---: | :---: | :---: | :---: | :---: |
|   **0**  |  0.12 | –0.34 |  0.45 | –0.01 |  0.33 |
|   **1**  | –0.05 |  0.78 | –1.10 |  0.22 |  0.09 |
|   **2**  |  1.30 | –0.44 |  0.00 |  0.55 | –0.88 |
|   **3**  | –0.77 |  0.21 |  0.64 | –0.19 |  1.15 |
|   **4**  |  0.03 | –0.67 | –0.40 |  0.90 | –0.12 |

**Initial `idx`** (shape (2, 3)):

| Batch | t=0 | t=1 | t=2 |
| :---: | :-: | :-: | :-: |
| **0** |  2  |  0  |  4  |
| **1** |  1  |  3  |  2  |

---

## Steps for Generation

1. **Lookup logits** for the **last** token in each batch (t=2):

   * Batch 0 last ID = 4 → logits = embedding row 4 = `[0.03, –0.67, –0.40, 0.90, –0.12]`
   * Batch 1 last ID = 2 → logits = embedding row 2 = `[1.30, –0.44, 0.00, 0.55, –0.88]`

2. **Argmax** (greedy) picks the highest logit:

   * Batch 0 → max at **Dim 3** (0.90) → new token = 3
   * Batch 1 → max at **Dim 0** (1.30) → new token = 0

3. **Append** to each sequence:

| Batch | Before     | After append (t=3) |
| :---: | :--------- | :----------------- |
| **0** | \[2, 0, 4] | \[2, 0, 4, **3**]  |
| **1** | \[1, 3, 2] | \[1, 3, 2, **0**]  |

Now `idx.shape == (2, 4)`.



## Summary

Starting from

```
idx = [
  [2, 0, 4],
  [1, 3, 2],
]
```

and running `generate(idx, max_new_tokens=2)` (greedy):

1. **Step 1** → append **3** to batch 0, **0** to batch 1

Yields:

```
idx_out = [
  [2, 0, 4, 3],
  [1, 3, 2, 0],
]
```

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

# ✅ Define the Bigram Language Model
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # A lookup table that maps token indices to vocab-sized logits for next token
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx: (B, T) input indices
        # targets: (B, T) expected next-token indices

        # Embed the tokens (output: logits of shape [B, T, vocab_size])
        logits = self.token_embedding_table(idx)  # (B, T, C)

        # If we're training (targets are provided), compute the loss
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)        # Flatten (B, T, C) -> (B*T, C)
            targets = targets.view(B * T)         # Flatten targets: (B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T)
        for _ in range(max_new_tokens):
            logits, _ = self(idx)               # (B, T, C)
            logits = logits[:, -1, :]           # focus on last time step (B, C)

            # logits = logits[:, -1, :]
                        # └─┬─┘ └─┬─┘ └─┬─┘
                          # │     │     └── all class-scores
                          # │     └──────── last time-step (t = T-1)
                          # └────────────── all batch entries

            probs = F.softmax(logits, dim=-1)   # convert to probabilities (B, C)
            idx_next = torch.multinomial(probs, num_samples=1)  # sample (B, 1)
            idx = torch.cat((idx, idx_next), dim=1)  # append to sequence (B, T+1)
        return idx

# ✅ Instantiate and test the model
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)

print("Logits shape:", logits.shape)  # [B*T, vocab_size]
print("Loss:", loss)                  # Cross-entropy loss

# ✅ Try generating some text from the model
sample = m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)
print("\nGenerated text (raw tokens):", sample[0].tolist())
print("\nGenerated text (decoded):")
print(decode(sample[0].tolist()))


Logits shape: torch.Size([32, 112])
Loss: tensor(5.2504, grad_fn=<NllLossBackward0>)

Generated text (raw tokens): [0, 6, 39, 103, 40, 60, 108, 103, 103, 37, 31, 46, 35, 12, 104, 62, 29, 43, 64, 3, 14, 36, 84, 35, 27, 13, 60, 15, 99, 96, 93, 45, 32, 74, 44, 14, 25, 99, 18, 53, 7, 34, 77, 1, 65, 91, 62, 63, 46, 26, 81, 28, 91, 65, 90, 84, 103, 59, 16, 38, 104, 33, 65, 59, 82, 9, 29, 85, 33, 77, 19, 36, 9, 36, 68, 25, 0, 62, 45, 98, 80, 41, 38, 107, 24, 14, 111, 13, 44, 62, 88, 34, 36, 68, 107, 34, 77, 61, 83, 80, 76]

Generated text (decoded):
	$H🔐I_🙏🔐🔐F@OD+🔒a>Lc!-EwD:,_.🌟☺–NAmM-8🌟1V&Cp
dàabO9t;àd¡w🔐]/G🔒Bd]u(>xBp2E(Eg8	aN️sJG🙁7-🤝,Ma{CEg🙁Cp`vso


#### Train the Model

Optimize the model using a basic training loop.

1. **Build** your Bigram model and AdamW optimizer.
2. **Repeat 100×:**

   * **Grab** a random batch (`xb`, `yb`).
   * **Forward** through the model to get `loss`.
   * **Backprop** to optimise the logits.


In [None]:
m = BigramLanguageModel(vocab_size)
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

for steps in range(100):
    xb, yb = get_batch('train')
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    print(loss.item())

5.332954406738281
5.358665943145752
5.151556968688965
5.235673904418945
5.46928071975708
5.0979108810424805
5.282344341278076
5.01500129699707
5.568490505218506
5.367368221282959
5.1276350021362305
5.375743865966797
5.245172023773193
5.231736660003662
5.352851390838623
5.219082832336426
5.791552543640137
4.828013896942139
5.439228534698486
5.397055625915527
4.983537197113037
5.428558826446533
5.535346031188965
5.413728713989258
5.238778591156006
5.203005790710449
5.35411262512207
5.070978164672852
5.397096157073975
5.209497451782227
4.900765419006348
5.011595726013184
5.300110816955566
5.278593063354492
5.2811760902404785
5.214981555938721
5.351513862609863
5.1683735847473145
5.572271347045898
5.3149189949035645
5.360107421875
5.484967231750488
5.286406517028809
5.2235894203186035
5.5175957679748535
5.22564172744751
5.440962791442871
5.232370853424072
4.816390514373779
5.533224105834961
5.294939994812012
5.235952854156494
5.159475326538086
5.338944435119629
5.122669219970703
5.31830120

In [None]:
context = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(context, max_new_tokens=100)[0].tolist()))

	🔐vym-U{☺:👍Dm[)_—ED	️M;W+–n🤗RQ😊 `lJO,v¡p🤝d🙁k'&zwUq5🙁👍[🔐;B$k🙏1🔐8p.90{👍4k,"(e{EG5.+nb?k2"Ho é¡yà💪V.A1+K


#### Issues with Simple Bigram Model

1. **Zero Context Beyond One Token**

2. **No Positional or Segment Embeddings**
   * Every character treated identically, regardless of its position in the sequence.
   * Model has no sense of “this is the start,” “this is the end,” or token order beyond one step.

3. **Tiny Context Window (`block_size`)**
   * We used very small chunks (e.g. 8 tokens) for demonstration → which severely limits what the model can memorize or generalize.

4. **Character‐Level Tokenization Only**
   * Single characters as tokens → extremely long sequences and slow convergence.

5. **No Attention or Deep Layers**

6. **No Loss Smoothing or Metrics**

   * We printed raw loss per step → extremely noisy signal due to random mini‐batches, without averaging or exponential smoothing.

---

#### What’s Next: Building a Better Bigram Language Model

1. **Longer Context Windows**
   * Increase `block_size` to dozens or hundreds of tokens so the model sees more history.

2. **Subword Tokenization (BPE / WordPiece)**

3. **Positional & Segment Embeddings**

4. **Self-Attention**
   * Introduce multi-head attention layers

5. **Loss Smoothing**

In the next video, we’ll upgrade our simple bigram into a much better bigram llanguage model before we build our first transformer model.
