## 🧠 Building on the Bigram Language model

**Improving the Bigram LM**
*(Still conditions only on token *t*)*

1. **Longer Context Windows**

   * Increase `block_size` from 1 to, e.g., 128.
   * *Why?*: You can process longer sequences in each batch—but the prediction at *t+1* still only uses the embedding at *t*.

2. **Subword Tokenization (BPE/WordPiece)**

   * Replace characters with subword units (e.g. “refund” → one token).
   * *Why?*: Sequences become semantically richer and shorter—yet you still map one token to the next via a single lookup.

3. **Token + Positional (± Segment) Embeddings**

   * *Why?*: Adds learned semantic vectors and absolute order (or speaker) info—but if you collapse only x\[t] to predict the next token, the model remains a bigram.

---

**Minimal Transformer LM**
*(Now conditions on all tokens 0…t)*

4. **Self-Attention + Causal Masking**

   * *Why?*: Each position *t* now “looks at” tokens 0…*t*–1 when forming its representation.  That full-history dependency is what **breaks** the bigram limitation and makes your model a **decoder-only Transformer**.



####  **Load from HugginFace as Pandas**

In [None]:
import pandas as pd

df = pd.read_csv("hf://datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset/Bitext_Sample_Customer_Support_Training_Dataset_27K_responses-v11.csv")

sample_data


In [None]:
pairs = []

for _, row in df.iterrows():
  prompt =row['instruction'].strip()
  reply = row['response'].strip()

  if prompt and reply:
    text_pair = f"### Instruction:\n{prompt}\n\n### Response:\n{reply}\n\n"
    pairs.append(text_pair)


full_text = "".join(pairs)


output_path = 'customer_support_data.txt'
with open(output_path, "w", encoding="utf-8") as f:
  f.write(full_text)

text = open('customer_support_data.txt', 'r').read()

'customer_support_data.txt'

### ✏️ Step 2: Create Tokenizer - Subword (BPE) Level (Encoder/Decoder)

Instead of a character-level vocabulary, we’ll use a GPT-2–style BPE tokenizer to break text into meaningful subword units. This produces shorter, semantically richer sequences.

In [None]:
import tiktoken
import torch

# 1️⃣ Initialize the BPE encoder (GPT-2 vocab)
enc = tiktoken.get_encoding('gpt2')

# 2️⃣ Replace character-level stoi/itos with BPE encode/decode
encode = lambda s: enc.encode(s)        # str → list[int] (subword IDs)
decode = lambda ids: enc.decode(ids)    # list[int] → str

# 3️⃣ Example usage
sample_text = "LLM"
encoded_ids = encode(sample_text)
decoded_text = decode(encoded_ids)

print("Original text: ", sample_text)
print("Encoded IDs:   ", encoded_ids)
print("Decoded text:  ", decoded_text)

# 4️⃣ Integrate into your dataset pipeline
full_text = open('customer_support_data.txt').read()
data_ids = encode(full_text)                  # tokenize entire corpus
data     = torch.tensor(data_ids, dtype=torch.long)
vocab_size = enc.n_vocab
print("Vocab size:", vocab_size)


Original text: LLM
Encoded: [43, 43, 44]
Decoded: LLM


### ✏️ Step 4: Encode the Entire Dataset

Now that we have defined our **encoder** and **decoder**, we can use them to convert the **entire text dataset** into a sequence of integers. This forms the actual training data that we’ll feed into the language model.


In [None]:
import torch  # We use PyTorch: https://pytorch.org

# Encode the entire text using our character-level encoder
data = torch.tensor(encode(text), dtype=torch.long)

# Print tensor information
print("Data shape:", data.shape)
print("Data type:", data.dtype)

# Preview first 1000 encoded tokens
print("First 1000 tokens:\n", data[:1000])

Data shape: torch.Size([19240191])
Data type: torch.int64
First 1000 tokens:
 tensor([ 5,  5,  5,  2, 40, 75, 80, 81, 79, 82, 64, 81, 70, 76, 75, 27,  1, 78,
        82, 66, 80, 81, 70, 76, 75,  2, 62, 63, 76, 82, 81,  2, 64, 62, 75, 64,
        66, 73, 73, 70, 75, 68,  2, 76, 79, 65, 66, 79,  2, 88, 88, 46, 79, 65,
        66, 79,  2, 45, 82, 74, 63, 66, 79, 89, 89,  1,  1,  5,  5,  5,  2, 49,
        66, 80, 77, 76, 75, 80, 66, 27,  1, 40,  8, 83, 66,  2, 82, 75, 65, 66,
        79, 80, 81, 76, 76, 65,  2, 86, 76, 82,  2, 69, 62, 83, 66,  2, 62,  2,
        78, 82, 66, 80, 81, 70, 76, 75,  2, 79, 66, 68, 62, 79, 65, 70, 75, 68,
         2, 64, 62, 75, 64, 66, 73, 70, 75, 68,  2, 76, 79, 65, 66, 79,  2, 88,
        88, 46, 79, 65, 66, 79,  2, 45, 82, 74, 63, 66, 79, 89, 89, 13,  2, 62,
        75, 65,  2, 40,  8, 74,  2, 69, 66, 79, 66,  2, 81, 76,  2, 77, 79, 76,
        83, 70, 65, 66,  2, 86, 76, 82,  2, 84, 70, 81, 69,  2, 81, 69, 66,  2,
        70, 75, 67, 76, 79, 74, 62, 81, 70

### 🌐 Step 5: Train/Test Split

Split data into training (90%) and validation (10%) sets.

In [8]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

### 🚗 Step 6: Create Data Batches

So far, we've seen how a **single sequence (or chunk)** of text helps a model learn by predicting the **next character** at every position. But training on one sequence at a time is inefficient. To train faster and learn more general patterns, we move to **batching**.

In [None]:
import torch

# Set random seed for reproducibility
torch.manual_seed(1337)

batch_size = 4   # How many sequences to process in parallel
block_size = 128   # Length of each sequence (context window)

def get_batch(split):
    """
    Samples a mini-batch of input (x) and target (y) sequences from the dataset.

    Args:
        split (str): One of 'train' or 'val' to choose the dataset split.

    Returns:
        x (torch.Tensor): Input sequences of shape (batch_size, block_size)
        y (torch.Tensor): Target sequences of shape (batch_size, block_size)
                          Each y[i, t] is the next character after x[i, t]
    """
    assert split in ['train', 'val'], "split must be 'train' or 'val'"

    data_source = train_data if split == 'train' else val_data

    # Randomly sample starting indices for each sequence
    start_indices = torch.randint(0, len(data_source) - block_size, (batch_size,))

    # Build input and target tensors using slicing
    x = torch.stack([data_source[i:i + block_size] for i in start_indices])
    y = torch.stack([data_source[i + 1:i + block_size + 1] for i in start_indices])

    return x, y

In [10]:
# Generate a training batch
xb, yb = get_batch('train')

# Inspect the shape of the input and target tensors
print("🧮 Input batch shape:", xb.shape)   # Expected: (4, 8)
print("🧮 Target batch shape:", yb.shape) # Expected: (4, 8)

# View actual data
print("\n🧾 Inputs (xb):")
print(xb)

print("\n🎯 Targets (yb):")
print(yb)

🧮 Input batch shape: torch.Size([4, 8])
🧮 Target batch shape: torch.Size([4, 8])

🧾 Inputs (xb):
tensor([[62, 81,  2, 86, 76, 82,  8, 79],
        [79, 66, 65, 66, 75, 81, 70, 62],
        [ 2, 82, 75, 65, 66, 79, 80, 81],
        [84, 70, 81, 69,  2, 81, 69, 66]])

🎯 Targets (yb):
tensor([[81,  2, 86, 76, 82,  8, 79, 66],
        [66, 65, 66, 75, 81, 70, 62, 73],
        [82, 75, 65, 66, 79, 80, 81, 62],
        [70, 81, 69,  2, 81, 69, 66,  2]])


#### Bigram Language Model

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(1337)

# ✅ Define the Bigram LM with Token + Positional Embeddings
class BigramWithPos(nn.Module):
    def __init__(self, vocab_size, emb_size, block_size):
        super().__init__()
        # token & positional embeddings
        self.tok_emb = nn.Embedding(vocab_size, emb_size)
        self.pos_emb = nn.Embedding(block_size, emb_size)
        # final head to project back to vocab logits
        self.lm_head = nn.Linear(emb_size, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        # embed tokens and positions
        tok = self.tok_emb(idx)                                        # (B, T, E)
        pos = self.pos_emb(torch.arange(T, device=idx.device))[None]   # (1, T, E)
        x   = tok + pos                                                 # (B, T, E)

        # compute logits
        logits = self.lm_head(x)                                        # (B, T, V)

        # compute loss if targets are provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(B * T, -1),
                targets.view(B * T)
            )
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # autoregressively sample from the model
        for _ in range(max_new_tokens):
            logits, _ = self(idx)               # (B, T, V)
            probs      = F.softmax(logits[:, -1, :], dim=-1)  # (B, V)
            next_id    = torch.multinomial(probs, num_samples=1)  # (B, 1)
            idx        = torch.cat([idx, next_id], dim=1)      # (B, T+1)
        return idx

# ── Instantiate & Test ────────────────────────────────────────────────────────
block_size = 128
emb_size   = 128

model = BigramWithPos(vocab_size, emb_size, block_size)

# test forward and loss
logits, loss = model(xb, yb)
print("Logits shape:", logits.shape)  # (B, T, V)
print("Loss:", loss.item())

# generate sample text
context = torch.zeros((1, 1), dtype=torch.long)
sample_ids = model.generate(context, max_new_tokens=100)[0].tolist()
print("\nGenerated text:\n", decode(sample_ids))


#### Train the Model

In [None]:
import torch
import torch.nn.functional as F

# ── Hyperparameters ───────────────────────────────────────────────────────────
batch_size    = 32
block_size    = 128
emb_size      = 128
learning_rate = 3e-4
num_steps     = 50
ema_alpha     = 0.99


optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
ema_loss = None

for step in range(1, num_steps + 1):
    xb, yb = get_batch('train')              # (B, T)
    _, loss = model(xb, targets=yb)          # forward + loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    raw = loss.item()
    ema_loss = raw if ema_loss is None else ema_alpha * ema_loss + (1 - ema_alpha) * raw

    if step % 100 == 0:
        print(f"Step {step:4d} | raw loss = {raw:.4f} | ema loss = {ema_loss:.4f}")

#### Inference with the Model

In [None]:
model.eval()
with torch.no_grad():
    xb_val, yb_val = get_batch('val')
    _, val_loss = model(xb_val, targets=yb_val)
print(f"\nValidation loss: {val_loss.item():.4f}")



model.eval()

# we won’t need gradients
with torch.no_grad():
    context = torch.zeros((1, 1), dtype=torch.long, device=next(model.parameters()).device)
    
    out_ids = model.generate(context, max_new_tokens=100)[0].tolist()
    
print(decode(out_ids))
