# Tiny Transformer Language Model (Character-Level)

This notebook implements a **minimal Transformer language model** from scratch for **educational purposes**:

- Character-level dataset
- Token + positional embeddings
- Masked self-attention (single head)
- Transformer block (attention + feedforward + residual + layer norm)
- Tiny training loop
- Text generation

Feel free to modify the sample text, model size, or training steps for experiments.

In [1]:
# Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import random

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using device:', device)

Using device: cpu


## 1. Build a tiny character-level dataset

We use a short text, build a character vocabulary, and create (context, next-character) pairs.


In [2]:
# Sample text data (you can replace this with anything you like)
text = "Hello world. This is a simple transformer demo."

# 1) Build vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print('Vocabulary:', chars)
print('vocab_size =', vocab_size)

char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for ch, i in char_to_idx.items()}

# 2) Encode full text as integer tensor
data = torch.tensor([char_to_idx[ch] for ch in text], dtype=torch.long)
print('Encoded text (first 50 tokens):', data[:50].tolist())

# 3) Create training examples (context -> next char)
block_size = 16  # maximum context length

def get_batch(batch_size=32):
    """Return a batch of (x, y) where:
    - x: [batch_size, block_size]
    - y: [batch_size, block_size]
    y is x shifted by one position (next-character prediction).
    """
    idx = torch.randint(len(data) - block_size - 1, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in idx])
    y = torch.stack([data[i+1:i+1+block_size] for i in idx])
    return x.to(device), y.to(device)

# Quick sanity check
xb, yb = get_batch(batch_size=2)
print('Example batch x:', xb)
print('Example batch y:', yb)

Vocabulary: [' ', '.', 'H', 'T', 'a', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'w']
vocab_size = 19
Encoded text (first 50 tokens): [2, 6, 10, 10, 13, 0, 18, 13, 15, 10, 5, 1, 0, 3, 8, 9, 16, 0, 9, 16, 0, 4, 0, 16, 9, 11, 14, 10, 6, 0, 17, 15, 4, 12, 16, 7, 13, 15, 11, 6, 15, 0, 5, 6, 11, 13, 1]
Example batch x: tensor([[16,  0,  4,  0, 16,  9, 11, 14, 10,  6,  0, 17, 15,  4, 12, 16],
        [16,  0,  9, 16,  0,  4,  0, 16,  9, 11, 14, 10,  6,  0, 17, 15]])
Example batch y: tensor([[ 0,  4,  0, 16,  9, 11, 14, 10,  6,  0, 17, 15,  4, 12, 16,  7],
        [ 0,  9, 16,  0,  4,  0, 16,  9, 11, 14, 10,  6,  0, 17, 15,  4]])


## 2. Token + positional embedding layer

We embed token IDs and add positional embeddings (so the model knows *where* each token is in the sequence).

In [3]:
class TokenAndPositionalEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)

    def forward(self, x):
        """x: [batch_size, seq_len] (token indices)
        returns: [batch_size, seq_len, d_model]
        """
        B, T = x.shape
        token_vecs = self.token_emb(x)  # [B, T, d_model]
        positions = torch.arange(T, device=x.device).unsqueeze(0)  # [1, T]
        pos_vecs = self.pos_emb(positions)  # [1, T, d_model]
        return token_vecs + pos_vecs

# Tiny test
embed_test = TokenAndPositionalEmbedding(vocab_size, d_model=8, max_len=block_size).to(device)
out = embed_test(xb)
print('Embedding output shape:', out.shape)

Embedding output shape: torch.Size([2, 16, 8])


## 3. Single-head masked self-attention

This is the core idea of a Transformer: each position attends to previous positions (causal mask).

In [4]:
class SelfAttentionHead(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.d_model = d_model
        self.key   = nn.Linear(d_model, d_model, bias=False)
        self.query = nn.Linear(d_model, d_model, bias=False)
        self.value = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        """x: [B, T, d_model]"""
        B, T, C = x.shape
        K = self.key(x)   # [B, T, C]
        Q = self.query(x) # [B, T, C]
        V = self.value(x) # [B, T, C]

        # Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1)  # [B, T, T]
        scores = scores / math.sqrt(C)

        # Causal mask: prevent looking into the future
        mask = torch.tril(torch.ones(T, T, device=x.device))
        scores = scores.masked_fill(mask == 0, float('-inf'))

        attn = F.softmax(scores, dim=-1)  # [B, T, T]

        # Weighted sum of values
        out = attn @ V  # [B, T, C]
        return out

# Quick test
attn_test = SelfAttentionHead(d_model=8).to(device)
attn_out = attn_test(out)
print('Attention output shape:', attn_out.shape)

Attention output shape: torch.Size([2, 16, 8])


## 4. Transformer block (Attention + Feedforward)

Each block:
1. LayerNorm → Self-Attention → Residual
2. LayerNorm → Feedforward (MLP) → Residual

In [5]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.attn = SelfAttentionHead(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        # Self-attention with residual
        x_norm = self.ln1(x)
        attn_out = self.attn(x_norm)
        x = x + attn_out

        # Feedforward with residual
        x_norm = self.ln2(x)
        ff_out = self.ff(x_norm)
        x = x + ff_out
        return x

# Quick test
block_test = TransformerBlock(d_model=8, d_ff=16).to(device)
block_out = block_test(attn_out)
print('Block output shape:', block_out.shape)

Block output shape: torch.Size([2, 16, 8])


## 5. Full tiny Transformer language model

We stack:
- Token+position embedding
- Several Transformer blocks
- Final linear layer to predict next-token logits

In [6]:
class TinyTransformerLM(nn.Module):
    def __init__(self, vocab_size, d_model=64, n_layers=2, d_ff=128, max_len=block_size):
        super().__init__()
        self.embed = TokenAndPositionalEmbedding(vocab_size, d_model, max_len)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, d_ff) for _ in range(n_layers)
        ])
        self.ln_final = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, idx):
        """idx: [B, T] token indices
        returns: [B, T, vocab_size] logits
        """
        x = self.embed(idx)
        for block in self.blocks:
            x = block(x)
        x = self.ln_final(x)
        logits = self.head(x)
        return logits

    @torch.no_grad()
    def generate(self, idx, max_new_tokens=50):
        """Autoregressive generation starting from idx [1, T]."""
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]  # use last context window
            logits = self(idx_cond)
            logits_last = logits[:, -1, :]  # last time step
            probs = F.softmax(logits_last, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_token], dim=1)
        return idx

# Instantiate model
model = TinyTransformerLM(vocab_size).to(device)
print(model)

# Test forward pass
logits_test = model(xb)
print('Logits shape:', logits_test.shape)

TinyTransformerLM(
  (embed): TokenAndPositionalEmbedding(
    (token_emb): Embedding(19, 64)
    (pos_emb): Embedding(16, 64)
  )
  (blocks): ModuleList(
    (0-1): 2 x TransformerBlock(
      (ln1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (attn): SelfAttentionHead(
        (key): Linear(in_features=64, out_features=64, bias=False)
        (query): Linear(in_features=64, out_features=64, bias=False)
        (value): Linear(in_features=64, out_features=64, bias=False)
      )
      (ff): Sequential(
        (0): Linear(in_features=64, out_features=128, bias=True)
        (1): ReLU()
        (2): Linear(in_features=128, out_features=64, bias=True)
      )
    )
  )
  (ln_final): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
  (head): Linear(in_features=64, out_features=19, bias=True)
)
Logits shape: torch.Size([2, 16, 19])


## 6. Training loop

We train with cross-entropy loss on next-character prediction. This is **tiny** and purely for demonstration, so don't expect amazing text quality.

In [7]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

num_steps = 1000  # keep small for demo; increase if you like

for step in range(num_steps):
    model.train()
    xb, yb = get_batch(batch_size=32)

    logits = model(xb)
    B, T, V = logits.shape
    loss = F.cross_entropy(logits.view(B*T, V), yb.view(B*T))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 100 == 0:
        print(f"step {step}, loss = {loss.item():.4f}")

step 0, loss = 3.0165
step 100, loss = 0.1509
step 200, loss = 0.0669
step 300, loss = 0.0715
step 400, loss = 0.0643
step 500, loss = 0.0581
step 600, loss = 0.0551
step 700, loss = 0.0527
step 800, loss = 0.0509
step 900, loss = 0.0500


## 7. Text generation

We seed the model with some initial characters and let it generate more, one character at a time.

In [8]:
model.eval()

start_text = "Hello world."
start_idx = torch.tensor([[char_to_idx[ch] for ch in start_text]], device=device)

generated_idx = model.generate(start_idx, max_new_tokens=80)[0].cpu().tolist()
generated_text = ''.join(idx_to_char[i] for i in generated_idx)

print('Seed:')
print(start_text)
print('\nGenerated text:')
print(generated_text)

Seed:
Hello world.

Generated text:
Hello world. This is a simple transformer demormer d. This is a simple transformer demor del


## 8. Inspect token embeddings ("character embeddings")

Like in real LLMs, each token (here: character) gets mapped to a dense vector. Let's inspect a few of them.

In [9]:
with torch.no_grad():
    emb_weights = model.embed.token_emb.weight  # [vocab_size, d_model]
    print('Embedding matrix shape:', emb_weights.shape)

    for ch in ['H', 'e', 'o', ' ']:
        if ch in char_to_idx:
            idx = char_to_idx[ch]
            vec = emb_weights[idx]
            print(f"\nCharacter '{ch}' (index {idx}) embedding vector (first 10 dims):")
            print(vec[:10])

Embedding matrix shape: torch.Size([19, 64])

Character 'H' (index 2) embedding vector (first 10 dims):
tensor([-0.0954,  1.3438,  1.4395,  0.9203,  0.8983,  1.2144, -0.3915,  0.1463,
        -0.7597,  1.8002], requires_grad=True)

Character 'e' (index 6) embedding vector (first 10 dims):
tensor([ 0.9669,  0.4964, -1.8346, -0.2491, -0.8514, -1.5402, -0.8203,  0.4805,
         0.2247, -2.4325], requires_grad=True)

Character 'o' (index 13) embedding vector (first 10 dims):
tensor([ 0.6131,  2.3747, -0.3250, -0.6984,  0.1271,  1.7151,  0.3394,  0.6249,
         0.6121, -2.5168], requires_grad=True)

Character ' ' (index 0) embedding vector (first 10 dims):
tensor([-0.1127, -0.1695, -1.0145,  0.0959, -0.6880,  0.1390,  1.1265,  0.3472,
        -0.5273, -0.3755], requires_grad=True)
