In [None]:
# Baby GPT training notebook (single-file script / runnable in a Jupyter cell or as a .py)
# Filename: baby_gpt_notebook.py
# Purpose: Train a tiny decoder-only transformer (baby GPT) on sample soccer text.
# Notes:
# - Designed to run on CPU or GPU. If you have a GPU, PyTorch will use it automatically.
# - Installs required packages when run in a fresh environment.
# - Uses HuggingFace tokenizer for convenience.

# %%
# 1) Install dependencies (run once)
# If running in Colab or a fresh env, uncomment the following pip commands.
# In a local environment you might already have these packages installed.

# !pip install -q transformers datasets torch tqdm

# %%
# 2) Imports
import os
import math
import time
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from transformers import GPT2TokenizerFast
from tqdm.auto import tqdm

# %%
# 3) Tiny dataset (we'll use the 1000-word soccer text + a few synthetic match lines)
# You can replace `raw_text` with any large text file for better results.
raw_text = '''
Soccer, known as football outside North America, is the world’s most popular sport — a
simple game with rich complexity. At its core, soccer is played between two teams of
eleven players on a rectangular pitch with the objective of moving a spherical ball
into the opponent’s net more times than the opponent during a fixed interval. Yet a
single match contains dozens of interacting systems: individual skill, team tactics,
coaches’ game plans, referee decisions, and the unpredictable variables of weather,
pitch condition, and fan atmosphere. These layers make soccer both an accessible
pastime and a deep subject for analysis.

From grassroots to elite levels, soccer is shaped by its rules and by the continuous
evolution of tactics. The Laws of the Game, maintained by the International Football
Association Board (IFAB), provide the framework — fouls, offside, substitutions, and
restart procedures — while coaches constantly innovate within that framework. Over the
past decades, tactical trends have come and gone: from rigid formations to fluid,
positionless systems; from deep defensive blocks to intense, high-pressing attacks
that aim to win the ball high up the pitch. Modern teams often blend approaches,
shifting shapes dynamically depending on game state, opponent, and available
personnel.

Players are the sport’s primary storytellers. Technical skills such as first touch,
passing range, dribbling, and finishing create moments of magic, while physical
attributes — speed, strength, and stamina — determine whether a player can execute
repeated high-intensity actions across 90 minutes. Yet soccer prizes decision-making
and spatial intelligence arguably above pure athleticism. The elite practitioners
read the game: they anticipate runs, manage tempo, and choose when to accelerate or
conserve energy. Young players are typically developed through a mix of deliberate
practice and game intelligence training, where coaches encourage pattern recognition
and variability of practice rather than rote repetition.

Clubs and national teams operate inside an ecosystem shaped by youth academies,
scouting networks, sports science, and data analytics. Academies are talent pipelines,
using structured training programs to develop technical foundations and tactical
understanding. Scouting extends this pipeline, combining in-person observation with
video and data-driven scouting to find undervalued prospects. Sports science and
medical teams optimize player load, recovery, and nutrition to reduce injury risk
and maximize performance. In recent years, data analytics has exploded in influence:
tracking data, event logs, and advanced metrics are used to evaluate players, plan
tactics, and inform recruitment. Analysts convert raw event streams — passes, shots,
tackles, positional coordinates — into actionable insights that coaches use to
gain marginal advantages.

Competitions are central to soccer’s global culture. Domestic leagues provide weekly
drama across seasons, continental competitions like the UEFA Champions League
assemble elite clubs from different countries, and international tournaments such as
the FIFA World Cup capture the world’s attention periodically. Each competition has
different incentives: domestic leagues prize consistency over 38–40 matches, cup
competitions reward knockout efficiency, and international tournaments reward
short-term peak performance. Fans scaffold these competitions with rituals — chants,
scarves, matchday foods — creating tribal identities that amplify the stakes and
intensity.

The economics of soccer are complex and increasingly globalized. Broadcasting rights,
sponsorships, and commercial partnerships bring revenue that determines clubs’ transfer
budgets and wage structures. While elite clubs can leverage vast revenue streams to
assemble world-class squads, economic disparity exists between top-tier clubs and
smaller clubs whose survival often depends on player development and smart trading.
Financial regulations, such as spending rules and licensing, aim to promote stability
but are not universally enforced in the same way across leagues.

Technology has also reshaped how the game is played and adjudicated. Video Assistant
Referee (VAR) systems aim to reduce clear errors for goals, penalties, red cards, and
identity mistakes. Wearable GPS and inertial measurement systems provide teams with
fine-grained load data. Broadcasting innovations and camera systems enhance the
viewer experience and make previously hidden tactical elements visible. Meanwhile,
fan engagement has expanded into social media, fantasy sports, and interactive
viewing experiences that further entrench soccer’s cultural footprint.

Soccer’s social role cannot be overstated. It fosters community cohesion, provides
pathways for social mobility, and serves as a platform for social and political
expression. From local youth clubs to global campaigns for equality and anti-racism,
soccer often intersects with broader societal issues. Its global reach makes it an
effective vehicle for cultural exchange, yet it also concentrates power and attention
among a limited number of clubs and players.

For researchers and engineers working with soccer data, the richness of the sport
translates into many modeling opportunities: event prediction, player valuation,
tactical pattern discovery, and generative tasks like simulating plausible match
narratives or commentary. The success of those models depends on data quality,
representativeness, and careful feature engineering. Event-level data and tracking
data enable complementary analyses: event data is ideal for understanding actions and
outcomes, while tracking data captures movement and space control.

In short, soccer is a deceptively simple game whose surface belies a deep,
interconnected system. Whether you love the sport for the drama of a late winner,
the elegance of a well-worked team move, or the intellectual puzzle of tactical
analysis, soccer offers experiences that are simultaneously universal and endlessly
detailed.
'''

# Add some structured match lines to increase pattern variety
sample_matches = [
    "2024-08-10 — Arsenal 2-1 Manchester City; possession 54-46; shots 12-9.",
    "2024-08-10 — Manchester United 1-1 Chelsea; possession 49-51; shots 10-11.",
    "2024-08-11 — Real Madrid 3-2 Barcelona; possession 60-40; shots 15-13.",
    "2024-08-12 — Juventus 0-0 AC Milan; possession 47-53; shots 8-7.",
]

full_text = raw_text + "\n\n" + "\n".join(sample_matches)

# %%
# 4) Tokenizer (GPT-2 tokenizer is convenient; using small vocab is fine)
# If you want to train your own tokenizer on a massive corpus, replace this with
# training a new ByteLevelBPETokenizer. For this tiny demo we use prebuilt GPT-2 vocab.

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# Add a special token for separation if desired
special_tokens = {"pad_token": "<|pad|>"}
tokenizer.add_special_tokens(special_tokens)

# Tokenize the full text
enc = tokenizer(full_text)
input_ids = torch.tensor(enc["input_ids"], dtype=torch.long)

print(f"Tokenized dataset length (tokens): {input_ids.size(0)}")

# %%
# 5) Prepare dataset with sliding windows (typical for language modeling)
class TextDataset(Dataset):
    def __init__(self, tokens, block_size=128):
        self.tokens = tokens
        self.block_size = block_size

    def __len__(self):
        # number of training examples from sliding window
        return max(0, self.tokens.size(0) - self.block_size)

    def __getitem__(self, idx):
        x = self.tokens[idx : idx + self.block_size]
        y = self.tokens[idx + 1 : idx + 1 + self.block_size]
        return x, y

block_size = 128
dataset = TextDataset(input_ids, block_size=block_size)
loader = DataLoader(dataset, batch_size=8, shuffle=True)

# %%
# 6) Define a tiny decoder-only transformer (baby GPT)
class CausalSelfAttention(nn.Module):
    def __init__(self, embed_dim, n_heads):
        super().__init__()
        assert embed_dim % n_heads == 0
        self.n_heads = n_heads
        self.head_dim = embed_dim // n_heads
        self.scale = self.head_dim ** -0.5

        self.qkv = nn.Linear(embed_dim, embed_dim * 3, bias=False)
        self.proj = nn.Linear(embed_dim, embed_dim)

        # causal mask will be created dynamically in forward

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.qkv(x)  # (B, T, 3*C)
        qkv = qkv.reshape(B, T, 3, self.n_heads, self.head_dim).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]  # each is (B, n_heads, T, head_dim)

        att = (q @ k.transpose(-2, -1)) * self.scale  # (B, n_heads, T, T)
        # causal mask: set attention for j>i to -inf
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        att = att.masked_fill(mask.unsqueeze(0).unsqueeze(0), float('-inf'))

        att = torch.softmax(att, dim=-1)
        out = att @ v  # (B, n_heads, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)


class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_mult=4):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * ff_mult),
            nn.GELU(),
            nn.Linear(embed_dim * ff_mult, embed_dim),
        )

    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    def __init__(self, embed_dim, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_dim)
        self.attn = CausalSelfAttention(embed_dim, n_heads)
        self.ln2 = nn.LayerNorm(embed_dim)
        self.ff = FeedForward(embed_dim)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x


class BabyGPT(nn.Module):
    def __init__(self, vocab_size, block_size=128, n_layers=4, n_heads=4, embed_dim=128):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(block_size, embed_dim)
        self.blocks = nn.ModuleList([Block(embed_dim, n_heads) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size, bias=False)

        # weight tying
        self.head.weight = self.tok_emb.weight
        self.block_size = block_size

    def forward(self, idx):
        B, T = idx.size()
        assert T <= self.block_size, "Sequence length exceeds block size"
        pos = torch.arange(0, T, device=idx.device).unsqueeze(0)
        x = self.tok_emb(idx) + self.pos_emb(pos)
        for b in self.blocks:
            x = b(x)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits

# %%
# 7) Create model, optimizer, device
vocab_size = len(tokenizer)
model = BabyGPT(vocab_size=vocab_size, block_size=block_size, n_layers=4, n_heads=4, embed_dim=128)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# %%
# 8) Training loop (small number of steps for demo)
num_epochs = 20
log_interval = 10
model.train()

for epoch in range(num_epochs):
    pbar = tqdm(loader, desc=f"Epoch {epoch}")
    running_loss = 0.0
    for step, (x, y) in enumerate(pbar):
        x = x.to(device)
        y = y.to(device)

        logits = model(x)  # (B, T, V)
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        running_loss += loss.item()
        if (step + 1) % log_interval == 0:
            pbar.set_postfix({"loss": running_loss / log_interval})
            running_loss = 0.0

    # Generate some text at the end of each epoch to monitor progress
    model.eval()
    with torch.no_grad():
        # prompt (tokenize a short seed)
        seed = "Soccer is"
        seed_ids = torch.tensor(tokenizer(seed)["input_ids"], dtype=torch.long, device=device).unsqueeze(0)

        # generate autoregressively
        for _ in range(60):
            if seed_ids.size(1) > block_size:
                seed_ids = seed_ids[:, -block_size:]
            logits = model(seed_ids)
            probs = F.softmax(logits[:, -1, :], dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            seed_ids = torch.cat([seed_ids, next_id], dim=1)

        gen = tokenizer.decode(seed_ids.squeeze().tolist())
        print(f"\n=== Sample (epoch {epoch}) ===\n{gen}\n")
    model.train()

# %%
# 9) Save the model and tokenizer
out_dir = Path("baby_gpt_checkpoints")
out_dir.mkdir(exist_ok=True)

save_path = out_dir / "baby_gpt.pth"
torch.save({
    "model_state_dict": model.state_dict(),
    "tokenizer": tokenizer.get_vocab(),
}, save_path)
print(f"Saved checkpoint to {save_path}")

# Also save tokenizer in HF format for reuse
tokenizer.save_pretrained(out_dir)

# %%
# 10) Notes and next steps
# - This is a tiny model trained on very little data; results are mainly illustrative.
# - For better results:
#   * Train on much more text (millions of tokens)
#   * Increase model size (layers, heads, embed_dim)
#   * Use learning rate schedules (cosine, warmup)
#   * Use mixed precision (AMP) and larger batch sizes on GPU
#   * Consider HuggingFace `Trainer` or DeepSpeed/FairScale for scaling
# - If you'd like, I can convert this to a full Jupyter .ipynb, add Colab-ready cells,
#   or make a HuggingFace `datasets` + `Trainer` version.
