# A Neural Probabilistic Language Model

Nesse notebook temos como objetivo replicar a ideia do artigo "A Neural Probabilistic Language Model". A classe de treinmaneot do language model surge do medium: [A Neural Probabilistic Language Model: Breaking Down Bengio’s Approach](https://medium.com/@dahami/a-neural-probabilistic-language-model-breaking-down-bengios-approach-4bf793a84426)


## Dependências

In [32]:
import re
from collections import Counter
from dataclasses import dataclass
import random
import math

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from tqdm.auto import tqdm

## Classe do Language Model

In [33]:
class NeuralProbabilisticLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size, hidden_dim):
        super(NeuralProbabilisticLanguageModel, self).__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim)  
        self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim)
        self.tanh = nn.Tanh()  
        self.linear2 = nn.Linear(hidden_dim, vocab_size)  



    def forward(self, inputs):
        emb = self.embeddings(inputs)            
        if emb.dim() == 2:                         
            emb = emb.unsqueeze(0)  
                        
        out = emb.flatten(start_dim=1)           
        out = self.tanh(self.linear1(out))
        out = self.linear2(out)
        
        return torch.log_softmax(out, dim=1)


## Carregando dataset

In [34]:
def manter_letras_e_espacos(texto: str) -> str:
    if texto is None:
        return ""
    s = str(texto).lower()
    return re.sub(r"[^a-z ]+", "", s)

def adicionar_coluna_normalizada(exemplo):
    exemplo["texto_normalizado"] = manter_letras_e_espacos(exemplo.get("text", ""))
    return exemplo

In [35]:
config_hf = "books"
split = "train"
ds = load_dataset("ubaada/booksum-complete-cleaned", config_hf, split=split)
ds = ds.map(adicionar_coluna_normalizada)
ds = ds.filter(lambda x: len(x["texto_normalizado"].strip()) > 0)

## Utilitários do treino

In [36]:
CONTEXT_SIZE = 5      
EMB_DIM = 128
HIDDEN_DIM = 256
MAX_VOCAB = 40000         
BATCH_SIZE = 512
EPOCHS = 1
LR = 3e-5
SEED = 42
DEVICE = torch.device("cuda" if torch.cuda.is_available() else ("mps" if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available() else "cpu"))
random.seed(SEED)
torch.manual_seed(SEED)


print(f"Usando o dispositivo: {DEVICE}")

Usando o dispositivo: mps


In [37]:
def tokenize(text):
    return re.sub(r"\s+", " ", text.strip()).split(" ")

ds_split = ds.train_test_split(test_size=0.01, seed=SEED)
train_texts = ds_split["train"]["texto_normalizado"]
valid_texts = ds_split["test"]["texto_normalizado"]

SPECIALS = ["<pad>", "<unk>", "<bos>", "<eos>"]
counter = Counter()
for t in train_texts:
    counter.update(tokenize(t))

most_common = [w for w, _ in counter.most_common(MAX_VOCAB - len(SPECIALS))]
itos = SPECIALS + most_common
stoi = {w: i for i, w in enumerate(itos)}

PAD, UNK, BOS, EOS = (stoi["<pad>"], stoi["<unk>"], stoi["<bos>"], stoi["<eos>"])

def encode(tokens):
    return [stoi.get(w, UNK) for w in tokens]

In [38]:
def make_windows_from_text(text, context_size=CONTEXT_SIZE):
    toks = tokenize(text)
    toks = ["<bos>"] * context_size + toks + ["<eos>"]
    ids = encode(toks)
    contexts, targets = [], []
    for i in range(context_size, len(ids)):
        contexts.append(ids[i - context_size:i])
        targets.append(ids[i])
    return contexts, targets

class NGramDataset(Dataset):
    def __init__(self, texts, context_size):
        self.contexts = []
        self.targets = []
        for text in texts:
            c, y = make_windows_from_text(text, context_size)
            self.contexts.extend(c)
            self.targets.extend(y)

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):
        return torch.tensor(self.contexts[idx], dtype=torch.long), torch.tensor(self.targets[idx], dtype=torch.long)

In [39]:
train_ds = NGramDataset(train_texts, CONTEXT_SIZE)
valid_ds = NGramDataset(valid_texts, CONTEXT_SIZE)

train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
valid_dl = DataLoader(valid_ds, batch_size=BATCH_SIZE, shuffle=False, drop_last=False)

In [40]:
model = NeuralProbabilisticLanguageModel(
    vocab_size=len(itos),
    embedding_dim=EMB_DIM,
    context_size=CONTEXT_SIZE,
    hidden_dim=HIDDEN_DIM,
).to(DEVICE)

criterion = nn.NLLLoss(ignore_index=PAD)  
optim = torch.optim.AdamW(model.parameters(), lr=LR)

## Treinamento

In [41]:
LOG_EVERY = 30

GOOD_SEEDS = [
    "the king of france",
    "once upon a time",
    "in a distant land",
    "deep learning models",
    "the history of science",
]

In [42]:
@torch.no_grad()
def generate_from_seed(seed_text, max_new_tokens=30, temperature=0.9, top_k=50):
    model.eval()
    seed_words = seed_text.lower().split()

    ctx = [BOS] * CONTEXT_SIZE
    seed_ids = encode(seed_words[-CONTEXT_SIZE:])
    for i, sid in enumerate(seed_ids[::-1]):
        ctx[-1 - i] = sid

    out_tokens = seed_words[:]
    for _ in range(max_new_tokens):
        x = torch.tensor(ctx, dtype=torch.long, device=DEVICE)    
        logp = model(x)                                            
        logits = logp[0] * temperature

        if top_k and top_k < logits.numel():
            kth = torch.topk(logits, top_k).values[-1]
            logits = torch.where(logits < kth, torch.tensor(float('-inf'), device=logits.device), logits)

        probs = torch.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, 1).item()
        tok = itos[next_id]
        if tok == "<eos>":
            break
        if tok not in SPECIALS:
            out_tokens.append(tok)
        ctx = ctx[1:] + [next_id]
    return " ".join(out_tokens)

In [43]:
def run_epoch(dataloader, train=True, log_every=LOG_EVERY):
    model.train(train)
    total_loss, total_items = 0.0, 0
    pbar = tqdm(enumerate(dataloader, start=1), total=len(dataloader),
                desc=f"{'Train' if train else 'Valid'}", leave=False)

    for step, (x_batch, y_batch) in pbar:
        x_batch, y_batch = x_batch.to(DEVICE), y_batch.to(DEVICE)
        bs = x_batch.size(0)

        if train:
            optim.zero_grad(set_to_none=True)

        batch_loss_val = 0.0

        for i in range(bs):
            log_probs = model(x_batch[i])                          
            loss_i = criterion(log_probs, y_batch[i].unsqueeze(0))
            if train:
                (loss_i / bs).backward()
            batch_loss_val += loss_i.item()

        if train:
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optim.step()

        total_loss += batch_loss_val
        total_items += bs
        avg = total_loss / max(1, total_items)
        pbar.set_postfix(avg_loss=f"{avg:.4f}")

        if train and (step % log_every == 0):
            ppl = math.exp(min(20, avg))
            tqdm.write(f"step {step}/{len(dataloader)} | avg_loss={avg:.4f} | ppl={ppl:.2f}")
            seed = random.choice(GOOD_SEEDS)
            sample = generate_from_seed(seed, max_new_tokens=25, temperature=0.9, top_k=50)
            tqdm.write(f"▶️ seed: '{seed}'\n📝 {sample}\n")
            if hasattr(torch, "mps") and torch.backends.mps.is_available():
                torch.mps.synchronize()

    avg = total_loss / max(1, total_items)
    ppl = math.exp(min(20, avg))
    return avg, ppl

In [44]:
for seed in GOOD_SEEDS:
    sample = generate_from_seed(seed, max_new_tokens=25, temperature=0.9, top_k=50)
    tqdm.write(f"▶️ seed: '{seed}'\n📝 {sample}\n")

▶️ seed: 'the king of france'
📝 the king of france garonne stevie effects fretting silencei drives veal themif persuasions wronged whyi isone jaded abased onlyone herinto power headman ploughing fagin andone stubble palings hedda partisan

▶️ seed: 'once upon a time'
📝 once upon a time lookthat gaywith rugs prophet yousergius stirring waywith ofduty olivers brag smokethe andwere dying demeanor chiefest fawned chantilly permanent hymn honoured brimstone disaster prospect frames thegods

▶️ seed: 'in a distant land'
📝 in a distant land mumbled catesby louise blackhair obligation asif siege sprinkled notno cords bumper crank thank corps notsee devisd mas notyour theinconvenience joked evelyn captainand forests appease locusts

▶️ seed: 'deep learning models'
📝 deep learning models longin nowmr observd wys iunderstood startingpoint angers guileless gondolas pact looming thesecond beconsidered herabout heremrs carve mountainside procurators togetherto myselfand pleasantness dorsets returnth

In [None]:
for epoch in range(1, EPOCHS + 1):
    tr_loss, tr_ppl = run_epoch(train_dl, train=True)
    va_loss, va_ppl = run_epoch(valid_dl, train=False)
    print(f"[{epoch}/{EPOCHS}] train loss={tr_loss:.4f} ppl={tr_ppl:.2f} | valid loss={va_loss:.4f} ppl={va_ppl:.2f}")

Train:   0%|          | 0/23208 [00:00<?, ?it/s]

step 30/23208 | avg_loss=10.6161 | ppl=40785.67
▶️ seed: 'the king of france'
📝 the king of france chronicle clarendon taxcollector andlost shecared crave norany apower habitable frilled allowedto tooif bide crier din lovemaster turnedback tennyson husbandry tightened single mlle parallels robe optical

step 60/23208 | avg_loss=10.6036 | ppl=40280.39
▶️ seed: 'the king of france'
📝 the king of france sceneof wen gruffly levelled rendered carriageand jewelled chimed crowning tought showthe dinewith halffinished slave boars whichwere tact loveshe quelled onthen goodwill toyour dock january answer

step 90/23208 | avg_loss=10.5904 | ppl=39749.79
▶️ seed: 'in a distant land'
📝 in a distant land associate luxuriant expelled mute truthit thenketh churning jess proclaimed reminds nicholas creative awaynow assigned adieux belles psalms uttermost ruth cottonville whenall growed meditatively inclose givet

step 120/23208 | avg_loss=10.5768 | ppl=39216.33
▶️ seed: 'once upon a time'
📝 once upon a