**Nota**: Se recomienda ejecutar este notebook en Google Colab para asegurar la compatibilidad y evitar problemas de dependencias. El entrenamiento de los modelos requiere una cantidad significativa de memoria RAM y potencia de cómputo, que puede no estar disponible en todos los entornos locales.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/antoniotrapote/chord-prediction-tfm/blob/main/anexos/notebooks/03_modelado/05_modelo_lstm_finetuning.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-black?logo=github)](https://github.com/antoniotrapote/chord-prediction-tfm/blob/main/anexos/notebooks/03_modelado/05_modelo_lstm_finetuning.ipynb)

# LSTM_torch_v3 — LSTM + Random Search (PyTorch)

**Objetivo:** ajuste de hiperparámetros sobre un modelo **LSTM** base.  
**dataset:** `songdb_funcional_v4`

**Estructura:**  
1) Entorno (semillas, GPU/versions).  
2) Carga de CSV + tokenización.  
3) Split train/val/test.  
4) Vocab + codificación.  
5) Dataset + DataLoaders.  
6) Modelo LSTM con `tie_weights` opcional.  
7) Métricas y entrenamiento (Top@k, MRR, PPL).  
8) A/B Test: selección manual de hiperparámetros.  
9) Random Search (espacio de búsqueda, tabla de resultados, checkpoints).  
10) Empaquetar el mejor checkpoint con metadatos y nombre informativo

## 1) Entorno (Colab)
El modelo fue entrenado en Google Colab, con las siguientes especificaciones:
>Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]  
>PyTorch: 2.8.0+cu126  
>CUDA disponible: True  
>GPU: Tesla T4

In [1]:
#@title  Semillas y determinismo
import random, os, numpy as np, torch
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False


In [2]:
#@title Comprobar GPU/versions
import sys, torch
print("Python:", sys.version)  # python 3.11.13
print("PyTorch:", torch.__version__)  # PyTorch: 2.6.0+cu124
print("CUDA disponible:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("⚠️ Activa GPU: Runtime ▶ Change runtime type ▶ GPU")


Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
PyTorch: 2.8.0+cu126
CUDA disponible: True
GPU: Tesla T4


## 2) Cargar CSV y tokenizar

In [3]:
import urllib.request

# Configuración de datos
USER = "antoniotrapote"
REPO = "chord-prediction-tfm"
BRANCH = "main"
PATH_IN_REPO = "anexos/data/songdb_funcional_v4.csv"
URL = f"https://raw.githubusercontent.com/{USER}/{REPO}/{BRANCH}/{PATH_IN_REPO}"

# Ruta local donde guardar el archivo
data_path = "/content/songdb_funcional_v4.csv"

# Descargar el archivo CSV desde GitHub
urllib.request.urlretrieve(URL, data_path)
print(f"Dataset descargado en: {data_path}")

Dataset descargado en: /content/songdb_funcional_v4.csv


In [4]:
#@title Tokenización de las progresiones funcionales
import pandas as pd, ast, re, json

# Configuración de datos
sequence_col = "funcional_prog"  # Columna con la secuencia ("chordprog" si prefieres cifrado americano)
min_seq_len = 8  # Filtrado - ignora secuencias muy cortas

df = pd.read_csv(data_path)
assert sequence_col in df.columns, f"Columna {sequence_col} no encontrada en el CSV."
print("Filas totales:", len(df))
display(df[[sequence_col]].head(3))

def parse_tokens_simple(s: str):
    # Si viene como lista en string, intenta parsear
    if isinstance(s, str) and s.strip().startswith("[") and s.strip().endswith("]"):
        try:
            lst = ast.literal_eval(s)
            if isinstance(lst, list):
                return [str(t) for t in lst]
        except Exception:
            pass
    # Normalizar a tokens por espacios
    s = str(s).replace("|", " ").replace("\n", " ")
    toks = [t for t in re.findall(r"\S+", s) if t.strip()]
    return toks

df["_tokens_"] = df[sequence_col].apply(parse_tokens_simple)
df = df[df["_tokens_"].apply(len) >= min_seq_len].reset_index(drop=True)
print(f"Filas tras filtro min_seq_len >= {min_seq_len}:", len(df))

Filas totales: 2613


Unnamed: 0,funcional_prog
0,vi #ivø V/III V/VI vi IV ii V7 iii vi ii V7 I ...
1,VII VII I vi ii V7 VII VII I vi ii V7 I IV #iv...
2,i VI V/V V7 i VI V/V V7 i VI iiø V7 i VI iiø V...


Filas tras filtro min_seq_len >= 8: 2612


## 3) Split train/val/test (simple, por filas)

In [5]:
from sklearn.model_selection import train_test_split

# Parámetros de división del dataset
val_size = 0.10     # 10% para validación
test_size = 0.10    # 10% para test
random_state = 42   # Semilla para reproducibilidad

# Dividir en train/val/test
train_df, tmp_df = train_test_split(df, test_size=val_size+test_size, random_state=random_state, shuffle=True)
rel_test = test_size / (val_size + test_size) if (val_size + test_size) > 0 else 0.5
val_df, test_df = train_test_split(tmp_df, test_size=rel_test, random_state=random_state, shuffle=True)

# Extraer las secuencias tokenizadas
train_seqs = train_df["_tokens_"].tolist()
val_seqs   = val_df["_tokens_"].tolist()
test_seqs  = test_df["_tokens_"].tolist()

print(f"Train: {len(train_seqs)}, Val: {len(val_seqs)}, Test: {len(test_seqs)}")

Train: 2089, Val: 261, Test: 262


## 4) Vocabulario y codificación

In [6]:
from collections import Counter
import json

# Tokens especiales
PAD, UNK, BOS, EOS = "<pad>", "<unk>", "<bos>", "<eos>"
tokenizer_path = "lstm_tokenizer.json"  # Nombre del archivo del tokenizer

def build_vocab(seqs, min_freq=1):
    """
    Construye el vocabulario SOLO con secuencias de entrenamiento.

    IMPORTANTE: No incluir datos de validación/test para evitar data leakage.
    Los tokens no vistos en entrenamiento se mapearán a <unk> durante la evaluación.
    """
    c = Counter()
    for s in seqs:
        c.update(s)

    # Crear vocabulario: tokens especiales + tokens frecuentes
    vocab = [PAD, UNK, BOS, EOS] + [t for t,f in c.items() if f >= min_freq and t not in {PAD,UNK,BOS,EOS}]
    stoi = {t:i for i,t in enumerate(vocab)}  # string to index
    itos = {i:t for t,i in stoi.items()}     # index to string
    return vocab, stoi, itos

# 🎯 Construir vocabulario SOLO con datos de entrenamiento (evita data leakage)
vocab, stoi, itos = build_vocab(train_seqs, min_freq=1)
vocab_size = len(vocab)
print("Tamaño del vocabulario:", vocab_size)
print(f"📊 Tokens únicos encontrados en entrenamiento: {vocab_size - 4}")  # -4 por tokens especiales

# Guardar el tokenizer para uso posterior
with open(tokenizer_path, "w") as f:
    json.dump({"vocab": vocab}, f, ensure_ascii=False, indent=2)
print(f"Tokenizer guardado en: {tokenizer_path}")

def encode(seq, add_bos=True):
    """
    Convierte una secuencia de tokens a índices numéricos.
    Los tokens no vistos en entrenamiento se mapean automáticamente a <unk>.
    """
    ids = [stoi[BOS]] if add_bos else []
    ids += [stoi.get(t, stoi[UNK]) for t in seq]  # .get() maneja tokens desconocidos
    return ids

Tamaño del vocabulario: 86
📊 Tokens únicos encontrados en entrenamiento: 82
Tokenizer guardado en: lstm_tokenizer.json


## 5) Dataset (context→next) + DataLoaders

In [7]:
import torch
from torch.utils.data import Dataset, DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class NextTokenDataset(Dataset):
    def __init__(self, sequences, seq_len):
        self.samples = []
        for seq in sequences:
            ids = encode(seq, add_bos=True)
            if len(ids) <= seq_len: continue
            for i in range(seq_len, len(ids)):
                self.samples.append((ids[i-seq_len:i], ids[i]))
    def __len__(self): return len(self.samples)
    def __getitem__(self, idx):
        x, y = self.samples[idx]
        return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)

def make_dataloaders(train_seqs, val_seqs, test_seqs, seq_len, batch_size):
    """
    Crea los DataLoaders para entrenamiento, validación y test.

    Args:
        train_seqs, val_seqs, test_seqs: listas de secuencias tokenizadas
        seq_len: longitud de secuencia de contexto
        batch_size: tamaño del batch

    Returns:
        train_loader, val_loader, test_loader, n_test_samples
    """
    train_data = NextTokenDataset(train_seqs, seq_len)
    val_data   = NextTokenDataset(val_seqs, seq_len)
    test_data  = NextTokenDataset(test_seqs, seq_len)

    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, drop_last=True)
    val_loader   = DataLoader(val_data, batch_size=batch_size, shuffle=False)
    test_loader  = DataLoader(test_data, batch_size=batch_size, shuffle=False)

    return train_loader, val_loader, test_loader, len(test_data)


## 6) Modelo LSTM (con `tie_weights` opcional)

In [8]:

import torch.nn as nn

class ChordLSTM(nn.Module):
    """
    LSTM sencillo con opción tie_weights. Si tie_weights=True y hidden_size!=embedding_dim,
    se añade una proyección H->E antes de decodificar con los pesos compartidos de la embedding.
    """
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, dropout, tie_weights=False):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_size, num_layers=num_layers,
                           batch_first=True, dropout=dropout if num_layers>1 else 0.0)
        self.dropout = nn.Dropout(dropout)

        self.tie_weights = tie_weights
        if tie_weights:
            self.proj = nn.Linear(hidden_size, embedding_dim, bias=False) if hidden_size != embedding_dim else nn.Identity()
            self.decoder = nn.Linear(embedding_dim, vocab_size, bias=False)
            self.decoder.weight = self.emb.weight
        else:
            self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        e = self.emb(x)                 # (B, T, E)
        o, _ = self.rnn(e)              # (B, T, H)
        h_t = self.dropout(o[:, -1, :]) # (B, H)
        if self.tie_weights:
            e_t = self.proj(h_t)        # (B, E)
            return self.decoder(e_t)    # (B, V)
        else:
            return self.fc(h_t)         # (B, V)


## 7) Métricas y entrenamiento (Top@k, MRR, PPL)

In [9]:

import math, time, os, torch
import torch.nn.functional as F
from torch.amp import GradScaler, autocast

def topk_metrics(logits, targets, ks=(1,3,5)):
    """
    Definción de métricas Top@k y MRR.
    """
    out = {}
    with torch.no_grad():
        for k in ks:
            topk = logits.topk(k, dim=-1).indices
            out[f"Top@{k}"] = (topk == targets.unsqueeze(1)).any(dim=1).float().mean().item()
        ranks = (logits.argsort(dim=-1, descending=True) == targets.unsqueeze(1)).nonzero(as_tuple=False)[:,1] + 1
        out["MRR"] = (1.0 / ranks.float()).mean().item()
    return out

def evaluate(model, loader, criterion, device):
    """
    Función de evaluación
    """
    model.eval()
    total, n = 0.0, 0
    agg = {"Top@1":0.0,"Top@3":0.0,"Top@5":0.0,"MRR":0.0}
    with torch.no_grad():
        for x,y in loader:
            x,y = x.to(device), y.to(device)
            logits = model(x)
            loss = criterion(logits, y)
            b = x.size(0); total += loss.item()*b; n += b
            m = topk_metrics(logits, y)
            for k in agg: agg[k] += m[k]*b
    for k in agg: agg[k] /= max(1,n)
    return {"loss": total/max(1,n), "ppl": math.exp(total/max(1,n)), **agg}

def train_once(model, train_loader, val_loader, epochs, lr, weight_decay,
               grad_clip=1.0, amp=True, save_path=None, label_smoothing=0.0,
               patience=2, device=torch.device("cpu"),
               scheduler_type="cosine", pct_start=0.15, div_factor=10.0,   #nuevo
               final_div_factor=1e3):                                   #nuevo
    """
    train_once: bucle de entrenamiento con early stopping por MRR.

    Parámetros clave:
    - scheduler_type: "cosine" (por defecto), "onecycle" o "none".
    - label_smoothing: e.g., 0.05 en train/val. (En test solemos usar 0.0 para comparabilidad.)
    - pct_start / div_factor / final_div_factor: SOLO aplican si scheduler_type == "onecycle".
      Con "cosine" y "none" se ignoran.

    Notas:
    - Se hace scheduler.step() por batch (T_max = epochs * steps_per_epoch para cosine).
    - Guardamos el mejor estado según MRR de validación y lo restauramos al final.
"""

    scaler = GradScaler('cuda' if device.type=='cuda' else 'cpu',
                        enabled=(amp and device.type=='cuda'))
    crit = torch.nn.CrossEntropyLoss(label_smoothing=label_smoothing)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)

    # Scheduler
    steps_per_epoch = len(train_loader)
    if scheduler_type == "onecycle":
        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            opt, max_lr=lr, steps_per_epoch=steps_per_epoch, epochs=epochs,
            pct_start=pct_start,
            anneal_strategy='cos',
            div_factor=div_factor,
            final_div_factor=final_div_factor
        )
    elif scheduler_type == "cosine":
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            opt, T_max=epochs * steps_per_epoch
        )
    else:
        scheduler = None

    best_mrr, best_state = -1.0, None
    no_improve = 0

    for ep in range(1, epochs+1):
        model.train()
        for x,y in train_loader:
            x,y = x.to(device), y.to(device)
            opt.zero_grad(set_to_none=True)
            with autocast('cuda', enabled=(amp and device.type=='cuda')):
                logits = model(x); loss = crit(logits,y)
            scaler.scale(loss).backward()
            if grad_clip is not None:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
            scaler.step(opt); scaler.update()
            if scheduler is not None:
                scheduler.step()

        valm = evaluate(model, val_loader, crit, device)
        print(f"Epoch {ep} | val loss {valm['loss']:.4f} ppl {valm['ppl']:.2f} Top@1 {valm['Top@1']:.3f} Top@3 {valm['Top@3']:.3f} Top@5 {valm['Top@5']:.3f} MRR {valm['MRR']:.3f}")
        if valm["MRR"] > best_mrr:
            best_mrr, no_improve = valm["MRR"], 0
            best_state = {k:v.detach().cpu().clone() for k,v in model.state_dict().items()}
            if save_path:
                os.makedirs(os.path.dirname(save_path), exist_ok=True)
                torch.save({"model_state": model.state_dict()}, save_path)
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"⏹️ Early stopping (patience={patience})")
                break

    if best_state is not None:
        model.load_state_dict(best_state)
    return best_mrr


## 8) A/B Tests: comparativa de parámetros

In [10]:
#@title A/B helper
import copy, torch.nn as nn, random, numpy as np, torch

# Configuración base para A/B tests
BASE_CONFIG = {
    "seq_len": 24,
    "batch_size": 128,
    "epochs": 6,
    "lr": 2e-3,
    "weight_decay": 1e-4,
    "dropout": 0.2,
    "embedding_dim": 128,
    "hidden_size": 256,
    "num_layers": 2,
    "grad_clip": 1.0,
    "amp": True,
    "scheduler": "cosine",
    "pct_start": 0.15,
    "div_factor": 10.0,
    "final_div_factor": 1e3,
    "random_state": random_state
}

def _reseed(seed: int):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

def run_single(cfg_local, *, tie_weights=False, label_smoothing=0.05,
               test_label_smoothing=0.0):
    """
    Ejecuta 1 entrenamiento + test con la config dada (sin RS).
    - tie_weights: para probar rápido con/ sin weight tying
    - label_smoothing: en training/validación
    - test_label_smoothing: en test (0.0 para comparabilidad entre runs)
    """
    # reseed para que las A/B sean comparables
    _reseed(cfg_local["random_state"])

    # dataloaders
    train_loader_t, val_loader_t, test_loader_t, _ = make_dataloaders(
        train_seqs, val_seqs, test_seqs,
        seq_len=cfg_local["seq_len"], batch_size=cfg_local["batch_size"]
    )

    # modelo
    model = ChordLSTM(
        vocab_size=len(vocab),
        embedding_dim=cfg_local["embedding_dim"],
        hidden_size=cfg_local["hidden_size"],
        num_layers=cfg_local["num_layers"],
        dropout=cfg_local["dropout"],
        tie_weights=tie_weights
    ).to(device)

    # train (usa el scheduler definido en cfg_local)
    _ = train_once(
        model, train_loader_t, val_loader_t,
        epochs=cfg_local["epochs"], lr=cfg_local["lr"], weight_decay=cfg_local["weight_decay"],
        grad_clip=cfg_local["grad_clip"], amp=cfg_local["amp"], save_path=None,
        label_smoothing=label_smoothing, patience=2, device=device,
        scheduler_type=cfg_local["scheduler"], pct_start=cfg_local["pct_start"],
        div_factor=cfg_local["div_factor"], final_div_factor=cfg_local["final_div_factor"]
    )

    # test (elige coherencia con training o comparabilidad entre runs)
    test_crit = nn.CrossEntropyLoss(label_smoothing=test_label_smoothing)
    return evaluate(model, test_loader_t, test_crit, device)

In [11]:
#@title Mini A/B test function: compara Top@1 y MRR y muestra difs ===

def compare_ab(cfg_A, cfg_B, *,
               tie_A=False, tie_B=False,
               ls_train=0.05, ls_test=0.0,
               label_A="A", label_B="B"):
    """
    Ejecuta A y B con mismos seeds/hparams (salvo los cambios que pongas en cfg_A/cfg_B)
    y muestra Top@1 / MRR + diferencias.
    - ls_train: label smoothing usado en training/validación
    - ls_test:  label smoothing del criterio en test (0.0 para comparabilidad)
    - tie_A/B:  activa/desactiva tie_weights en cada iteración
    """
    print(f"\n>>> Run {label_A}")
    resA = run_single(cfg_A, tie_weights=tie_A, label_smoothing=ls_train, test_label_smoothing=ls_test)
    print(f"{label_A}: Top@1={resA['Top@1']:.4f}  MRR={resA['MRR']:.4f}")

    print(f"\n>>> Run {label_B}")
    resB = run_single(cfg_B, tie_weights=tie_B, label_smoothing=ls_train, test_label_smoothing=ls_test)
    print(f"{label_B}: Top@1={resB['Top@1']:.4f}  MRR={resB['MRR']:.4f}")

    d_top1 = resB['Top@1'] - resA['Top@1']
    d_mrr  = resB['MRR']  - resA['MRR']
    print("\nΔ (B − A):  Top@1={:+.4f}  MRR={:+.4f}".format(d_top1, d_mrr))

    return resA, resB

In [12]:
#@title 8.1) A/B test: scheduler 'none' vs 'onecycle' (mismos hparams)
cfgA = BASE_CONFIG.copy()
cfgA["scheduler"] = "none"

cfgB = BASE_CONFIG.copy()
cfgB["scheduler"] = "onecycle"

compare_ab(cfgA, cfgB, tie_A=False, tie_B=False, ls_train=0.05, ls_test=0.0,
           label_A="none", label_B="onecycle")


>>> Run none
Epoch 1 | val loss 2.4192 ppl 11.24 Top@1 0.430 Top@3 0.669 Top@5 0.758 MRR 0.577
Epoch 2 | val loss 2.3566 ppl 10.55 Top@1 0.444 Top@3 0.680 Top@5 0.767 MRR 0.588
Epoch 3 | val loss 2.3282 ppl 10.26 Top@1 0.453 Top@3 0.687 Top@5 0.773 MRR 0.595
Epoch 4 | val loss 2.3258 ppl 10.23 Top@1 0.459 Top@3 0.690 Top@5 0.775 MRR 0.599
Epoch 5 | val loss 2.3287 ppl 10.26 Top@1 0.464 Top@3 0.694 Top@5 0.772 MRR 0.602
Epoch 6 | val loss 2.3492 ppl 10.48 Top@1 0.453 Top@3 0.693 Top@5 0.770 MRR 0.596
none: Top@1=0.4384  MRR=0.5829

>>> Run onecycle
Epoch 1 | val loss 2.4731 ppl 11.86 Top@1 0.423 Top@3 0.655 Top@5 0.745 MRR 0.567
Epoch 2 | val loss 2.3602 ppl 10.59 Top@1 0.451 Top@3 0.679 Top@5 0.767 MRR 0.591
Epoch 3 | val loss 2.3126 ppl 10.10 Top@1 0.461 Top@3 0.692 Top@5 0.775 MRR 0.601
Epoch 4 | val loss 2.2974 ppl 9.95 Top@1 0.469 Top@3 0.696 Top@5 0.778 MRR 0.607
Epoch 5 | val loss 2.2914 ppl 9.89 Top@1 0.465 Top@3 0.698 Top@5 0.778 MRR 0.605
Epoch 6 | val loss 2.2957 ppl 9.93 To

({'loss': 2.191780496691302,
  'ppl': 8.951136403560488,
  'Top@1': 0.43838831713557447,
  'Top@3': 0.6777937152672038,
  'Top@5': 0.7609857871520316,
  'MRR': 0.582893018788442},
 {'loss': 2.1547056823436495,
  'ppl': 8.62535121220173,
  'Top@1': 0.44321293572214315,
  'Top@3': 0.6841830750117325,
  'Top@5': 0.7624201332439359,
  'MRR': 0.5869966540891229})

In [13]:
#@title 8.2) A/B test: scheduler 'cosine' vs 'onecycle'
cfgA = BASE_CONFIG.copy()
cfgA["scheduler"] = "cosine"

cfgB = BASE_CONFIG.copy()
cfgB["scheduler"] = "onecycle"

compare_ab(cfgA, cfgB, tie_A=False, tie_B=False, ls_train=0.05, ls_test=0.0,
           label_A="cosine", label_B="onecycle")


>>> Run cosine
Epoch 1 | val loss 2.4185 ppl 11.23 Top@1 0.431 Top@3 0.671 Top@5 0.758 MRR 0.577
Epoch 2 | val loss 2.3394 ppl 10.37 Top@1 0.449 Top@3 0.687 Top@5 0.773 MRR 0.594
Epoch 3 | val loss 2.3022 ppl 10.00 Top@1 0.463 Top@3 0.692 Top@5 0.776 MRR 0.603
Epoch 4 | val loss 2.2928 ppl 9.90 Top@1 0.466 Top@3 0.697 Top@5 0.780 MRR 0.606
Epoch 5 | val loss 2.2906 ppl 9.88 Top@1 0.468 Top@3 0.697 Top@5 0.780 MRR 0.607
Epoch 6 | val loss 2.2938 ppl 9.91 Top@1 0.467 Top@3 0.698 Top@5 0.779 MRR 0.607
cosine: Top@1=0.4433  MRR=0.5879

>>> Run onecycle
Epoch 1 | val loss 2.4731 ppl 11.86 Top@1 0.423 Top@3 0.655 Top@5 0.745 MRR 0.567
Epoch 2 | val loss 2.3602 ppl 10.59 Top@1 0.451 Top@3 0.679 Top@5 0.767 MRR 0.591
Epoch 3 | val loss 2.3126 ppl 10.10 Top@1 0.461 Top@3 0.692 Top@5 0.775 MRR 0.601
Epoch 4 | val loss 2.2974 ppl 9.95 Top@1 0.469 Top@3 0.696 Top@5 0.778 MRR 0.607
Epoch 5 | val loss 2.2914 ppl 9.89 Top@1 0.465 Top@3 0.698 Top@5 0.778 MRR 0.605
Epoch 6 | val loss 2.2957 ppl 9.93 T

({'loss': 2.1458166598776263,
  'ppl': 8.549020028907307,
  'Top@1': 0.44334333079597105,
  'Top@3': 0.6858782112823811,
  'Top@5': 0.7680271224133707,
  'MRR': 0.5879050129969904},
 {'loss': 2.1547056823436495,
  'ppl': 8.62535121220173,
  'Top@1': 0.44321293572214315,
  'Top@3': 0.6841830750117325,
  'Top@5': 0.7624201332439359,
  'MRR': 0.5869966540891229})

In [14]:
#@title 8.3) A/B test: tie_weights False vs True (con mismo scheduler)
cfg_fix = BASE_CONFIG.copy()
cfg_fix["scheduler"] = "onecycle"  # o "none"

compare_ab(cfg_fix, cfg_fix, tie_A=False, tie_B=True, ls_train=0.05, ls_test=0.0,
           label_A="tie=False", label_B="tie=True")


>>> Run tie=False
Epoch 1 | val loss 2.4731 ppl 11.86 Top@1 0.423 Top@3 0.655 Top@5 0.745 MRR 0.567
Epoch 2 | val loss 2.3602 ppl 10.59 Top@1 0.451 Top@3 0.679 Top@5 0.767 MRR 0.591
Epoch 3 | val loss 2.3126 ppl 10.10 Top@1 0.461 Top@3 0.692 Top@5 0.775 MRR 0.601
Epoch 4 | val loss 2.2974 ppl 9.95 Top@1 0.469 Top@3 0.696 Top@5 0.778 MRR 0.607
Epoch 5 | val loss 2.2914 ppl 9.89 Top@1 0.465 Top@3 0.698 Top@5 0.778 MRR 0.605
Epoch 6 | val loss 2.2957 ppl 9.93 Top@1 0.468 Top@3 0.698 Top@5 0.778 MRR 0.606
⏹️ Early stopping (patience=2)
tie=False: Top@1=0.4432  MRR=0.5870

>>> Run tie=True
Epoch 1 | val loss 2.5040 ppl 12.23 Top@1 0.411 Top@3 0.664 Top@5 0.753 MRR 0.561
Epoch 2 | val loss 2.3834 ppl 10.84 Top@1 0.442 Top@3 0.683 Top@5 0.769 MRR 0.586
Epoch 3 | val loss 2.3310 ppl 10.29 Top@1 0.455 Top@3 0.689 Top@5 0.775 MRR 0.597
Epoch 4 | val loss 2.3270 ppl 10.25 Top@1 0.458 Top@3 0.698 Top@5 0.776 MRR 0.601
Epoch 5 | val loss 2.3231 ppl 10.21 Top@1 0.458 Top@3 0.699 Top@5 0.778 MRR 0.6

({'loss': 2.1547056823436495,
  'ppl': 8.62535121220173,
  'Top@1': 0.44321293572214315,
  'Top@3': 0.6841830750117325,
  'Top@5': 0.7624201332439359,
  'MRR': 0.5869966540891229},
 {'loss': 2.158039945063216,
  'ppl': 8.654158397866162,
  'Top@1': 0.44282174999158347,
  'Top@3': 0.692137175953082,
  'Top@5': 0.7701134439676802,
  'MRR': 0.5898252001702824})

Se probaron distintos parámetros de entrenamiento:
- scheduler: 'cosine', 'OneCycle', 'none'
- weight_tying: True, False.

Todas ellas aportaron ligeras mejoras en métricas Top@1/MRR.

Finalmente, se seleccionó la configuración con scheduler: 'cosine' y tie_weights= True, aunque las diferencias respecto a OneCycle sin tying no fueron verdaderamente significativas.

## 9) Random Search (espacio de búsqueda, resultados, checkpoints)

In [15]:
import math, time, copy, random, os, gc
import pandas as pd
from pprint import pprint
import torch.nn as nn

# Configuración base para el Random Search
base_epochs = 6
base_amp = True
base_save_dir = "/content/models_rs"

# Parámetros de scheduler (para OneCycle, aunque usaremos cosine por defecto)
base_pct_start = 0.15
base_div_factor = 10.0
base_final_div_factor = 1e3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def log_uniform(a,b):
    return math.exp(random.uniform(math.log(a), math.log(b)))

def sample_cfg():
    """Genera una configuración aleatoria para Random Search"""
    cfg_local = {}

    # Arquitectura del modelo
    cfg_local["embedding_dim"]   = random.choice([128,160,192])
    cfg_local["hidden_size"]     = random.choice([256,320,384])
    cfg_local["num_layers"]      = random.choice([1,2])
    cfg_local["dropout"]         = random.choice([0.2,0.3])

    # Optimización
    cfg_local["lr"]              = log_uniform(5e-4, 5e-3)
    cfg_local["weight_decay"]    = random.choice([0.0, 1e-4, 5e-4])
    cfg_local["grad_clip"]       = random.choice([0.5, 1.0])

    # Datos y entrenamiento
    cfg_local["batch_size"]      = random.choice([64, 128])
    cfg_local["seq_len"]         = random.choice([16, 24])

    # Regularización
    cfg_local["label_smoothing"] = random.choice([0.03, 0.05])
    cfg_local["tie_weights"]     = True # fijado por pruebas A/B

    # Scheduler (fijado por pruebas A/B)
    cfg_local["scheduler"] = "cosine"

    # Parámetros de entrenamiento (fijos)
    cfg_local["epochs"]   = base_epochs
    cfg_local["patience"] = 2
    cfg_local["amp"]      = base_amp

    # OneCycle params (no aplican con cosine, pero mantenemos por compatibilidad)
    cfg_local["pct_start"]      = base_pct_start
    cfg_local["div_factor"]     = base_div_factor
    cfg_local["final_div_factor"]= base_final_div_factor

    return cfg_local

RESULTS = []
BEST = {"mrr": -1, "path": None, "cfg": None, "test": None}

N_TRIALS = 20  # n combinaciones a probar

for t in range(1, N_TRIALS+1):
    trial_cfg = sample_cfg()
    print(f"\n=== TRIAL {t}/{N_TRIALS} ===")
    print({k:trial_cfg[k] for k in ["embedding_dim","hidden_size","num_layers","dropout",
                                     "seq_len","batch_size","lr","weight_decay","grad_clip",
                                     "label_smoothing"]}),


    # DataLoaders específicos del trial
    train_loader_t, val_loader_t, test_loader_t, n_test = make_dataloaders(
        train_seqs, val_seqs, test_seqs,
        seq_len=trial_cfg["seq_len"], batch_size=trial_cfg["batch_size"]
    )

    # Modelo para el trial
    model_t = ChordLSTM(vocab_size=len(vocab),
                        embedding_dim=trial_cfg["embedding_dim"],
                        hidden_size=trial_cfg["hidden_size"],
                        num_layers=trial_cfg["num_layers"],
                        dropout=trial_cfg["dropout"],
                        tie_weights=trial_cfg["tie_weights"]).to(device)

    # Entrenamiento con early stopping por MRR (val)
    save_path = os.path.join(base_save_dir, f"rs_trial_{t}.pt")
    best_mrr_val = train_once(
        model_t, train_loader_t, val_loader_t,
        epochs=trial_cfg["epochs"], lr=trial_cfg["lr"], weight_decay=trial_cfg["weight_decay"],
        grad_clip=trial_cfg["grad_clip"], amp=trial_cfg["amp"],
        save_path=save_path, label_smoothing=trial_cfg["label_smoothing"],
        patience=trial_cfg["patience"], device=device,
        scheduler_type=trial_cfg["scheduler"],
        pct_start=trial_cfg["pct_start"], div_factor=trial_cfg["div_factor"],
        final_div_factor=trial_cfg["final_div_factor"]
    )

    # Evaluación en test (coherente con training)
    test_crit = nn.CrossEntropyLoss(label_smoothing=trial_cfg["label_smoothing"])
    testm = evaluate(model_t, test_loader_t, test_crit, device)


    row = {"trial": t, **{k:trial_cfg[k] for k in ["embedding_dim","hidden_size","num_layers","dropout",
                                                   "seq_len","batch_size","lr","weight_decay","grad_clip",
                                                   "label_smoothing","tie_weights"]},
           "tie_weights": True, "scheduler": "cosine",
           "val_best_MRR": best_mrr_val,
           **{k:testm[k] for k in ["loss","ppl","Top@1","Top@3","Top@5","MRR"]},
           "n_test": n_test, "ckpt": save_path}
    RESULTS.append(row)

    # Track best por MRR (test)
    if testm["MRR"] > BEST["mrr"]:
        BEST = {"mrr": testm["MRR"], "path": save_path, "cfg": trial_cfg, "test": testm}

    # Limpieza
    del model_t; gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

# Mostrar resultados
df_rs = pd.DataFrame(RESULTS).sort_values(by="MRR", ascending=False)
print("Top 10 trials por MRR (test):")
display(df_rs.head(10))

# Guardar CSV con todos los ensayos
os.makedirs(base_save_dir, exist_ok=True)
csv_path = os.path.join(base_save_dir, "random_search_results.csv")
df_rs.to_csv(csv_path, index=False)
print("Resultados guardados en:", csv_path)

print("\n=== MEJOR ENSAYO (por MRR en test) ===")
print("Config:", {k:BEST["cfg"][k] for k in ["embedding_dim","hidden_size","num_layers","dropout",
                                             "seq_len","batch_size","lr","weight_decay","grad_clip",
                                             "label_smoothing","tie_weights", "scheduler"]})
print("Metrics (test):", BEST["test"])
print("Checkpoint:", BEST["path"])


=== TRIAL 1/20 ===
{'embedding_dim': 192, 'hidden_size': 256, 'num_layers': 1, 'dropout': 0.3, 'seq_len': 24, 'batch_size': 64, 'lr': 0.0008787429588337066, 'weight_decay': 0.0, 'grad_clip': 0.5, 'label_smoothing': 0.03}
Epoch 1 | val loss 2.3949 ppl 10.97 Top@1 0.415 Top@3 0.664 Top@5 0.754 MRR 0.566
Epoch 2 | val loss 2.3093 ppl 10.07 Top@1 0.440 Top@3 0.676 Top@5 0.764 MRR 0.585
Epoch 3 | val loss 2.2728 ppl 9.71 Top@1 0.442 Top@3 0.685 Top@5 0.771 MRR 0.589
Epoch 4 | val loss 2.2382 ppl 9.38 Top@1 0.454 Top@3 0.689 Top@5 0.776 MRR 0.597
Epoch 5 | val loss 2.2269 ppl 9.27 Top@1 0.454 Top@3 0.694 Top@5 0.777 MRR 0.599
Epoch 6 | val loss 2.2241 ppl 9.25 Top@1 0.459 Top@3 0.694 Top@5 0.777 MRR 0.601

=== TRIAL 2/20 ===
{'embedding_dim': 128, 'hidden_size': 256, 'num_layers': 1, 'dropout': 0.2, 'seq_len': 16, 'batch_size': 128, 'lr': 0.0016007565680120513, 'weight_decay': 0.0, 'grad_clip': 0.5, 'label_smoothing': 0.05}
Epoch 1 | val loss 2.4322 ppl 11.38 Top@1 0.433 Top@3 0.673 Top@5 0

Unnamed: 0,trial,embedding_dim,hidden_size,num_layers,dropout,seq_len,batch_size,lr,weight_decay,grad_clip,...,scheduler,val_best_MRR,loss,ppl,Top@1,Top@3,Top@5,MRR,n_test,ckpt
16,17,192,320,2,0.3,16,64,0.001371,0.0005,1.0,...,cosine,0.603018,2.236275,9.358407,0.451197,0.69948,0.779188,0.597337,9610,/content/models_rs/rs_trial_17.pt
8,9,160,256,1,0.3,16,128,0.000816,0.0001,1.0,...,cosine,0.601694,2.324085,10.217326,0.449844,0.693132,0.777836,0.594969,9610,/content/models_rs/rs_trial_9.pt
12,13,160,256,2,0.2,16,64,0.004496,0.0005,1.0,...,cosine,0.607986,2.32909,10.268592,0.449116,0.691675,0.777523,0.594903,9610,/content/models_rs/rs_trial_13.pt
1,2,128,256,1,0.2,16,128,0.001601,0.0,0.5,...,cosine,0.604191,2.339745,10.378589,0.448595,0.695317,0.775338,0.594343,9610,/content/models_rs/rs_trial_2.pt
4,5,160,256,2,0.3,16,64,0.001889,0.0005,0.5,...,cosine,0.604717,2.333321,10.312136,0.449324,0.693548,0.775442,0.594256,9610,/content/models_rs/rs_trial_5.pt
11,12,192,320,2,0.2,16,128,0.002395,0.0,1.0,...,cosine,0.604657,2.379006,10.794168,0.450156,0.692404,0.770552,0.594072,9610,/content/models_rs/rs_trial_12.pt
9,10,128,256,2,0.3,16,64,0.003952,0.0001,1.0,...,cosine,0.604258,2.323171,10.207992,0.448179,0.693132,0.776067,0.594039,9610,/content/models_rs/rs_trial_10.pt
17,18,128,320,1,0.2,16,64,0.001938,0.0,0.5,...,cosine,0.600129,2.292265,9.897333,0.449948,0.689594,0.769615,0.593399,9610,/content/models_rs/rs_trial_18.pt
18,19,128,320,1,0.2,24,64,0.000949,0.0001,0.5,...,cosine,0.604553,2.290088,9.875811,0.444778,0.692007,0.773373,0.590615,7669,/content/models_rs/rs_trial_19.pt
6,7,128,384,2,0.2,24,64,0.002033,0.0,0.5,...,cosine,0.603992,2.363687,10.630075,0.444647,0.688747,0.774547,0.590102,7669,/content/models_rs/rs_trial_7.pt


Resultados guardados en: /content/models_rs/random_search_results.csv

=== MEJOR ENSAYO (por MRR en test) ===
Config: {'embedding_dim': 192, 'hidden_size': 320, 'num_layers': 2, 'dropout': 0.3, 'seq_len': 16, 'batch_size': 64, 'lr': 0.0013711030226214714, 'weight_decay': 0.0005, 'grad_clip': 1.0, 'label_smoothing': 0.03, 'tie_weights': True, 'scheduler': 'cosine'}
Metrics (test): {'loss': 2.236275031737805, 'ppl': 9.358406513740468, 'Top@1': 0.45119667014147813, 'Top@3': 0.6994797086368366, 'Top@5': 0.7791883455106793, 'MRR': 0.5973372343427557}
Checkpoint: /content/models_rs/rs_trial_17.pt


warnings a revisar:
/tmp/ipython-input-1199272555.py:31: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  scaler = torch.cuda.amp.GradScaler(enabled=(amp and device.type=='cuda'))
/tmp/ipython-input-1199272555.py:43: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=(amp and device.type=='cuda')):

## 10) Empaquetar el mejor checkpoint con metadatos y nombre informativo

In [17]:
# === Re-entrenar SOLO el mejor trial del CSV y exportar ===
import os, json, sys, datetime, torch, pandas as pd
import torch.nn as nn

# Rutas de archivos
CSV_PATH = "models_rs/random_search_results.csv"
TOKENIZER_PATH = "lstm_tokenizer.json"

# 1) Leer mejor fila por MRR
df = pd.read_csv(CSV_PATH)
best = df.sort_values("MRR", ascending=False).iloc[0]

cfg_best = {
    "embedding_dim":   int(best["embedding_dim"]),
    "hidden_size":     int(best["hidden_size"]),
    "num_layers":      int(best["num_layers"]),
    "dropout":         float(best["dropout"]),
    "seq_len":         int(best["seq_len"]),
    "batch_size":      int(best["batch_size"]),
    "lr":              float(best["lr"]),
    "weight_decay":    float(best["weight_decay"]),
    "grad_clip":       float(best["grad_clip"]),
    "label_smoothing": float(best["label_smoothing"]),
    "tie_weights":     bool(best.get("tie_weights", True)),
    "scheduler":       str(best.get("scheduler", "cosine")),
}

print("Reentrenando config:", cfg_best)

# 2) Dataloaders con la config ganadora (usa las funciones ya definidas)
train_loader, val_loader, test_loader, n_test = make_dataloaders(
    train_seqs, val_seqs, test_seqs,
    seq_len=cfg_best["seq_len"], batch_size=cfg_best["batch_size"]
)

# 3) Modelo
model = ChordLSTM(
    vocab_size=len(vocab),
    embedding_dim=cfg_best["embedding_dim"],
    hidden_size=cfg_best["hidden_size"],
    num_layers=cfg_best["num_layers"],
    dropout=cfg_best["dropout"],
    tie_weights=cfg_best["tie_weights"]
).to(device)

# 4) Entrenar con el mismo esquema (early stop por MRR)
_ = train_once(
    model, train_loader, val_loader,
    epochs=base_epochs, lr=cfg_best["lr"], weight_decay=cfg_best["weight_decay"],
    grad_clip=cfg_best["grad_clip"], amp=base_amp,
    save_path=None, label_smoothing=cfg_best["label_smoothing"],
    patience=2, device=device,
    scheduler_type=cfg_best["scheduler"], pct_start=base_pct_start,
    div_factor=base_div_factor, final_div_factor=base_final_div_factor
)

# 5) Test coherente con el training
test_crit = nn.CrossEntropyLoss(label_smoothing=cfg_best["label_smoothing"])
testm = evaluate(model, test_loader, test_crit, device)
print("Test:", testm)

# 6) Exportar con vocab del tokenizer (reproducible)
with open(TOKENIZER_PATH, "r") as f:
    tok = json.load(f)
vocab = tok["vocab"]
stoi = {t:i for i,t in enumerate(vocab)}
itos = {i:t for i,t in enumerate(vocab)}

export = {
    "model_state": model.state_dict(),
    "model_class": "ChordLSTM",
    "config": cfg_best,
    "metrics_test": testm,
    "stoi": stoi,
    "itos": itos,
    "vocab_size": len(stoi),
    "created_at": datetime.datetime.utcnow().isoformat() + "Z",
    "env": {"python": sys.version, "torch": torch.__version__, "cuda_available": torch.cuda.is_available()},
}

out_dir = "."
name_info = f"Top1-{testm['Top@1']:.4f}_MRR-{testm['MRR']:.4f}_ppl-{testm['ppl']:.3f}"
best_named = os.path.join(out_dir, f"lstm_rs_best__{name_info}.pt")
stable_best = os.path.join(out_dir, "lstm_rs_best.pt")

torch.save(export, best_named)
torch.save(export, stable_best)

print("✅ Guardado")
print(" ├ OUT 1:", best_named)
print(" └ OUT 2:", stable_best)

Reentrenando config: {'embedding_dim': 192, 'hidden_size': 320, 'num_layers': 2, 'dropout': 0.3, 'seq_len': 16, 'batch_size': 64, 'lr': 0.0013711030226214, 'weight_decay': 0.0005, 'grad_clip': 1.0, 'label_smoothing': 0.03, 'tie_weights': True, 'scheduler': 'cosine'}
Epoch 1 | val loss 2.3647 ppl 10.64 Top@1 0.421 Top@3 0.661 Top@5 0.753 MRR 0.569
Epoch 2 | val loss 2.2803 ppl 9.78 Top@1 0.450 Top@3 0.686 Top@5 0.769 MRR 0.593
Epoch 3 | val loss 2.2436 ppl 9.43 Top@1 0.451 Top@3 0.692 Top@5 0.777 MRR 0.597
Epoch 4 | val loss 2.2281 ppl 9.28 Top@1 0.461 Top@3 0.697 Top@5 0.777 MRR 0.603
Epoch 5 | val loss 2.2200 ppl 9.21 Top@1 0.466 Top@3 0.696 Top@5 0.781 MRR 0.606
Epoch 6 | val loss 2.2258 ppl 9.26 Top@1 0.467 Top@3 0.697 Top@5 0.778 MRR 0.606
Test: {'loss': 2.256385751321338, 'ppl': 9.548516015025047, 'Top@1': 0.44443288242655665, 'Top@3': 0.6949011446534037, 'Top@5': 0.7798126951464753, 'MRR': 0.5928952473135322}
✅ Guardado
 ├ OUT 1: ./lstm_rs_best__Top1-0.4444_MRR-0.5929_ppl-9.549.p

  "created_at": datetime.datetime.utcnow().isoformat() + "Z",
