**Nota**: Se recomienda ejecutar estos notebooks en Google Colab para asegurar la compatibilidad y evitar problemas de dependencias. El entrenamiento de los modelos requiere una cantidad significativa de memoria RAM y potencia de cómputo, que puede no estar disponible en todos los entornos locales.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/antoniotrapote/chord-prediction-tfm/blob/main/anexos/notebooks/03_modelado/03_modelo_gru.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-black?logo=github)](https://github.com/antoniotrapote/chord-prediction-tfm/blob/main/anexos/notebooks/03_modelado/03_modelo_gru.ipynb)

# Gated Recurrent Unit (GRU) model - Pytorch

Hemos utilizado PyTorch para implementar y entrenar un modelo de red neuronal recurrente basado en Gated Recurrent Units (GRU) para la predicción de acordes en secuencias musicales.

El último dataset utilizado fue `songdb_funcional_v4`

Contenido del notebook:
1. Entorno (Colab) - comprobación de versiones
2. Configuración de parámetros
3. Traer el CSV a Colab


In [1]:
import pandas as pd
import numpy as np
import re
from collections import Counter, defaultdict
from pathlib import Path
from typing import List, Tuple, Dict

## 1) Entorno (Colab)
El modelo fue entrenado en Google Colab, con las siguientes especificaciones:
>Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]  
>PyTorch: 2.8.0+cu126  
>CUDA disponible: True  
>GPU: Tesla T4

In [1]:
#@title Comprobar GPU/versions
import sys, torch
print("Python:", sys.version)
print("PyTorch:", torch.__version__)
print("CUDA disponible:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("⚠️ Activa GPU: Runtime ▶ Change runtime type ▶ GPU")


Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
PyTorch: 2.8.0+cu126
CUDA disponible: True
GPU: Tesla T4


## 2) Configuración de parámetros y rutas

In [None]:
from dataclasses import dataclass

@dataclass
class Config:
    # Ruta al CSV en Colab (sube el archivo o móntalo desde Drive)
    data_path: str = "/content/songdb_funcional_v4.csv"
    # 🔧 Elige la columna de secuencia (sin autodetección):
    sequence_col: str = "funcional_prog"

    # Splits y semilla
    val_size: float = 0.10
    test_size: float = 0.10
    random_state: int = 42

    # Modelo/entrenamiento
    seq_len: int = 24
    batch_size: int = 128
    epochs: int = 6
    lr: float = 2e-3
    weight_decay: float = 1e-4
    dropout: float = 0.2
    embedding_dim: int = 128
    hidden_size: int = 256
    num_layers: int = 2
    grad_clip: float = 1.0
    amp: bool = True

    # Guardado
    save_dir: str = "/content/models_gru_v1"
    save_name: str = "gru_best.pt"
    tokenizer_path: str = "gru_tokenizer.json"

    # Filtrado
    min_seq_len: int = 8  # ignora secuencias muy cortas

cfg = Config()
print(cfg)


Config(data_path='https://raw.githubusercontent.com/antoniotrapote/chord-prediction-tfm/main/anexos/datasets/songdb_funcional_v4.csv', sequence_col='funcional_prog', val_size=0.1, test_size=0.1, random_state=42, seq_len=24, batch_size=128, epochs=6, lr=0.002, weight_decay=0.0001, dropout=0.2, embedding_dim=128, hidden_size=256, num_layers=2, grad_clip=1.0, amp=True, save_dir='/content/models_gru_v1', save_name='gru_best.pt', tokenizer_path='gru_tokenizer.json', min_seq_len=8)


## 2) Traer el CSV a Colab

In [None]:
import urllib.request
# Subir el dataset desde GitHub
# Configuración para acceder al dataset `songdb_funcional_v4.csv` en GitHub
USER = "antoniotrapote"
REPO = "chord-prediction-tfm"
BRANCH = "main"
PATH_IN_REPO = "anexos/data/songdb_funcional_v4.csv"
URL = f"https://raw.githubusercontent.com/{USER}/{REPO}/{BRANCH}/{PATH_IN_REPO}"

# Descargar el archivo CSV desde GitHub
urllib.request.urlretrieve(URL, cfg.data_path)

## 3) Cargar CSV y tokenizar (whitespace)

In [3]:
#@title Carga + tokenización
import pandas as pd, ast, re

df = pd.read_csv(cfg.data_path)
assert cfg.sequence_col in df.columns, f"Columna {cfg.sequence_col} no encontrada en el CSV."
print("Filas totales:", len(df))
display(df[[cfg.sequence_col]].head(3))

def parse_tokens_simple(s: str):
    if isinstance(s, str) and s.strip().startswith("[") and s.strip().endswith("]"):
        try:
            lst = ast.literal_eval(s)
            if isinstance(lst, list):
                return [str(t) for t in lst]
        except Exception:
            pass
    # Normaliza separadores de compás y saltos de línea a espacios
    s = str(s).replace("|", " ").replace("\n", " ")#.replace(" ", " ")
    toks = [t for t in re.findall(r"\S+", s) if t.strip()]
    return toks

df["_tokens_"] = df[cfg.sequence_col].apply(parse_tokens_simple)
df = df[df["_tokens_"].apply(len) >= cfg.min_seq_len].reset_index(drop=True)
print("Filas tras filtro min_seq_len:", len(df))

HTTPError: HTTP Error 404: Not Found

## 4) Split train/val/test (simple, por filas)

In [6]:

#@title Split
from sklearn.model_selection import train_test_split
train_df, tmp_df = train_test_split(df, test_size=cfg.val_size+cfg.test_size, random_state=cfg.random_state, shuffle=True)
rel_test = cfg.test_size / (cfg.val_size + cfg.test_size) if (cfg.val_size + cfg.test_size) > 0 else 0.5
val_df, test_df = train_test_split(tmp_df, test_size=rel_test, random_state=cfg.random_state, shuffle=True)

train_seqs = train_df["_tokens_"].tolist()
val_seqs   = val_df["_tokens_"].tolist()
test_seqs  = test_df["_tokens_"].tolist()
print(len(train_seqs), len(val_seqs), len(test_seqs))


2089 261 262


## 5) Vocabulario y codificación

In [7]:

#@title Vocab + encode
from collections import Counter
import json
PAD, UNK, BOS, EOS = "<pad>", "<unk>", "<bos>", "<eos>"

def build_vocab(seqs, min_freq=1):
    c = Counter()
    for s in seqs: c.update(s)
    vocab = [PAD, UNK, BOS, EOS] + [t for t,f in c.items() if f >= min_freq and t not in {PAD,UNK,BOS,EOS}]
    stoi = {t:i for i,t in enumerate(vocab)}
    itos = {i:t for t,i in stoi.items()}
    return vocab, stoi, itos

vocab, stoi, itos = build_vocab(train_seqs, 1)
vocab_size = len(vocab)
print("Vocab size:", vocab_size)

with open(cfg.tokenizer_path, "w") as f:
    json.dump({"vocab": vocab}, f, ensure_ascii=False, indent=2)

def encode(seq, add_bos=True):
    ids = [stoi[BOS]] if add_bos else []
    ids += [stoi.get(t, stoi[UNK]) for t in seq]
    return ids


Vocab size: 86


## 6) Dataset (context→next) y DataLoaders

In [8]:

#@title Dataset
import torch
from torch.utils.data import Dataset, DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class NextTokenDataset(Dataset):
    def __init__(self, sequences, seq_len):
        self.samples = []
        for seq in sequences:
            ids = encode(seq, add_bos=True)
            if len(ids) <= seq_len: continue
            for i in range(seq_len, len(ids)):
                self.samples.append((ids[i-seq_len:i], ids[i]))
    def __len__(self): return len(self.samples)
    def __getitem__(self, idx):
        x, y = self.samples[idx]
        return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)

train_data = NextTokenDataset(train_seqs, cfg.seq_len)
val_data   = NextTokenDataset(val_seqs,   cfg.seq_len)
test_data  = NextTokenDataset(test_seqs,  cfg.seq_len)

train_loader = DataLoader(train_data, batch_size=cfg.batch_size, shuffle=True, drop_last=True)
val_loader   = DataLoader(val_data,   batch_size=cfg.batch_size, shuffle=False)
test_loader  = DataLoader(test_data,  batch_size=cfg.batch_size, shuffle=False)

len(train_data), len(val_data), len(test_data)


(60979, 7044, 7669)

## 7) Modelo GRU

In [9]:

#@title Definición del modelo
import torch.nn as nn
class ChordGRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, dropout):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_size, num_layers=num_layers,
                                   batch_first=True, dropout=dropout if num_layers>1 else 0.0)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size, vocab_size)
    def forward(self, x):
        e = self.emb(x)
        o, _ = self.rnn(e)
        return self.fc(self.dropout(o[:, -1, :]))


## 8) Entrenamiento y métricas (Top@K, MRR, PPL)

In [10]:

#@title Utils
import math, time, os, torch
import torch.nn.functional as F

def topk_metrics(logits, targets, ks=(1,3,5)):
    out = {}
    with torch.no_grad():
        for k in ks:
            topk = logits.topk(k, dim=-1).indices
            out[f"Top@{k}"] = (topk == targets.unsqueeze(1)).any(dim=1).float().mean().item()
        ranks = (logits.argsort(dim=-1, descending=True) == targets.unsqueeze(1)).nonzero(as_tuple=False)[:,1] + 1
        out["MRR"] = (1.0 / ranks.float()).mean().item()
    return out

def evaluate(model, loader, criterion):
    model.eval()
    total, n = 0.0, 0
    agg = {"Top@1":0.0,"Top@3":0.0,"Top@5":0.0,"MRR":0.0}
    with torch.no_grad():
        for x,y in loader:
            x,y = x.to(device), y.to(device)
            logits = model(x)
            loss = criterion(logits, y)
            b = x.size(0); total += loss.item()*b; n += b
            m = topk_metrics(logits, y)
            for k in agg: agg[k] += m[k]*b
    for k in agg: agg[k] /= max(1,n)
    return {"loss": total/max(1,n), "ppl": math.exp(total/max(1,n)), **agg}

def train_model(model, train_loader, val_loader, epochs, lr, weight_decay, grad_clip=1.0, amp=True, save_dir=".", save_name="best.pt"):
    os.makedirs(save_dir, exist_ok=True)
    scaler = torch.cuda.amp.GradScaler(enabled=(amp and device.type=='cuda'))
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    crit = torch.nn.CrossEntropyLoss()
    best_mrr, best_path = -1.0, os.path.join(save_dir, save_name)
    for ep in range(1, epochs+1):
        model.train(); t0 = time.time()
        for i,(x,y) in enumerate(train_loader,1):
            x,y = x.to(device), y.to(device)
            opt.zero_grad(set_to_none=True)
            with torch.cuda.amp.autocast(enabled=(amp and device.type=='cuda')):
                logits = model(x); loss = crit(logits,y)
            scaler.scale(loss).backward()
            if grad_clip is not None:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
            scaler.step(opt); scaler.update()
            if i % 100 == 0: print(f"Ep{ep} step {i}/{len(train_loader)} loss {loss.item():.4f}")
        valm = evaluate(model, val_loader, crit)
        print(f"Epoch {ep} | val loss {valm['loss']:.4f} ppl {valm['ppl']:.2f} Top@1 {valm['Top@1']:.3f} Top@3 {valm['Top@3']:.3f} Top@5 {valm['Top@5']:.3f} MRR {valm['MRR']:.3f}")
        if valm["MRR"] > best_mrr:
            best_mrr = valm["MRR"]
            torch.save({"model_state": model.state_dict(), "config": dict(vars(cfg)), "stoi": stoi, "itos": itos}, best_path)
            print("🔥 Guardado best ->", best_path, "| MRR:", best_mrr)
    return best_mrr, best_path


## 9) Entrenar GRU

In [11]:

#@title Train
import torch, os, json
torch.manual_seed(cfg.random_state)
model = ChordGRU(vocab_size=len(vocab), embedding_dim=cfg.embedding_dim,
                          hidden_size=cfg.hidden_size, num_layers=cfg.num_layers,
                          dropout=cfg.dropout).to(device)
best_mrr, best_path = train_model(model, train_loader, val_loader, cfg.epochs, cfg.lr, cfg.weight_decay,
                                  grad_clip=cfg.grad_clip, amp=cfg.amp, save_dir=cfg.save_dir, save_name=cfg.save_name)
print("Best MRR:", best_mrr, "| path:", best_path)


  scaler = torch.cuda.amp.GradScaler(enabled=(amp and device.type=='cuda'))
  with torch.cuda.amp.autocast(enabled=(amp and device.type=='cuda')):


Ep1 step 100/476 loss 2.2282
Ep1 step 200/476 loss 2.3747
Ep1 step 300/476 loss 2.0149
Ep1 step 400/476 loss 2.0779
Epoch 1 | val loss 2.1584 ppl 8.66 Top@1 0.442 Top@3 0.676 Top@5 0.763 MRR 0.585
🔥 Guardado best -> /content/models_gru_v1/gru_best.pt | MRR: 0.584794789535765
Ep2 step 100/476 loss 2.1084
Ep2 step 200/476 loss 2.1919
Ep2 step 300/476 loss 2.1190
Ep2 step 400/476 loss 2.1181
Epoch 2 | val loss 2.1306 ppl 8.42 Top@1 0.448 Top@3 0.685 Top@5 0.768 MRR 0.591
🔥 Guardado best -> /content/models_gru_v1/gru_best.pt | MRR: 0.591246411382035
Ep3 step 100/476 loss 2.0119
Ep3 step 200/476 loss 1.8909
Ep3 step 300/476 loss 1.8901
Ep3 step 400/476 loss 2.1402
Epoch 3 | val loss 2.1259 ppl 8.38 Top@1 0.455 Top@3 0.688 Top@5 0.770 MRR 0.597
🔥 Guardado best -> /content/models_gru_v1/gru_best.pt | MRR: 0.5970695267882555
Ep4 step 100/476 loss 1.9535
Ep4 step 200/476 loss 1.8190
Ep4 step 300/476 loss 1.8945
Ep4 step 400/476 loss 2.0423
Epoch 4 | val loss 2.1204 ppl 8.33 Top@1 0.459 Top@3 0.

## 10) Evaluación en Test

In [12]:

#@title Test
import torch, os
ckpt = torch.load(os.path.join(cfg.save_dir, cfg.save_name), map_location=device)
model.load_state_dict(ckpt["model_state"])
test_metrics = evaluate(model, test_loader, torch.nn.CrossEntropyLoss())
print("Test:", test_metrics)


Test: {'loss': 2.1969814152866194, 'ppl': 8.997811807607055, 'Top@1': 0.4319989573755014, 'Top@3': 0.6775329250729151, 'Top@5': 0.7616377626377533, 'MRR': 0.5797308985110382}


## 11) predict_next(context, k=5)

In [13]:

#@title predict_next()
import torch.nn.functional as F
def predict_next(context_tokens, k=5):
    model.eval()
    ids = [stoi.get(t, stoi["<unk>"]) for t in context_tokens]
    if len(ids) < cfg.seq_len:
        ids = [stoi["<bos>"]] * (cfg.seq_len - len(ids)) + ids
    else:
        ids = ids[-cfg.seq_len:]
    import torch
    x = torch.tensor(ids, dtype=torch.long, device=device).unsqueeze(0)
    with torch.no_grad():
        logits = model(x)
        probs = F.softmax(logits[0], dim=-1)
        topk = torch.topk(probs, k)
        return [(itos[i], float(p)) for i,p in zip(topk.indices.tolist(), topk.values.tolist())]

# Ejemplo rápido (si hay datos)
if len(train_seqs):
    ctx = train_seqs[0][:cfg.seq_len]
    print("Context:", ctx)
    print("Pred:", predict_next(ctx, k=5))


Context: ['III', 'III', '#IV', '#IV', 'bII', 'bII', 'natIII', 'natIII', 'III', '#IV', 'IV', 'VI', 'V', 'bVII', 'natVI', 'I', 'II', 'VII', 'VI', 'IV', 'III', 'I', 'VII', 'natVI']
Pred: [('VII', 0.13458140194416046), ('VI', 0.12113039195537567), ('I', 0.11372911930084229), ('IV', 0.08486910164356232), ('III', 0.07772155851125717)]



### Roadmap corto
- Añadir **posición en compás** y **duración** como embeddings adicionales (v2).
- Re-ranking suave para evitar repes y favorecer cadencias.
- Ajustar `seq_len`, capas y *scheduler* cuando confirmes el pipeline.
