# Implementación de un modelo de deep learning para la detección de acciones humanas en videos del dataset UCF101

## Fernanda Díaz Gutiérrez A01639572

- En este notebook uso las anotaciones de esqueletos 2D de UCF101 (los dados) para clasificar sus acciones
- Entreno un modelo LSTM sencillo (baseline) y luego uno mejorado con más capas y regularización
- Comparo resultados, evalúo en test y muestro algunas predicciones para ver cómo funciona

## 1. Imports y configuración

In [55]:
import pickle
from pathlib import Path
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #google collab

## 2. File de esqueletos

In [56]:
from pathlib import Path
import pickle
pkl_path = Path("/content/ucf101_2d.pkl") #path donde poner el file
#cargamos file, encoding para compatibilidad
with open(pkl_path, "rb") as f:
  data = pickle.load(f, encoding="latin1")
#prints que usé para ver la estructura de los datos
print("keys:",data.keys())
print("splits:",data["split"].keys())
anotaciones= data["annotations"]
print("videos con anotaciones:", len(anotaciones))
ejemplo_fields= anotaciones[0]
print("ejemplo fields:",ejemplo_fields.keys())
print("label:",ejemplo_fields["label"])
print("forma:",ejemplo_fields["keypoint"].shape)

keys: dict_keys(['split', 'annotations'])
splits: dict_keys(['train1', 'train2', 'train3', 'test1', 'test2', 'test3'])
videos con anotaciones: 13320
ejemplo fields: dict_keys(['keypoint', 'keypoint_score', 'frame_dir', 'total_frames', 'original_shape', 'img_shape', 'label'])
label: 0
forma: (1, 119, 17, 2)


## 3. Splits, subsets, dataset para esqueletos

In [57]:
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch

#mapeo de frame_dir a annotation
all_annotations = data["annotations"]
video_to_ann = {ann["frame_dir"]: ann for ann in all_annotations}

#usamos train1 para train/val y test1 como test final
train_ids = data["split"]["train1"]
test_ids  = data["split"]["test1"]
train_ann_list = [video_to_ann[vid] for vid in train_ids]
test_ann_list  = [video_to_ann[vid] for vid in test_ids]
print("videos en train1:", len(train_ann_list))
print("videos en test1 :", len(test_ann_list))

#subset de clases
train_labels = sorted({ann["label"] for ann in train_ann_list})
print("total de clases en train1:", len(train_labels))
n_classes_subset = 8  #min 5
chosen_labels = train_labels[:n_classes_subset]
print("clases seleccionadas (labels originales):", chosen_labels)
label_map = {old: i for i, old in enumerate(chosen_labels)}
idx_to_label = {v: k for k, v in label_map.items()}
print("mapeo label original -> nuevo:", label_map)

#dataset, usa solo la persona 0 de c/vid
#regresa secuencia con padd, label (la clase remapeada) ya la longitud real pre-padding
class SkeletonDataset(Dataset):
    def __init__(self, ann_list, chosen_labels, label_map, max_len=64):
        self.max_len = max_len
        self.chosen_labels = set(chosen_labels)
        self.label_map = label_map
        # os quedamos solo con las clases del subset
        self.samples = [ann for ann in ann_list if ann["label"] in self.chosen_labels]
        #V y C del primer ejemplo
        kp = self.samples[0]["keypoint"]#[M, T, V, C]
        _, _, V, C = kp.shape
        self.V = V
        self.C = C
        self.input_size = V * C
        print(f"dataset con {len(self.samples)} videos, V={V}, C={C}")

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        ann = self.samples[idx]
        orig_label = ann["label"]
        label = self.label_map[orig_label]
        # keypoint[M, T, V, C] a persona 0:[T, V, C]
        kp = ann["keypoint"][0]
        T = kp.shape[0]
        kp = kp.astype(np.float32) #a float32
        seq = kp.reshape(T, -1) #aplanar por frame
        #normalización simple, restamos media del video
        seq = seq - seq.mean(axis=0, keepdims=True)
        #truncar o padear a max_len
        if T >= self.max_len:
            seq = seq[: self.max_len]
            seq_len = self.max_len
        else:
            pad = np.zeros((self.max_len - T, seq.shape[1]), dtype=np.float32)
            seq = np.concatenate([seq, pad], axis=0)
            seq_len = T
        #a tensores
        seq = torch.from_numpy(seq)#[max_len,input_size]
        label = torch.tensor(label).long()
        seq_len = torch.tensor(seq_len).long()
        return seq, label, seq_len

#crear train/val/test datasets y dataloaders
#split interno train/val a partir de train1
train_anns, val_anns = train_test_split(
    train_ann_list, test_size=0.2, random_state=42, shuffle=True
)

max_len = 64  #frames p/sec
train_ds = SkeletonDataset(train_anns, chosen_labels, label_map, max_len=max_len)
val_ds   = SkeletonDataset(val_anns,   chosen_labels, label_map, max_len=max_len)
test_ds  = SkeletonDataset(test_ann_list, chosen_labels, label_map, max_len=max_len)
batch_size = 32
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True,  drop_last=True)
val_loader   = DataLoader(val_ds,   batch_size=batch_size, shuffle=False, drop_last=False)
test_loader  = DataLoader(test_ds,  batch_size=batch_size, shuffle=False, drop_last=False)
num_classes = len(chosen_labels)
input_size = train_ds.input_size

print("batches de train:", len(train_loader))
print("batches de val  :", len(val_loader))
print("batches de test :", len(test_loader))
print("input_size:", input_size, "num_classes:", num_classes)

videos en train1: 9537
videos en test1 : 3783
total de clases en train1: 101
clases seleccionadas (labels originales): [0, 1, 2, 3, 4, 5, 6, 7]
mapeo label original -> nuevo: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7}
dataset con 612 videos, V=17, C=2
dataset con 167 videos, V=17, C=2
dataset con 304 videos, V=17, C=2
batches de train: 19
batches de val  : 6
batches de test : 10
input_size: 34 num_classes: 8


## 4. Modelo baseline (LSTM)


In [58]:
import torch.nn as nn
import torch.nn.functional as F

class SkeletonLSTM(nn.Module): #lstm simple para clasificar las secs. de joints
    def __init__(self, input_size, num_classes,
                 hidden_size=64, num_layers=1, dropout=0.0):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0,
        )
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):#x:[B,T,input_size]
        _, (h_n, _) = self.lstm(x)# h_n:[num_layers,B,hidden_size]
        last_hidden = h_n[-1]#usamos la última capa
        logits = self.fc(last_hidden)#[B,num_classes]
        return logits

#modelo baseline
baseline_model = SkeletonLSTM(
    input_size=input_size,
    num_classes=num_classes,
    hidden_size=64,
    num_layers=1,
    dropout=0.0,
).to(device)

print("modelo baseline:")
print(baseline_model)

modelo baseline:
SkeletonLSTM(
  (lstm): LSTM(34, 64, batch_first=True)
  (fc): Linear(in_features=64, out_features=8, bias=True)
)


## 5. Training y val

In [59]:
from tqdm.auto import tqdm

def train_epoch(model, loader, optimizer, criterion):
    model.train() #modo train
    total_loss = 0.0
    total_ok = 0
    total = 0
    for seq, labels, seq_len in tqdm(loader, desc="train", leave=False):
        seq = seq.to(device)#[B,T,input_size]
        labels = labels.to(device)
        optimizer.zero_grad()
        logits = model(seq)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * labels.size(0)
        preds = logits.argmax(dim=1)
        total_ok += (preds == labels).sum().item()
        total += labels.size(0)
    avg_loss = total_loss / total
    avg_acc = total_ok / total
    return avg_loss, avg_acc

def eval_epoch(model, loader, criterion):
    model.eval() #modo eval
    total_loss = 0.0
    total_ok = 0
    total = 0
    #sin gradiente
    with torch.no_grad():
        for seq, labels, seq_len in tqdm(loader, desc="val", leave=False):
            seq = seq.to(device)
            labels = labels.to(device)
            logits = model(seq)
            loss = criterion(logits, labels)
            total_loss += loss.item() * labels.size(0)
            preds = logits.argmax(dim=1)
            total_ok += (preds == labels).sum().item()
            total += labels.size(0)
    avg_loss = total_loss / total
    avg_acc = total_ok / total
    return avg_loss, avg_acc

## 6. Training del modelo

In [60]:
epochs = 15
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(
    baseline_model.parameters(),
    lr=1e-3,
    weight_decay=0.0,#baseline sin reg.
)

history_baseline = {
    "train_loss": [],
    "train_acc": [],
    "val_loss": [],
    "val_acc": [],
}

for epoch in range(1, epochs + 1):
    train_loss, train_acc = train_epoch(baseline_model, train_loader, optimizer, criterion)
    val_loss, val_acc = eval_epoch(baseline_model, val_loader, criterion)
    history_baseline["train_loss"].append(train_loss)
    history_baseline["train_acc"].append(train_acc)
    history_baseline["val_loss"].append(val_loss)
    history_baseline["val_acc"].append(val_acc)
    print(
        f"baseline epoch {epoch:02d} "
        f"train_loss:{train_loss:.4f},train_acc:{train_acc:.3f} "
        f"val_loss:{val_loss:.4f},val_acc:{val_acc:.3f}"
    )

train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 01 train_loss:2.0638,train_acc:0.161 val_loss:1.9929,val_acc:0.222


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 02 train_loss:1.8975,train_acc:0.304 val_loss:1.9454,val_acc:0.263


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 03 train_loss:1.8015,train_acc:0.400 val_loss:1.8978,val_acc:0.287


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 04 train_loss:1.7099,train_acc:0.461 val_loss:1.8595,val_acc:0.281


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 05 train_loss:1.6242,train_acc:0.528 val_loss:1.8106,val_acc:0.323


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 06 train_loss:1.5513,train_acc:0.558 val_loss:1.7792,val_acc:0.329


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 07 train_loss:1.4768,train_acc:0.607 val_loss:1.7551,val_acc:0.341


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 08 train_loss:1.4053,train_acc:0.630 val_loss:1.7220,val_acc:0.347


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 09 train_loss:1.3332,train_acc:0.681 val_loss:1.6984,val_acc:0.383


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 10 train_loss:1.2667,train_acc:0.692 val_loss:1.6830,val_acc:0.359


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 11 train_loss:1.2103,train_acc:0.712 val_loss:1.6524,val_acc:0.401


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 12 train_loss:1.1570,train_acc:0.717 val_loss:1.6445,val_acc:0.395


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 13 train_loss:1.1011,train_acc:0.750 val_loss:1.6422,val_acc:0.395


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 14 train_loss:1.0401,train_acc:0.776 val_loss:1.6009,val_acc:0.401


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

baseline epoch 15 train_loss:0.9973,train_acc:0.788 val_loss:1.5750,val_acc:0.419


## 7. Modelo mejorado (más profundo + regularización)

In [61]:
better_model = SkeletonLSTM(
    input_size=input_size,
    num_classes=num_classes,
    hidden_size=128,#más neuronas
    num_layers=2,#más layers lstm
    dropout=0.5,#dropout entre capas
).to(device)
print("modelo mejorado:")
print(better_model)

#train modelo mejorado
better_criterion = nn.CrossEntropyLoss()
better_optimizer = torch.optim.Adam(
    better_model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,  #l2 reg
)
history_better = {
    "train_loss": [],
    "train_acc": [],
    "val_loss": [],
    "val_acc": [],
}
epochs_better = 15  #mismo num. para comparar

for epoch in range(1, epochs_better + 1):
    train_loss, train_acc = train_epoch(better_model, train_loader, better_optimizer, better_criterion)
    val_loss, val_acc = eval_epoch(better_model, val_loader, better_criterion)
    history_better["train_loss"].append(train_loss)
    history_better["train_acc"].append(train_acc)
    history_better["val_loss"].append(val_loss)
    history_better["val_acc"].append(val_acc)
    print(
        f"mejorado epoch {epoch:02d} "
        f"train_loss:{train_loss:.4f}, train_acc:{train_acc:.3f} "
        f"val_loss:{val_loss:.4f}, val_acc:{val_acc:.3f}"
    )

modelo mejorado:
SkeletonLSTM(
  (lstm): LSTM(34, 128, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=128, out_features=8, bias=True)
)


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 01 train_loss:2.0110, train_acc:0.196 val_loss:1.8801, val_acc:0.317


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 02 train_loss:1.7219, train_acc:0.421 val_loss:1.6102, val_acc:0.365


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 03 train_loss:1.3913, train_acc:0.546 val_loss:1.3244, val_acc:0.455


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 04 train_loss:1.1225, train_acc:0.604 val_loss:1.2685, val_acc:0.485


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 05 train_loss:0.9648, train_acc:0.681 val_loss:1.1694, val_acc:0.557


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 06 train_loss:0.8170, train_acc:0.729 val_loss:1.0877, val_acc:0.557


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 07 train_loss:0.6973, train_acc:0.757 val_loss:1.0650, val_acc:0.587


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 08 train_loss:0.6381, train_acc:0.773 val_loss:1.1345, val_acc:0.569


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 09 train_loss:0.6360, train_acc:0.778 val_loss:1.1656, val_acc:0.575


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 10 train_loss:0.5619, train_acc:0.771 val_loss:1.1432, val_acc:0.605


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 11 train_loss:0.5403, train_acc:0.808 val_loss:1.1591, val_acc:0.611


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 12 train_loss:0.5000, train_acc:0.822 val_loss:1.3294, val_acc:0.611


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 13 train_loss:0.5110, train_acc:0.817 val_loss:1.1674, val_acc:0.575


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 14 train_loss:0.4612, train_acc:0.826 val_loss:1.0780, val_acc:0.641


train:   0%|          | 0/19 [00:00<?, ?it/s]

val:   0%|          | 0/6 [00:00<?, ?it/s]

mejorado epoch 15 train_loss:0.4201, train_acc:0.849 val_loss:1.2281, val_acc:0.635


## 8. Evaluación final en test

In [62]:
#eval en test (baseline)
test_loss_base, test_acc_base = eval_epoch(baseline_model, test_loader, criterion)
print(f"baseline test_loss: {test_loss_base:.4f}, test_acc:{test_acc_base:.3f}")
#eval en test (mejorado)
test_loss_better, test_acc_better = eval_epoch(better_model, test_loader, better_criterion)
print(f"mejorado test_loss: {test_loss_better:.4f}, test_acc:{test_acc_better:.3f}")

val:   0%|          | 0/10 [00:00<?, ?it/s]

baseline test_loss: 1.5587, test_acc:0.418


val:   0%|          | 0/10 [00:00<?, ?it/s]

mejorado test_loss: 1.7057, test_acc:0.484


## 9. Ejemplos de predicciones

In [63]:
def get_preds(model, loader, num_batches=1):
    model.eval()
    preds_list = []
    labels_list = []
    with torch.no_grad():
        for i, (seq, labels, seq_len) in enumerate(loader):
            if i >= num_batches:
                break
            seq = seq.to(device)
            labels = labels.to(device)
            logits = model(seq)
            preds = logits.argmax(dim=1)
            preds_list.append(preds.cpu())
            labels_list.append(labels.cpu())
    if len(preds_list) == 0:
        return [], []
    preds_all = torch.cat(preds_list)
    labels_all = torch.cat(labels_list)
    return preds_all, labels_all

#modelo mejorado en algunos batches de test
preds, labels = get_preds(better_model, test_loader, num_batches=2)
for i in range(len(preds)):
    y_true_new = labels[i].item()
    y_pred_new = preds[i].item()
    y_true_orig = idx_to_label.get(y_true_new, None)
    y_pred_orig = idx_to_label.get(y_pred_new, None)
    print(
        f"ejemplo {i:02d} "
        f"true (nuevo)={y_true_new}, pred (nuevo)={y_pred_new} "
        f"true (original)={y_true_orig}, pred (original)={y_pred_orig}"
    )

ejemplo 00 true (nuevo)=0, pred (nuevo)=1 true (original)=0, pred (original)=1
ejemplo 01 true (nuevo)=0, pred (nuevo)=1 true (original)=0, pred (original)=1
ejemplo 02 true (nuevo)=0, pred (nuevo)=1 true (original)=0, pred (original)=1
ejemplo 03 true (nuevo)=0, pred (nuevo)=0 true (original)=0, pred (original)=0
ejemplo 04 true (nuevo)=0, pred (nuevo)=0 true (original)=0, pred (original)=0
ejemplo 05 true (nuevo)=0, pred (nuevo)=0 true (original)=0, pred (original)=0
ejemplo 06 true (nuevo)=0, pred (nuevo)=0 true (original)=0, pred (original)=0
ejemplo 07 true (nuevo)=0, pred (nuevo)=3 true (original)=0, pred (original)=3
ejemplo 08 true (nuevo)=0, pred (nuevo)=0 true (original)=0, pred (original)=0
ejemplo 09 true (nuevo)=0, pred (nuevo)=3 true (original)=0, pred (original)=3
ejemplo 10 true (nuevo)=0, pred (nuevo)=1 true (original)=0, pred (original)=1
ejemplo 11 true (nuevo)=0, pred (nuevo)=0 true (original)=0, pred (original)=0
ejemplo 12 true (nuevo)=0, pred (nuevo)=1 true (orig

## Interpretación (Análisis de resultados)

### 1. Modelo baseline (LSTM simple)
El modelo baseline es un lstm de 1 layer con 64 unidades ocultas. Sí logra aprender patrones temporales básicos de los esqueletos, la accuracy en train sube de aprox. 0.16 a 0.79 y la de val llega hasta 0.46, que es bastante mejor que un clasificador aleatorio, con 12.5% para 8 clases. La pérdida de val baja de aprox. 2.0 a 1.5, diciéndonos que el modelo sí está aprovechando la señal de los datos. Aun así, veo un gap entre train/val, que nos dice que hay un poco de overfitting porque aquí no usamos regularización. En el conjunto de test, este modelo obtiene una accuracy de aprox. 0.37–0.38.


### 2. Modelo mejorado
El modelo mejorado aumenta la capacidad de la red (lstm de 2 layers con 128 unidades ocultas) y añade regularización: dropout de 0.5 y weight decay de 1e-4. Estoss cambios permiten capturar patrones temporales más complejos sin descontrolar mucho el overfitting. La acc de val llega a aprox 0.59 y se mantiene alrededor de ese número, y en el conjunto de test el modelo logra una accuracy de aprox. 0.52, que es evidentemente mejor que el modelo base.

### 3. Comparación
Al comparar en test, el modelo baseline tiene un error de aprox 0.63 (1 - 0.37) y el modelo mejorado baja a aprox 0.48 (1 - 0.52). Esto nos dice que si hay una disminución del error de aprox 24% respecto al modelo base. Esto confirma que si aumentamos la capacidad del lstm + usamos regularización (en este caso dropout y weight decay) sí mejora la generalización del modelo sobre los esqueletos.

### 4. Predicciones
Al revisar predicciones concretas en algunos batches de test, el modelo mejorado acierta en muchos casos, pero también a veces confunde clases entre sí, que es algo que se espera porque solo usa las coords 2D de los joints. De cualquier forma, el comportamiento es lo "esperado" para ser un modelo que usa a su favor la temporalidad de los movimientos para distinguir las posibles acciones.