### Extracción de características a nivel de clip (UCF-Crime)

Este notebook utiliza como entrada el archivo `processed/index_clips.csv`

El archivo `index_clips.csv` contiene, para cada clip temporal:
- La ruta al video original.
- El rango temporal del clip (`start_frame`, `end_frame`).
- La partición correspondiente (`train`, `val`, `test`).
- La etiqueta binaria (normal vs anómalo) y la categoría asociada.
- Los parámetros de segmentación utilizados (longitud del clip y solapamiento).

A partir de este índice, se cargan los frames correspondientes a cada clip y se transforman en la
representación requerida por los modelos evaluados, sin redefinir ni modificar la composición del
conjunto experimental.


In [1]:
import pandas as pd
from pathlib import Path
import cv2
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

# Ruta al índice de clips
INDEX_CLIPS_PATH = Path("processed/index_clips.csv")

# Verificar existencia
assert INDEX_CLIPS_PATH.exists(), f"No se encuentra el archivo: {INDEX_CLIPS_PATH}"

# Cargar CSV
df_clips = pd.read_csv(INDEX_CLIPS_PATH)

print("Archivo cargado correctamente")
print("Número total de clips:", len(df_clips))

display(df_clips.head())


Archivo cargado correctamente
Número total de clips: 145356


Unnamed: 0,split,y,category,path,clip_idx,start_frame,end_frame,clip_len,stride,fps,n_frames
0,train,0,Normal,/home/DIINF/dvaldes/tesis/UCF_Crime/Training-N...,0,0,32,32,16,30.0,2016
1,train,0,Normal,/home/DIINF/dvaldes/tesis/UCF_Crime/Training-N...,1,16,48,32,16,30.0,2016
2,train,0,Normal,/home/DIINF/dvaldes/tesis/UCF_Crime/Training-N...,2,32,64,32,16,30.0,2016
3,train,0,Normal,/home/DIINF/dvaldes/tesis/UCF_Crime/Training-N...,3,48,80,32,16,30.0,2016
4,train,0,Normal,/home/DIINF/dvaldes/tesis/UCF_Crime/Training-N...,4,64,96,32,16,30.0,2016


In [2]:
# Distribución por split
print("Clips por split:")
print(df_clips["split"].value_counts())

# Distribución por clase
print("\nClips por clase (y):")
print(df_clips["y"].value_counts())

# Verificar rangos temporales válidos
invalid_ranges = df_clips[df_clips["end_frame"] <= df_clips["start_frame"]]
print("\nClips con rangos inválidos:", len(invalid_ranges))

# Verificar paths únicos y existencia
missing_paths = df_clips[~df_clips["path"].apply(lambda p: Path(p).exists())]
print("Clips con path inexistente:", len(missing_paths))


Clips por split:
split
train    106527
val       19793
test      19036
Name: count, dtype: int64

Clips por clase (y):
y
0    73870
1    71486
Name: count, dtype: int64

Clips con rangos inválidos: 0
Clips con path inexistente: 0


In [3]:
# Verificar que un mismo video no aparezca en más de un split
video_split_counts = df_clips.groupby("path")["split"].nunique()
n_leak = int((video_split_counts > 1).sum())

print("Videos con clips en más de un split:", n_leak)


Videos con clips en más de un split: 0


# 1. Pasar clips a tensor
Construye un DataLoader de clips que, para cada fila del index_clips.csv, lee un rango de frames del video, muestra T frames, los normaliza y los entrega como tensor 5D listo para pasar por TimeSformer.

Variables:

- T = 8
Número de frames que verá el encoder por clip. En TimeSformer (checkpoint típico) 8 es estándar.

- IMG_SIZE = 224
Tamaño espacial final por frame. 224×224 es el tamaño estándar de la mayoría de checkpoints preentrenados (evita problemas con embeddings posicionales).

- BATCH_SIZE = 8
Cuántos clips procesas en paralelo. Impacta memoria GPU y velocidad.

- NUM_WORKERS = 4
Paraleliza la lectura/decodificación de video en CPU (acelera el input pipeline).

- mean/std (ImageNet)
Normalización estándar para modelos preentrenados. Alinea la escala de píxeles con el entrenamiento del modelo.

In [4]:

# Parámetros del input
T = 8                
IMG_SIZE = 224       
BATCH_SIZE = 16
NUM_WORKERS = 8

def uniform_sample_indices(start_f: int, end_f: int, T: int):
    n = max(1, end_f - start_f)
    idx = np.linspace(0, n - 1, T).round().astype(int)
    return (start_f + idx).astype(int)

class ClipDataset(Dataset):
    def __init__(self, df, T=8, img_size=224):
        self.df = df.reset_index(drop=True)
        self.T = T
        self.img_size = img_size
        self.mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
        self.std  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, i):
        row = self.df.iloc[i]
        path = row["path"]
        start_f = int(row["start_frame"])
        end_f   = int(row["end_frame"])
        y = int(row["y"])

        cap = cv2.VideoCapture(path)
        if not cap.isOpened():
            raise RuntimeError(f"No pude abrir video: {path}")

        frame_ids = uniform_sample_indices(start_f, end_f, self.T)

        frames = []
        last_good = None
        for fid in frame_ids:
            cap.set(cv2.CAP_PROP_POS_FRAMES, int(fid))
            ok, frame = cap.read()

            if not ok:
                if last_good is None:
                    frame = np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8)
                else:
                    frame = last_good
            else:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                frame = cv2.resize(frame, (self.img_size, self.img_size), interpolation=cv2.INTER_LINEAR)
                last_good = frame

            frames.append(frame)

        cap.release()

        # (T,H,W,C) -> float [0,1]
        arr = np.stack(frames).astype(np.float32) / 255.0
        arr = (arr - self.mean) / self.std

        # -> (C,T,H,W)
        arr = np.transpose(arr, (3, 0, 1, 2))
        clip = torch.from_numpy(arr)  # float32

        return clip, torch.tensor(y, dtype=torch.long)


In [5]:
df_train = df_clips[df_clips["split"]=="train"].copy()
df_val   = df_clips[df_clips["split"]=="val"].copy()
df_test  = df_clips[df_clips["split"]=="test"].copy()

train_ds = ClipDataset(df_train, T=T, img_size=IMG_SIZE)
val_ds   = ClipDataset(df_val,   T=T, img_size=IMG_SIZE)
test_ds = ClipDataset(df_test, T=T, img_size=IMG_SIZE)


train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,
                          num_workers=NUM_WORKERS, pin_memory=True)
val_loader   = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False,
                          num_workers=NUM_WORKERS, pin_memory=True)

test_loader = DataLoader(
    test_ds,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=NUM_WORKERS,
    pin_memory=True
)


xb, yb = next(iter(train_loader))
print("xb shape:", xb.shape)  # esperado: (B, 3, 8, 224, 224)
print("yb shape:", yb.shape)




xb shape: torch.Size([16, 3, 8, 224, 224])
yb shape: torch.Size([16])


Significado de cada dimensión
xb: [B, C, T, H, W]

- B = 1 → Batch size (número de clips procesados en paralelo).

- C = 3 → Canales de color (RGB).

- T = 8 → Número de frames que ve TimeSformer por clip.

- H = 224 → Alto de la imagen (resize estándar).

- W = 224 → Ancho de la imagen (resize estándar).

En resumen:
Cada batch contiene 1 clips, cada clip tiene 8 frames RGB de 224×224.

# 2.Carga y congelamiento de TimeSformer

Este bloque carga el modelo TimeSformer preentrenado, lo mueve a GPU (si está disponible) y congela sus parámetros para usarlo como feature extractor.

Variables importantes

- DEVICE: define si el modelo corre en "cuda" (GPU) o "cpu".

- TIMESFORMER_CKPT: checkpoint utilizado (facebook/timesformer-base-finetuned-k400), entrenado en Kinetics-400.

- image_processor: contiene configuración estándar de preprocesamiento (mean/std).

- encoder: backbone que extrae representaciones espacio-temporales del clip.

In [6]:
import torch
from transformers import TimesformerModel, AutoImageProcessor

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("DEVICE:", DEVICE)

TIMESFORMER_CKPT = "facebook/timesformer-base-finetuned-k400"

image_processor = AutoImageProcessor.from_pretrained(TIMESFORMER_CKPT)
encoder = TimesformerModel.from_pretrained(TIMESFORMER_CKPT)

encoder = encoder.to(DEVICE)
encoder.eval()
for p in encoder.parameters():
    p.requires_grad = False

print("OK loaded:", TIMESFORMER_CKPT)
print("processor mean/std:", image_processor.image_mean, image_processor.image_std)


DEVICE: cuda


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


OK loaded: facebook/timesformer-base-finetuned-k400
processor mean/std: [0.45, 0.45, 0.45] [0.225, 0.225, 0.225]


Se cambia el orden de dimensiones porque TimeSformer espera los clips en formato (batch, frames, canales, alto, ancho) y nuestro tensor estaba en formato (batch, canales, frames, alto, ancho).

In [7]:
xb = xb.to(DEVICE)

# (B, C, T, H, W) -> (B, T, C, H, W)
pixel_values = xb.permute(0, 2, 1, 3, 4).contiguous()
print("pixel_values shape:", pixel_values.shape)


pixel_values shape: torch.Size([16, 8, 3, 224, 224])


In [8]:
with torch.no_grad():
    out = encoder(pixel_values=pixel_values)

print("out keys:", out.keys())
print("last_hidden_state shape:", out.last_hidden_state.shape)

# CLS token
cls = out.last_hidden_state[:, 0, :]   # (B, D)
print("CLS shape:", cls.shape)


out keys: odict_keys(['last_hidden_state'])
last_hidden_state shape: torch.Size([16, 1569, 768])
CLS shape: torch.Size([16, 768])


Sanity check

In [9]:
cls_cpu = cls.detach().cpu()
print("CLS mean:", float(cls_cpu.mean()))
print("CLS std:", float(cls_cpu.std()))
print("Any NaN:", torch.isnan(cls_cpu).any().item())


CLS mean: -0.016693538054823875
CLS std: 0.9742990732192993
Any NaN: False


# 3 Extracción de emebdings


In [10]:
import numpy as np
from pathlib import Path

def create_memmap(path, shape, dtype="float16"):
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    return np.memmap(path, mode="w+", dtype=dtype, shape=shape)


In [11]:
D = cls.shape[1]
print("Embedding dim:", D)


Embedding dim: 768


In [12]:
df_train = df_clips[df_clips["split"]=="train"].copy()
df_val   = df_clips[df_clips["split"]=="val"].copy()
df_test  = df_clips[df_clips["split"]=="test"].copy()

train_ds = ClipDataset(df_train, T=T, img_size=IMG_SIZE)
val_ds   = ClipDataset(df_val,   T=T, img_size=IMG_SIZE)
test_ds  = ClipDataset(df_test,  T=T, img_size=IMG_SIZE)

print(len(train_ds), len(val_ds), len(test_ds))


106527 19793 19036


In [13]:
# Train
X_train = create_memmap("processed/emb_timesformer_cls_train.mmap",
                        (len(train_ds), D),
                        dtype="float16")
y_train = create_memmap("processed/y_train.mmap",
                        (len(train_ds),),
                        dtype="int8")

# Val
X_val = create_memmap("processed/emb_timesformer_cls_val.mmap",
                      (len(val_ds), D),
                      dtype="float16")
y_val = create_memmap("processed/y_val.mmap",
                      (len(val_ds),),
                      dtype="int8")

# Test
X_test = create_memmap("processed/emb_timesformer_cls_test.mmap",
                       (len(test_ds), D),
                       dtype="float16")
y_test = create_memmap("processed/y_test.mmap",
                       (len(test_ds),),
                       dtype="int8")


In [14]:
def extract_embeddings(loader, encoder, X_mm, y_mm, split_name="train"):
    encoder.eval()

    total_batches = len(loader)
    print(f"\nExtracting {split_name.upper()} embeddings...")
    print(f"Total batches: {total_batches}")

    with torch.no_grad():
        offset = 0

        for xb, yb in tqdm(loader, desc=f"{split_name}", leave=True):

            xb = xb.to(DEVICE)
            pixel_values = xb.permute(0, 2, 1, 3, 4).contiguous()

            out = encoder(pixel_values=pixel_values)
            cls = out.last_hidden_state[:, 0, :]  # (B, D)

            cls_np = cls.detach().cpu().numpy().astype(X_mm.dtype, copy=False)
            y_np = yb.numpy().astype(y_mm.dtype, copy=False)

            batch_size = cls_np.shape[0]

            X_mm[offset:offset+batch_size] = cls_np
            y_mm[offset:offset+batch_size] = y_np

            offset += batch_size

    X_mm.flush()
    y_mm.flush()

    print(f"{split_name.upper()} extraction done.")


In [15]:

from tqdm import tqdm

extract_embeddings(train_loader, encoder, X_train, y_train, split_name="train")
extract_embeddings(val_loader, encoder, X_val, y_val, split_name="val")
extract_embeddings(test_loader, encoder, X_test, y_test, split_name="test")

print("\nAll embeddings extracted successfully.")



Extracting TRAIN embeddings...
Total batches: 6658


train: 100%|██████████████████████████████| 6658/6658 [1:16:24<00:00,  1.45it/s]


TRAIN extraction done.

Extracting VAL embeddings...
Total batches: 1238


val: 100%|██████████████████████████████████| 1238/1238 [14:15<00:00,  1.45it/s]


VAL extraction done.

Extracting TEST embeddings...
Total batches: 1190


test: 100%|█████████████████████████████████| 1190/1190 [13:42<00:00,  1.45it/s]

TEST extraction done.

All embeddings extracted successfully.





In [16]:
import numpy as np

print("Train:", X_train.shape, y_train.shape)
print("Val:  ", X_val.shape, y_val.shape)
print("Test: ", X_test.shape, y_test.shape)


Train: (106527, 768) (106527,)
Val:   (19793, 768) (19793,)
Test:  (19036, 768) (19036,)


In [17]:
import numpy as np

def sanity_mm_fp32(X_mm, y_mm, name="split", n=5000):
    n = min(n, len(y_mm))
    X = np.array(X_mm[:n], dtype=np.float32)   # <- clave
    y = np.array(y_mm[:n], dtype=np.int64)

    print(f"\n[{name}]")
    print("  finite:", np.isfinite(X).all())
    print("  mean:", float(X.mean()))
    print("  std:",  float(X.std()))
    print("  min/max:", float(X.min()), float(X.max()))
    print("  y counts:", {int(v): int((y==v).sum()) for v in np.unique(y)})

sanity_mm_fp32(X_train, y_train, "train")
sanity_mm_fp32(X_val,   y_val,   "val")
sanity_mm_fp32(X_test,  y_test,  "test")




[train]
  finite: True
  mean: -0.0163530632853508
  std: 0.9735729098320007
  min/max: -6.234375 5.6015625
  y counts: {0: 2556, 1: 2444}

[val]
  finite: True
  mean: -0.01572292298078537
  std: 0.9724651575088501
  min/max: -6.6953125 5.98828125
  y counts: {0: 2762, 1: 2238}

[test]
  finite: True
  mean: -0.017447900027036667
  std: 0.973820686340332
  min/max: -6.01171875 5.7109375
  y counts: {0: 2083, 1: 2917}


In [18]:
X = np.array(X_train[:5000], dtype=np.float32)
print("Any inf:", np.isinf(X).any())
print("Any nan:", np.isnan(X).any())


Any inf: False
Any nan: False


In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# usa una muestra pequeña para que sea rápido
ntr = min(20000, len(y_train))
nva = min(8000,  len(y_val))

Xtr = np.array(X_train[:ntr], dtype=np.float32)
ytr = np.array(y_train[:ntr], dtype=np.int64)

Xva = np.array(X_val[:nva], dtype=np.float32)
yva = np.array(y_val[:nva], dtype=np.int64)

clf = LogisticRegression(max_iter=200, n_jobs=-1)
clf.fit(Xtr, ytr)

pva = clf.predict_proba(Xva)[:, 1]
auc = roc_auc_score(yva, pva)
print("Sanity AUC (LogReg):", auc)


Sanity AUC (LogReg): 0.9303626521189672


In [20]:
import json
from datetime import datetime

manifest = {
    "created_at": datetime.now().isoformat(),
    "model": TIMESFORMER_CKPT,
    "T": T,
    "img_size": IMG_SIZE,
    "embedding_dim": int(X_train.shape[1]),
    "files": {
        "X_train": "processed/emb_timesformer_cls_train.mmap",
        "y_train": "processed/y_train.mmap",
        "X_val":   "processed/emb_timesformer_cls_val.mmap",
        "y_val":   "processed/y_val.mmap",
        "X_test":  "processed/emb_timesformer_cls_test.mmap",
        "y_test":  "processed/y_test.mmap",
    }
}

Path("processed").mkdir(exist_ok=True)
with open("processed/manifest_timesformer_cls.json", "w") as f:
    json.dump(manifest, f, indent=2)

print("Saved:", "processed/manifest_timesformer_cls.json")


Saved: processed/manifest_timesformer_cls.json
