# Predicción de likes en Instagram (PyTorch)
Notebook de demostración que reutiliza los módulos `preprocess.py`, `train.py`, `predict.py` y `models.py` para:
1) Preprocesar datos
2) Entrenar el modelo multimodal
3) Ejecutar predicciones sobre un post

Asegúrate de tener creada/activada tu `venv` e instalar `requirements.txt` antes de correr las celdas.

## Setup
Define rutas y dispositivo. Ajusta `DATA_DIR` si tu dataset está en otra ubicación.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path
import torch
from models import get_device

DATA_DIR = Path('data')
PROCESSED_DIR = Path('processed')
PROCESSED_DIR.mkdir(exist_ok=True)

device = get_device()
device


## 1. Preprocesamiento
Esta celda llama al script `preprocess.py` para generar:
- `processed_data.pt`
- `vectorizer.joblib`
- `text_scaler.joblib`
- `meta_scaler.joblib`

Solo necesitas ejecutarla la primera vez o cuando cambien los datos.

In [None]:
!python preprocess.py --data_dir $DATA_DIR --out_dir $PROCESSED_DIR --max_features 3000

In [None]:
from preprocess import *
from pathlib import Path

data_dir = Path('data')
folders = sorted([p for p in data_dir.iterdir() if p.is_dir()])
samples: List[Dict] = []
for folder in folders:
    try:
        sample = extract_sample(folder)
        samples.append(sample)
    except Exception as exc:  # pragma: no cover
        print(f"[WARN] skipping {folder}: {exc}")
        continue

captions = [s["caption_clean"] for s in samples]

## 2. Entrenamiento
Entrena el modelo multimodal. Ajusta épocas, learning rate o división de validación según resultados.

In [None]:
!python train.py --processed $PROCESSED_DIR/processed_data.pt --epochs 5 --batch_size 1 --lr 1e-4 --val_split 0.2 --model_out $PROCESSED_DIR/model.pt

In [None]:
import train as t
import sys

sys.argv = [
    "notebook",                         # nombre ficticio del script
    "--processed", "processed/processed_data.pt",
    "--epochs", "10",
    "--lr", "1e-3",
    "--batch_size", "1",
    "--val_split", "0.2",
    "--model_out", "processed/model_manual.pt",
]

model, val_loader, device = t.main()

In [None]:
import torch
import numpy as np

model.eval()
total_abs_err, n = 0.0, 0
with torch.no_grad():
    for batch in val_loader:
        for sample in batch:
            images = [img.to(device) for img in sample["images"]]
            text   = sample["text"].to(device)
            meta   = sample["meta"].to(device)

            target = torch.expm1(sample["target"].to(device))

            pred_log = model(images, text, meta)
            pred = torch.expm1(pred_log)

            total_abs_err += torch.abs(pred - target).item()
            n += 1

mae = total_abs_err / n
print(f"MAE (likes): {mae:.4f} con n={n}")

## 3. Predicción sobre un post
Usa el modelo guardado para predecir likes de un post (carpeta con jpg + txt + json.xz).
Cambia `POST_DIR` por la carpeta que quieras evaluar.

In [None]:
POST_DIR = DATA_DIR / 'aashnashroff_969148_3000403601659402518_25980_65'
!python predict.py --data_dir $POST_DIR --model $PROCESSED_DIR/model.pt --vectorizer $PROCESSED_DIR/vectorizer.joblib --text_scaler $PROCESSED_DIR/text_scaler.joblib --meta_scaler $PROCESSED_DIR/meta_scaler.joblib

## 4. Inspección rápida de un sample en memoria (opcional)
Carga un sample ya procesado y revisa shapes/datos para entender el pipeline.

In [None]:
import torch
from dataset import InstagramPostDataset

ds = InstagramPostDataset(PROCESSED_DIR/'processed_data.pt', device=device)
sample = ds[0]
print('id:', sample['id'])
print('n_imgs:', len(sample['images']))
print('text_vec shape:', sample['text'].shape)
print('meta_vec:', sample['meta'])
