<a href="https://colab.research.google.com/github/NickEsColR/MachineLearningV/blob/train/taller/Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Equipo**

- Nicolás Colmenares

- Carlos Martinez

**Situación:**
Una ciudad enfrenta un aumento significativo de casos de dengue, con una tasa de incidencia que supera el promedio nacional.
La anticipación de brotes es crucial para implementar medidas preventivas y reducir la propagación de la enfermedad.

**Objetivo:**
Desarrollar un modelo predictivo utilizando redes neuronales para pronosticar futuros brotes de dengue en cada barrio de la ciudad.
Utilizar una base de datos histórica de casos de dengue desde 2015 hasta 2022 para entrenar el modelo.
Anticiparse a los brotes con al menos 3 semanas de anticipación.

**Finalidad:**
Permitir a las autoridades de salud pública tomar acciones oportunas, como:
Preparar a las instituciones prestadoras de salud (IPS).
Gestionar recursos (carros fumigadores, limpieza de sumideros).
Capacitar a la comunidad.

1. Redes Neuronales Tradicinales (MLP)
2. Red Convolucional (CNN) adaptada a series temporales
3. Red Neuronal Recurrente (RNN) básica.
4. Modelo con LSTMs
5. Modelo con GRUs

## Diccionario

train.parquet - El conjunto de datos de entrenamiento
test.parquet - El conjunto de datos de prueba
sample_submission.csv - un ejemplo de un archivo a someter en la competencia

| **Variable**         | **Descripción**                                                                                      |
|-----------------------|------------------------------------------------------------------------------------------------------|
| id_bar               | identificador único del barrio                                                                      |
| anio                 | Año de ocurrencia                                                                                   |
| semana               | Semana de ocurrencia                                                                               |
| Estrato              | Estrato socioeconómico del barrio                                                                   |
| area_barrio          | Área del barrio en km²                                                                             |
| dengue               | Conteo de casos de dengue                                                                          |
| concentraciones      | Cantidad de visitas e intervención a lugares de concentración humana (Instituciones)                |
| vivienda             | Conteo de las visitas a viviendas a revisión y control de criaderos                                 |
| equipesado           | Conteo de las fumigaciones con Maquinaria Pesada                                                   |
| sumideros            | Conteo de las intervenciones a los sumideros                                                       |
| maquina              | Conteo de las fumigaciones con motomochila                                                         |
| lluvia_mean          | Lluvia promedio en la semana i                                                                     |
| lluvia_var           | Varianza de la lluvia en la semana i                                                               |
| lluvia_max           | Lluvia máxima en la semana i                                                                       |
| lluvia_min           | Lluvia mínima en la semana i                                                                       |
| temperatura_mean     | Temperatura promedio en la semana i                                                                |
| temperatura_var      | Varianza de la temperatura en la semana i                                                          |
| temperatura_max      | Temperatura máxima en la semana i                                                                  |
| temperatura_min      | Temperatura mínima en la semana i                                                                  |


# 0.  Configuraciones de Colab

Mover Kaggle.json a la ubicación correcta después de subirlo

In [None]:
#Estas líneas son comandos de shell que se ejecutan dentro del Jupyter notebook. Se usan para configurar las credenciales de la API de Kaggle, que son necesarias para descargar conjuntos de datos (datasets) desde Kaggle.

!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!rm -rf /content/kaggle/output
!rm -rf /content/kaggle/input

Descargar dataset de la competencia

In [None]:
!kaggle competitions download -c aa-v-2025-i-pronosticos-nn-rnn-cnn

In [None]:
!mkdir -p /content/kaggle/output
!mkdir -p /content/kaggle/input

In [None]:
!mv aa-v-2025-i-pronosticos-nn-rnn-cnn.zip /content/kaggle/input

In [None]:
!unzip /content/kaggle/input/aa-v-2025-i-pronosticos-nn-rnn-cnn.zip -d /content/kaggle/input/

In [None]:
#/kaggle/input
import os
for dirname, _, filenames in os.walk('/content/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# 1.   Imports



In [None]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torchsummary import summary

import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score
from tqdm import tqdm
from datetime import datetime, timedelta # Importing the required modules datetime and timedelta

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import copy
from typing import List, Tuple, Type, Any, Dict, Union
import sys

In [None]:
#Printing library versions
print('Pandas:', pd.__version__)
print('Numpy:', np.__version__)
print('PyTorch:', torch.__version__)

In [None]:
import warnings
warnings.filterwarnings("ignore")

# 2.  Configuración Inicial y Carga de Datos

In [None]:
!rm -rf output
!rm -rf output.zip
!mkdir output

In [None]:
config = {
    "TRAIN_DIR": '/content/kaggle/input/df_train.parquet',
    "TEST_DIR": '/content/kaggle/input/df_test.parquet',
    "SUBMISSION_DIR": '/content/sample_submission.csv',
    "BATCH_SIZE": 32,
    "TARGET_COLUMN": 'dengue',
    "GROUP_COLUMN": 'id_bar',
    "WINDOW_SIZE": 5,
    "HORIZON": 3,
}

In [None]:
# Configuración del dispositivo
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# Cargar datos
train_df = pd.read_parquet(config["TRAIN_DIR"])
test_df = pd.read_parquet(config["TEST_DIR"])

# 3.  Preprocesamiento de Datos

## 3.1  Generar Columna fecha
Creamos la columna fecha basada en anio y semana, asignando el último día de cada semana como índice.

In [None]:
train_df['fecha'] = train_df['anio'].astype(str) + train_df['semana'].astype(str).str.zfill(2)
test_df['fecha'] = test_df['anio'].astype(str) + test_df['semana'].astype(str).str.zfill(2)
train_df.sort_values(by=['fecha','id_bar'])
test_df.sort_values(by=['fecha','id_bar'])

Particionamos el dataset en entrenamiento hasta el año 2020 y validación el 2021

In [None]:
# Dividir conjunto de entrenamiento en train y validation
train_df_full = train_df.copy()
train_df = train_df_full[train_df_full.index.year <= 2020].copy()
val_df = train_df_full[train_df_full.index.year >= 2021].copy()

## 3.2 Selección de Características
Definimos las características de entrada, considerando las correlaciones altas entre variables (e.g., lluvia_mean y lluvia_var: 0.82). Para simplificar, usamos todas las características disponibles y dejamos que el modelo aprenda las relaciones.

In [None]:
features = ['ESTRATO', 'area_barrio', 'concentraciones', 'vivienda', 'equipesado', 'sumideros', 'maquina','lluvia_mean', 'temperatura_mean','temperatura_max']  # Selección basada en correlaciones
target = 'dengue'

## 3.3 Estandarizar


Normalizamos las características y el objetivo usando StandardScaler. Identificamos las características numéricas (excluyendo id, id_bar y dengue):

Características: ESTRATO, area_barrio, concentraciones, vivienda, equipesado, sumideros, maquina, lluvia_mean, lluvia_var, lluvia_max, lluvia_min, temperatura_mean, temperatura_var, temperatura_max, temperatura_min. Ajustamos escaladores por separado para características y objetivo:

Se excluyeron variables como lluvia_var, lluvia_max, temperatura_var, y temperatura_min debido a sus altas correlaciones (e.g., lluvia_var y lluvia_mean: 0.82), para reducir redundancia y mejorar la estabilidad de los modelos.

In [None]:
# Normalización
# scaler_features = MinMaxScaler()
scaler_features = StandardScaler()
train_df[features] = scaler_features.fit_transform(train_df[features])
val_df[features] = scaler_features.transform(val_df[features])
test_df[features] = scaler_features.transform(test_df[features])

# scaler_target = StandardScaler()
# scaler_target = MinMaxScaler()
# train_df[target] = scaler_target.fit_transform(train_df[[target]])
# val_df[target] = scaler_target.transform(val_df[[target]])
# train_df[target] = np.log1p(train_df[[target]])
# val_df[target] = np.log1p(val_df[[target]])

Standard scaler para las features y el target sin transformar fue la que dio mejores resultdos

## 3.4 Crear Secuencias para Series Temporales
Para predecir con 3 semanas de anticipación, usamos una ventana de 5 semanas (window_size=5) y un horizonte de 3 semanas (horizon=3).

In [None]:
def create_sequences(df: pd.DataFrame,
                     window_size: int,
                     horizon: int,
                     features: List[str],
                     target: str,
                     group_column: str) -> Tuple[List[np.ndarray], List[float]]:
    """
    Creates sequences and labels for time series forecasting.

    Args:
        df: The input DataFrame containing the time series data.
        window_size: The size of the rolling window.
        horizon: The forecasting horizon.
        features: A list of column names representing the input features.
        target: The column name representing the target variable.
        group_column: The column name to group the data by (e.g., 'id_bar').

    Returns:
        A tuple containing the sequences (a list of NumPy arrays) and the labels (a list of floats).
    """
    sequences = []
    labels = []
    ids = []

    groups = df.groupby(group_column)
    for _, group in groups:
        group = group.sort_values(by=['id_bar', 'fecha'])
        for i in range(len(group) - window_size - horizon + 1):
            X = group.iloc[i:i + window_size][features].values
            y = group.iloc[i + window_size + horizon - 1][target]
            ids.append(group.iloc[i + window_size + horizon - 1]['id'])
            sequences.append(X)
            labels.append(y)

    return sequences, labels,ids

train_sequences, train_labels, train_ids = create_sequences(train_df, config["WINDOW_SIZE"], config["HORIZON"], features, target, config["GROUP_COLUMN"])
val_sequences, val_labels, val_ids = create_sequences(val_df, config["WINDOW_SIZE"], config["HORIZON"], features, target, config["GROUP_COLUMN"])

## 3.5 Dataset y DataLoader
Creamos un Dataset personalizado y dividimos en entrenamiento y validación.

In [None]:
class DengueDataset(Dataset):
    """
    A custom PyTorch Dataset for Dengue forecasting.

    Args:
        sequences: A list of NumPy arrays representing the input sequences.
        labels: A list of floats representing the corresponding target values.
    """
    def __init__(self, sequences: List[np.ndarray], labels: List[float]):
        """Initializes the DengueDataset with sequences and labels."""
        self.sequences = sequences
        self.labels = labels

    def __len__(self) -> int:
        """Returns the length of the dataset."""
        return len(self.sequences)

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Returns the input sequence and label for the given index.

        Args:
            idx: The index of the item to retrieve.

        Returns:
            A tuple containing the input sequence (as a PyTorch tensor)
            and the corresponding label (as a PyTorch tensor).
        """
        X = self.sequences[idx]
        y = self.labels[idx]
        return torch.tensor(X, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)

In [None]:
class DengueTestDataset(Dataset):
    """
    A custom PyTorch Dataset for Dengue forecasting for test data.

    Args:
        sequences: A list of NumPy arrays representing the input sequences for test data.
    """
    def __init__(self, sequences: List[np.ndarray]):
        """Initializes the DengueTestDataset with sequences."""
        self.sequences = sequences

    def __len__(self) -> int:
        """Returns the length of the dataset."""
        return len(self.sequences)

    def __getitem__(self, idx: int) -> torch.Tensor:
        """
        Returns the input sequence for the given index.

        Args:
            idx: The index of the item to retrieve.

        Returns:
            The input sequence (as a PyTorch tensor).
        """
        X = self.sequences[idx]
        return torch.tensor(X, dtype=torch.float32)

In [None]:
train_dataset = DengueDataset(train_sequences, train_labels)
val_dataset = DengueDataset(val_sequences, val_labels)

train_loader = DataLoader(train_dataset, batch_size=config["BATCH_SIZE"], shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=config["BATCH_SIZE"], shuffle=False)

# 4.  Implementación de Modelos

## 4.1 Modelo MLP
Un Perceptrón Multicapa que aplana las secuencias.

In [None]:
class MLPModel(nn.Module):
    """
    A Multilayer Perceptron (MLP) model for Dengue forecasting.

    Args:
        input_dim: The dimensionality of the input features.
        hidden_dim: The dimensionality of the hidden layers.
        layer_dim: The number of hidden layers (default: 1).
        output_dim: The dimensionality of the output (default: 1).
        dropout_rate: The dropout rate to apply between layers (default: 0.2).
    """
    def __init__(self, input_dim: int, hidden_dim: int, layer_dim:int = 1, output_dim: int = 1, dropout_rate: float = 0.2):
        """Initializes the MLPModel with specified dimensions and dropout rate."""
        super(MLPModel, self).__init__()
        layers = []
        layers.append(nn.Linear(input_dim * config['WINDOW_SIZE'], hidden_dim)) # multiply window_size since we need to receive a tensor like (batch,window_size, features) and convert to (batch,window_size*features) to correctly output the prediction
        layers.append(nn.ReLU())
        layers.append(nn.Dropout(dropout_rate))
        for _ in range(layer_dim - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))

        layers.append(nn.Linear(hidden_dim, output_dim))

        self.model = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Performs a forward pass through the MLP model.

        Args:
            x: The input tensor of shape (batch_size, sequence_length, input_dim).

        Returns:
            The output tensor of shape (batch_size, output_dim).
        """
        batch_size = x.size(0)
        x = x.view(batch_size, -1)
        return self.model(x)

para usar la arquitectura MLP junto con las otras en la busqueda bayesiana y no tener que modificar el dataset cuando se quiere usar MLP. Los ajustes que se hacen en la clase *MLPModel* son:

-  multiplicar *input_dim* con la ventana definida para enviar la cantidad de caracteristicas como dimension de entrada al igual que se hace con los otros modelos, pero permitiendole al modelo recibir las caracteristicas aplanadas por la ventana.
-  en el *forward pass* se aplana la entrada pero permaneciendo el batch. Esto porque MLP solo recibe datos en una .

## 4.2 Modelo CNN para Series Temporales
Una CNN 1D adaptada a series temporales.

In [None]:
class CNNModel(nn.Module):
    """
    A 1D Convolutional Neural Network (CNN) model for Dengue forecasting.

    Args:
        input_dim: The dimensionality of the input features.
        hidden_dim: The dimensionality of the hidden layers.
        layer_dim: The number of hidden layers.
        output_dim: The dimensionality of the output (usually 1 for regression).
        dropout_rate: The dropout rate to apply between layers (default: 0.2).
    """
    def __init__(self, input_dim: int, hidden_dim: int, layer_dim:int = 1, output_dim: int = 1, dropout_rate: float = 0.2):
        """Initializes the CNNModel with specified dimensions and dropout rate."""
        super(CNNModel, self).__init__()
        layers = []
        layers.append(nn.Conv1d(input_dim, hidden_dim, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
        for _ in range(layer_dim - 1):
            layers.append(nn.Conv1d(hidden_dim, hidden_dim, kernel_size=3, padding=1))

        self.model = nn.Sequential(*layers)
        self.pool = nn.AdaptiveAvgPool1d(1)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Performs a forward pass through the CNN model.

        Args:
            x: The input tensor of shape (batch_size, sequence_length, input_dim).

        Returns:
            The output tensor of shape (batch_size, output_dim).
        """
        x = x.permute(0, 2, 1)  # (batch, features, window_size)
        x = self.model(x)
        x = self.pool(x).squeeze(-1) # La salida es (batch_size, hidden_dim)
        x = self.dropout(x)
        x = self.fc(x)
        return x

En el forward pass se acomoda la entrada para recibir las caracteristicas como el canal de entrada.

## 4.3 Modelo RNN Básico


In [None]:
class RNNModel(nn.Module):
    """
    A basic Recurrent Neural Network (RNN) model for Dengue forecasting.

    Args:
        input_dim: The dimensionality of the input features.
        hidden_dim: The dimensionality of the hidden state.
        layer_dim: The number of RNN layers (default: 1).
        output_dim: The dimensionality of the output (default: 1).
        dropout_rate: The dropout rate to apply between layers (default: 0.2).
    """
    def __init__(self, input_dim: int, hidden_dim: int, layer_dim: int = 1, output_dim: int = 1, dropout_rate: float = 0.2):
        """Initializes the RNNModel with specified dimensions and dropout rate."""
        super(RNNModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='relu')
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Performs a forward pass through the RNN model.

        Args:
            x: The input tensor of shape (batch_size, sequence_length, input_dim).

        Returns:
            The output tensor of shape (batch_size, output_dim).
        """
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
        out, _ = self.rnn(x, h0)
        out = self.dropout(out[:, -1, :])
        out = self.fc(out)
        return out

## 4.4 Modelo LSTM

In [None]:
class LSTMModel(nn.Module):
    """
    A Long Short-Term Memory (LSTM) model for Dengue forecasting.

    Args:
        input_dim: The dimensionality of the input features.
        hidden_dim: The dimensionality of the hidden state.
        layer_dim: The number of LSTM layers (default: 1).
        output_dim: The dimensionality of the output (default: 1).
        dropout_rate: The dropout rate to apply between layers (default: 0.2).
    """
    def __init__(self, input_dim: int, hidden_dim: int, layer_dim: int = 1, output_dim: int = 1, dropout_rate: float = 0.2):
        """Initializes the LSTMModel with specified dimensions and dropout rate."""
        super(LSTMModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Performs a forward pass through the LSTM model.

        Args:
            x: The input tensor of shape (batch_size, sequence_length, input_dim).

        Returns:
            The output tensor of shape (batch_size, output_dim).
        """
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
        c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
        out, _ = self.lstm(x, (h0, c0))
        out = self.dropout(out[:, -1, :])
        out = self.fc(out)
        return out

## 4.5 Modelo GRU

In [None]:
class GRUModel(nn.Module):
    """
    A Gated Recurrent Unit (GRU) model for Dengue forecasting.

    Args:
        input_dim: The dimensionality of the input features.
        hidden_dim: The dimensionality of the hidden state.
        layer_dim: The number of GRU layers (default: 1).
        output_dim: The dimensionality of the output (default: 1).
        dropout_rate: The dropout rate to apply between layers (default: 0.2).
    """
    def __init__(self, input_dim: int, hidden_dim: int, layer_dim: int = 1, output_dim: int = 1, dropout_rate: float = 0.2):
        """Initializes the GRUModel with specified dimensions and dropout rate."""
        super(GRUModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.gru = nn.GRU(input_dim, hidden_dim, layer_dim, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Performs a forward pass through the GRU model.

        Args:
            x: The input tensor of shape (batch_size, sequence_length, input_dim).

        Returns:
            The output tensor of shape (batch_size, output_dim).
        """
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
        out, _ = self.gru(x, h0)
        out = self.dropout(out[:, -1, :])
        out = self.fc(out)
        return out

# 5. Entrenamiento y Evaluación

## 5.1 Función de Entrenamiento

In [None]:
def train_model(model: nn.Module,
                train_loader: DataLoader,
                epochs: int,
                optimizer: optim.Optimizer,
                criterion: nn.Module,
                device: torch.device,
                val_loader: DataLoader = None,
                patience: int = 10) -> Tuple[List[float], List[float]]:
    """
    Trains a PyTorch model and returns the training and validation losses.

    Args:
        model: The PyTorch model to train.
        train_loader: The DataLoader for the training data.
        val_loader: The DataLoader for the validation data. If None, no validation is performed.
        epochs: The number of epochs to train for.
        optimizer: The optimizer to use.
        criterion: The loss function to use.
        device: The device to train on (e.g., 'cpu' or 'cuda').
        patience: The number of epochs to wait for improvement before stopping. Default is 10.

    Returns:
        A tuple containing the training losses and validation losses.
    """

    model.to(device)
    train_losses = []
    val_losses = []
    best_loss = float('inf')
    epochs_without_improvement = 0
    best_model = None

    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for X, y in train_loader:
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            output = model(X)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        train_loss /= len(train_loader)
        train_losses.append(train_loss)

        if val_loader is None:
          if (epoch + 1) % 10 == 0:
            print(f'Epoch {epoch+1}/{epochs}, Train Loss: {train_loss:.4f}')
          if train_loss < best_loss:
            best_loss = train_loss
            best_model = copy.deepcopy(model)
            epochs_without_improvement = 0
          else:
            epochs_without_improvement += 1
            if epochs_without_improvement >= patience:
              print(f'Early stopping at epoch {epoch+1}')
              break
        else:
          if (epoch + 1) % 10 == 0:
            print(f'Epoch {epoch+1}/{epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')
          model.eval()
          val_loss = 0
          with torch.no_grad():
              for X, y in val_loader:
                  X, y = X.to(device), y.to(device)
                  output = model(X)
                  loss = criterion(output, y)
                  val_loss += loss.item()
          val_loss /= len(val_loader)
          val_losses.append(val_loss)
          if val_loss < best_loss:
              best_loss = val_loss
              best_model = copy.deepcopy(model)
              epochs_without_improvement = 0
          else:
              epochs_without_improvement += 1
              if epochs_without_improvement >= patience:
                  print(f'Early stopping at epoch {epoch+1}')
                  break

    model = best_model
    return train_losses, val_losses

## 5.2 Función de Evaluación

In [None]:
def evaluate_model(model: torch.nn.Module,
                   val_loader: torch.utils.data.DataLoader,
                   device: torch.device,
                   scaler_target: MinMaxScaler | StandardScaler) -> Tuple[float, float, float, np.ndarray, np.ndarray]:
    """
    Evaluates a PyTorch model on a validation set and returns evaluation metrics.

    Args:
        model: The PyTorch model to evaluate.
        val_loader: The DataLoader for the validation data.
        device: The device to evaluate on (e.g., 'cpu' or 'cuda').
        scaler_target: The scaler used to normalize the target variable.

    Returns:
        A tuple containing the mean absolute error (MAE), mean squared error (MSE),
        root mean squared error (RMSE), predicted values, and actual values.
    """
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for X, y in val_loader:
            X, y = X.to(device), y.to(device)
            output = model(X)
            predictions.append(output.cpu().numpy())
            actuals.append(y.cpu().numpy())
    predictions = np.concatenate(predictions)
    actuals = np.concatenate(actuals)
    predictions = scaler_target.inverse_transform(predictions.reshape(-1, 1)).flatten()
    actuals = scaler_target.inverse_transform(actuals.reshape(-1, 1)).flatten()
    mae = mean_absolute_error(actuals, predictions)
    mse = mean_squared_error(actuals, predictions)
    rmse = np.sqrt(mse)
    print(f'MAE: {mae:.4f}, MSE: {mse:.4f}, RMSE: {rmse:.4f}')
    return mae, mse, rmse, predictions, actuals

## 5.3 Gráficos
Generamos gráficos de pérdidas y predicciones vs reales.

In [None]:
def plot_losses(train_losses: List[float], val_losses: List[float]) -> None:
    """
    Generates a plot of training and validation losses.

    Args:
        train_losses: A list of training losses for each epoch.
        val_losses: A list of validation losses for each epoch.
    """
    plt.plot(train_losses, label='Train Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.title('Training and Validation Losses')
    plt.show()

def plot_predictions(actuals: np.ndarray, predictions: np.ndarray) -> None:
    """
    Generates a plot of actual vs predicted values.

    Args:
        actuals: A NumPy array of actual values.
        predictions: A NumPy array of predicted values.
    """
    plt.plot(actuals, label='Actual')
    plt.plot(predictions, label='Predicted')
    plt.xlabel('Sample')
    plt.ylabel('Dengue Cases')
    plt.legend()
    plt.title('Actual vs Predicted Values')
    plt.show()

## 5.4 Entrenar Modelos con optimización bayesiana

In [None]:
def bayesian_optimization(train_loader: DataLoader, val_loader: DataLoader, space: Dict[str, Any], max_evals: int = 50, device: Union[str, torch.device] = 'cpu') -> Dict[str, Any]:

    """
    Performs Bayesian optimization for hyperparameter tuning.

    Args:
        train_loader: The DataLoader for the training data.
        val_loader: The DataLoader for the validation data.
        space: The search space for hyperparameters.
        max_evals: The maximum number of evaluations
        device: The device to train on (e.g., 'cpu' or 'cuda'). Default is 'cpu'.

    Returns:
        A dictionary containing the best hyperparameters and the best model.
    """

    def objective(params):
        """
        Objective function for Bayesian optimization.

        Args:
            params: The hyperparameters to optimize.

        Returns:
            The negative validation loss.
        """
        model = params['model'](input_dim=len(features), hidden_dim=params['hidden_dim'], dropout_rate=params['dropout_rate'])

        criterion = params['loss_fn']
        optimizer = params['optimizer'](model.parameters(), lr=params['lr'])
        train_losses, val_losses = train_model(model=model, train_loader=train_loader, val_loader=val_loader, epochs=params['epochs'], optimizer=optimizer, criterion=criterion, device=device,patience=params['patience'])

        return {'loss': val_losses[-1], 'status': STATUS_OK, 'model': model}

    trials = Trials()
    best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=max_evals, trials=trials)

    # Get the best model from the trials
    best_trial = trials.best_trial
    best_model = best_trial['result']['model'] # Get the best model


    return {'best_params': best, 'best_model': best_model}

Hiperparametros del espacio de búsqueda

un hidden_dim o layer_dim muy grande producen un desvanecimiento del gradiente, dando predicciones de una constante.
Se encontro mejores resultados entre 1 y 2 capas, con una dimensionalidad maxima de 32 y pocas epocas.

Durante los diferentes experimentos, en la mayoria de los casos la funcion *PoissonNLLLoss* y el optimizador *SGD* diron los mejores resultados. El mejor learning rate fue cercano a 1e-4, se recomienda fijarlo para reducir el espacio de busqueda.

In [None]:
epochs_choices = [10, 20, 30, 40, 50]
patience_choices = [10, 50]
hidden_dim_choices = [8, 12, 16, 24, 32]
layer_dim_choices = [1, 2]
optimizer_choices = [optim.Adam, optim.RMSprop, optim.SGD]
loss_fn_choices = [nn.MSELoss(), nn.PoissonNLLLoss()]
model_choice = [MLPModel, CNNModel, RNNModel, LSTMModel, GRUModel]
max_evals_bayesian_search = 150

In [None]:
space = {
    'model': hp.choice('model', model_choice),
    'hidden_dim': hp.choice('hidden_dim', hidden_dim_choices),
    'layer_dim': hp.choice('layer_dim', layer_dim_choices),
    'dropout_rate': hp.uniform('dropout_rate', 0, 0.7),

    'epochs': hp.choice('epochs', epochs_choices),
    'optimizer': hp.choice('optimizer', optimizer_choices),
    'lr': hp.loguniform('lr', np.log(1e-4), np.log(1e-2)),
    'patience': hp.choice('patience',patience_choices),
    'loss_fn': hp.choice('loss_fn',loss_fn_choices)
}

In [None]:
best_model = bayesian_optimization(train_loader, val_loader, space, max_evals=max_evals_bayesian_search, device=DEVICE)

## Mejores hiperpárametros

In [None]:
print("Mejores parámetros:")
print(f"Model: {model_choice[best_model['best_params']['model']]}")
print(f"hidden_dim: {hidden_dim_choices[best_model['best_params']['hidden_dim']]}")
print(f"layer_dim: {layer_dim_choices[best_model['best_params']['layer_dim']]}")
print(f"dropout_rate: {best_model['best_params']['dropout_rate']}")
print("--------------------------------------------------------------")
print(f"epochs: {epochs_choices[best_model['best_params']['epochs']]}")
print(f"patience: {patience_choices[best_model['best_params']['patience']]}")
print(f"lr: {best_model['best_params']['lr']}")
print(f"optimizer: {optimizer_choices[best_model['best_params']['optimizer']]}")
print(f"loss: {loss_fn_choices[best_model['best_params']['loss_fn']]}")
print("Mejor modelo:", best_model['best_model'])

In [None]:
with open('output/params.txt', 'w') as f:
    print("Mejores parámetros:", file=f)
    print(f"Model: {model_choice[best_model['best_params']['model']]}", file=f)
    print(f"hidden_dim: {hidden_dim_choices[best_model['best_params']['hidden_dim']]}", file=f)
    print(f"layer_dim: {layer_dim_choices[best_model['best_params']['layer_dim']]}", file=f)
    print(f"dropout_rate: {best_model['best_params']['dropout_rate']}", file=f)
    print("--------------------------------------------------------------",file=f)
    print(f"epochs: {epochs_choices[best_model['best_params']['epochs']]}", file=f)
    print(f"patience: {patience_choices[best_model['best_params']['patience']]}", file=f)
    print(f"lr: {space['lr']}", file=f)
    print(f"optimizer: {optimizer_choices[best_model['best_params']['optimizer']]}", file=f)
    print(f"loss: {loss_fn_choices[best_model['best_params']['loss_fn']]}", file=f)
    print("Mejor modelo:", best_model['best_model'], file=f)
    print("--------------------------------------------------------------",file=f)
    print(f"Window: {config['WINDOW_SIZE']}",file=f)
    print(f"Horizon: {config['HORIZON']}",file=f)
    print(f"Features: {features}",file=f)
    print(f"Model choices: {model_choice}",file=f)
    print(f"adicional info",file=f)

# 6. Predicción en el Test Set

##6.1 Re entrenar modelo final

Creamos una copia del modelo para re entrenar con todos los datos disponibles.

In [None]:
best_model_trained = copy.deepcopy(best_model['best_model']).to(DEVICE)

Preparamos los datos para generar la secuencia como la espera el modelo, creando el dataset y el dataloader

In [None]:
# 1.
train_df_full[features] = scaler_features.transform(train_df_full[features])
# train_df_full[target] = scaler_target.transform(train_df_full[[target]])

# 2.
train_sequences_full, train_labels_full, full_ids = create_sequences(train_df_full, config["WINDOW_SIZE"], config["HORIZON"], features, target, config["GROUP_COLUMN"])

# 3.
train_dataset_full = DengueDataset(train_sequences_full, train_labels_full)
train_loader_full = DataLoader(train_dataset_full, batch_size=config["BATCH_SIZE"], shuffle=False)

Entrenamos nuevamente el modelo usando los mejores hiperparametros y basandonos en el mejor modelo

In [None]:
print("\n🔁 Reentrenando con todo el dataset (train + val)...\n")
criterion = loss_fn_choices[best_model['best_params']['loss_fn']]
optimizer = optimizer_choices[best_model['best_params']['optimizer']](best_model_trained.parameters(), lr=best_model['best_params']['lr'])
train_losses, val_losses = train_model(best_model_trained, train_loader_full,  epochs=epochs_choices[best_model['best_params']['epochs']], optimizer=optimizer, criterion=criterion, device=DEVICE, patience=patience_choices[best_model['best_params']['patience']])

##6.2 Crear Secuencias para Test
Combinamos train y test para obtener las semanas previas necesarias.

In [None]:
combined_df = pd.concat([train_df_full, test_df], sort=False)
combined_df = combined_df.sort_values(by=['id_bar', 'fecha'])

In [None]:
test_sequences = []
ids = []

for idx, row in test_df.iterrows():
    id_bar = row['id_bar']
    fecha = row.name
    prev_dates = combined_df[(combined_df['id_bar'] == id_bar) & (combined_df.index < fecha)].tail(config["WINDOW_SIZE"])
    if len(prev_dates) == config["WINDOW_SIZE"]:
        seq = prev_dates[features].values
        test_sequences.append(seq)
        ids.append(row['id'])

In [None]:
test_sequences = np.array(test_sequences)
test_tensor = torch.tensor(test_sequences, dtype=torch.float32).to(DEVICE)

# 7.  Submission

Primero preparamos el dataset de prueba

In [None]:
test_dataset = DengueTestDataset(test_sequences)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

generamos la predicción con el mejor modelo re-entrenado

In [None]:
def forecasting(modelo: Union[torch.nn.Module], dataloader: DataLoader) -> np.ndarray:
    """Generates dengue case predictions using the provided model and dataloader.

    Args:
        modelo: The trained PyTorch model used for forecasting.
        dataloader: The DataLoader containing the test dataset.

    Returns:
        np.ndarray: An array of predicted dengue cases.
    """
    modelo.eval()
    predictions = []
    with torch.no_grad():
        for batch in dataloader:  # Iterate through batches from the dataloader
            if isinstance(batch, list) or isinstance(batch, tuple):  # Check if batch is a list or tuple
                X = batch[0]  # Get the input data (assuming it's the first element)
            else:
                X = batch  # Otherwise, assume batch is the input data directly
            X = X.to(DEVICE)
            output = modelo(X)
            #output = torch.expm1(output)
            #output = torch.round(output)
            predictions.append(output.cpu().numpy())

    predictions = np.concatenate(predictions).flatten()
    # predictions = scaler_target.inverse_transform(predictions.reshape(-1, 1)).flatten()

    return predictions

In [None]:
predictions = forecasting(best_model_trained, test_loader)

df_submission = pd.DataFrame({'id': ids, 'dengue': predictions})
df_submission.to_csv('output/submission_retrain.csv', index=False)
print(f'Submission guardado en submission_retrain.csv, con {len(df_submission)} predicciones.')

Generamos la predicción con el mejor modelo sin re-entrenar

In [None]:
predictions = forecasting(best_model['best_model'], test_loader)

# Prepare submission
df_submission = pd.DataFrame({'id': ids, 'dengue': predictions})
df_submission.to_csv('output/submission.csv', index=False)
print(f'Submission guardado en submission.csv, con {len(df_submission)} predicciones.')

# 8.  Grafica de predicción contra validación

In [None]:
predictions = forecasting(best_model['best_model'], val_loader)

df_validation = pd.DataFrame({'id': val_ids, 'dengue': predictions})
val_df = val_df.sort_values(by=['id_bar', 'fecha'])

Se guardan las graficas por barrio de la predicción del modelo contra el dataset de validación

In [None]:
# Extract neighborhood IDs from the 'id' column
df_validation['id_bar'] = df_validation['id'].str.split('_').str[0].astype(int)
df_validation['year'] = df_validation['id'].str.split('_').str[1].astype(int)
df_validation['week'] = df_validation['id'].str.split('_').str[2].astype(int)


# Group by neighborhood ID
for id_bar, group in df_validation.groupby('id_bar'):
    # Create a new figure for each neighborhood
    plt.figure(figsize=(8, 4))

    # Plot 'dengue' values for the current neighborhood
    plt.plot(group['week'].values, group['dengue'].values, label=f'Neighborhood {id_bar}')
    # Filter true values from val_df for the current neighborhood and plot
    true_values = val_df[(val_df['id_bar'] == id_bar)]['dengue'].values
    plt.plot(true_values, label=f'Neighborhood {id_bar} (True)')

    # Set title and labels
    plt.title(f'Dengue Cases for Neighborhood {id_bar}')
    plt.xlabel('Week')  # Assuming the x-axis represents weeks
    plt.ylabel('Dengue Cases')

    # Remove top and right spines
    plt.gca().spines[['top', 'right']].set_visible(False)

    # Add legend
    plt.legend()

    # Save the plot
    plt.savefig(f'output/val_plots/neighborhood_{id_bar}.png')

    # Show the plot
    plt.show()

# 9. Guardar modelo

Se guarda el mejor modelo.

In [None]:
torch.save(best_model_trained.state_dict(), 'output/best_model_retrained.pth')
torch.save(best_model['best_model'].state_dict(), 'output/best_model.pth')

In [None]:
with open('output/model_summary.txt', 'w') as f:
    print("Model Summary:", file=f)
    try:
        print(summary(best_model_trained, (1, len(features))), file=f)
    except:
        try:
            print(summary(best_model_trained, (1, len(features) * config['WINDOW_SIZE'])), file=f)
        except:
            print("Error al generar el summary", file=f)

Se comprimen los diferentes archivos generados durante el experimento para una fácil descarga.

In [None]:
_!zip -r output.zip output