<a href="https://colab.research.google.com/github/cam2149/MachineLearningV/blob/main/CNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Equipo**

- Nicolás Colmenares

- Carlos Martinez

1. Implementación de una Red Convolucional (CNN) adaptada a series temporales.- 1 pts

  -  .

  - .

  - .

**Situación:**
Una ciudad enfrenta un aumento significativo de casos de dengue, con una tasa de incidencia que supera el promedio nacional.
La anticipación de brotes es crucial para implementar medidas preventivas y reducir la propagación de la enfermedad.

**Objetivo:**
Desarrollar un modelo predictivo utilizando redes neuronales para pronosticar futuros brotes de dengue en cada barrio de la ciudad.
Utilizar una base de datos histórica de casos de dengue desde 2015 hasta 2022 para entrenar el modelo.
Anticiparse a los brotes con al menos 3 semanas de anticipación.

**Finalidad:**
Permitir a las autoridades de salud pública tomar acciones oportunas, como:
Preparar a las instituciones prestadoras de salud (IPS).
Gestionar recursos (carros fumigadores, limpieza de sumideros).
Capacitar a la comunidad.

*   Red Convolucional (CNN) adaptada a series temporales.
*   .
*   .

# 0. Configuraciones de Colab

Mover Kaggle.json a la ubicación correcta después de subirlo

In [None]:
#Estas líneas son comandos de shell que se ejecutan dentro del Jupyter notebook. Se usan para configurar las credenciales de la API de Kaggle, que son necesarias para descargar conjuntos de datos (datasets) desde Kaggle.

!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!rm -rf /content/kaggle/output
!rm -rf /content/kaggle/input

Descargar dataset de la competencia

In [None]:
!kaggle competitions download -c aa-v-2025-i-pronosticos-nn-rnn-cnn

In [None]:
!mkdir -p /content/kaggle/output
!mkdir -p /content/kaggle/input

In [None]:
!mv aa-v-2025-i-pronosticos-nn-rnn-cnn.zip /content/kaggle/input

In [None]:
!unzip /content/kaggle/input/aa-v-2025-i-pronosticos-nn-rnn-cnn.zip -d /content/kaggle/input/

In [None]:
#/kaggle/input
import os
for dirname, _, filenames in os.walk('/content/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# 1. Imports

In [None]:
!pip install altair

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score
import os
from tqdm import tqdm

In [None]:
#Printing library versions
print('Pandas:', pd.__version__)
print('Numpy:', np.__version__)
print('PyTorch:', torch.__version__)
print('Altair:', alt.__version__)

In [None]:
import warnings
warnings.filterwarnings("ignore")

# 2. Configs

In [None]:
config = {
    "TRAIN_DIR": '/content/kaggle/input/df_train.parquet',
    "TEST_DIR": '/content/kaggle/input/df_test.parquet',
    "SUBMISSION_DIR": '/content/sample_submission.csv'
}

# Exploración

## Diccionario

train.parquet - El conjunto de datos de entrenamiento
test.parquet - El conjunto de datos de prueba
sample_submission.csv - un ejemplo de un archivo a someter en la competencia

| **Variable**         | **Descripción**                                                                                      |
|-----------------------|------------------------------------------------------------------------------------------------------|
| id_bar               | identificador único del barrio                                                                      |
| anio                 | Año de ocurrencia                                                                                   |
| semana               | Semana de ocurrencia                                                                               |
| Estrato              | Estrato socioeconómico del barrio                                                                   |
| area_barrio          | Área del barrio en km²                                                                             |
| dengue               | Conteo de casos de dengue                                                                          |
| concentraciones      | Cantidad de visitas e intervención a lugares de concentración humana (Instituciones)                |
| vivienda             | Conteo de las visitas a viviendas a revisión y control de criaderos                                 |
| equipesado           | Conteo de las fumigaciones con Maquinaria Pesada                                                   |
| sumideros            | Conteo de las intervenciones a los sumideros                                                       |
| maquina              | Conteo de las fumigaciones con motomochila                                                         |
| lluvia_mean          | Lluvia promedio en la semana i                                                                     |
| lluvia_var           | Varianza de la lluvia en la semana i                                                               |
| lluvia_max           | Lluvia máxima en la semana i                                                                       |
| lluvia_min           | Lluvia mínima en la semana i                                                                       |
| temperatura_mean     | Temperatura promedio en la semana i                                                                |
| temperatura_var      | Varianza de la temperatura en la semana i                                                          |
| temperatura_max      | Temperatura máxima en la semana i                                                                  |
| temperatura_min      | Temperatura mínima en la semana i                                                                  |


## Lectura del dataset de entrenamiento

In [None]:
#Esta celda tiene como objetivo leer los datos de entrenamiento desde un archivo Parquet y mostrar información básica sobre ellos
def load_and_prepare_data(train_dir, test_dir):
  try:
    train_df = pd.read_parquet(train_dir)
    test_df = pd.read_parquet(test_dir)

    # Generar columna 'id'
    for df in [train_df, test_df]:
        df['id'] = df['id_bar'].astype(str) + '_' + df['anio'].astype(str) + '_' + df['semana'].astype(str)

    # Ordenar por id_bar, anio y semana
    train_df = train_df.sort_values(['id_bar', 'anio', 'semana'])
    test_df = test_df.sort_values(['id_bar', 'anio', 'semana'])

    return train_df, test_df
  except FileNotFoundError:
    print("Error: 'series_train.parquet' not found. Please make sure the file exists in the current directory or provide the correct path.")
  except Exception as e:
    print(f"An error occurred: {e}")



In [None]:
def normalize_features(train_df, test_df, features):
  try:
    scaler = StandardScaler()
    train_df[features] = scaler.fit_transform(train_df[features])
    test_df[features] = scaler.transform(test_df[features])
    return train_df, test_df, scaler
  except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
def create_sequences(df, window_size, forecast_horizon, features, target):
  try:
    X, y = [], []
    for id_bar in df['id_bar'].unique():
        df_bar = df[df['id_bar'] == id_bar].sort_values(['anio', 'semana'])
        for i in range(len(df_bar) - window_size - forecast_horizon + 1):
            X.append(df_bar.iloc[i:i+window_size][features].values)
            y.append(df_bar.iloc[i+window_size+forecast_horizon-1][target])
    return np.array(X), np.array(y)
  except Exception as e:
    print(f"An error occurred: {e}")

def create_test_sequences(train_df, test_df, window_size, forecast_horizon, features):
  try:
    X_test = []
    train_df['temporal_idx'] = (train_df['anio'] - train_df['anio'].min()) * 52 + train_df['semana']
    test_df['temporal_idx'] = (test_df['anio'] - train_df['anio'].min()) * 52 + test_df['semana']

    for idx, row in test_df.iterrows():
        id_bar = row['id_bar']
        t = row['temporal_idx']
        start = t - forecast_horizon - window_size
        end = t - forecast_horizon - 1
        df_bar = train_df[(train_df['id_bar'] == id_bar) &
                          (train_df['temporal_idx'] >= start) &
                          (train_df['temporal_idx'] <= end)]
        if len(df_bar) == window_size:
            X_test.append(df_bar[features].values)
        else:
            # Padding con ceros si no hay suficientes datos
            X_test.append(np.zeros((window_size, len(features))))
    return np.array(X_test)
  except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
class DengueDataset(Dataset):
    def __init__(self, X, y=None):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32) if y is not None else None

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        if self.y is not None:
            return self.X[idx], self.y[idx]
        return self.X[idx]

In [None]:
class CNNForecast(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, kernel_size=3):
        super(CNNForecast, self).__init__()
        self.conv1 = nn.Conv1d(input_size, hidden_size, kernel_size)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool1d(2)
        # Calcular tamaño de salida después de conv y pool
        conv_output_size = (window_size - kernel_size + 1) // 2
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(hidden_size * conv_output_size, 50)
        self.fc2 = nn.Linear(50, output_size)

    def forward(self, x):
        x = x.permute(0, 2, 1)  # (batch, features, sequence)
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In [None]:
def train_model(model, train_loader, val_loader, criterion, optimizer, epochs, device):
    model.to(device)
    best_mse = float('inf')
    best_model_state = None

    for epoch in tqdm(range(epochs), desc="Training"):
        model.train()
        running_loss = 0.0
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = criterion(y_pred.squeeze(), y_batch)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        # Evaluación en validación
        model.eval()
        val_preds, val_true = [], []
        with torch.no_grad():
            for X_val, y_val in val_loader:
                X_val, y_val = X_val.to(device), y_val.to(device)
                y_pred = model(X_val).squeeze()
                val_preds.extend(y_pred.cpu().numpy())
                val_true.extend(y_val.cpu().numpy())

        mse = mean_squared_error(val_true, val_preds)
        if mse < best_mse:
            best_mse = mse
            best_model_state = model.state_dict()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {running_loss/len(train_loader):.4f}, Val MSE: {mse:.4f}")

    model.load_state_dict(best_model_state)
    return model, compute_metrics(val_true, val_preds)

def compute_metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    # Accuracy como proporción de predicciones cercanas (umbral arbitrario de 0.5)
    y_true_bin = [1 if x > 0.5 else 0 for x in y_true]
    y_pred_bin = [1 if x > 0.5 else 0 for x in y_pred]
    accuracy = accuracy_score(y_true_bin, y_pred_bin)
    return {"MAE": mae, "MSE": mse, "RMSE": rmse, "Accuracy": accuracy}

In [None]:
def find_best_model(X_train, y_train, input_size, hidden_size=64, output_size=1, device='cpu'):
    epochs_list = [100, 300, 500]
    learning_rates = [0.01, 0.001]
    optimizers = [optim.Adam, optim.SGD]

    # Dividir en entrenamiento y validación
    split_idx = int(0.8 * len(X_train))
    train_dataset = DengueDataset(X_train[:split_idx], y_train[:split_idx])
    val_dataset = DengueDataset(X_train[split_idx:], y_train[split_idx:])
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=32)

    best_metrics = None
    best_config = None
    best_model = None

    for epochs in epochs_list:
        for lr in learning_rates:
            for opt_class in optimizers:
                model = CNNForecast(input_size, hidden_size, output_size)
                optimizer = opt_class(model.parameters(), lr=lr)
                criterion = nn.MSELoss()

                print(f"\nTraining with epochs={epochs}, lr={lr}, optimizer={opt_class.__name__}")
                model, metrics = train_model(model, train_loader, val_loader, criterion, optimizer, epochs, device)

                if best_metrics is None or metrics['MSE'] < best_metrics['MSE']:
                    best_metrics = metrics
                    best_config = {'epochs': epochs, 'lr': lr, 'optimizer': opt_class.__name__}
                    best_model = model

    return best_model, best_metrics, best_config

In [None]:
def predict_and_submit(model, X_test, test_df, submission_dir, device):
    model.eval()
    X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
    with torch.no_grad():
        y_pred = model(X_test_tensor).squeeze().cpu().numpy()

    submission_df = test_df[['id']].copy()
    submission_df['dengue'] = y_pred
    submission_df.to_csv(submission_dir, index=False)
    return submission_df

In [None]:
# Configuración
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
window_size = 10
forecast_horizon = 3
features = ['ESTRATO', 'area_barrio', 'concentraciones', 'vivienda', 'equipesado', 'sumideros',
            'maquina', 'lluvia_mean', 'lluvia_var', 'lluvia_max', 'lluvia_min',
            'temperatura_mean', 'temperatura_var', 'temperatura_max', 'temperatura_min']

# Cargar y preparar datos
train_df, test_df = load_and_prepare_data(config["TRAIN_DIR"], config["TEST_DIR"])
train_df, test_df, scaler = normalize_features(train_df, test_df, features)

# Crear secuencias
X_train, y_train = create_sequences(train_df, window_size, forecast_horizon, features, 'dengue')
X_test = create_test_sequences(train_df, test_df, window_size, forecast_horizon, features)

# Encontrar el mejor modelo
input_size = len(features)
best_model, best_metrics, best_config = find_best_model(X_train, y_train, input_size, device=device)

# Imprimir resumen
print("\nResumen del Mejor Modelo:")
print(f"Configuración: {best_config}")
print(f"Métricas: {best_metrics}")

# Predicción y submission
submission_df = predict_and_submit(best_model, X_test, test_df, config["SUBMISSION_DIR"], device)

print("\nPredicciones generadas en:", config["SUBMISSION_DIR"])