<div style="text-align: center;">
    <a target="_blank" href="https://colab.research.google.com/github/bmalcover/cursSocib/blob/main/2_AA/2_8_Time_Series.ipynb">
        <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
    </a>
</div>


# Deep Learning for Time Series


Traditional time series models are effective for small-scale, linear, and stationary data. However, real-world time series are often:

- **Non-linear** (e.g., financial markets, weather patterns)
- **High-dimensional** (e.g., multiple sensors, multivariate financial indicators)
- **Non-stationary** (e.g., concept drift, changing seasonality)
- **Noisy and incomplete** (e.g., IoT sensor dropouts)

Deep learning offers several key advantages:

- **Learns complex patterns**: Can model non-linear relationships and long-range dependencies.
- **Scales to large data**: Handles high-volume, high-frequency data without manual feature engineering.
- **Supports multivariate inputs**: Learns interactions between multiple time series.
- **Flexible input/output formats**: Sequence-to-sequence architectures allow custom horizons and multistep forecasting.
- **Robust to noise**: Neural networks can filter and learn over noisy, redundant inputs.

As a result, deep learning has become essential in domains where:
- Feature interactions are unknown or too complex to hand-craft,
- Patterns evolve over time (non-stationarity),
- Large, heterogeneous datasets are available.


In [None]:
import json
from matplotlib import pyplot as plt
from tqdm.auto import tqdm

import torch
from torch import nn

## Daily SST, Gulf of Mexico

The data is obtained from this [Link](https://climatereanalyzer.org/clim/sst_daily/?dm_id=gomex). Our goal is to predict is to forecast the temperature.

> This page provides time series and map visualizations of daily mean Sea Surface Temperature (SST) from NOAA Optimum Interpolation SST (OISST) version 2.1. OISST is a 0.25°x0.25° gridded dataset that provides estimates of temperature based on a blend of satellite, ship, and buoy observations. The datset spans 1 September 1981 to present with a 1 to 2-day lag from the current day. Data are preliminary for about two weeks until a finalized product is posted by NOAA. This status is identified on the maps by "preliminary" appearing in the title, and applies to the time series as well. Learn more about OISST, including strengths and limitations, from the NCAR Climate Data Guide.


<img src="../assets/bloc2/data.png" />

In [None]:
!curl https://climatereanalyzer.org/clim/sst_daily/json_2clim/oisst2.1_gomex_sst_day.json -o mexico_gulf.json

In [None]:
with open('mexico_gulf.json') as f:
    data = json.load(f)

info = list()
for year_info in data[1:-3]:
    print(year_info['name'])
    info += year_info["data"]

import pandas as pd

df = pd.DataFrame(info)

## Data cleaning

We remove the first and the last year due to the presence of NaNs

In [None]:
x_train = df[:-365]
x_val = df[-365:]

Time series data often comes from sensors, logs, APIs, or manual inputs — and it's rarely clean out of the box. **Data cleaning** is a crucial first step before any time series analysis or modeling. Cleaning ensures that patterns in the data reflect reality, not noise or errors.

| Method                          | Description                                  | Syntax Example                           |
|--------------------------------|----------------------------------------------|-------------------------------------------|
| Drop NaNs                      | Removes rows with NaNs                       | `df.dropna()`                             |
| Forward Fill                   | Fills NaN with previous value                | `df.ffill()`                              |
| Backward Fill                  | Fills NaN with next value                    | `df.bfill()`                              |
| Fill with Fixed Value          | Replaces NaN with a specified value          | `df.fillna(0)`                            |
| Fill with Mean                 | Replaces NaN with column mean                | `df.fillna(df.mean())`                    |
| Fill with Median               | Replaces NaN with column median              | `df.fillna(df.median())`                  |
| Interpolate                    | Linearly interpolate missing values          | `df.interpolate()`                        |
| Fill Within Group (ffill)      | Forward fill within group (e.g., by date)    | `df.groupby('group_col').ffill()`         |
| Fill Within Group (mean)       | Fill with mean within each group             | `df.groupby('group_col').transform('mean')` |
| Rolling Fill                   | Use rolling stats (e.g., mean) to fill       | `df.fillna(df.rolling(3).mean())`         |


In [None]:
x_train.isnull().sum()

We will use ``ffill`` method to remove the NaNs.

In [None]:
x_train = x_train.ffill()
x_train.isnull().sum()

In [None]:
x_train = torch.from_numpy(x_train.values)
x_val = torch.from_numpy(x_val.values)

In [None]:
def get_batches(data, window):
    """
    Takes data with shape (n_samples, n_features) and creates mini-batches
    with shape (1, window).
    """

    L = len(data)
    for i in range(L - (window + 1)):
        x_sequence = data[i:i + window]
        y_sequence = data[i + window + 1]

        yield x_sequence, y_sequence

# Model: RNN

![](https://www.mdpi.com/information/information-15-00517/article_deploy/html/images/information-15-00517-g001-550.jpg)

Recurrent Neural Networks (RNNs) are a type of neural network designed for **sequence data**. Unlike traditional feedforward neural networks, RNNs have connections that loop backward, allowing them to maintain a **memory of previous inputs**. This makes them ideal for tasks like:

- Time series forecasting
- Natural language processing
- Speech recognition
- Sequential data classification

## How RNNs Work

At each time step $t$, an RNN takes an input $x_t$ and the previous hidden state $h_{t-1}$, and computes the new hidden state $h_t$:

$$
    h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)
$$

Where:
- $x_t$: input at time $t$
- $h_t$: hidden state at time $t$
- $W_{xh}$, $W_{hh}$: weight matrices
- $b_h$: bias term

The hidden state $h_t$ acts as a summary of everything the network has seen up to time $t$.

## Key Concepts

- **Hidden state ($h_t$)**: Carries information from previous time steps.
- **Weights are shared** across all time steps, making RNNs efficient for sequences.

## Limitations

- RNNs struggle with **long-term dependencies** due to vanishing gradients.
- Variants like **LSTM** and **GRU** are designed to solve this problem.

RNNs are a foundational building block in deep learning for sequential tasks, and understanding them is key to mastering time-based data models.

In [None]:
features = 1  # Features
n_hidden = 200  # Nodes
n_layers = 2  # Layers

class RNN(nn.Module):
    def __init__(self):
        super(RNN, self).__init__()
        self.rnn = nn.RNN(features, n_hidden, n_layers)
        self.fc = nn.Linear(n_hidden, 1)

    def forward(self, x):
        x, _ = self.rnn(x)
        x = x[:, -1, :]

        x = self.fc(x)

        return x

RNNs output follows always the shape:

``batch_size, sequence_length, hidden_units``

Remember that for the recurrent nature of these networks the last information is in the last element of the sequence. To make predictions we use this information. The hidden units are similar to the neurons for FCN and number of filters for CNNs.

In [None]:
net = RNN()

EPOCHS = 5
LR = 1e-3

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=LR)

train_loss = []
valid_loss = []

for _ in tqdm(range(EPOCHS), desc="Epochs"):
    i = 0
    t_loss = 0
    for i, (x, y) in enumerate(get_batches(x_train, 12)):
        net.train()
        optimizer.zero_grad()

        # Create batch_size dimension
        x = x.unsqueeze(0)
        out = net(x.float())

        loss = criterion(out, y.reshape(1,1).float())
        loss.backward()
        optimizer.step()
        t_loss += loss.item()

    train_loss.append(t_loss / i + 1)

    val_loss = 0
    i = 0
    for i, (x, y) in enumerate(get_batches(x_val, 12)):
        net.eval()

        x = x.unsqueeze(0)
        out = net(x.float())
        loss = criterion(out, y.reshape(1,1).float())

        val_loss += loss.item()

    valid_loss.append(val_loss / i + 1)

In [None]:
plt.title("Train loss")
plt.plot(train_loss);

## Model: LSTM

A good explanation: [Link](https://la.mathworks.com/discovery/lstm.html)

**Long Short-Term Memory (LSTM)** networks are a specialized type of **Recurrent Neural Network (RNN)** designed to model sequential data and capture long-term dependencies. Unlike traditional RNNs, which suffer from the **vanishing gradient problem**, LSTMs can learn to retain or forget information over long sequences using a more complex internal structure.

Standard RNNs update their hidden state $h_t$ at each time step based on the current input $x_t$ and the previous hidden state $h_{t-1}$. However, as the sequence length increases, the network struggles to retain relevant past information due to vanishing or exploding gradients during training.

LSTMs address this by introducing a **memory cell** that can preserve information across many time steps, controlled by gates that regulate the flow of information.

### Architecture Components

![](https://la.mathworks.com/discovery/lstm/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_1332042868/09887e7d-81a1-4a53-b298-eb7bd9d6ac8c/image.adapt.full.medium.jpg/1747387158162.jpg)

Each LSTM cell has three primary gates:

- **Forget Gate $f_t$**: Decides what information to discard from the previous cell state.
- **Input Gate $i_t$**: Determines which new values to add to the cell state.
- **Output Gate $o_t$**: Controls the information passed to the next hidden state.

Additionally:

- **Cell State $C_t$**: Acts as a memory conveyor belt, updated at each step using the input and forget gates.
- **Hidden State $h_t$**: The output of the LSTM cell at each time step.

## Mathematical Formulation

The update equations for an LSTM cell at time step $t$ are:


**Equations of LSTM**

$$
\begin{align*}
f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) & \text{(Forget gate)} \\
i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) & \text{(Input gate)} \\
\tilde{C}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) & \text{(Candidate cell state)} \\
C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t & \text{(Updated cell state)} \\
o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) & \text{(Output gate)} \\
h_t &= o_t \odot \tanh(C_t) & \text{(Updated hidden state)}
\end{align*}
$$


| Symbol          | Meaning                          |
| --------------- | -------------------------------- |
| $x_t$           | Input vector at time $t$         |
| $h_t$           | Hidden state at time $t$         |
| $C_t$           | Cell state (memory) at time $t$  |
| $f_t, i_t, o_t$ | Forget, input, and output gates  |
| $\tilde{C}_t$   | Candidate values for cell state  |
| $W_*, U_*, b_*$ | Weights and biases for each gate |
| $\sigma$        | Sigmoid activation function      |
| $\tanh$         | Hyperbolic tangent activation    |
| $\odot$         | Element-wise multiplication      |


#### RNN vs LSTM

| Feature                           | RNN                                                                  | LSTM                                                              |
| --------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------------------- |
| **Memory capability**             | Short-term memory only; struggles with long sequences.               | Designed for **long-term memory** using a memory cell.            |
| **Structure**                     | Simple loop with one hidden state update.                            | Complex cell with **three gates**: input, forget, and output.     |
| **Vanishing gradient problem**    | Common issue during training with long sequences.                    | Much **less prone** to vanishing gradients.                       |
| **Gates**                         | ❌ No gates. Just applies an activation function to the hidden state. | ✅ Uses gates to control **what to remember, forget, and output**. |
| **Training time**                 | Faster to train (fewer parameters).                                  | Slower to train (more parameters, complex structure).             |
| **Performance on long sequences** | Poor, tends to forget early inputs.                                  | Excellent at **capturing long-term dependencies**.                |
| **Use cases**                     | Simple or short sequences.                                           | Complex, long sequences like text, audio, time series.            |



### When to Use Which?

- Use RNN when:
    - The sequence is short.
    - You need something lightweight.
    - You’re just experimenting or prototyping.



- Use LSTM when:
    - The sequence is long or has dependencies far apart.
    - You need accurate memory retention (e.g., text generation, time series prediction).
    - Vanishing gradients are a problem.

In [None]:
input_size = 1 # Features
n_hidden = 200 # Nodes
n_layers = 2 # Layers


class LSTM(nn.Module):
    def __init__(self):
        super(LSTM, self).__init__()
        self.lstm = nn.LSTM(input_size, n_hidden, n_layers)
        self.fc = nn.Linear(n_hidden, 1)

    def forward(self, x):
        x, _ = self.lstm(x)
        x = x[:, -1, :]

        x = self.fc(x)

        return x

In [None]:
net = LSTM()

EPOCHS = 5
LR = 1e-3

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=LR)

train_loss = []
valid_loss = []

for e in tqdm(range(EPOCHS), desc="Epochs"):
    t_loss = 0
    for i, (x, y) in enumerate(get_batches(x_train, 12)):
        net.train()
        optimizer.zero_grad()

        # Create batch_size dimension
        x = x.unsqueeze(0)
        out = net(x.float())

        loss = criterion(out, y.reshape(1,1).float())
        loss.backward()
        optimizer.step()
        t_loss += loss.item()

    train_loss.append(t_loss / i)