# PCS5024 - Aprendizado Estatístico - Statistical Learning - 2023/1
### Professors: 
### Anna Helena Reali Costa (anna.reali@usp.br)
### Fabio G. Cozman (fgcozman@usp.br)

In [1]:
!pip install --quiet torch numpy pandas gdown uniplot matplotlib

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader
import gdown
from tqdm import tqdm
import uniplot
import datetime
import random

# Recurrent Neural Networks

Recurrent Neural Networks (RNNs) is a family of neural network specifically designed for processing **sequential data** (Rumelhart et al., 1986a). They can handle sequences of _variable length_ and _share parameters_ across different parts of the model, making them more adaptable to different input forms. RNNs can generalize across different sequence lengths and positions in time, which is particularly important when specific information can occur at multiple positions within a sequence.

In contrast to traditional fully connected feedforward networks, RNNs share the same weights across several time steps. This allows the network to learn language rules at each position in the sequence without having to relearn them separately. Convolution across a 1-D temporal sequence is another related idea, used in Time-Delay Neural Networks (TDNNs) (Lang and Hinton, 1988; Waibel et al., 1989; Lang et al., 1990). While convolution allows for parameter sharing across time, it is shallow compared to the deep computational graph sharing in RNNs.

**Sharing weights is an idea from 1980s ML and is still very commonly used in Deep Learning!**

In general, a RNN can be represented as following:

<img src='https://drive.google.com/uc?id=14x9VlBRucvDCwBqS00xjOYWoYqVuMb3w'  width="40%" height="40%">

The transformations U, W and V are shared for all $x$ the same way CNNs share the same set of kernels for the whole image.

RNNs compute a new hidden state $h$ for each new input $x$:

$h^{(t)} = f(h^{(t-1)},x^{(t)},\theta)$

There are several RNNs models. The most common are:

<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*B0q2ZLsUUw31eEImeVf3PQ.png"   width="60%" height="60%">

In this notebook we'll use the Gated Recurrent Unit or GRU (Chung et al. 2014) architecture to show how RNNs can be used in a forecast task.


$r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}$

Let's see an example of a sequence of GPS measurements (2 features) being processed by an RNN with hidden size 4.

<img src='https://drive.google.com/uc?id=1EbcmMY7pvodQO_8a9mUNOu2ahggvDSEc'  width="60%" height="60%" align="left">

To further illustrate how RNNs can be used now we'll use sea surface height (SSH) measurements extracted from the Santos Port Channel dataset. This dataset is a collection of measurements from several sensors, but we're only interested in SSH.

In [2]:
id = "1qZv6wwHLyMIZQQNQN8NN676AIgF-XJt5"
gdown.download(id=id, output="santos_ssh.csv", quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1qZv6wwHLyMIZQQNQN8NN676AIgF-XJt5
To: C:\Caio\DevProjects\Python\academico\doutorado\disciplinas\PCS5024 - Aprendizado Estatistico\homework2\santos_ssh.csv
100%|███████████████████████████████████████████████| 792k/792k [00:00<00:00, 2.92MB/s]


'santos_ssh.csv'

In [3]:
df = pd.read_csv("santos_ssh.csv")
df["datetime"] = pd.to_datetime(df["datetime"])
train_df = df[df["datetime"].dt.tz_convert(None) < np.datetime64("2020-06-01 00:00:00")]
test_df = df[df["datetime"].dt.tz_convert(None) >= np.datetime64("2020-06-01 00:00:00")]
train_df.head(), test_df.head()
#ssh valume da maré - sea surface height

(                   datetime   ssh
 0 2020-01-01 00:00:00+00:00  0.70
 1 2020-01-01 00:10:00+00:00  0.69
 2 2020-01-01 00:20:00+00:00  0.68
 3 2020-01-01 00:30:00+00:00  0.67
 4 2020-01-01 00:40:00+00:00  0.67,
                        datetime   ssh
 21293 2020-06-01 00:00:00+00:00  0.90
 21294 2020-06-01 00:10:00+00:00  0.92
 21295 2020-06-01 00:20:00+00:00  0.94
 21296 2020-06-01 00:30:00+00:00  0.99
 21297 2020-06-01 00:40:00+00:00  1.01)

In [None]:
train_df["ssh"].iloc[:1000].plot()

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_size = 1
hidden_size = 1#64
num_epochs = 15
past_len = 800
future_len = 100
batch_size = 32
learning_rate = 1e-3
device

device(type='cpu')

In [5]:
# Here we write a SimpleARModel that uses the network 
# output as the input for the next prediction step (i.e. no teacher forcing)
class SimpleARModel(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.GRU(input_size, hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, input_size)

    def forward(self, x, target_seq_len):
        # encoding
        out, h_n = self.rnn(x)
        inp = self.linear(out[:, -1]).unsqueeze(1)

        output_seq = torch.empty(
            x.shape[0], target_seq_len, x.shape[-1], device=inp.device
        )

        for i in range(target_seq_len):
            out, h_n = self.rnn(inp, h_n)
            output_seq[:, i] = self.linear(out[:, -1])
        return output_seq


def create_sequences(data, past_len, future_len):
    xs, ys = [], []
    for i in range(past_len, len(data) - future_len):
        x = data[(i - past_len) : i]
        y = data[i : (i + future_len)]
        xs.append(x)
        ys.append(y)
    return np.array(xs), np.array(ys)

In [6]:
# Here we write another class, TeacherForcingModel, that uses the t-1 actual target value as 
# the input for the next prediction step (i.e. teacher forcing) from the second iteration on.
class TeacherForcingModel(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.GRU(input_size, hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, input_size)

    def forward(self, x, y, target_seq_len):
        # encoding
        out, h_n = self.rnn(x)
        inp = self.linear(out[:, -1]).unsqueeze(1)

        output_seq = torch.empty(
            x.shape[0], target_seq_len, x.shape[-1], device=inp.device
        )


        for i in range(target_seq_len):
          if i == 0:
            out, h_n = self.rnn(inp, x[:,x.size()[1] - 1].unsqueeze(0))
            output_seq[:,i] = self.linear(out[:,-1])
          else:
            out, h_n = self.rnn(inp, y[:,i-1].unsqueeze(0).contiguous())
            output_seq[:,i] = self.linear(out[:,-1])

        return output_seq

In [7]:
# Here we write another class, TeacherForcingModel, that uses the t-1 actual target value as 
# the input for the next prediction step (i.e. teacher forcing) from the second iteration on.
class TeacherForcingBonusModel(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.GRU(input_size, hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, input_size)

    def forward(self, x, y, target_seq_len):
        # encoding
        out, h_n = self.rnn(x)
        inp = self.linear(out[:, -1]).unsqueeze(1)
        tfr = 0.5
        output_seq = torch.empty(
            x.shape[0], target_seq_len, x.shape[-1], device=inp.device
        )


        for i in range(target_seq_len):
          if i == 0:
            out, h_n = self.rnn(inp, x[:,x.size()[1] - 1].unsqueeze(0))
            output_seq[:,i] = self.linear(out[:,-1])
          else:
            if random.random() < tfr:
                out, h_n = self.rnn(inp, y[:,i-1].unsqueeze(0).contiguous())
                output_seq[:,i] = self.linear(out[:,-1])
            else:
                out, h_n = self.rnn(inp, x[:,x.size()[1] - 1].unsqueeze(0))
                output_seq[:,i] = self.linear(out[:,-1])
                

        return output_seq

In [8]:
train_data = train_df["ssh"].values
test_data = test_df["ssh"].values

X_train, y_train = create_sequences(train_data, past_len, future_len)
X_test, y_test = create_sequences(test_data, past_len, future_len)

X_train = torch.from_numpy(X_train.astype(np.float32))
y_train = torch.from_numpy(y_train.astype(np.float32))
X_test = torch.from_numpy(X_test.astype(np.float32))
y_test = torch.from_numpy(y_test.astype(np.float32))

train_dataset = TensorDataset(X_train, y_train)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = TensorDataset(X_test, y_test)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [9]:
start_time = datetime.datetime.now()
#model = SimpleARModel(input_size, hidden_size).to(device)
#model = TeacherForcingModel(input_size,hidden_size).to(device)
model = TeacherForcingBonusModel(input_size,hidden_size).to(device)


criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(num_epochs):
    model.train()
    for inputs, targets in tqdm(train_dataloader):
        inputs = inputs.unsqueeze(-1).to(device)
        targets = targets.unsqueeze(-1).to(device)

        #outputs = model(inputs, targets.shape[1])
        outputs = model(inputs, targets,targets.shape[1])

        loss = criterion(outputs, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.no_grad():
        for i, (inputs, targets) in enumerate(tqdm(test_dataloader)):
            inputs = inputs.unsqueeze(-1).to(device)
            targets = targets.unsqueeze(-1).to(device)

            #outputs = model(inputs, targets.shape[1])
            outputs = model(inputs, targets,targets.shape[1])
            test_loss = criterion(outputs, targets)
            past_view_size = 50
            if i in [0, 1, 2] and epoch % 3 == 0:
                inp = inputs[0, :, 0].cpu()
                tar = targets[0, :, 0].cpu()
                out = outputs[0, :, 0].cpu()
                uniplot.plot(
                    ys=[inp[-past_view_size:], tar, out],
                    xs=[
                        np.arange(0, past_view_size),
                        np.arange(past_view_size, past_view_size + len(tar)),
                        np.arange(past_view_size, past_view_size + len(tar)),
                    ],
                    color=True,
                    legend_labels=["Input", "Target", "Output"],
                    title=f"Epoch: {epoch + 1}, Test Loss: {test_loss.item():.4f} - Example {i}",
                    height=10
                )


100%|████████████████████████████████████████████████| 638/638 [04:08<00:00,  2.57it/s]
  1%|▍                                                 | 1/107 [00:00<00:11,  8.95it/s]

           Epoch: 1, Test Loss: 0.0865 - Example 0
┌────────────────────────────────────────────────────────────┐
│[34m▛[0m[34m▚[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▖[0m                     [35m▟[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▄[0m                          │ 
││     [34m▚[0m                   [35m▄[0m[35m▀[0m [32m▄[0m[32m▖[0m[32m▖[0m [32m▖[0m[35m▀[0m[35m▚[0m                        │ 1
││      [34m▀[0m[34m▖[0m                [35m▝[0m[32m▝[0m[32m▝[0m[32m▘[0m    [32m▝[0m[32m▀[0m[32m▝[0m[32m▗[0m[32m▄[0m[32m▗[0m[32m▖[0m                 [32m▗[0m[32m▖[0m[32m▗[0m│ 
││       [34m▘[0m[34m▖[0m             [32m▖[0m[32m▀[0m[35m▘[0m           [35m▝[0m[35m▀[0m[32m▘[0m[32m▝[0m[32m▗[0m          [32m▗[0m [32m▖[0m[32m▀[0m[32m▘[0m[32m▘[0m[35m▀[0m[35m▝[0m│ 
││       [34m▝[0m[34m▝[0m[34m▄[0m         [32m▄[0m[32m▄[0m[32m▄[0m[32m▗[0m[32m▗[0m[32m▄[0m[32m▄[0m[32m▖[0m[32m▗

  3%|█▍                                                | 3/107 [00:00<00:13,  7.69it/s]

┌────────────────────────────────────────────────────────────┐
││              [34m▄[0m[34m▀[0m[34m▀[0m[34m▀[0m[34m▀[0m[35m▄[0m                                       │ 
││            [34m▄[0m[34m▞[0m     [32m▄[0m[35m▀[0m[32m▗[0m[32m▄[0m[32m▗[0m[32m▄[0m [32m▗[0m[32m▖[0m[32m▄[0m [32m▖[0m [32m▖[0m[32m▖[0m [32m▄[0m[32m▖[0m[32m▗[0m [32m▄[0m [32m▗[0m[32m▖[0m [32m▖[0m[32m▖[0m[32m▖[0m [32m▗[0m[32m▖[0m[32m▄[0m  [32m▄[0m[32m▄[0m  [32m▖[0m[32m▄[0m│ 1
││           [34m▝[0m       [32m▝[0m[32m▀[0m[32m▀[0m[35m▚[0m[32m▄[0m [32m▖[0m                 [32m▗[0m[32m▄[0m[32m▗[0m[32m▗[0m[32m▄[0m[32m▖[0m          │ 
││           [34m▞[0m           [35m▝[0m[35m▀[0m[32m▝[0m[32m▘[0m[32m▗[0m          [32m▗[0m [32m▄[0m[32m▘[0m[32m▝[0m[32m▘[0m[35m▀[0m [35m▀[0m[35m▀[0m[32m▘[0m[32m▝[0m[35m▄[0m[32m▀[0m[32m▀[0m [32m▝[0m[32m▄[0m[32m▖[0m  │ 
││         [34m▗[0m[34m▀[0m          

100%|████████████████████████████████████████████████| 107/107 [00:09<00:00, 10.80it/s]
100%|████████████████████████████████████████████████| 638/638 [04:07<00:00,  2.58it/s]
100%|████████████████████████████████████████████████| 107/107 [00:09<00:00, 10.97it/s]
100%|████████████████████████████████████████████████| 638/638 [04:13<00:00,  2.52it/s]
100%|████████████████████████████████████████████████| 107/107 [00:11<00:00,  9.54it/s]
100%|████████████████████████████████████████████████| 638/638 [04:13<00:00,  2.51it/s]
  1%|▍                                                 | 1/107 [00:00<00:15,  7.01it/s]

           Epoch: 4, Test Loss: 0.0896 - Example 0
┌────────────────────────────────────────────────────────────┐
│[34m▛[0m[34m▚[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▖[0m                     [35m▟[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▄[0m                          │ 
││     [34m▚[0m                   [35m▄[0m[35m▀[0m[32m▗[0m[32m▝[0m[32m▘[0m[32m▘[0m[32m▘[0m[32m▄[0m[32m▗[0m[32m▖[0m                        │ 1
││      [34m▀[0m[34m▖[0m                [35m▝[0m[32m▝[0m[32m▝[0m        [32m▗[0m[32m▖[0m                     [35m▄[0m│ 
││       [34m▘[0m[34m▖[0m              [32m▗[0m[32m▖[0m           [32m▝[0m[32m▀[0m[32m▘[0m               [35m▖[0m[32m▖[0m[32m▞[0m[35m▀[0m[35m▝[0m│ 
││       [34m▝[0m[34m▝[0m[34m▄[0m            [32m▝[0m[32m▘[0m              [35m▚[0m[35m▄[0m[32m▖[0m           [32m▝[0m[32m▘[0m[35m▝[0m    │ 
││          [34m▚[0m          [32m▝[0m[35m▘[0m                 

  3%|█▍                                                | 3/107 [00:00<00:17,  5.81it/s]

┌────────────────────────────────────────────────────────────┐
││              [34m▄[0m[34m▀[0m[34m▀[0m[34m▀[0m[34m▀[0m[35m▄[0m                                       │ 
││            [34m▄[0m[34m▞[0m     [32m▜[0m[32m▀[0m[32m▚[0m[32m▝[0m [32m▘[0m[32m▝[0m [32m▝[0m[32m▀[0m[32m▀[0m[32m▀[0m [32m▘[0m[32m▝[0m[32m▀[0m[32m▘[0m[32m▘[0m[32m▀[0m[32m▘[0m[32m▝[0m[32m▘[0m[32m▝[0m[32m▀[0m[32m▝[0m [32m▝[0m[32m▀[0m [32m▀[0m [32m▝[0m[32m▘[0m[32m▝[0m [32m▘[0m[32m▝[0m[32m▀[0m[32m▝[0m │ 1
││           [34m▝[0m         [32m▝[0m[32m▘[0m[32m▄[0m                     [35m▄[0m             │ 
││           [34m▞[0m           [32m▝[0m[32m▝[0m[32m▘[0m               [35m▖[0m[35m▄[0m[32m▖[0m[32m▀[0m[32m▘[0m[32m▘[0m[32m▀[0m[32m▘[0m[32m▄[0m[32m▖[0m[32m▗[0m[32m▄[0m[32m▖[0m     │ 
││         [34m▗[0m[34m▀[0m              [35m▐[0m[32m▀[0m[32m▚[0m          [32m▗[0m[32m▖[0m[32m▚[0m[32m▘

100%|████████████████████████████████████████████████| 107/107 [00:10<00:00, 10.59it/s]
100%|████████████████████████████████████████████████| 638/638 [04:10<00:00,  2.55it/s]
100%|████████████████████████████████████████████████| 107/107 [00:10<00:00, 10.53it/s]
100%|████████████████████████████████████████████████| 638/638 [04:14<00:00,  2.50it/s]
100%|████████████████████████████████████████████████| 107/107 [00:09<00:00, 10.82it/s]
100%|████████████████████████████████████████████████| 638/638 [04:07<00:00,  2.58it/s]
  1%|▍                                                 | 1/107 [00:00<00:16,  6.47it/s]

           Epoch: 7, Test Loss: 0.0965 - Example 0
┌────────────────────────────────────────────────────────────┐
│[34m▛[0m[34m▚[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▖[0m                     [35m▟[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▄[0m                          │ 
││     [34m▚[0m                   [35m▄[0m[35m▀[0m [32m▝[0m[32m▘[0m[32m▖[0m[32m▘[0m[32m▖[0m[35m▀[0m[35m▚[0m                        │ 1
││      [34m▀[0m[34m▖[0m                [35m▝[0m[32m▟[0m[32m▝[0m[32m▘[0m     [32m▝[0m[32m▝[0m[32m▘[0m[35m▖[0m                     [35m▄[0m│ 
││       [34m▘[0m[34m▖[0m              [35m▗[0m[35m▘[0m           [35m▝[0m[32m▝[0m[32m▚[0m               [32m▗[0m[35m▄[0m[35m▀[0m[32m▘[0m[35m▝[0m│ 
││       [34m▝[0m[34m▝[0m[34m▄[0m            [32m▞[0m[35m▀[0m              [35m▚[0m[35m▄[0m[32m▙[0m[32m▄[0m         [32m▄[0m [35m▗[0m[35m▝[0m    │ 
││          [34m▚[0m          [32m▝[0m

  3%|█▍                                                | 3/107 [00:00<00:15,  6.56it/s]

┌────────────────────────────────────────────────────────────┐
││              [34m▄[0m[34m▀[0m[34m▀[0m[34m▀[0m[34m▀[0m[35m▄[0m                                       │ 
││            [34m▄[0m[34m▞[0m     [32m▜[0m[32m▞[0m[32m▟[0m[32m▀[0m[32m▝[0m[32m▀[0m[32m▝[0m [32m▀[0m[32m▀[0m [32m▀[0m[32m▘[0m[32m▘[0m[32m▝[0m[32m▘[0m[32m▘[0m [32m▝[0m[32m▘[0m   [32m▀[0m[32m▘[0m[32m▘[0m[32m▘[0m[32m▝[0m[32m▝[0m [32m▝[0m   [32m▘[0m[32m▀[0m[32m▀[0m[32m▝[0m[32m▀[0m[32m▝[0m│ 1
││           [34m▝[0m          [35m▚[0m[32m▖[0m                     [35m▄[0m             │ 
││           [34m▞[0m           [32m▝[0m[35m▀[0m[32m▚[0m               [32m▗[0m[35m▄[0m[32m▝[0m[32m▀[0m[32m▝[0m[32m▘[0m[32m▘[0m[32m▀[0m[32m▖[0m[32m▄[0m[32m▄[0m[32m▄[0m      │ 
││         [34m▗[0m[34m▀[0m              [35m▐[0m[32m▀[0m           [32m▗[0m[32m▞[0m[32m▄[0m[32m▘[0m[35m▘[0m       [35m▝[0m[35m▀[0m[35m

100%|████████████████████████████████████████████████| 107/107 [00:09<00:00, 10.74it/s]
100%|████████████████████████████████████████████████| 638/638 [04:08<00:00,  2.56it/s]
100%|████████████████████████████████████████████████| 107/107 [00:09<00:00, 10.74it/s]
100%|████████████████████████████████████████████████| 638/638 [04:24<00:00,  2.41it/s]
100%|████████████████████████████████████████████████| 107/107 [00:10<00:00, 10.58it/s]
100%|████████████████████████████████████████████████| 638/638 [04:14<00:00,  2.50it/s]
  1%|▍                                                 | 1/107 [00:00<00:17,  5.90it/s]

           Epoch: 10, Test Loss: 0.0841 - Example 0
┌────────────────────────────────────────────────────────────┐
│[34m▛[0m[34m▚[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▖[0m                     [35m▟[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▄[0m                          │ 
││     [34m▚[0m                   [35m▄[0m[35m▀[0m [32m▗[0m [32m▄[0m[32m▘[0m[32m▄[0m[35m▀[0m[35m▚[0m                        │ 1
││      [34m▀[0m[34m▖[0m                [35m▝[0m [32m▞[0m      [32m▘[0m[32m▘[0m[32m▗[0m[35m▖[0m                     [35m▄[0m│ 
││       [34m▘[0m[34m▖[0m              [35m▗[0m[32m▝[0m           [32m▝[0m[32m▝[0m[32m▌[0m               [32m▗[0m[32m▖[0m[32m▞[0m[32m▝[0m[32m▀[0m│ 
││       [34m▝[0m[34m▝[0m[34m▄[0m            [35m▗[0m[32m▘[0m              [32m▝[0m[32m▀[0m [32m▖[0m[32m▖[0m        [32m▗[0m [32m▘[0m[32m▘[0m    │ 
││          [34m▚[0m        [32m▗[0m[32m▄[0m[35m▗[0m[35m▘[0

  3%|█▍                                                | 3/107 [00:00<00:13,  7.44it/s]

┌────────────────────────────────────────────────────────────┐
││              [34m▄[0m[34m▀[0m[34m▀[0m[34m▀[0m[34m▀[0m[35m▄[0m                                       │ 
││            [34m▄[0m[34m▞[0m     [32m▜[0m[32m▚[0m[32m▞[0m[32m▘[0m[32m▝[0m [32m▝[0m[32m▀[0m[32m▀[0m[32m▝[0m[32m▝[0m [32m▝[0m[32m▀[0m[32m▀[0m[32m▀[0m[32m▝[0m[32m▀[0m[32m▘[0m[32m▘[0m[32m▀[0m[32m▘[0m[32m▝[0m[32m▘[0m[32m▝[0m[32m▘[0m[32m▘[0m[32m▀[0m  [32m▝[0m[32m▀[0m[32m▝[0m[32m▀[0m  [32m▘[0m[32m▝[0m[32m▘[0m │ 1
││           [34m▝[0m          [32m▗[0m[32m▖[0m                     [35m▄[0m             │ 
││           [34m▞[0m           [35m▝[0m[32m▀[0m[32m▘[0m               [35m▖[0m[32m▗[0m[32m▖[0m[32m▀[0m[32m▝[0m[32m▘[0m[32m▀[0m[32m▀[0m[32m▖[0m[32m▗[0m[32m▖[0m[32m▗[0m      │ 
││         [34m▗[0m[34m▀[0m              [32m▝[0m[35m▄[0m[32m▗[0m            [32m▄[0m[32m▘[0m[35m▘[0m       [35m▝

100%|████████████████████████████████████████████████| 107/107 [00:10<00:00,  9.77it/s]
100%|████████████████████████████████████████████████| 638/638 [04:11<00:00,  2.53it/s]
100%|████████████████████████████████████████████████| 107/107 [00:09<00:00, 10.75it/s]
100%|████████████████████████████████████████████████| 638/638 [04:21<00:00,  2.44it/s]
100%|████████████████████████████████████████████████| 107/107 [00:10<00:00, 10.35it/s]
100%|████████████████████████████████████████████████| 638/638 [04:08<00:00,  2.57it/s]
  2%|▉                                                 | 2/107 [00:00<00:14,  7.31it/s]

           Epoch: 13, Test Loss: 0.0694 - Example 0
┌────────────────────────────────────────────────────────────┐
│[34m▛[0m[34m▚[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▄[0m[34m▖[0m                     [35m▟[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▀[0m[35m▄[0m                          │ 
││     [34m▚[0m                   [35m▄[0m[35m▀[0m[32m▗[0m[32m▖[0m[32m▗[0m[32m▄[0m[32m▘[0m[32m▄[0m[35m▀[0m[35m▚[0m                        │ 1
││      [34m▀[0m[34m▖[0m                [35m▝[0m[32m▖[0m[32m▀[0m[32m▘[0m      [32m▀[0m[32m▗[0m[35m▖[0m                     [35m▄[0m│ 
││       [34m▘[0m[34m▖[0m              [32m▗[0m[35m▘[0m           [32m▝[0m[35m▀[0m[32m▗[0m               [32m▗[0m[32m▄[0m[32m▖[0m[32m▀[0m[32m▀[0m│ 
││       [34m▝[0m[34m▝[0m[34m▄[0m            [32m▖[0m[35m▀[0m              [35m▚[0m[32m▀[0m[32m▗[0m[32m▄[0m        [32m▗[0m [32m▞[0m[32m▀[0m[35m▝[0m    │ 
││          [34m▚[0m        [3

  4%|█▊                                                | 4/107 [00:00<00:12,  7.95it/s]

           Epoch: 13, Test Loss: 0.1162 - Example 2
┌────────────────────────────────────────────────────────────┐
││ [34m▞[0m[34m▀[0m[34m▀[0m[34m▀[0m[34m▀[0m[34m▚[0m[34m▄[0m                                                   │ 
│[34m▄[0m[34m▞[0m       [34m▀[0m[34m▄[0m                                                 │ 1
││         [34m▝[0m[34m▚[0m[34m▄[0m                  [35m▄[0m[32m▄[0m[32m▖[0m[35m▄[0m[32m▖[0m[32m▖[0m                      [32m▗[0m│ 
││            [34m▚[0m             [32m▗[0m[32m▄[0m[32m▞[0m[32m▀[0m[35m▘[0m    [35m▀[0m[35m▀[0m[35m▀[0m[32m▘[0m[35m▄[0m[32m▗[0m [32m▖[0m              [35m▗[0m[35m▘[0m│ 
││             [34m▀[0m[34m▖[0m        [32m▗[0m[32m▗[0m[32m▞[0m[32m▘[0m[35m▟[0m            [35m▀[0m[35m▚[0m[32m▀[0m             [32m▝[0m[32m▘[0m │ 
││              [34m▝[0m[34m▜[0m   [32m▀[0m[32m▀[0m[32m▀[0m[32m▀[0m[32m▀[0m[32m▘[0m[35m▀[0m    [32m▀[0m [32m▝[

100%|████████████████████████████████████████████████| 107/107 [00:10<00:00, 10.11it/s]
100%|████████████████████████████████████████████████| 638/638 [04:11<00:00,  2.54it/s]
100%|████████████████████████████████████████████████| 107/107 [00:11<00:00,  9.26it/s]
100%|████████████████████████████████████████████████| 638/638 [04:23<00:00,  2.42it/s]
100%|████████████████████████████████████████████████| 107/107 [00:09<00:00, 11.06it/s]


In [10]:
end_time = datetime.datetime.now()
print(end_time - start_time)

1:05:54.770522


# The Challenge of Long-Term Dependencies

The basic problem is that gradients propagated over many stages tend to either vanish (most of the time) or explode (rarely, but with much damage to the optimization).

Recurrent networks involve the composition of the same function multiple times, once per time step.

In particular, the function composition employed by recurrent neural networks somewhat resembles matrix multiplication. We can think of the recurrence relation:

$h^{(t)} = W^Th^{(t-1)}$

as a very simple recurrent neural network lacking a nonlinear activation function and lacking inputs x. 
This recurrence relation essentially describes the power method. It may be simpliﬁed to:

$h^{(t)} = (W^t)^Th^{(0)}$

and if W admits an eigendecomposition of the form:

$W = Q \Lambda Q^T$

with orthogonal Q, the recurrence may be simpliﬁed further to:

$h^{(t)} = Q^T \Lambda^{t} Q h^{(0)}$

The eigenvalues are raised to the power of t, causing eigenvalues with magnitude less than one to decay to zero and eigenvalues with magnitude greater than one to explode. Any component of $h(0)$ that is not aligned with the largest eigenvector will eventually be discarded.

Several approaches have been proposed to tackle this. Namely:
- Adding Skip Connections through Time: One way to obtain coarse time scales is to add direct connections from variables in the distant past to variables in the present.
- Adding other controlled gates: This allows the network to accumulate information (such as evidence for a particular feature or category) over a long duration. Once that information has been used, however, it might be useful for the neural network to forget the old state.
- Gradient clipping: Prevents gradient explosion
- Many others...

# Exercise
A more comprehensive look into sequence modeling can be found at: https://www.deeplearningbook.org/contents/rnn.html

One of the strategies to train RNNs is called teacher forcing. This strategy allows for the network to use the actual label instead of its generated output as input to the next step. This allows for the gradient to be propagated through fewer steps in some networks and also for the network to slowly be able to use its own outputs.

Exercise:
 - Implement the **teacher forcing** mechanism and compare the results.
 - It is described at '10.2.1 Teacher Forcing and Networks with Output Recurrence' section of the deeplearningbook.org book

## Answer

**Introduction**: 
- RNNs are a class of neural network designed to __process sequence of related data__ by allowing previous output to be used as inputs while having hidden states. 
- As main __advantages__ they are capable of process input of any length, model size does not increase with the size of input, computation takes into account historical information and weights are shared across time. 
- However computation may be __slow and they have difficulty of accessing information from a long time ago__, also known as the vanishing gradient problem. Thus GRU and LSTM architectures are used to handle this.

**The Problem**: 
- In our example a special kind of RNN architecture called GRU, was used to predict the sea surface height (SSH) based on measurements extracted from the Santos Port Channel dataset.
- The GRU is an advanced version of the classical RNN, as well as LSTM achitecture. Their primary purpose is to store long-term information.
- In order to improve RNN performance teacher forcing mechanism was used. It is one of the strategies used to train the so-called Encoder-Decoder model which is a logial continuation of RNN model and they have provided the state-of-the-art results in sequence-to-sequence multistep time series forecasting.

**More about _Teacher Forcing_**: 
- Teacher forcing training adds external information to the training before loss calculation.
- During teacher forcing training, the decoder RNN cell receives actual previous target values on each step, as shown in figure below:

![<caption>](teacher_forcing.jpg)

**The Implementation**: 
- As RNN cell will be feed by actual previous target values to training, hidden_size parameter must to be equal to **1** (target dimension) instead of 64;
- A new class, TeacherForcingModel, was created. It uses the t-1 actual target value as the input for the next prediction step (i.e. teacher forcing) from the second iteration on. Forward method requires target values (y) as additional parameter;
- Training and evaluation model compute outputs having target_values as new parameter having _outputs = model(inputs, targets,targets.shape[1])_, as opposite as _outputs = model(inputs, targets.shape[1])_ previously used.

**The Scenarios**: 
- The teacher forcing model as well as the classical GRU model were performed locally, at mini-conda and Google Colab environments.
- In order to measure performance three metrics were used: overall execution lapsed time, best test loss and time-to-fit.

**Results**: 
- As expected, teacher forcing approach improved network performance under all metrics no matter the environment.

|Trainning strategie | Environment | Execution time | Best Test Loss 
| --- | --- | --- | --- 
|no teacher forcing | running at **colab** device(type='cpu') | 1:09:03.001693 | Epoch: 13, Test Loss: 0.0388 - Example 2 
|**with teacher forcing** | running at colab device device(type='cpu') | 0:45:35.660404 | Epoch: 13, Test Loss: 0.0013 - Example 0 
|no teacher forcing | running at **mini-conda** device(type='cpu') | 1:59:40.389969 | Epoch: 7, Test Loss: 0.0318 - Example 0 
|**with teacher forcing** | running at mini-conda device(type='cpu') | 1:06:54.284348 | Epoch: 4, Test Loss: 0.0011 - Example 0

- The use of teacher forcing was proved to be ~65% faster and 30 times more accurate given our problem context running at Google Colab. Not very different when executed at mini-conda environment it was ~56% faster and 29 times more accurate.
- Last, but not least, by visual inspection we notice that with the use of teacher forcing since the first epoch the model outputs shapes of actual values, on the other hand without it, the model takes much longer to do it. See figure below:

                                                no teacher forcing training
![<caption>](no_teacherforcing_training.jpg)

                                                 teacher forcing training
![<caption>](with_teacherforcing_training.jpg)

**References:**

- Article: The Unreasonable Effectiveness of Recurrent Neural Networks, available at: https://karpathy.github.io/2015/05/21/rnn-effectiveness/
- Book: Deep Learning, provided by MIT, available at: https://www.deeplearningbook.org/
- Book: Time Series Forecasting using Deep Learning: Combining PyTorch, RNN, TCN, and Deep Neural Network Models to Provide Production-Ready Prediction Solutions (English Edition)
- PyTorch Tutorial, by Patrick Loeber, available at: https://www.youtube.com/watch?v=EMXfZB8FVUA&list=PLqnslRFeH2UrcDBWF5mfPGpqQDSta6VK4

In [9]:
models = {t:TeacherForcingModel(input_size, hidden_size).to(device) for t in [True,False]}

In [10]:
test_losses = {t:[] for t in models.keys()}

In [15]:
for use_encoder_loss,model in models.items():
    print(use_encoder_loss)

True
False


In [31]:
models = [SimpleARModel(input_size, hidden_size).to(device), TeacherForcingModel(input_size, hidden_size_TF).to(device), TeacherForcingBonusModel(input_size, hidden_size_TF).to(device)]
for model in models:
    if(not isinstance(model,SimpleARModel)):
        print(model)

TeacherForcingModel(
  (rnn): GRU(1, 1, batch_first=True)
  (linear): Linear(in_features=1, out_features=1, bias=True)
)
TeacherForcingBonusModel(
  (rnn): GRU(1, 1, batch_first=True)
  (linear): Linear(in_features=1, out_features=1, bias=True)
)
