# Recurrent Neural Network and Multi-Head Attention (MHA)

In this task, you will implement a conventional RNN cell, a GRU, and an MHA to understand these models. Then you will configurate GRU in special ways such that it either recovers a conventional RNN or keeps its memory in long term. NOTE: you should not change the provided function interfaces and test cases. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.append('/content/drive/MyDrive/cs137assignments/assignment4')

Mounted at /content/drive


In [19]:
# As usual, a bit of setup
import time
import numpy as np
import torch
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
%autosave 180

def rel_error(x, y):
    """ returns relative error """
    return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Autosaving every 180 seconds


## Recurrent Neural Networks

In this task, you will need to implement forward calculation of recurrent neural networks. Let's first initialize a problem for RNNs.

In [20]:
import torch.nn as nn
## Setup an example. Provide sizes and the input data. 

# set sizes 
time_steps = 12
batch_size = 4
input_size = 3
hidden_size = 2

# create input data with shape [batch_size, time_steps, num_features]
np.random.seed(137)
input_data = torch.randn(batch_size, time_steps, input_size, dtype = torch.float32)
## Create RNN layers

# initialize a state of zero for both RNN and GRU
# 'state' is a tensor of shape [batch_size, hidden_size]
initial_state = torch.randn(batch_size, hidden_size, dtype = torch.float32).unsqueeze(0)

### Implement an RNN and a GRU with PyTorch

In [21]:
# create an RNN with only one layer from torch
t_rnn = nn.RNN(input_size, hidden_size, num_layers = 1, batch_first = True)

# 'outputs' is a tensor of shape [batch_size, time_steps, hidden_size]
# RNN cell outputs the hidden state directly, so the output at each step is the hidden state at that step
# final_state is the last state of the sequence. final_state == outputs[:, -1, :]

# create a GRU RNN
t_gru = nn.GRU(input_size, hidden_size, num_layers = 1, batch_first = True)

with torch.no_grad():
    t_rnn_outputs, t_rnn_final_state = t_rnn(input_data, initial_state)
    # 'outputs' and `final_state` are the same for a GRU.
    t_gru_outputs, t_gru_final_state = t_gru(input_data, initial_state)

### Read out parameters from RNN and GRU cells

**Q1 (0 points)** Understanding `RNN` and `GRU` parameters

Please read the code and documentation of `get_rnn_params` and `get_gru_params` to see how to read out parameters from these to models. You will need to use these parameters in your own implementations. NO implementation is needed here.


In [22]:
from rnn_param_helper import get_rnn_params, get_gru_params

wt_h, wt_x, bias = get_rnn_params(t_rnn)

# NOTE: please check the documentation of `torch.nn.GRU` and the implementation of `get_gru_params` to 
# understand the three returning arguments.

linear_trans_r, linear_trans_z, linear_trans_n = get_gru_params(t_gru)

### Numpy Implementation
**Q2 (3 points)** Please implement your own simple RNN. 

Your implementation needs to match the tensorflow calculation.

**Q3 (5 points)** Please implement your own GRU. 

Your implementation needs to match the tensorflow calculation.

In [23]:
from implementation import rnn,gru

# calculation from your own implemenation of a basic RNN
nprnn_outputs, nprnn_final_state = rnn(wt_h, wt_x, bias, initial_state.numpy(), input_data.numpy())

print("Difference between your RNN implementation and tf RNN", 
                     rel_error(t_rnn_outputs.numpy(), nprnn_outputs) + rel_error(t_rnn_final_state.numpy(), nprnn_final_state))

# calculation from your own implemenation of a GRU RNN
npgru_outputs, npgru_final_state = gru(linear_trans_r, linear_trans_z, linear_trans_n, initial_state.numpy(), input_data.numpy())

print("Difference between your GRU implementation and tf GRU", 
      rel_error(t_gru_outputs.numpy(), npgru_outputs) + rel_error(t_gru_final_state.numpy(), npgru_final_state))


Difference between your RNN implementation and tf RNN 4.474195e-07
Difference between your GRU implementation and tf GRU 4.826031e-07


### GRU includes RNN as a special case
**Q4 (2 points)** Can you assign a special set of parameters to GRU such that its outputs is almost the same as RNN?

In [24]:
# Assign some value to a parameter of GRU

from implementation import init_gru_with_rnn

linear_trans_r, linear_trans_z, linear_trans_n = init_gru_with_rnn(wt_h, wt_x, bias)

# concatenate these parameters to initialize GRU kernels
kernel_init = np.concatenate([linear_trans_r[0], linear_trans_z[0], linear_trans_n[0]], axis=1).T
rec_kernel_init = np.concatenate([linear_trans_r[2], linear_trans_z[2], linear_trans_n[2]], axis=1).T
bias_init0 = np.concatenate([linear_trans_r[1], linear_trans_z[1], linear_trans_n[1]], axis=0)
bias_init1 = np.concatenate([linear_trans_r[3], linear_trans_z[3], linear_trans_n[3]])

grurnn = nn.GRU(input_size, hidden_size, num_layers = 1, batch_first = True)
wt_x1, wt_h1, bias_ih1, bias_hh1 = grurnn._flat_weights

wt_x1.data = torch.tensor(kernel_init, dtype =torch.float32)
wt_h1.data = torch.tensor(rec_kernel_init, dtype = torch.float32)
bias_ih1.data = torch.tensor(bias_init0, dtype = torch.float32)
bias_hh1.data = torch.tensor(bias_init1, dtype = torch.float32)


# 'outputs' is a tensor of shape [batch_size, time_steps, hidden_size]
# Same as the basic RNN cell, final_state == outputs[:, -1, :]
with torch.no_grad():
    t_rnn_outputs, t_rnn_final_state = t_rnn(input_data, initial_state)
    grurnn_outputs, grurnn_final_state = grurnn(input_data, initial_state)

# they are the same as the calculation from the basic RNN
print("Difference between RNN and a special GRU", rel_error(t_rnn_outputs.numpy(), grurnn_outputs.numpy()))

Difference between RNN and a special GRU 4.316596e-07


## Long-term dependency in GRUs


**Q5 (2 points)** Can you set GRU parameters such that it maintains the initial state in the memory for a long term? 

In [25]:
from implementation import init_gru_with_long_term_memory

linear_trans_r, linear_trans_z, linear_trans_n = init_gru_with_long_term_memory(input_size, hidden_size)

# concatenate these parameters to initialize GRU kernels
kernel_init = np.concatenate([linear_trans_r[0], linear_trans_z[0], linear_trans_n[0]], axis=1).T
rec_kernel_init = np.concatenate([linear_trans_r[2], linear_trans_z[2], linear_trans_n[2]], axis=1).T
bias_init0 = np.concatenate([linear_trans_r[1], linear_trans_z[1], linear_trans_n[1]], axis=0)
bias_init1 = np.concatenate([linear_trans_r[3], linear_trans_z[3], linear_trans_n[3]])

gru2 = nn.GRU(input_size, hidden_size, num_layers=1, batch_first=True)
wt_xg, wt_hg, bias_ihg, bias_hhg = gru2._flat_weights

wt_xg.data = torch.tensor(kernel_init, dtype = torch.float32)
wt_hg.data = torch.tensor(rec_kernel_init, dtype = torch.float32)
bias_ihg.data = torch.tensor(bias_init0, dtype = torch.float32)
bias_hhg.data = torch.tensor(bias_init1, dtype = torch.float32)

with torch.no_grad():
    outputs, _ = gru2(input_data, initial_state)
    outputs = outputs.numpy()
    print('Difference between a later hidden state and the initial state is', np.mean(np.abs(outputs[:, 10, :] - initial_state[0, :, :].numpy())))
    

Difference between a later hidden state and the initial state is 0.0


# Implement a multi-head attention layer
**Q6 (5 points)** In the task, you need to implement the forward calculation of a multi-head attention layer. Your calculation needs to match the calculation of the torch MHA layer in the following test case. 

In [26]:
from rnn_param_helper import get_mha_params
from implementation import mha


batch_size = 4
time_steps = 8
input_size = 10
num_heads = 5

input_data = torch.randn(batch_size, time_steps, input_size, dtype = torch.float32)


# run torch implementation of MHA
with torch.no_grad():

    t_mha = nn.MultiheadAttention(embed_dim=input_size, num_heads=num_heads, dropout=0.0, bias=False, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, batch_first=True)

    t_output, _ = t_mha(input_data, input_data, input_data, need_weights=False)


# extract model parameters from the torch MHA layer
Wq, Wk, Wv, Wo = get_mha_params(t_mha)

# run the same calculation with your implementation
output = mha(Wq, Wk, Wv, Wo, input_data )

print('Difference between my output and torch output is ', np.mean(np.abs(output - t_output.numpy())))



Difference between my output and torch output is  1.7877756e-08
