# The Remembering Problem

The "remembering probem" is a variant of the "adding problem" proposed by Schmidhuber and colleagues as an example of a sequential task that LSTM's are particularly well suited for: http://people.idsia.ch/~juergen/nipslstm/node4.html


![frequent words](https://blogs.sas.com/content/sastraining/files/2015/11/word_frequency.png)

Data source: http://norvig.com/google-books-common-words.txt


All methods will be compared using MSE on a held out test set. 

In [2]:
!pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 31kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x59e78000 @  0x7f4b0d3e81c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hInstalling collected packages: torch
Successfully installed torch-0.4.1


In [1]:
import numpy as np
import pandas as pd
import random
import string
import collections


import matplotlib.pylab as plt
import seaborn as sns;
%matplotlib inline

In [2]:
import sys
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader

In [3]:
# use CUDA or not
use_cuda = False
if use_cuda and torch.cuda.is_available():
  print("using cuda!")
  device = torch.device("cuda")
else:
  print("using CPU!")

using CPU!


## Data loading functions

We will define some helper functions to generate our datasets. `generate_sequence` will genrate a single sequence whereas `get_set` returns multiple sequences (so a *dataset* of sequences).



In [4]:
def generate_char_seq(seq_len = 10):
  return ''.join(random.choice(string.ascii_lowercase) for _ 
                 in range(seq_len))

def get_set(num_examples = 100, seq_len = 10):
  one_hot_encoded = {}
  for i, char in enumerate(list(string.ascii_lowercase)):
    one_hot_encoded[char] = i

  X_seqs = []
  num_repeats = []

  for _ in range(num_examples):
    seq_example = generate_char_seq(seq_len)
    X_seqs.append([one_hot_encoded[char] for char in list(seq_example)])
    num_repeats.append(collections.Counter(seq_example).most_common(1)[0][1])
    
  return np.array(X_seqs), np.array(num_repeats)  

Lets see `get_set` in action:

In [5]:
X_train, y_train = get_set(num_examples=100, seq_len = 3)
X_train.shape

(100, 3)

So for the input we have a 2D array that has shape "num examples" x "sequence length" 

Note that the datasets that `get_set` returns are Numpy arrays, but PyTorch recquires PyTorch tensors. We could of course convert these Numpy arrays to PyTorch arrays, and then do some booking with indices to keep track of going through different batches when doing batch updates on the network.

But that is tedious and PyTorch offers the Dataset class that we can inherit from to keep all this bookkeeping for us. Below we define the `SequenceDataset` generator class that will be used for all our data handilng for PyTorch. 

In [6]:
class SequenceDataset(Dataset):
  
  def __init__(self, num_examples, seq_len):
    self.num_examples = num_examples
    self.seq_len = seq_len
    
    X, y = get_set(num_examples=self.num_examples, seq_len = self.seq_len)
    self.X = torch.LongTensor(X)
    self.y = torch.from_numpy(y).float()
    if use_cuda and torch.cuda.is_available():
      self.X = self.X.to(device)
      self.y = self.y.to(device)
    
    
    
  def __getitem__(self, index):
    return self.X[index], self.y[index]
  
  def __len__(self):
    return self.num_examples

  

Lets create a training and test set with 100 examples for each and sequence lengths of 10. 

In [7]:
train_set = SequenceDataset(num_examples=100, seq_len = 10)
test_set = SequenceDataset(num_examples=100, seq_len = 10)



We can use PyTorch's `DataLoader` to specify the the batches of data to load for training. Note that each of the 100 example sequences are independent, so we also shuffle the order of the different sequences. 


In [8]:
batch_size = 32

train_loader = DataLoader(dataset = train_set,
                          batch_size=batch_size,
                          shuffle = True)

test_loader = DataLoader(dataset = test_set,
                         batch_size=batch_size,
                         shuffle = True)

## RNN

We will start solving the Remembering Problem with a simple RNN (the *Elman Network*). The network will update its internal hidden state for every element in the sequence until we reach the end. When we reach the end, we pass the final hidden state through a fully connected linear layer to predict the target. This type of architecture is sometimes called *many-to-one* since we are taking "many" elements (a sequence) to a single element (the target).

<center>
![Many to one](https://i.stack.imgur.com/QCnpU.jpg)
</center>

In [9]:
class RNNRemember(nn.Module):

    def __init__(self, hidden_size, embedding_size, input_size):    
        super(RNNRemember, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size
        self.embedding = nn.Embedding(input_size, self.embedding_size)
        self.rnn = nn.RNN(input_size=embedding_size,
                          hidden_size=self.hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size)
        #h_0 = Variable(torch.zeros(1, embedding_size, self.hidden_size))
        h_0 = Variable(torch.zeros(1, x.size(0), self.hidden_size))

        emb = self.embedding(x)
        # Propagate embedding through RNN
        # Input: (batch, seq_len, embedding_size)
        # h_0: (num_layers * num_directions, batch, hidden_size)
        _, h_f = self.rnn(emb, h_0)
        return self.fc(h_f).squeeze()

In [0]:
rnn_remember = RNNRemember(hidden_size = 50, embedding_size = 5, 
                           input_size = len(string.ascii_lowercase))

if use_cuda and torch.cuda.is_available():
    rnn_adder = rnn_adder.cuda(device)

In [0]:
# Set loss and optimizer function
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(rnn_remember.parameters(), lr=0.01)

In [18]:
%%time
num_epochs = 1000
for epoch in range(num_epochs):
  for i, (sequences, targets) in enumerate(train_loader):
    if use_cuda and torch.cuda.is_available():
      sequences = sequences.to(device)
      targets = targets.to(device)

    
    # forward pass
    outputs = rnn_remember(sequences)
    loss = criterion(outputs, targets)
    
    # update weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
  if (epoch+1)%100 == 0:
    print("loss is", loss.item())

loss is 2.297874743817374e-05
loss is 0.0032314001582562923
loss is 0.0002122371079167351
loss is 0.0012824522564187646
loss is 0.006178750656545162
loss is 3.2399691463069757e-06
loss is 2.0392006263136864e-05
loss is 0.00036697823088616133
loss is 5.5320295359706506e-05
loss is 3.730580647243187e-05
CPU times: user 12.2 s, sys: 1.13 s, total: 13.3 s
Wall time: 13.5 s


In [20]:
with torch.no_grad():
  outputs = rnn_remember(test_set.X)
  test_mse = torch.mean((outputs - test_set.y)**2)
print(test_mse.item())

0.724921464920044


## LSTM

RNN's suffer from the vanishing gradient problem since creating the final hidden state is a result of updating the state through multiplications everytime a new element arrives in the sequence. LSTM's bypass this challenge by updating state additively. As a result, updaing gradients is much easier and longer memories can persist. Below is an `LSTMAdder` that is nearly identical to the `RNNAdder.`



In [0]:
class LSTMRemember(nn.Module):

    def __init__(self, hidden_size, input_size, embedding_size):    
        super(LSTMRemember, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size
        self.embedding = nn.Embedding(input_size, self.embedding_size)
        self.lstm = nn.LSTM(input_size=self.embedding_size,
                          hidden_size=self.hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size)
        h_0 = Variable(torch.zeros(1, x.size(0), self.hidden_size))
        c_0 = Variable(torch.zeros(1, x.size(0), self.hidden_size))
        if use_cuda and torch.cuda.is_available():
          h_0 = h_0.to(device)
          c_0 = c_0.to(device)
          
                  
        emb = self.embedding(x)
        # Propagate input through LSTM
        # Input: (batch, seq_len, embedding_size)
        # h_0: (num_layers * num_directions, batch, hidden_size)
        _, (h_f, c_f) = self.lstm(emb, (h_0, c_0))
        return self.fc(h_f).squeeze()


In [0]:
lstm_remember = LSTMRemember(hidden_size = 50, embedding_size = 5, 
                           input_size = len(string.ascii_lowercase))
if use_cuda and torch.cuda.is_available():
    lstm_remember = lstm_remember.cuda(device)

In [0]:
# Set loss and optimizer function
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(lstm_remember.parameters(), lr=0.01)

In [37]:
%%time
num_epochs = 10000
for epoch in range(num_epochs):
  for i, (sequences, targets) in enumerate(train_loader):
    # forward pass
    outputs = lstm_remember(sequences)
    loss = criterion(outputs, targets)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
  if (epoch+1)%100 == 0:
    print("loss is", loss.item())

loss is 0.0024148006923496723
loss is 5.1259557949379086e-05
loss is 8.673283673488186e-07
loss is 0.01033545471727848
loss is 4.511162842391059e-06
loss is 3.4083641367033124e-06
loss is 1.2738404109313706e-07
loss is 0.001695329905487597
loss is 1.3969316370321394e-08
loss is 3.6411762494026334e-11
loss is 9.540596579427074e-08
loss is 0.0002487917663529515
loss is 8.247862126609107e-08
loss is 0.0001593692577444017
loss is 0.00023692134709563106
loss is 9.572919952915981e-05
loss is 0.001537151401862502
loss is 1.7767004464985803e-05
loss is 5.851800233358517e-05
loss is 0.01244555227458477
loss is 4.7574124550919805e-07
loss is 1.535607623281976e-07
loss is 5.798185043204285e-09
loss is 1.6104791029647458e-06
loss is 2.9206352337496355e-05
loss is 0.00021576270228251815
loss is 1.823918501031585e-05
loss is 1.463759872422088e-06
loss is 3.244131221435964e-07
loss is 7.617662163283967e-07
loss is 3.2981015465338714e-06
loss is 0.00015684122627135366
loss is 0.00039906747406348586
lo

In [39]:
with torch.no_grad():
  outputs = lstm_remember(test_set.X)
  test_mse = torch.mean((outputs - test_set.y)**2)
print(test_mse.item())

0.8073420524597168


## ReLU RNN

The idea of the ReLU RNN is to initialize the hidden state of the RNN with the identity matrix and the bias with 0 and use the ReLU activation function. Below we demonstrate how such an RNN can be implemented. The results are not as good as the LSTM but certainly better than the traditional Elman Network.