# Recurrent Neural Networks

We have looked at Linear Neural Networks and Convolutional Neural Networks. So what is the need of Recurrent Neural Networks? There is an obvious limitation to the architectures which we have studied until now. Can you guess what it is?<br>
.<br>
.<br>
.<br>
.<br>
Memory!<br>
These architectures don't store any information about the previous inputs given to the network. This mean they tend to give poor results while working with sequential data (for the most part). Humans don't start thinking from scratch at every instant. Just while reading this sentence, you have an idea of the words which came before and the ones to follow. A Linear model processes each input independently. So you must convert the entire sequence into one input data point. They are hence stateless. 

### What is an RNN?

An RNN is an architecture which unlike Linear models, preserve state. They process sequences by iterating through its elements and maintaining a <b>state</b>. This state is reset while processing two different sequences. This is what a simple RNN looks like:

<img src='Images/RNN.png' />

The saved state is called the <b>hidden state</b>. An RNN processes each element of the sequence sequentially. At each time step, it updates its hidden state and produces an output. This is what happens when we "unroll" an RNN:
    
<img src='Images/RNN_unrolled.png'/>

Unrolling an RNN is simply visualizing how it processes the sequence element by element. In reality, the RNN consists of just one cell processing the input in a loop. This property of an RNN allows it to process variable length inputs. RNNs are just a refactored, fully-connected neural network.

The working of an RNN (at timestep $t$) is as follows:
An RNN consists of 3 weight matrices: $W_x$, $W_h$, $W_y$.
- $W_x$ is the weight matrix for the input (x).
- $W_h$ is the weight matrix for the hidden state.
- $W_y$ is the weight matrix for the output.

The hidden state is given by:<br>
$H_t = \sigma(W_x * X_t + W_h * H_{t-1})$
- $H_t$ is the hidden state at timestep $t$.
- $\sigma$ is the activation function (generally sigmoid or tanh).
- $X_t$ is the input at the current timestep.

The output of the RNN is given by:<br>
$y = \sigma_y(W_y * H_t)$

Due to the sequential nature of natural language, RNNs are commonly used in Natural Language Processing. 
Let us try to better understand the working of RNNs using an example. 

In this example we are going to build a model to classify names into two countries of origin -> Italy and Germany. Our dataset consists of two files `Italian.txt` and `German.txt`. Both of these files contain a single name on each line. The files follow the following format:
```
name_1
name_2
name_3
...
```



### Importing modules

In [1]:
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader, random_split
from pprint import pprint
import os
from string import ascii_letters

## Cleaning the Data

### Reading the data

In [None]:
with open('Projects/NameClassifier/data/names/German.txt', 'r') as german_f, open('Projects/NameClassifier/data/names/Italian.txt', 'r') as italian_f:
    german_names = german_f.read()
    italian_names = italian_f.read()

print(f'German names:\n{german_names[:30]}')
print()
print(f'Italian names:\n{italian_names[:33]}')

### Finding all the unique characters in the files

The classifier which we are going to build is going to be character based. This means that it will take a sequence of characters as its input. Each name will be read by the model character by character. For this we need to first find all the unique characters in the files. We find all the unique characters in the files and then take its union with all letters (uppercase and lowercase) to form our vocabulary.

In [68]:
unique_characters = set((german_names + italian_names).replace('\n', '')).union(set(ascii_letters))
unique_characters = list(unique_characters)
''.join(sorted(unique_characters))

" 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzßàäèéìòóöùü"

In [69]:
german_names = german_names.split('\n')
italian_names = italian_names.split('\n')
print(german_names[:5])
print(italian_names[:5])

['Abbing', 'Abel', 'Abeln', 'Abt', 'Achilles']
['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']


### Removing common names

We don't want our classifier to accept the same input with two different classes. Hence, we will find the names which exist in both the german and italian datasets and remove them from both

In [70]:
common_names = list(set(german_names).intersection(set(italian_names)))
common_names

['', 'Salomon', 'Paternoster']

In [71]:
for common_name in common_names:
    german_names.remove(common_name)
    italian_names.remove(common_name)
    
common_names = list(set(german_names).intersection(set(italian_names)))
common_names

[]

### Creating our data

We will create a list of all our names. This will be the input passed to our model. Along with this we will also need labels. We will have a label of `0` for german names and a label of `1` for italian names.

In [72]:
german_label = [0]
italian_label = [1]

all_names = german_names + italian_names
all_labels = german_label * len(german_names) + italian_label * len(italian_names)
print(all_names[720:726])
print(all_labels[720:726])

['Zimmerman', 'Zimmermann', 'Abandonato', 'Abatangelo', 'Abatantuono', 'Abate']
[0, 0, 1, 1, 1, 1]


### One hot encoding characters

For our model to be able to process our input, we have to convert the characters to one hot encoded vectors. The size of our vector will be the total number of unique characters in our dataset. Hence we will first create a mapping of our character and its index. We can then use this mapping to convert our input characters to digits.

In [73]:
stoi =  {char:idx for idx, char in enumerate(sorted(unique_characters))}
stoi

{' ': 0,
 "'": 1,
 'A': 2,
 'B': 3,
 'C': 4,
 'D': 5,
 'E': 6,
 'F': 7,
 'G': 8,
 'H': 9,
 'I': 10,
 'J': 11,
 'K': 12,
 'L': 13,
 'M': 14,
 'N': 15,
 'O': 16,
 'P': 17,
 'Q': 18,
 'R': 19,
 'S': 20,
 'T': 21,
 'U': 22,
 'V': 23,
 'W': 24,
 'X': 25,
 'Y': 26,
 'Z': 27,
 'a': 28,
 'b': 29,
 'c': 30,
 'd': 31,
 'e': 32,
 'f': 33,
 'g': 34,
 'h': 35,
 'i': 36,
 'j': 37,
 'k': 38,
 'l': 39,
 'm': 40,
 'n': 41,
 'o': 42,
 'p': 43,
 'q': 44,
 'r': 45,
 's': 46,
 't': 47,
 'u': 48,
 'v': 49,
 'w': 50,
 'x': 51,
 'y': 52,
 'z': 53,
 'ß': 54,
 'à': 55,
 'ä': 56,
 'è': 57,
 'é': 58,
 'ì': 59,
 'ò': 60,
 'ó': 61,
 'ö': 62,
 'ù': 63,
 'ü': 64}

While our RNN can accept inputs of variable length, we still have to define a sequence length. This will allow us to batch our data for parallel execution.

In [74]:
def one_hot_encoder(name, sequence_length):
    global stoi
    size = len(stoi)
    print(f'Size of stoi: {size}')
    # To save output
    encoded = []
    # Iterating through name
    for char in name:
        temp = torch.zeros(size)
        # Setting index of character to 1
        temp[stoi[char]] = 1
        encoded.append(temp)
        
    # Filling the rest of the sequence with zeros
    for i in range(sequence_length - len(name)):
        temp = torch.zeros(size)
        encoded.append(temp)

    return torch.stack(encoded)

one_hot_encoder('Aniket', 10)

Size of stoi: 65


tensor([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0.,

### Creating our dataset object

Now we have done a lot of preprocessing! Let us combine all of this in our dataset class.

In [2]:
class NameDataset(Dataset):
    def __init__(self, german_fname='Projects/NameClassifier/data/names/German.txt', italian_fname='Projects/NameClassifier/data/names/Italian.txt'):
        super().__init__()
        # Reading from files
        with open(german_fname, 'r') as german_f, open(italian_fname, 'r') as italian_f:
            german_names = german_f.read()
            italian_names = italian_f.read()
        
        # Finding unique characters
        unique_characters = list(set((german_names + italian_names).replace('\n', '')).union(set(ascii_letters)))
        german_names = german_names.split('\n')
        italian_names = italian_names.split('\n')
        
        # Removing common names
        common_names = list(set(german_names).intersection(set(italian_names)))
        for common_name in common_names:
            german_names.remove(common_name)
            italian_names.remove(common_name)
        german_label = [0]
        italian_label = [1]

        # Setting names and labels
        self.names = german_names + italian_names
        self.labels = german_label * len(german_names) + italian_label * len(italian_names)
        
        # Mapping from chars to int
        self.stoi =  {char:idx for idx, char in enumerate(sorted(unique_characters))}
        
        # Size of longest word is 18
        self.sequence_length = 18
        
        # One hot encoded names
        self.encoded_names = self.encode_dataset()

    def one_hot_encoder(self, name):
        size = len(self.stoi)

        encoded = []
        for char in name:
            temp = torch.zeros(size)
            temp[self.stoi[char]] = 1
            encoded.append(temp)

        for i in range(self.sequence_length - len(name)):
            temp = torch.zeros(size)
            encoded.append(temp)

        return torch.stack(encoded)
        
    def encode_dataset(self):
        encoded_list = []
        for name in self.names:
            encoded_list.append(self.one_hot_encoder(name))
            
        return torch.stack(encoded_list)
        
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.encoded_names[idx], torch.tensor([self.labels[idx]])
    
        

In [3]:
names = NameDataset()
names[0]

(tensor([[0., 0., 1.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]), tensor([0]))

In [4]:
# Shape of input tensor (one word)
names[0][0].shape

torch.Size([18, 65])

In [7]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
split_ratio = 0.8
data_len = len(names)
train_size = int(split_ratio * data_len)
test_size = data_len - train_size

# Randomly splits data into given sizes
train_dataset, test_dataset = random_split(names, lengths=(train_size, test_size))

### Comparison with a Linear model

Before we build our RNN based model, let us look at the results from a conventional Linear model which you have studied until now. On our problem statement, using a 3 layer linear model, I was able to achieve an accuracy of just 69.2% even after training for 100 epochs. This problem is very simple with very short sequences. For a larger problem with longer sequences, the model's performance would be even worse. 

### Building an RNN using linear layers

Let us revisit the mathematics behind an RNN:
$H_t = \sigma(W_x * X_t + W_h * H_{t-1})$
$y = \sigma_y(W_y * H_t)$

Where $H_t$ is the hidden state at timestep $t$ and $y$ is the output of the RNN.

In [31]:
class LinearRNN(nn.Module):
    def __init__(self):
        super().__init__()
        global device
        self.device = device
        self.hidden_size = 256
        self.sequence_length = 18
        self.input_size = 65
        self.Wx = nn.Linear(self.input_size, self.hidden_size)
        self.Wh = nn.Linear(self.hidden_size, self.hidden_size)
        self.Wy = nn.Linear(self.sequence_length * self.hidden_size, self.hidden_size)
        self.h = torch.zeros(1, self.hidden_size).to(self.device)
        self.output_layer = nn.Linear(self.hidden_size, 2)
        self.bn = nn.BatchNorm1d(self.hidden_size)
        
    def forward(self, input_tensor):
#         h = self.h
        h = torch.zeros(1, self.hidden_size).to(self.device)
        res = []
        for i in range(input_tensor.shape[1]):     # input_tensor.shape[1] = sequence length
            x = F.tanh(self.Wx(input_tensor[:, i]))     # input_tensor[:, i] = the ith element in the sequence
            h = F.tanh(self.Wh(h))
            h = torch.add(h, x)
            res.append(h)
        
        self.h = h.detach()        
        res = torch.stack(res, dim=1)
        res = res.reshape(-1, self.sequence_length * self.hidden_size)
        res = F.relu(self.Wy(res))
        res = F.softmax(self.output_layer(res))
        return res
    

In [32]:
batch_size = 1
linear_train_loader = DataLoader(train_dataset, batch_size=batch_size)
linear_test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [33]:
model = LinearRNN().to(device)
criterion = nn.CrossEntropyLoss()
lr = 5e-6
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [34]:
num_epochs = 30
max_accuracy = 0.0
MODEL_PATH = ''
if os.path.exists(MODEL_PATH):
    print('Existing model found!')
    load_weights(model, MODEL_PATH)
else:
    print('No existing model found.')
    
model

No existing model found.


LinearRNN(
  (Wx): Linear(in_features=65, out_features=256, bias=True)
  (Wh): Linear(in_features=256, out_features=256, bias=True)
  (Wy): Linear(in_features=4608, out_features=256, bias=True)
  (output_layer): Linear(in_features=256, out_features=2, bias=True)
  (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)

In [35]:
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    total = 0
    correct = 0
    print(f'Epoch: {epoch}')
    for input_batch, labels in linear_train_loader:
        if labels.size(0) != batch_size: continue
        model.zero_grad()
#         print(input_batch.shape)
        output = model.forward(input_batch.to(device))
#         print(output)
#         print(labels)
#         
#         print(output.shape, labels.shape)
        loss = criterion(output, labels.to(device).long().reshape(batch_size,))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        total += labels.size(0)
        correct += torch.sum(torch.argmax(output, dim=1).view(1, -1) == labels.to(device).view(1, -1)).item()
#         break
#     break
    
    
    print(f'Accuracy: {correct/total * 100}\nLoss: {epoch_loss/total}')
    if (epoch + 1) % 3 == 0:
        test_epoch_loss = 0
        total = 0
        correct = 0
        model.eval()
        for input_batch, labels in linear_test_loader:
            with torch.no_grad():
                if labels.size(0) != batch_size: continue
                output = model.forward(input_batch.to(device))
                loss = criterion(output, labels.to(device).long().reshape(batch_size,))
                test_epoch_loss += loss.item()
                total += labels.size(0)
                correct += torch.sum(torch.argmax(output, dim=1).view(1, -1) == labels.to(device).view(1, -1)).item()
#                 print(torch.argmax(output, dim=0))
#                 print(labels)
#                 break
    
        test_accuracy = round(correct/total, 4) * 100
        print(f'''### TESTING ###
        Accuracy: {test_accuracy}
        Loss: {round(test_epoch_loss/total, 4)}''')
#         if max_accuracy < test_accuracy:
#             max_accuracy = test_accuracy
#             save_weights(model, MODEL_PATH)
#             print('Best model found!')
        

Epoch: 0




Accuracy: 61.85476815398076
Loss: 0.6853206089892516
Epoch: 1
Accuracy: 75.32808398950131
Loss: 0.6586288675235639
Epoch: 2
Accuracy: 82.76465441819772
Loss: 0.6005860694854382
### TESTING ###
        Accuracy: 82.87
        Loss: 0.5592
Epoch: 3
Accuracy: 81.97725284339458
Loss: 0.5180254277177578
Epoch: 4
Accuracy: 83.46456692913385
Loss: 0.48068066542334237
Epoch: 5
Accuracy: 86.3517060367454
Loss: 0.45952158613572075
### TESTING ###
        Accuracy: 84.61999999999999
        Loss: 0.4619
Epoch: 6
Accuracy: 87.83902012248468
Loss: 0.4428591350662218
Epoch: 7
Accuracy: 89.501312335958
Loss: 0.42897117972269666
Epoch: 8
Accuracy: 90.5511811023622
Loss: 0.41748434638935333
### TESTING ###
        Accuracy: 89.51
        Loss: 0.4273
Epoch: 9
Accuracy: 91.42607174103236
Loss: 0.40820089110343577
Epoch: 10
Accuracy: 91.68853893263342
Loss: 0.4006210315258797
Epoch: 11
Accuracy: 92.56342957130359
Loss: 0.39420335441958143
### TESTING ###
        Accuracy: 91.25999999999999
        Loss: 

As you can see, with just 30 epochs of training we are able to achieve testing accuracy of more than <b>92%</b>. The RNN implementation we saw above is just to provide insight into the working of RNNs. I don't recommend anyone to actually build their own RNNs while working on their projects. We will now use the `torch.nn.RNN` module on the same problem. This is a much faster implementation which supports parallel processing. 

In [5]:
class NameClassifier(nn.Module):
    def __init__(self, max_len=18, hidden_size=256, input_size=65):
        super().__init__()
        dropout_prob = 0.4
        self.input_size = input_size
        self.sequence_length = max_len
        self.hidden_size = hidden_size
        self.num_layers = 2
        self.rnn = nn.RNN(
            input_size=self.input_size,
            hidden_size=self.hidden_size, 
            num_layers=self.num_layers, 
            batch_first=True,
            dropout=dropout_prob
        )
        self.dropout_layer = nn.Dropout(dropout_prob)
        self.linear_layer = nn.Linear(self.hidden_size * self.sequence_length, 256)
        self.output_layer = nn.Linear(256, 2)
        
        
    def forward(self, input_tensor, hidden):
        rnn_output, new_hidden = self.rnn(input_tensor, hidden)
        rnn_output = self.dropout_layer(rnn_output)
#         print(f'RNN output: {rnn_output.size()}')
#         linear_input = rnn_output[:, -1]
#         print(f'Linear input: {linear_input.size()}')
#         linear_output = F.relu(self.linear_layer(linear_input.reshape(-1, self.hidden_size * )))
        linear_output = F.relu(self.linear_layer(rnn_output.reshape(-1, self.hidden_size * self.sequence_length)))
        output = F.softmax(self.output_layer(linear_output))
        new_hidden = new_hidden.detach()
        return output, new_hidden
        
        
    def init_hidden(self, batch_size):
        return torch.zeros(self.num_layers, batch_size, self.hidden_size)

In [8]:
model = NameClassifier().to(device)
criterion = nn.CrossEntropyLoss()
lr = 5e-6
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [9]:
num_epochs = 30
max_accuracy = 0.0
    
model

NameClassifier(
  (rnn): RNN(65, 256, num_layers=2, batch_first=True, dropout=0.4)
  (dropout_layer): Dropout(p=0.4)
  (linear_layer): Linear(in_features=4608, out_features=256, bias=True)
  (output_layer): Linear(in_features=256, out_features=2, bias=True)
)

In [12]:
batch_size = 8
train_loader = DataLoader(train_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [13]:
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    total = 0
    correct = 0
    hidden = model.init_hidden(batch_size=batch_size)
    print(f'Epoch: {epoch}')
    for input_batch, labels in train_loader:
        if labels.size(0) != batch_size: continue
        model.zero_grad()
        output, hidden = model.forward(input_batch.to(device), hidden.to(device))
#         print(output)
#         print(labels)
#         
        loss = criterion(output, labels.to(device).long().reshape(batch_size,))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        total += labels.size(0)
        correct += torch.sum(torch.argmax(output, dim=1).view(1, -1) == labels.to(device).view(1, -1)).item()
    
    print(f'Accuracy: {correct/total * 100}\nLoss: {epoch_loss/total}')
    if (epoch + 1) % 3 == 0:
        test_epoch_loss = 0
        total = 0
        correct = 0
        model.eval()
        hidden = model.init_hidden(batch_size=batch_size)
        for input_batch, labels in test_loader:
            with torch.no_grad():
                if labels.size(0) != batch_size: continue
                output, hidden = model.forward(input_batch.to(device), hidden.to(device))
                loss = criterion(output, labels.to(device).long().reshape(batch_size,))
                test_epoch_loss += loss.item()
                total += labels.size(0)
                correct += torch.sum(torch.argmax(output, dim=1).view(1, -1) == labels.to(device).view(1, -1)).item()
#                 print(torch.argmax(output, dim=0))
#                 print(labels)
#                 break
                
        test_accuracy = round(correct/total, 4) * 100
        print(f'''### TESTING ###
        Accuracy: {test_accuracy}
        Loss: {round(test_epoch_loss/total, 4)}''')
        if max_accuracy < test_accuracy:
            max_accuracy = test_accuracy
#             save_weights(model, MODEL_PATH)
            print('Best model found!')
        

Epoch: 0




Accuracy: 51.58450704225353
Loss: 0.086549880848804
Epoch: 1
Accuracy: 51.6725352112676
Loss: 0.08647492449258415
Epoch: 2
Accuracy: 51.76056338028169
Loss: 0.08638013862598111
### TESTING ###
        Accuracy: 48.209999999999994
        Loss: 0.0864
Best model found!
Epoch: 3
Accuracy: 52.55281690140845
Loss: 0.08620173363408572
Epoch: 4
Accuracy: 55.1056338028169
Loss: 0.08603270105283026
Epoch: 5
Accuracy: 58.71478873239436
Loss: 0.08576032760697351
### TESTING ###
        Accuracy: 60.36
        Loss: 0.0855
Best model found!
Epoch: 6
Accuracy: 62.676056338028175
Loss: 0.08538978919386864
Epoch: 7
Accuracy: 69.54225352112677
Loss: 0.08472762912721701
Epoch: 8
Accuracy: 76.76056338028168
Loss: 0.0831931986859147
### TESTING ###
        Accuracy: 78.93
        Loss: 0.081
Best model found!
Epoch: 9
Accuracy: 78.4330985915493
Loss: 0.07857666133155286
Epoch: 10
Accuracy: 78.4330985915493
Loss: 0.07003735299681274
Epoch: 11
Accuracy: 80.36971830985915
Loss: 0.06540740795538459
### TEST

As you can see we get very similar accuracies from the two models. This is because they are essentially doing the same thing. 

### Testing the model with user input

In [17]:
with torch.no_grad():
    model.eval()
    hidden = model.init_hidden(batch_size=1)
    input_tensor = names.one_hot_encoder('Tribbiani')
    input_tensor = input_tensor.view(1, *input_tensor.shape)
    output = model.forward(input_tensor.to(device), hidden.to(device))
    class_ = torch.argmax(output[0], dim=1).item()
    print('German' if class_ == 0 else 'Italian')
    print(f'Confidence: {output[0][0][class_]}')

Italian
Confidence: 0.9999949932098389


