# What are RNNs?

![alt text](rnn.jpg "RNN")

The above diagram shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word.

# Core Ideas Behind RNN

1. Sequential Data: Dependency for every data within every time steps
2. Shared Weights: Using the same parameters accross all steps and makes it possible to generalize the model despite of the difference in length for every examples

![alt text](rnns.jpeg "rnns")

# Vanilla RNN

![alt text](vanilla-rnn.png "Vanilla RNN")

Where __st__ is the memory of RNN, __xt__ is the input for current time step, __ot__ is the output for current time step, and __U__, __V__, __W__ represent weights for input, output and memory respectively.

Based on the above formula, it introduces two major advantages:
1. Regardless of the sequence length, the learned model always has the same input size (in vectors)
2. It is possible to use the same transition function to calculate curernt memory with the same parameters at every time steps

# How to train them?

## Backpropagation Through Time

![alt text](bptt.png "Backpropagation Through Time")

As we may recall, we have several parameters to train, __W__, __U__, and __V__. For the rest of this section, we will use __E3__ as an example.

For each time steps, we calculate the following:
1. ![alt text](dl-dv.png "dl-dv")
2. ![alt text](dl-dw1.png "dl-dw1") ![alt text](dl-dw2.png "dl-dw2")
3. ![alt text](dl-du.png "dl-du")

Where:
1. ![alt text](E.png "E")
2. ![alt text](y.png "y")
3. ![alt text](q.png "q")

# Long-Term Dependencies Problem

![alt text](dl-dw2 problems.png "dl-dw2")

The red square shows the problem while doing bptt in RNN. Depends on the weights, if they are to small the the gradient will __vanish__. On the other hand, if the weights are too big, it will __explode__

## How to mitigate?

1. For exploding gradient, a simple type of solution has been in use by practitioners for many years is __clipping the gradient__
![alt text](gradient clipping pseudocode.png "gradient clipping pseudocode")
![alt text](gradient clipping effect.png "gradient clipping effect")
2. Adjust weights initialization
3. Change activation function (from sigmoid to relu, tanh, leaky relu, etc)
4. Batch Normalization
5. Use other cell, such as LSTM or GRU

# Implementation

The problem we are trying to solve in this implementation is Language Modeling. Language model is the art of determining the probability of a sequence of words. In layman's terms, we are trying teach the machine how to generate words based on their probability of occurence given a pair of words.

## Requirements
1. [Install PyTorch](http://pytorch.org/)
2. Install torchtext by running this command ```pip install git+https://github.com/pytorch/text --upgrade```

### Load Data

Let's start by loading data using torch text. For this example we will use provided WikiText-2 data from torch text. The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

In [137]:
from torchtext import data
from torchtext import datasets
from torchtext.vocab import GloVe

In [138]:
BATCH_SIZE = 3
BPTT_LEN = 30
EMBEDDING_DIM = 300

In [139]:
TEXT = data.Field(lower=True)

In [140]:
train, valid, test = datasets.WikiText2.splits(TEXT)

In [141]:
TEXT.build_vocab(train, vectors=GloVe(name="6B", dim=EMBEDDING_DIM))

In [142]:
TEXT.vocab.freqs.most_common(5)

[('the', 130768), (',', 99913), ('.', 73388), ('of', 57030), ('<unk>', 54625)]

In [143]:
train_iter, valid_iter, test_iter = data.BPTTIterator.splits(
    (train, valid, test), batch_size=BATCH_SIZE, bptt_len=BPTT_LEN,
    device=-1)

Below is the sample of how our dataset looks like. It consists of sequence of words, from which we will try to predict what is the next word

In [144]:
batch = next(iter(train_iter))
data = batch.text.transpose(1, 0).data.numpy()
sample = []
for d1 in data:
    for d2 in d1:
        sample.append(TEXT.vocab.itos[d2])
print(" ".join(sample))

<eos> = valkyria chronicles iii = <eos> <eos> senjō no valkyria 3 : <unk> chronicles ( japanese : 戦場のヴァルキュリア3 , lit . valkyria of the battlefield 3 ) , commonly for its release model in contrast to the rock band series , causing some players to hold contempt towards activision . harmonix considered the rock band series as a " top rope against both headshrinkers . as he tried to attack samu from the top rope again , samu caught him and <unk> him before fatu executed a diving splash


In [145]:
print("Total Training Data:", len(train_iter))
print("Total Validation Data:", len(train_iter))
print("Total Testing Data:", len(train_iter))
print("Total Vocabularies:", len(TEXT.vocab))

Total Training Data: 23207
Total Validation Data: 23207
Total Testing Data: 23207
Total Vocabularies: 28913


In [113]:
import torch.nn as nn
from torch.optim import Adam

############################
# Variable Initialization #
############################
HIDDEN_SIZE = 100
NUM_LAYERS = 1
DROPOUT = 0.5
VOCAB_SIZE = len(TEXT.vocab)

#################################
# Neural Network Initialization #
#################################
class LanguageModelLSTM(nn.Module):
    def __init__(self):
        super(LanguageModelLSTM, self).__init__()
        self.lstm = nn.LSTM(input_size=EMBEDDING_DIM,
                            hidden_size=HIDDEN_SIZE,
                            num_layers=NUM_LAYERS,
                            dropout=DROPOUT)
        self.linear = nn.Linear(in_features=HIDDEN_SIZE,
                                out_features=VOCAB_SIZE)
        
    def forward(self, X):
        lstm_out, lstm_hidden = self.lstm(X)
        step_size, batch_size = lstm_out.size()
        modified_output = lstm_out.view(step_size * batch_size, -1)
        
        is_tensor_equal = (lstm_out[0][0] == modified_output[0])
        assert (is_tensor_equal.sum() == HIDDEN_SIZE).data.numpy()[0]
        out = self.linear(modified_output)
        
        return out
    
embedding = nn.Embedding(TEXT.vocab.vectors.size(0),
                         TEXT.vocab.vectors.size(1))
embedding.weight.data.copy_(TEXT.vocab.vectors)
model = LanguageModelLSTM()
loss_fn = nn.CrossEntropyLoss()
opt = Adam(model.parameters())

################
# RNN Training #
################
import time

start = time.time()
total_steps = len(train_iter)
for idx, batch in enumerate(train_iter):
    model.zero_grad()
    word_embedding = embedding(batch.text)
    out = model(word_embedding)
    
    target = batch.target.view(BPTT_LEN * BATCH_SIZE)  
    loss = loss_fn(out, target)
    
    break
    if idx % 100 == 0:
        print("Loss [%d/%d]: %f" % (idx, total_steps,
                                    loss.data.numpy()[0]))

    loss.backward()
    
    opt.step()

Loss [0/23207]: 10.284704
Loss [100/23207]: 7.636081
Loss [200/23207]: 7.254285
Loss [300/23207]: 7.189576
Loss [400/23207]: 7.070719
Loss [500/23207]: 6.672390
Loss [600/23207]: 6.892480
Loss [700/23207]: 6.824963
Loss [800/23207]: 6.405207
Loss [900/23207]: 6.652770
Loss [1000/23207]: 7.195287
Loss [1100/23207]: 6.251607
Loss [1200/23207]: 6.896472
Loss [1300/23207]: 6.249759
Loss [1400/23207]: 6.997591
Loss [1500/23207]: 6.630477
Loss [1600/23207]: 6.002308
Loss [1700/23207]: 6.442147
Loss [1800/23207]: 7.107758
Loss [1900/23207]: 6.655063
Loss [2000/23207]: 6.274049
Loss [2100/23207]: 6.689747
Loss [2200/23207]: 5.470231
Loss [2300/23207]: 6.647050
Loss [2400/23207]: 6.639992
Loss [2500/23207]: 6.521108
Loss [2600/23207]: 6.296050
Loss [2700/23207]: 6.597279
Loss [2800/23207]: 6.092088
Loss [2900/23207]: 6.122556
Loss [3000/23207]: 6.447164
Loss [3100/23207]: 5.756624
Loss [3200/23207]: 5.866466
Loss [3300/23207]: 6.300177
Loss [3400/23207]: 6.252677
Loss [3500/23207]: 6.046329
Los

In [134]:
for i, batch in enumerate(test_iter):
    word_embedding = embedding(batch.text)
    out = model(word_embedding)
    values, indices = out.max(1)
    
    print("PREDICTION: ")
    for idx in indices.data.numpy():
        print(TEXT.vocab.itos[idx], end=" ")
    print("\n\nREAL LABEL: ")
    for idx in batch.text.transpose(1, 0).data.numpy():
        for idx2 in idx:
            print(TEXT.vocab.itos[idx2], end=" ")
            
    if (i+1) % 5 == 0:
        break

90 28913
PREDICTION: 
, <unk> the = of <unk> shelley be , , <unk> the = . the <eos> . was the been the <unk> the player , by <unk> a be " <unk> <unk> " <unk> game , , <unk> , and . . series . . <unk> <unk> of , ) . , . the <eos> and own was the , a of . <unk> in the for the 's in games that and . the , . s the <unk> " guitar @-@ " series . " . . the 

REAL LABEL: 
<eos> = robert <unk> = <eos> <eos> robert <unk> is an english film , television and theatre actor . he had a guest @-@ starring role on the television series the rights to the <unk> shield had been sold to a new australian baseball league ( <unk> ) , with ownership split between major league baseball 's 75 percent share and bright " and that he evoked the " delicate , <unk> power @-@ playing vignettes " of his theater work . jackson said <unk> ' theatrical roots rarely showed 90 28913
PREDICTION: 
<unk> the and of <unk> the the @-@ the . of common <eos> of <unk> was in common also the of by <unk> , the government of <unk> <un