# Our first RNN

The goal of this lesson is to introduce the ideas behind Recurrent Neural Networks.

A very good starting point: [Karpathy's blog](http://karpathy.github.io/2015/05/21/rnn-effectiveness)

and the associated [code](https://github.com/karpathy/char-rnn) 

Fortunately, the code is not in pytorch, so that you can now 'translate it'!

In [1]:
import numpy as np

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch_utils import ScaledEmbedding
import torch.nn.functional as F

import torch_utils; 
from torch_utils import gpu

We're going to download the collected works of Nietzsche to use as our data for this class.

In [2]:
data_folder = 'data/nietzsche/'
data_nie = data_folder+'nietzsche.txt'

In [3]:
%mkdir -p $data_folder
!wget -O $data_nie 'https://s3.amazonaws.com/text-datasets/nietzsche.txt'

--2017-12-30 13:40:45--  https://s3.amazonaws.com/text-datasets/nietzsche.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.72.226
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.72.226|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 600901 (587K) [text/plain]
Saving to: ‘data/nietzsche/nietzsche.txt’


2017-12-30 13:40:46 (546 KB/s) - ‘data/nietzsche/nietzsche.txt’ saved [600901/600901]



In [4]:
text = open(data_nie).read()
print('corpus length:', len(text))

('corpus length:', 600901)


In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

('total chars:', 86)


In [6]:
chars.insert(0, "\0")

In [7]:
''.join(chars[1:-6])

'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz'

Map from chars to indices and back again

In [8]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

idx will be the data we use from now own

In [9]:
idx = [char_indices[c] for c in text]

In [10]:
idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [11]:
''.join(indices_char[i] for i in idx[:60])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is the'

# 3 Char

We want to learn the 4-th character from the 3 first ones.

In [12]:
cs=3
c1_dat = [idx[i] for i in range(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-1-cs, cs)]

Inputs

In [13]:
x1 = np.asarray(c1_dat[:-2])
x2 = np.asarray(c2_dat[:-2])
x3 = np.asarray(c3_dat[:-2])

Outputs

In [14]:
y = np.stack(c4_dat[:-2])

In [15]:
x1[:4], x2[:4], x3[:4]

(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [16]:
y[:4]

array([30, 29,  1, 40])

In [17]:
x1.shape, y.shape

((200297,), (200297,))

## Char-RNN

I am using the architecture described in [Lesson 6 of fast.ai course](http://wiki.fast.ai/index.php/Lesson_6_Notes#Recurrent_Neural_Network_.28RNN.29:)

In [18]:
class RNN(nn.Module):
    def __init__(self, embedding_dim=42, vocab_size = 1, hidden_dim =256):
        super(RNN, self).__init__()
        
        self._embedding_dim = embedding_dim
        self._vocab_size = vocab_size
        self._hidden_dim = hidden_dim
        
        self.embeddings = ScaledEmbedding(vocab_size, embedding_dim)
        self.i2h = nn.Linear(embedding_dim, hidden_dim)
        self.h2h = nn.Linear(hidden_dim, hidden_dim)
        self.h2o = nn.Linear(hidden_dim, vocab_size)
            
    def forward(self, c1, c2, c3):
        c1_embedding = self.embeddings(c1)
        c2_embedding = self.embeddings(c2)
        c3_embedding = self.embeddings(c3)
        c1_r = F.relu(self.i2h(c1_embedding))
        c2_r = F.relu(self.i2h(c2_embedding))
        c3_r = F.relu(self.i2h(c3_embedding))
        h2 = F.tanh(self.h2h(c1_r))
        h3 = F.tanh(self.h2h(h2+c2_r))
        output = self.h2o(h3+c3_r)
        return output

In [19]:
rnn = RNN(vocab_size=vocab_size)

First try

In [20]:
in1 = Variable(gpu(torch.from_numpy(np.array([idx[0]]).astype(np.int64))))
in2 = Variable(gpu(torch.from_numpy(np.array([idx[1]]).astype(np.int64))))
in3 = Variable(gpu(torch.from_numpy(np.array([idx[2]]).astype(np.int64))))

In [21]:
out = rnn(in1,in2,in3)

In [22]:
out

Variable containing:

Columns 0 to 9 
 0.0718  0.0170 -0.0124  0.0873 -0.0355  0.0238  0.0207 -0.1071  0.0914 -0.0193

Columns 10 to 19 
-0.0449 -0.0220  0.0314 -0.0239  0.0117  0.0291  0.0102 -0.1325 -0.0243  0.0210

Columns 20 to 29 
-0.0958 -0.0412 -0.0721  0.1280 -0.0649  0.0328  0.0071 -0.0180  0.0723 -0.0227

Columns 30 to 39 
 0.0166  0.0243  0.0028 -0.1259  0.0542 -0.0516  0.0035  0.0489  0.0038  0.0288

Columns 40 to 49 
 0.0087 -0.0094  0.0105 -0.0717 -0.0045  0.0248  0.0597  0.0036 -0.1202 -0.0477

Columns 50 to 59 
 0.0273  0.0537 -0.1072  0.0944 -0.0132 -0.0244 -0.0301  0.0042 -0.0240  0.0526

Columns 60 to 69 
-0.0413  0.0178 -0.0824 -0.0099 -0.0428  0.0343  0.0424 -0.0864  0.0085  0.0890

Columns 70 to 79 
 0.1179  0.0024  0.0078  0.0530 -0.0092  0.0247 -0.0111 -0.1049 -0.0862  0.1223

Columns 80 to 85 
-0.1020  0.0353 -0.0573 -0.0429  0.1549  0.0165
[torch.FloatTensor of size 1x86]

Now with a batch of size 2

In [23]:
in1 = Variable(gpu(torch.from_numpy(np.array([x1[:2]]).astype(np.int64))).squeeze())
in2 = Variable(gpu(torch.from_numpy(np.array([x2[:2]]).astype(np.int64))).squeeze())
in3 = Variable(gpu(torch.from_numpy(np.array([x3[:2]]).astype(np.int64))).squeeze())

In [24]:
in1

Variable containing:
 40
 30
[torch.LongTensor of size 2]

In [25]:
out = rnn(in1,in2,in3)

In [26]:
out

Variable containing:

Columns 0 to 9 
 0.0718  0.0170 -0.0124  0.0873 -0.0355  0.0238  0.0207 -0.1071  0.0914 -0.0193
 0.0677  0.0192 -0.0184  0.0853 -0.0506  0.0336  0.0075 -0.1016  0.0968 -0.0191

Columns 10 to 19 
-0.0449 -0.0220  0.0314 -0.0239  0.0117  0.0291  0.0102 -0.1325 -0.0243  0.0210
-0.0444 -0.0152  0.0330 -0.0209  0.0175  0.0229  0.0075 -0.1290 -0.0200  0.0283

Columns 20 to 29 
-0.0958 -0.0412 -0.0721  0.1280 -0.0649  0.0328  0.0071 -0.0180  0.0723 -0.0227
-0.0944 -0.0347 -0.0732  0.1261 -0.0691  0.0320  0.0136 -0.0378  0.0662 -0.0092

Columns 30 to 39 
 0.0166  0.0243  0.0028 -0.1259  0.0542 -0.0516  0.0035  0.0489  0.0038  0.0288
 0.0079  0.0081 -0.0009 -0.1245  0.0452 -0.0495 -0.0010  0.0419  0.0054  0.0259

Columns 40 to 49 
 0.0087 -0.0094  0.0105 -0.0717 -0.0045  0.0248  0.0597  0.0036 -0.1202 -0.0477
 0.0004 -0.0177  0.0177 -0.0703 -0.0094  0.0334  0.0638  0.0185 -0.1140 -0.0572

Columns 50 to 59 
 0.0273  0.0537 -0.1072  0.0944 -0.0132 -0.0244 -0.0301  0.0042 -0.

In [27]:
rnn_loss = nn.CrossEntropyLoss()
lr = 0.000001
rnn_optimizer = torch.optim.Adam(rnn.parameters(),lr = lr)

In [28]:
def char2Var(ch):
    return Variable(gpu(torch.from_numpy(np.array([ch]).astype(np.int64))).squeeze())

In [29]:
def data_gen(c1,c2,c3,c4,batch_size=64,shuffle=True):
    if shuffle:
        index = np.random.permutation(c1.shape[0])
        c1 = c1[index]
        c2 = c2[index]
        c3 = c3[index]
        c4 = c4[index]
    for idx in range(0,c1.shape[0],batch_size):
        yield(c1[idx:idx+batch_size],c2[idx:idx+batch_size], c3[idx:idx+batch_size], c4[idx:idx+batch_size])

In [30]:
def train_model(c1,c2,c3,c4,model=rnn,epochs=1,train=True):
    if train:
        model.train()
    else:
        model.eval()
        
    for epoch in range(epochs):
        batches = data_gen(c1,c2,c3,c4)
        running_loss = 0.0
        for ch1,ch2,ch3,ch4 in batches:
            in1 = char2Var(ch1)#Variable(gpu(torch.from_numpy(np.array([ch1]).astype(np.int64))).squeeze())
            in2 = char2Var(ch2)#Variable(gpu(torch.from_numpy(np.array([ch2]).astype(np.int64))).squeeze())
            in3 = char2Var(ch3)#Variable(gpu(torch.from_numpy(np.array([ch3]).astype(np.int64))).squeeze())
            ou4 = char2Var(ch4)#Variable(gpu(torch.from_numpy(np.array([ch4]).astype(np.int64))).squeeze())
            
            out = rnn(in1,in2,in3)
            loss = rnn_loss(out,ou4)
            rnn_optimizer.zero_grad()
            loss.backward()
            rnn_optimizer.step()
            
            running_loss += loss.data[0]
            
        epoch_loss = running_loss / c1.shape[0]
        print('Loss: {:.4f}'.format(epoch_loss))            

In [31]:
%%time
train_model(x1[:4], x2[:4], x3[:4], y[:4])

Loss: 1.1113
CPU times: user 47.8 ms, sys: 3.01 ms, total: 50.8 ms
Wall time: 29.6 ms


In [32]:
%%time
train_model(x1, x2, x3, y)

Loss: 0.0685
CPU times: user 1min 20s, sys: 2.03 s, total: 1min 22s
Wall time: 28 s


In [33]:
lr = 0.01
rnn_optimizer = torch.optim.Adam(rnn.parameters(),lr = lr)

In [34]:
%%time
train_model(x1, x2, x3, y)

Loss: 0.0397
CPU times: user 1min 16s, sys: 1.62 s, total: 1min 18s
Wall time: 26.3 s


In [35]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [char2Var(i)[np.newaxis] for i in idxs]
    out = rnn(arrs[0], arrs[1], arrs[2])
    i = np.argmax(out.data.numpy())
    return chars[i]

In [36]:
get_next(' th')

'e'

In [37]:
get_next(' an')

'd'

## RNN with pytorch

We will now use the Recurrent layers of pytorch. The documentation is [here](http://pytorch.org/docs/master/nn.html#rnn)

To understand it, we will use a very simple example taken from PyTorchZeroToAll [tutorials](https://github.com/hunkim/PyTorchZeroToAll/blob/master/12_1_rnn_basics.py)

In [38]:
import torch
import torch.nn as nn
from torch.autograd import Variable

# One hot encoding for each char in 'hello'
h = [1, 0, 0, 0]
e = [0, 1, 0, 0]
l = [0, 0, 1, 0]
o = [0, 0, 0, 1]

# One cell RNN input_dim (4) -> output_dim (2). sequence: 5
cell = nn.RNN(input_size=4, hidden_size=2, batch_first=True)

# (batch, num_layers * num_directions, hidden_size) for batch_first=True
hidden = (Variable(torch.randn(1, 1, 2)))

# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
inputs = Variable(torch.Tensor([h, e, l, l, o]))
print('inputs size', inputs.size())
for one in inputs:
    one = one.view(1, 1, -1)
    # Input: (batch, seq_len, input_size) when batch_first=True
    out, hidden = cell(one, hidden)
    print("one input size", one.size(), "out size", out.size(), 'hidden size', hidden.size())

('inputs size', torch.Size([5, 4]))
('one input size', torch.Size([1, 1, 4]), 'out size', torch.Size([1, 1, 2]), 'hidden size', torch.Size([1, 1, 2]))
('one input size', torch.Size([1, 1, 4]), 'out size', torch.Size([1, 1, 2]), 'hidden size', torch.Size([1, 1, 2]))
('one input size', torch.Size([1, 1, 4]), 'out size', torch.Size([1, 1, 2]), 'hidden size', torch.Size([1, 1, 2]))
('one input size', torch.Size([1, 1, 4]), 'out size', torch.Size([1, 1, 2]), 'hidden size', torch.Size([1, 1, 2]))
('one input size', torch.Size([1, 1, 4]), 'out size', torch.Size([1, 1, 2]), 'hidden size', torch.Size([1, 1, 2]))


In [39]:
hidden = (Variable(torch.randn(1, 1, 2)))
# We can do the whole at once
# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
inputs = inputs.view(1, 5, -1)
out, hidden = cell(inputs, hidden)
print("sequence input size", inputs.size(), "out size", out.size(), 'hidden size', hidden.size())

('sequence input size', torch.Size([1, 5, 4]), 'out size', torch.Size([1, 5, 2]), 'hidden size', torch.Size([1, 1, 2]))


In [40]:
# Strange batch number for hidden can be arbitrary?
hidden = (Variable(torch.randn(3, 1, 2)))

# One cell RNN input_dim (4) -> output_dim (2). sequence: 5, batch 3
# 3 batches 'hello', 'eolll', 'lleel'
inputs = Variable(torch.Tensor([[h, e, l, l, o],
                                [e, o, l, l, l],
                                [l, l, e, e, l]]))

# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
# B x S x I
out, hidden = cell(inputs, hidden)
print("batch input size", inputs.size(), "out size", out.size(), 'hidden size', hidden.size())

('batch input size', torch.Size([3, 5, 4]), 'out size', torch.Size([3, 5, 2]), 'hidden size', torch.Size([1, 3, 2]))


There seems to be a bug: the hidden state has size (num_layers * num_directions, batch, hidden_size)?

We are now ready to build our new RNN with an arbitrary number of characters in input.

In [80]:
import torch.nn.functional as F

In [122]:
class new_RNN(nn.Module):
    def __init__(self, embedding_dim=42, vocab_size = 1, hidden_dim =256):
        super(new_RNN, self).__init__()
        
        self._embedding_dim = embedding_dim
        self._vocab_size = vocab_size
        self._hidden_dim = hidden_dim
               
        self.embeddings = ScaledEmbedding(vocab_size, embedding_dim)
        self.cell = nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, batch_first=True)
        self.h2o = nn.Linear(hidden_dim, vocab_size)
        self.softmax = F.softmax
        
    def forward(self, c):
        #print(c.size(1))
        hidden = Variable(torch.zeros(c.size(0),1,self._hidden_dim))
        for i in range(c.size(1)):
            #print(self.embeddings(c[:,i]).size())
            #print(self.embeddings(c[:,i]).unsqueeze(1).size(),hidden.size())
            outp,hidden = self.cell(self.embeddings(c[:,i]).unsqueeze(1),hidden)
            #print(i, outp.size())
        output = self.h2o(outp)
#         output = self.softmax(output)
        return output

In [123]:
new_rnn = new_RNN(vocab_size=vocab_size)

In [124]:
inp = char2Var(idx[:8]).view(1,-1)

In [125]:
inp

Variable containing:
   40    42    29    30    25    27    29     1
[torch.LongTensor of size 1x8]

In [126]:
inp[:,0]

Variable containing:
 40
[torch.LongTensor of size 1]

In [127]:
emb = ScaledEmbedding(vocab_size, 30)

In [128]:
emb(inp[:,0]).unsqueeze(1).size()

torch.Size([1, 1, 30])

In [129]:
out = new_rnn(inp)

In [130]:
out

Variable containing:
(0 ,.,.) = 

Columns 0 to 8 
  -0.0037 -0.0466 -0.1427 -0.0551  0.0348  0.0384 -0.0661  0.0333  0.0169

Columns 9 to 17 
  -0.0306 -0.0655 -0.0020 -0.0534  0.0583 -0.0670  0.0251 -0.1005 -0.0304

Columns 18 to 26 
  -0.0505  0.0011 -0.0382 -0.0208 -0.0131  0.0861 -0.0114 -0.0634 -0.0473

Columns 27 to 35 
  -0.0225  0.0600 -0.0505 -0.0414 -0.0333 -0.0299  0.0343  0.0774  0.1006

Columns 36 to 44 
  -0.1005  0.0486 -0.0216 -0.0292 -0.0485  0.0431  0.0240  0.0827 -0.0338

Columns 45 to 53 
  -0.0799  0.0089  0.0333  0.0306 -0.0466 -0.0396  0.0446  0.0047  0.0131

Columns 54 to 62 
  -0.0511  0.0704  0.0324  0.0224 -0.0195 -0.0203  0.0074  0.0740  0.0607

Columns 63 to 71 
   0.0114 -0.0005  0.0579  0.0672 -0.0360 -0.0201  0.0246  0.0693 -0.0026

Columns 72 to 80 
   0.0118 -0.0309 -0.0331  0.0627 -0.0560  0.0022 -0.1064 -0.0071 -0.0088

Columns 81 to 85 
  -0.0442 -0.0814  0.0756  0.0332  0.0296
[torch.FloatTensor of size 1x1x86]

In [131]:
out.size()

torch.Size([1, 1, 86])

In [132]:
cs=8

In [133]:
c_in_dat = [[idx[i+n] for i in range(0, len(idx)-1-cs, cs)]
            for n in range(cs)]

In [134]:
np.shape(c_in_dat)

(8, 75112)

In [135]:
c_out_dat = [idx[i+cs] for i in range(0, len(idx)-1-cs, cs)]

In [136]:
xs = [np.stack(c[:-2]) for c in c_in_dat]

In [137]:
len(xs), xs[0].shape

(8, (75110,))

In [138]:
y = np.stack(c_out_dat[:-2])

In [139]:
np.shape(y)

(75110,)

In [140]:
[xs[n][:5] for n in range(cs)]

[array([40,  1, 33,  2, 72]),
 array([42,  1, 38, 44,  2]),
 array([29, 43, 31, 71, 54]),
 array([30, 45,  2, 74,  2]),
 array([25, 40, 73, 73, 76]),
 array([27, 40, 61, 61, 68]),
 array([29, 39, 54,  2, 66]),
 array([ 1, 43, 73, 62, 54])]

In [141]:
y[:5]

array([ 1, 33,  2, 72, 67])

In [142]:
idx[:16]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1, 43, 45, 40, 40, 39, 43]

In [143]:
inp = char2Var([xs[n][:2] for n in range(cs)]).permute(1,0)

In [144]:
inp

Variable containing:
   40    42    29    30    25    27    29     1
    1     1    43    45    40    40    39    43
[torch.LongTensor of size 2x8]

In [145]:
out = new_rnn(inp)

In [174]:
_x,_y = next(new_data_gen(np.asarray(xs),y))
xs

[array([40,  1, 33, ..., 72, 71, 61]),
 array([42,  1, 38, ..., 73, 65, 58]),
 array([29, 43, 31, ..., 62, 57,  2]),
 array([30, 45,  2, ..., 54,  2, 62]),
 array([25, 40, 73, ..., 67, 54, 67]),
 array([27, 40, 61, ...,  2, 72, 57]),
 array([29, 39, 54, ..., 76,  2, 62]),
 array([ 1, 43, 73, ..., 68, 73, 56])]

In [146]:
def new_data_gen(ch,y,batch_size=64,shuffle=True):
    if shuffle:
        index = np.random.permutation(ch.shape[0])
        ch = ch[index,:]
        y = y[index]
    for idx in range(0,ch.shape[0],batch_size):
        yield(ch[idx:idx+batch_size,:], y[idx:idx+batch_size])

In [175]:
def new_train_model(ch,y,model=new_rnn,epochs=10,train=True):
    if train:
        model.train()
    else:
        model.eval()
        
    for epoch in range(epochs):
        batches = new_data_gen(ch,y)
        running_loss = 0.0
        for ch,y in batches:
            inp = char2Var(ch)
            o = char2Var(y)
            
            out = new_rnn(inp)
            #print(out.squeeze().size(), o.size())
            loss = new_rnn_loss(out.squeeze(),o)
            new_rnn_optimizer.zero_grad()
            loss.backward()
            new_rnn_optimizer.step()
            
            running_loss += loss.data[0]
            
        epoch_loss = running_loss / ch.shape[0]
        print('Loss: {:.4f}'.format(epoch_loss))            

In [176]:
new_rnn_loss = nn.CrossEntropyLoss()
lr = 0.000001
new_rnn_optimizer = torch.optim.Adam(new_rnn.parameters(),lr = lr)

In [177]:
%%time
new_train_model(np.transpose([xs[n][:2] for n in range(cs)]),y[:2])

Loss: 2.3691
Loss: 2.3683
Loss: 2.3676
Loss: 2.3668
Loss: 2.3660
Loss: 2.3652
Loss: 2.3644
Loss: 2.3637
Loss: 2.3629
Loss: 2.3621
CPU times: user 443 ms, sys: 16 ms, total: 459 ms
Wall time: 158 ms


In [178]:
%%time
new_train_model(np.transpose(xs),y)

Loss: 86.0031
Loss: 0.0826
Loss: 0.0826
Loss: 0.0826
Loss: 0.0826
Loss: 0.0826
Loss: 0.0826
Loss: 0.0826
Loss: 0.0826
Loss: 0.0826
CPU times: user 1min 19s, sys: 1.82 s, total: 1min 21s
Wall time: 27.2 s


In [184]:
lr = 0.01
new_rnn_optimizer = torch.optim.Adam(new_rnn.parameters(),lr = lr)

In [185]:
%%time
new_train_model(np.transpose([xs[n] for n in range(cs)]),y)

Loss: 86.9620
Loss: 0.0651
Loss: 0.0516
Loss: 0.0391
Loss: 0.0300
Loss: 0.0236
Loss: 0.0175
Loss: 0.0130
Loss: 0.0097
Loss: 0.0070
CPU times: user 1min 20s, sys: 1.96 s, total: 1min 22s
Wall time: 27.8 s


In [181]:
def new_get_next(inp):
    idxs = [char_indices[c] for c in inp]
    #print(idxs)
    #arrs = [char2Var(i)[np.newaxis] for i in idxs]
    arrs = char2Var(idxs).unsqueeze(0)
    #print(arrs)
    out = new_rnn(arrs)
#     print out
    i = np.argmax(out.data.numpy())
    return chars[i]

In [182]:
new_get_next('th')

' '

In [183]:
new_get_next('bues')

' '

# Exercise

As stated during the course, this code is very preliminary and does not run on GPU. Fix it!

Also instead of RNN, use [GRU](http://pytorch.org/docs/master/nn.html#torch.nn.GRU)