# Recurrent neural networks

In the previous module, we have been using rich **_semantic representations of text_**, and a simple linear classifier on top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it does not take into account the **order** of words, because aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.

To capture the meaning of text sequence, we need to use another neural network architecture, which is called a **recurrent neural network**, or RNN. In RNN, we pass our sentence through the network one word vector from a news article sequence at a time, and the network produces some **state**, which we then pass to the network again with the next one word vector from the sequence.  RNN storing a "memory" of the previous in the state, helps the network understand the **_context of the sentence_** to be able to predict the network word in the sequence.

<img alt="Image showing an example recurrent neural network generation." src="images/5-recurrent-networks-1.png" align="middle" />

- Given the input sequence of word vectors $X_0,\dots,X_n$, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using back propagation. 
- Each network block takes a pair $(X_i,h_i)$ as an input, and produces $h_{i+1}$ as a result. 
- Final state $h_n$ or output $y$ goes into a linear classifier to produce the result. 
- All network blocks share the same weights, and are trained end-to-end using one backpropagation pass.

The hidden cell containing the current and prior state is calculated with the following formula:

- $h(t) = {\tanh(W_{h}h_{t-1} + W_{x}x_{t} + B_{h}) }$ 
- $y(t) = {  W_{y}h_{t} + B_{y} }$ 
- Tanh is hyperbolic tangent function, which is defined as ${\tanh(x)} = {\large \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}} $

At each network block, weights $W_{x}$ are applied to the numeric word vector input value; applying the previous hidden state $W_{h}$; and the  final state $W_{y}$. The ${tanh}$ activation function is applied to the hidden layer to produce values between $[-1,1]$.

Because state vectors $h_0,\dots,h_n$ are passed through the network, it is able to learn the sequential dependencies between words. For example, when the word **_not_** appears somewhere in the sequence, it can learn to negate certain elements within the state vector, resulting in negation.  

Let's see how recurrent neural networks can help us classify our news dataset.

In [1]:
!pip install -r https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/requirements.txt
!pip install torchinfo
!wget -q https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/torchnlp.py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim==3.8.3
  Downloading gensim-3.8.3-cp38-cp38-manylinux1_x86_64.whl (24.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface==0.0.1
  Downloading huggingface-0.0.1-py3-none-any.whl (2.5 kB)
Collecting nltk==3.5
  Downloading nltk-3.5.zip (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m75.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting numpy==1.18.5
  Downloading numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl (20.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.6/20.6 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting opencv-python==4.5.1.48
  Downloading opencv_python-4.5.1.48-cp38-cp38-manylinux2014_x86_64.whl (50.4 MB)
[2K     [90m━━━━━

In [2]:
import torch
import torchtext
from torchinfo import summary
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)

Loading dataset...


train.csv: 29.5MB [00:00, 120MB/s]
test.csv: 1.86MB [00:00, 96.4MB/s]                  


Building vocab...


## Simple RNN classifier

In the case of simple RNN, each recurrent unit is a simple linear network, which takes concatenated input vector and state vector, and produce a new state vector. PyTorch represents this unit with `RNNCell` class, and a networks of each cells - as `RNN` layer.

To define an RNN classifier, we will first apply an embedding layer to lower the dimensionality of input vocabulary, and then have a RNN layer on top of it: 

In [3]:
class RNNClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,h = self.rnn(x)
        return self.fc(x.mean(dim=1))

> **Note:** We use untrained embedding layer here for simplicity, but for even better results we can use pre-trained embedding layer with Word2Vec or GloVe embeddings, as described in the previous unit. For better understanding, you might want to adapt this code to work with pre-trained embeddings.

In our case, we will use padded data loader, so each batch will have a number of padded sequences of the same length. RNN layer will take the sequence of embedding tensors, and produce two outputs:

* The `input` to the embedding layer is the word sequence or news article
* The `embedding layer` output contains the vector index value in vocab for each word in the sequence
* $x$ is a sequence of RNN cell outputs at each step.  
* $h$ is a final `hidden state` for the last element of the sequence.  Each RNN hidden layer stores the prior word in the sequence and the current as each word in the sequence is passed through the layers.

We then apply a fully-connected linear classifier to get the probability for number of classes.

> **Note:** RNNs are quite difficult to train, because once the RNN cells are unrolled along the sequence length, the resulting number of layers involved in back propagation is quite large. Thus we need to select small learning rate, and train the network on larger dataset to produce good results. It can take quite a long time, so using GPU is preferred.

In [4]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)
net = RNNClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)

3200: acc=0.311875
6400: acc=0.3775
9600: acc=0.433125
12800: acc=0.47953125
16000: acc=0.5185
19200: acc=0.5518229166666667
22400: acc=0.58
25600: acc=0.6065625
28800: acc=0.6254166666666666
32000: acc=0.6430625
35200: acc=0.6582954545454546
38400: acc=0.6727604166666666
41600: acc=0.6848076923076923
44800: acc=0.6960491071428572
48000: acc=0.7058333333333333
51200: acc=0.7148828125
54400: acc=0.7232536764705882
57600: acc=0.7298263888888888
60800: acc=0.7368092105263158
64000: acc=0.743109375
67200: acc=0.7488690476190476
70400: acc=0.7538920454545455
73600: acc=0.759266304347826
76800: acc=0.7635546875
80000: acc=0.7676
83200: acc=0.7715745192307693
86400: acc=0.7755208333333333
89600: acc=0.7794977678571429
92800: acc=0.7827155172413793
96000: acc=0.7858541666666666
99200: acc=0.7892842741935484
102400: acc=0.792138671875
105600: acc=0.7949242424242424
108800: acc=0.7975643382352942
112000: acc=0.80025
115200: acc=0.80265625
118400: acc=0.8050168918918919


(0.03318536987304688, 0.8061916666666666)

> 上面的训练耗时49s

In [5]:
train_epoch(net,train_loader, lr=0.001)

3200: acc=0.9021875
6400: acc=0.905625
9600: acc=0.9040625
12800: acc=0.90703125
16000: acc=0.9061875
19200: acc=0.9067708333333333
22400: acc=0.906875
25600: acc=0.90671875
28800: acc=0.9071875
32000: acc=0.9075625
35200: acc=0.907528409090909
38400: acc=0.9072135416666667
41600: acc=0.9076442307692307
44800: acc=0.90765625
48000: acc=0.9078333333333334
51200: acc=0.9084375
54400: acc=0.9087316176470588
57600: acc=0.9089930555555555
60800: acc=0.9093092105263157
64000: acc=0.9099375
67200: acc=0.9099702380952381
70400: acc=0.9099573863636363
73600: acc=0.9097554347826087
76800: acc=0.9099088541666667
80000: acc=0.9102375
83200: acc=0.9103245192307692
86400: acc=0.9104513888888889
89600: acc=0.9104129464285714
92800: acc=0.9107974137931034
96000: acc=0.91115625
99200: acc=0.9112600806451613
102400: acc=0.911328125
105600: acc=0.9114962121212121
108800: acc=0.9117279411764706
112000: acc=0.9119285714285714
115200: acc=0.9122569444444445
118400: acc=0.9124155405405405


(0.016861434936523437, 0.9125666666666666)

Now, let's load the test dataset to evaluate the trained RNN model.  We'll be using the 4 different classes of the news categories to map the predicted output with the targeted label.

In [6]:
print(f'class map: {classes}')

test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, collate_fn=padify, shuffle=True)

class map: ['World', 'Sports', 'Business', 'Sci/Tech']


Before we evaluate the model, we'll extract the padded vector dataset from the dataloader.  We will use the vocab.itos function to convert the numeric index to the word it matches in the vocabulary.  When conversion from numeric to string happens for padded vectors, the '0' values are set to a special character `<unk>` as an unknown identifier. So, the character needs to be removed, depending on the unknown values from the padded zeros.

Finally, we’ll run the model with our test dataset to verify if the expected output matched the predicted.

In [7]:
net.eval()

RNNClassifier(
  (embedding): Embedding(95812, 64)
  (rnn): RNN(64, 32, batch_first=True)
  (fc): Linear(in_features=32, out_features=4, bias=True)
)

In [8]:
with torch.no_grad():
    for batch_idx, (target, data) in enumerate(test_loader):
        
        word_lookup = [vocab.itos[w] for w in data[batch_idx]]
        unknow_vals = {'<unk>'}
        word_lookup = [ele for ele in word_lookup if ele not in unknow_vals]
        print('Input text:\n {}\n'.format(word_lookup))
        
        data, target = data.to(device), target.to(device)
        pred = net(data)
        print(torch.argmax(pred[batch_idx]))
        print("Actual:\nvalue={}, class_name= {}\n".format(target[batch_idx], classes[target[batch_idx]]))
        print("Predicted:\nvalue={}, class_name= {}\n".format(pred[0].argmax(0),classes[pred[0].argmax(0)]))
        break

Input text:
 ['junior', 'swears', 'by', 'win', 'at', 'talladega', 'dale', 'earnhardt', 'jr', '.', 'went', 'from', '11th', 'on', 'a', 'restart', 'on', 'lap', '184', 'to', 'first', 'less', 'than', 'two', 'laps', 'later', 'to', 'win', 'the', 'ea', 'sports', '500', '.', 'he', 'led', 'nine', 'times', 'for', '78', 'laps', '.']

tensor(1, device='cuda:0')
Actual:
value=1, class_name= Sports

Predicted:
value=1, class_name= Sports



## Long Short Term Memory (LSTM)

One of the main problems of classical RNNs is the so-called **vanishing gradients** problem. Because RNNs are trained end-to-end in one back-propagation pass, it is having hard times propagating error to the first layers of the network, and thus the network cannot learn relationships between distant tokens. The gradient helps in adjusting the weights during back-progagation to achieve better accuracy and reduce the error margin.  If the weights are too small the network does not learn.  Since the gradient decreases during back-propagation in RNNs, the network does not learn the initial inputs in the network.  In other ways, the network "forgets" the earlier word inputs.

One of the ways to avoid this problem is to introduce **explicit state management** by using so called **gates**. There are two most known architectures of this kind: **Long Short Term Memory** (LSTM) and **Gated Relay Unit** (GRU).

<img alt="Image showing an example long short term memory cell" src="images/5-recurrent-networks-2.png" align="middle" />


LSTM Network is organized in a manner similar to RNN, but there are two states that are being passed from layer to layer: `actual state` $c$, and `hidden vector` $h$. 
- At each unit, hidden vector $h_i$ is concatenated with input $x_i$, and they `control what happens to the state $c$ via **gates**. 
- Each gate is a neural network with `sigmoid` (${\sigma}$) activation (output in the range $[0,1]$), which can be thought of as bitwise mask when multiplied by the state vector. 

There are the following gates (from left to right on the picture above):
* **forget gate** takes hidden vector and determines, which components of the vector $c$ we need to forget, and which to pass through. The gate determines which words are not important and sigmod values close to zero need to be thrown out.  The formula is $f_t = \sigma(W_f  * [x_t + h_{t-1}] + b_f)$.
* **input gate** takes some information from the input and hidden vector, and inserts it into state. The formula for the input gate is a product of the new information $i_t = \sigma(W_i  * [x_t + h_{t-1}] + b_i)$ and the hidden  $\tilde{C_t} = \tanh(W_c  * [x_t + h_{t-1}] + b_c)$
* **output gate** transforms state via some linear layer with `tanh` activation, then selects some of its components using hidden vector $h_i$ to produce new state $c_{i+1}$.  The formula for the input gate is $o_t = \sigma(W_o  * [x_t + h_{t-1}] + b_f)$ and the hidden is ${h_t} = {o_t} * \tanh(C_t) $
* **cell state**  takes a product of the hidden state and the forget gate.  Then sums value with the product of the input gate and output gate. $C_t = f_t \circ C_{t-1} + i_t \circ \tilde{C_t}$ 

Components of the state $c$ can be thought of as some flags that can be switched on and off. For example, when we encounter a name *'Alice'* in the sequence, we may want to assume that it refers to female character, and raise the flag in the state that we have female noun in the sentence. When we further encounter phrases *and Tom*, we will raise the flag that we have plural noun. Thus by manipulating state we can supposedly keep track of grammatical properties of sentence parts.

> **Note**: A great resource for understanding internals of LSTM is this great article "Understanding LSTM Networks" by Christopher Olah.

While internal structure of LSTM cell may look complex, PyTorch hides this implementation inside the `LSTMCell` class, and provides a `LSTM` object to represent the whole LSTM layer. Thus, implementation of LSTM classifier will be pretty similar to the simple RNN which we have seen above:

In [9]:
class LSTMClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
        self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,(h,c) = self.rnn(x)
        return self.fc(h[-1])

Now let's train our network. Note that training LSTM is also quite slow, and you may not seem much raise in accuracy in the beginning of training. Also, you may need to play with `lr` learning rate parameter to find the learning rate that results in reasonable training speed.

In [10]:
net = LSTMClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)

3200: acc=0.2509375
6400: acc=0.2490625
9600: acc=0.24979166666666666
12800: acc=0.250859375
16000: acc=0.2668125
19200: acc=0.29104166666666664
22400: acc=0.3114732142857143
25600: acc=0.3290625
28800: acc=0.34479166666666666
32000: acc=0.35771875
35200: acc=0.3721306818181818
38400: acc=0.3857552083333333
41600: acc=0.4029326923076923
44800: acc=0.4202901785714286
48000: acc=0.43747916666666664
51200: acc=0.4533203125
54400: acc=0.4702205882352941
57600: acc=0.48602430555555554
60800: acc=0.5014309210526315
64000: acc=0.516046875
67200: acc=0.5305803571428571
70400: acc=0.5439772727272727
73600: acc=0.55625
76800: acc=0.5673567708333334
80000: acc=0.5786125
83200: acc=0.58921875
86400: acc=0.5989814814814814
89600: acc=0.6084263392857143
92800: acc=0.6168211206896552
96000: acc=0.62484375
99200: acc=0.6327822580645162
102400: acc=0.64017578125
105600: acc=0.6475094696969697
108800: acc=0.65421875
112000: acc=0.6604464285714285
115200: acc=0.6663020833333333
118400: acc=0.672035472972

(0.0438391357421875, 0.6748916666666667)

In [11]:
train_epoch(net,train_loader, lr=0.0005)

3200: acc=0.9009375
6400: acc=0.90046875
9600: acc=0.903125
12800: acc=0.902890625
16000: acc=0.9010625
19200: acc=0.90125
22400: acc=0.9033482142857143
25600: acc=0.9021484375
28800: acc=0.9012152777777778
32000: acc=0.901375
35200: acc=0.9014772727272727
38400: acc=0.9016927083333334
41600: acc=0.901826923076923
44800: acc=0.903125
48000: acc=0.9026041666666667
51200: acc=0.9030078125
54400: acc=0.9030698529411765
57600: acc=0.9027083333333333
60800: acc=0.9035032894736842
64000: acc=0.90371875
67200: acc=0.9041071428571429
70400: acc=0.9045880681818181
73600: acc=0.9048369565217391
76800: acc=0.9053255208333333
80000: acc=0.9055625
83200: acc=0.9056129807692308
86400: acc=0.9058912037037037
89600: acc=0.9056584821428572
92800: acc=0.9058405172413793
96000: acc=0.90584375
99200: acc=0.9063104838709677
102400: acc=0.906357421875
105600: acc=0.9065625
108800: acc=0.9067095588235294
112000: acc=0.9069553571428571
115200: acc=0.9071440972222222
118400: acc=0.9073902027027027


(0.01693394571940104, 0.9073916666666667)

## Packed sequences

In our example, we had to pad all sequences in the minibatch with zero vectors. While it results in some memory waste, with RNNs it is more critical that additional RNN cells are created for the padded input items, which take part in training, yet do not carry any important input information. It would be much better to train RNN only to the actual sequence size.

To do that, a special format of padded sequence storage is introduced in PyTorch. Suppose we have input padded minibatch which looks like this:
```
[[1,2,3,4,5],
 [6,7,8,0,0],
 [9,0,0,0,0]]
```
Here 0 represents padded values, and the actual length vector of input sequences is `[5,3,1]`.

In order to effectively train RNN with padded sequence, we want to begin training first group of RNN cells with large minibatch (`[1,6,9]`), but then end processing of third sequence, and continue training with shorted minibatches (`[2,7]`, `[3,8]`), and so on. Thus, packed sequence is represented as one vector - in our case `[1,6,9,2,7,3,8,4,5]`, and length vector (`[5,3,1]`), from which we can easily reconstruct the original padded minibatch.

To produce packed sequence, we can use `torch.nn.utils.rnn.pack_padded_sequence` function. All recurrent layers, including RNN, LSTM and GRU, support packed sequences as input, and produce packed output, which can be decoded using `torch.nn.utils.rnn.pad_packed_sequence`.

To be able to produce packed sequence, we need to pass length vector to the network, and thus we need a different function to prepare minibatches:

In [12]:
def pad_length(b):
    # build vectorized sequence
    v = [encode(x[1]) for x in b]
    # compute max length of a sequence in this minibatch and length sequence itself
    len_seq = list(map(len,v))
    l = max(len_seq)
    return ( # tuple of three tensors - labels, padded features, length sequence
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v]),
        torch.tensor(len_seq)
    )

train_loader_len = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=pad_length, shuffle=True)
test_loader_len = torch.utils.data.DataLoader(test_dataset, batch_size=16, collate_fn=pad_length, shuffle=True)

The actual network would be very similar to `LSTMClassifier` above, but `forward` pass will receive both padded minibatch and the vector of sequence lengths. After computing the embedding, we compute packed sequence, pass it to LSTM layer, and then unpack the result back.

> **Note**: We actually do not use unpacked result `x`, because we use output from the hidden layers in the following computations. Thus, we can remove the unpacking altogether from this code. The reason we place it here is for you to be able to modify this code easily, in case you should need to use network output in further computations.

In [13]:
class LSTMPackClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
        self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x, lengths):
        batch_size = x.size(0)
        x = self.embedding(x)
        pad_x = torch.nn.utils.rnn.pack_padded_sequence(x,lengths,batch_first=True,enforce_sorted=False)
        _,(h,c) = self.rnn(pad_x)
        return self.fc(h[-1])

Now let's train our network with the padded sequence:

In [14]:
net = LSTMPackClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch_emb(net,train_loader_len, lr=0.001,use_pack_sequence=True)


3200: acc=0.289375
6400: acc=0.345625
9600: acc=0.40125
12800: acc=0.45515625
16000: acc=0.5005
19200: acc=0.539375
22400: acc=0.5726339285714286
25600: acc=0.5994921875
28800: acc=0.6219097222222222
32000: acc=0.64159375
35200: acc=0.6592045454545454
38400: acc=0.6746875
41600: acc=0.6877403846153847
44800: acc=0.6992410714285714
48000: acc=0.7097083333333334
51200: acc=0.71953125
54400: acc=0.7288235294117648
57600: acc=0.7359201388888889
60800: acc=0.7428289473684211
64000: acc=0.74953125
67200: acc=0.7555803571428571
70400: acc=0.7615340909090909
73600: acc=0.7664809782608696
76800: acc=0.7710026041666667
80000: acc=0.7755
83200: acc=0.7797355769230769
86400: acc=0.7839699074074075
89600: acc=0.7873549107142858
92800: acc=0.790614224137931
96000: acc=0.79371875
99200: acc=0.7969254032258064
102400: acc=0.80029296875
105600: acc=0.8033617424242424
108800: acc=0.8062224264705883
112000: acc=0.8089285714285714
115200: acc=0.8114930555555555
118400: acc=0.8135304054054054


(0.0300931640625, 0.8146666666666667)

In [15]:
train_epoch_emb(net,train_loader_len, lr=0.0005,use_pack_sequence=True)

3200: acc=0.9215625
6400: acc=0.9196875
9600: acc=0.9190625
12800: acc=0.919921875
16000: acc=0.9205
19200: acc=0.9219270833333333
22400: acc=0.9227678571428571
25600: acc=0.9225
28800: acc=0.9232638888888889
32000: acc=0.9236875
35200: acc=0.9234659090909091
38400: acc=0.923828125
41600: acc=0.924735576923077
44800: acc=0.9254464285714286
48000: acc=0.9250416666666667
51200: acc=0.92478515625
54400: acc=0.9247794117647059
57600: acc=0.9247222222222222
60800: acc=0.9247203947368421
64000: acc=0.924875
67200: acc=0.9250744047619047
70400: acc=0.9251136363636364
73600: acc=0.9248505434782609
76800: acc=0.9249348958333333
80000: acc=0.925225
83200: acc=0.9250841346153846
86400: acc=0.9250694444444445
89600: acc=0.925234375
92800: acc=0.9251400862068966
96000: acc=0.92525
99200: acc=0.9253225806451613
102400: acc=0.9254296875
105600: acc=0.9257102272727272
108800: acc=0.9259191176470588
112000: acc=0.9258928571428572
115200: acc=0.9260416666666667
118400: acc=0.9261402027027027


(0.013871371459960938, 0.9260416666666667)

In [16]:
train_epoch_emb(net,train_loader_len, lr=0.0001,use_pack_sequence=True)

3200: acc=0.9409375
6400: acc=0.9403125
9600: acc=0.9429166666666666
12800: acc=0.943046875
16000: acc=0.942
19200: acc=0.9425520833333333
22400: acc=0.9429017857142857
25600: acc=0.9438671875
28800: acc=0.9439583333333333
32000: acc=0.94421875
35200: acc=0.9439772727272727
38400: acc=0.943671875
41600: acc=0.9438942307692307
44800: acc=0.9440401785714285
48000: acc=0.9437708333333333
51200: acc=0.9435546875
54400: acc=0.9438419117647059
57600: acc=0.9440104166666666
60800: acc=0.94375
64000: acc=0.94378125
67200: acc=0.9438095238095238
70400: acc=0.9438920454545454
73600: acc=0.943695652173913
76800: acc=0.9436848958333334
80000: acc=0.9440125
83200: acc=0.9437860576923077
86400: acc=0.9435300925925926
89600: acc=0.9436941964285714
92800: acc=0.9436099137931034
96000: acc=0.9433020833333333
99200: acc=0.9433165322580646
102400: acc=0.94341796875
105600: acc=0.9434659090909091
108800: acc=0.9433639705882353
112000: acc=0.9431428571428572
115200: acc=0.943203125
118400: acc=0.9429898648

(0.010807306925455729, 0.9429833333333333)

> **Note:** You may have noticed the parameter `use_pack_sequence` that we pass to the training function. Currently, `pack_padded_sequence` function requires length sequence tensor to be on CPU device, and thus training function needs to avoid moving the length sequence data to GPU when training. You can look into implementation of `train_epoch_emb` helper function in the `torchnlp.py` file located in the local directory.

here 👇👇👇

```python
def train_epoch_emb(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.CrossEntropyLoss(),epoch_size=None, report_freq=200,use_pack_sequence=False):
    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
    loss_fn = loss_fn.to(device)
    net.train()
    total_loss,acc,count,i = 0,0,0,0
    for labels,text,off in dataloader:
        optimizer.zero_grad()
        labels,text = labels.to(device), text.to(device)
        if use_pack_sequence:
            off = off.to('cpu')
        else:
            off = off.to(device)
        out = net(text, off)
        loss = loss_fn(out,labels) #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss+=loss
        _,predicted = torch.max(out,1)
        acc+=(predicted==labels).sum()
        count+=len(labels)
        i+=1
        if i%report_freq==0:
            print(f"{count}: acc={acc.item()/count}")
        if epoch_size and count>epoch_size:
            break
    return total_loss.item()/count, acc.item()/count
```


In [17]:
net.eval()

LSTMPackClassifier(
  (embedding): Embedding(95812, 64)
  (rnn): LSTM(64, 32, batch_first=True)
  (fc): Linear(in_features=32, out_features=4, bias=True)
)

> 可以run下面的cell多几次看看结果

In [20]:

with torch.no_grad():
    for label,text,off in test_loader_len:
        # show the input text
        word_lookup = [vocab.itos[w] for w in text[0]]
        unknow_vals = {'<unk>'}
        word_lookup = [ele for ele in word_lookup if ele not in unknow_vals]
        print('Input text:\n {}\n'.format(word_lookup))

        text, label = text.to(device), label.to(device)
        off = off.to('cpu')
        print(f'off value: {off}')
        pred = net(text, off )
        print(f'target {label}')
        y=torch.argmax(pred, dim=1)
        print(f'pred: {y}')
        print("Predicted:\nvalue={}, class_name= {}\n".format(y[0],classes[y[0]]))
        print("Target:\nvalue={}, class_name= {}\n".format(label[0],classes[label[0]]))
        break
     

Input text:
 ['sony', ',', 'ibm', ',', 'and', 'toshiba', 'reveal', 'additional', 'details', 'on', 'cell', 'chip', 'initial', 'versions', 'of', 'playstation', '3', 'chip', 'will', 'not', 'be', 'produced', 'with', 'a', 'cutting-edge', 'chip-making', 'technology', '.', 'the', 'four', 'companies', 'developing', 'the', 'cell', 'consumer', 'electronics', 'microprocessor', 'released', 'a', 'few', 'more', 'details']

off value: tensor([42, 52, 31, 35, 47, 30, 27, 30, 47, 35, 46, 24, 48, 41, 46, 46])
target tensor([3, 2, 3, 1, 0, 1, 2, 2, 1, 2, 0, 0, 1, 0, 2, 2], device='cuda:0')
pred: tensor([3, 2, 3, 1, 0, 1, 3, 2, 1, 3, 0, 0, 1, 0, 2, 2], device='cuda:0')
Predicted:
value=3, class_name= Sci/Tech

Target:
value=3, class_name= Sci/Tech



## Bidirectional and multilayer RNNs

In our examples, all recurrent networks operated in one direction, from beginning of a sequence to the end. It looks natural, because it resembles the way we read and listen to speech. However, since in many practical cases we have random access to the input sequence, it might make sense to run recurrent computation in both directions. Such networks are call **bidirectional** RNNs, and they can be created by passing `bidirectional=True` parameter to RNN/LSTM/GRU constructor.

> **Example**:   _self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True, bidrectional=True)_

When dealing with bidirectional network, we would need two hidden state vectors, one for each direction. PyTorch encodes those vectors as one vector of twice larger size, which is quite convenient, because you would normally pass the resulting hidden state to fully-connected linear layer, and you would just need to take this increase in size into account when creating the layer.

Recurrent network, one-directional or bidirectional, captures certain patterns within a sequence, and can store them into state vector or pass into output. As with convolutional networks, we can build another recurrent layer on top of the first one to capture higher level patterns, build from low-level patterns extracted by the first layer. This leads us to the notion of **multi-layer RNN**, which consists of two or more recurrent networks, where output of the previous layer is passed to the next layer as input.

<img alt="Image showing a Multilayer long-short-term-memory- RNN" src="images/5-recurrent-networks-3.jpg" align="middle" />

*Picture from "From a LSTM cell to a Multilayer LSTM Network with PyTorch" article by Fernando López*

PyTorch makes constructing such networks an easy task, because you just need to pass `num_layers` parameter to RNN/LSTM/GRU constructor to build several layers of recurrence automatically. This would also mean that the size of hidden/state vector would increase proportionally, and you would need to take this into account when handling the output of recurrent layers.

## RNNs for other tasks

In this unit, we have seen that RNNs can be used for sequence classification, but in fact, they can handle many more tasks, such as text generation, machine translation, and more. We will consider those tasks in the next unit.