# Deep Learning For Text

## Description

Processing language is difficult owing to the sequential nature of it.
Learn and understand how to utilize the power of Deep Learning(RNNs, LSTM, GRU) for NLP tasks.


## Overview

- Need of sequence modelling
- Recurrent neural networks
- LSTM and its variants

## Pre-requisite

- Python (along with NumPy and pandas libraries)
- Word Embeddings
- Basic statistics 


## Learning Outcomes

- Understand the problem of handling sequence models
- Learn about RNNs and how to implement it
- Learn about LSTMs and its variants and how to implement it
---

# 1. Introduction to Recurrent Neural Networks

# 1.1 Why sequence models?

Repeatedly throughout our previous NLP concepts, we have mentioned why processing text by machines is a challenge. Textual data by it's nature requires the understanding of context which machines can't do easily.

One of the recent(and popular) techniques to have machine understand the context is 'word embeddings'.

They are based on distributional hypothesis — words appearing within similar context possess similar meaning. 
 

![](final_images/word_vector.jpg)

Word embeddings are pre-trained on a task where the objective is to predict a word based on its context using neural networks. Using word embeddings Neural-based models have achieved superior results on various language-related tasks. The word vectors intrinsically embed syntactical and semantic information. As a result, they can efficiently construct high-quality word embeddings (For e.g. 'man' + 'woman'- 'king' = 'queen').


The limitation with word embedding methods is when we want to obtain vector representations for sentences/phrases such as "blue moon" or “USA Today”. The difficutly is in combining the individual word vector representations since the phrases don’t represent the combination of meaning of the individual words. 

- 'Blue moon' refers to a phenomenon that happens very rarely(and not blue colored moon)

- 'USA Today' refers to newspaper(and not a phrase talking about the country today) 


It gets more complicated when longer phrases and sentences are considered.

Consider the following use cases:


- **Sentiment Classification:** The input X is a sentence say "There is nothing to love in this movie" and output is how many stars the review is giving. 


![](final_images/sentiment_analysis.jpg)

- **Machine Translation:** 

This is one of the popular NLP use cases given the rise of globalization.

In this task, given words, phrase or sentence in one language, the machine automatically translates it into another language. We input a sentence in Russian and we want our output to be the same sentence, say in English.

![](final_images/mt.jpg)


- **Name Entity Recognition** The input is a sentence say "Andrew Ng was a former Vice President and Chief Scientist at Baidu" and the output is prediction of the named entities in that sentence.


![](final_images/ner.jpg)

There are bunch of areas in which NER Systems are used. Some of the use cases are:

- In news, identify the main subjects in the news.

- To find a relation between various entities described in a document

****

All the above uses cases are examples where one can't work if the data input is not considered to be a sequence.

For e.g. 

#1.
Consider the following review: "I have nothing but love for the movie"

If the sequence info is not taken, it becomes a negative sentiment, but in sequence it is a positive sentiment. 

#2.
Consider the English sentence "Hello world"(2 words). It's French translation is "Bonjour le monde"(3 words). In language translation, the input sentence and output sentence are usually not of the same length. If not considered sequence how can we decide the input length or output length?!


We don't realise it but humans do sequence processing intrinsically. Humans don’t isolate events and start their thinking from scratch every second. We are able to 'watch a movie' or 'read a book' precisely because we make a mental note about all the things that happened in the movie/book and how they affect the current/future events.  

Traditional neural networks are unable to do this and is one of its major shortcomings. 

For example, imagine you want to classify what kind of event is happening at every page in a book. A traditional neural network can't use its reasoning about previous events in the book to inform later ones.

Another major limitation of traditional neural networks is that their parameters are too constrained. They accept a fixed-sized vector as input and produce a fixed-sized vector as output. When dealing with say, `machine translation` how do you decide the input length or output length?(Remember the French-English translation problem?)


Enter `'Recurrent Neural Networks'`

Recurrent neural networks are a variation of neural network designed to recognize patterns in sequences of data. These algorithms take both `time` and `sequence`, they have a temporal dimension.

Let's look at it in detail.

## 1.2 Recurrent Neural Network Model


**Intuition behind RNN**

Let's break down how humans process sequences to understand what's needed for machines to do the same.

Suppose you are watching a Tom Cruise action movie.

`In one scene, Tom Cruise drives from his home and arrives at a supermarket. In the next scene, it's shown that he is holding an Orange fruit. In the next scene, a man tries to attack Tom but he saves himself by spraying Orange juice on the attacker's face.` 


Here's how you understood the movie:

- From the first scene, you identified he's in supermarket

- In the next scene, without any distinctive supermarket features, when you saw the image of Tom with oranges you categorized it as shopping instead of cooking/eating.

- In the final scene, when Tom attacked the attacker with oranges, you had no problem following that scene because you remembered that Tom was holding an orange


Following is what a model will need to do the same:


- After seeing each scene, the model should output a label as well as update the knowledge it’s been learning.

For the movie, when seeing the first scene,the model has to output that Tom is in a supermarket and keep this knowledge with him, while additionaly scraping the knowledge of Tom driving from home.

Similarly, the model might learn to automatically discover and track information like time of day(if a scene contains an image of the setting sun, the model should remember that it's evening), and within-movie progress (is this image the first frame or the 10th?).

Most critical though your model should automatically discover useful information by itself(just as a neural network is able to learn good features with the help of data). 


- When given a new scene, the model should incorporate the knowledge it has learned so far to do a better job.

For the movie, when seeing the second scene, the model should categorize the activity as shopping(as it has learned that Tom was in a supermarket) instead of eating.


Recurrent Neural Networks(RNNs) are a family of neural networks for processing sequential data.

**Recurrent Neural Networks(RNNs)**

Recurrent neural networks (RNNs) are designed to utilize and learn from sequential information. The RNN architecture is supposed to perform the same task for every element of a sequence, and hence the term `recurrent` in its nomenclature. RNNs have been of great use in the task of natural language processing because of the sequential dependency of words in any language.  


RNNs are specialized for processing a sequence of values $x_1, . . . , x_t$ and can scale to much longer sequences than would be feasible for networks without sequence-based specialization. Most RNNs can also process sequences of varible length.

In a RNN, the information cycles through a loop. When it makes a decision, it takes into consideration the current input and also what it has learned from the inputs it received previously.



The two images below illustrate the difference in the information flow between a traditional Neural Network & RNNs.


![](final_images/rvsf.jpg)


As you can see from the above image, the difference is that RNNs have loops.

![](final_images/new_rnn_1.jpg)

In the above diagram, the Neural Network looks at the input xt and outputs a value ht. The loop allows information to be passed from one step to the next.

Don't be baffled by the loop. Another way to think about RNN is to think of it as multiple copies of the same network, each passing the message to its successor. 

Following is the same network as above with it's loop unfurled:

![](final_images/new_rnn_2.jpg)

This chain-like structure can be visualised as being a sequence itself and RNN therefore is the optimum architecture of neural network to use for such data.

Generally, at any time step of a sequence, RNNs compute some memory based on its computations thus far; i.e., prior memory and the current input. This computed memory is used to make predictions for the current time step and is passed on to the next step as an input. An RNN `recursively` applies a computation to every instance of the input sequence therefore conditioning itself on the previous computed results as well. An RNN therefore take as their input not just the current input example they see, but also what they have perceived previously in time. 

So recurrent networks have two inputs, the present as well as the recent past, which together determines how they respond to the new data( just like we do in real life). Recurrent networks therefore are sometimes said to have `memory`. 


This addition of memory to neural networks has a major purpose: There is information in the sequence itself, and recurrent nets use it to perform tasks that traditional neural networks can't.

It is essentially finding correlations between events separated by time, one way to think about RNNs therfore is they are a way to share weights over time. 

*Note: These correlations are called “long-term dependencies”*

One would think that the current output depends directly on the previous output. That's not the case. One important thing to understand is relationship between present input and past input(`memory`) is indirect. It’s indirect because the current output is dependent on the previously calculated `hidden states`, not on the previous outputs.

For example, given the sentence "I like eating noodles", the RNN doesn’t deduce "noodles" directly from "eating", it deduces "noodles", partly, from the information(the hidden state) that gave rise to "eating".

Think of it as human memory that circulates invisibly within our brain, affecting our behavior without never revealing the full shape of it. You know it's there, you use it but you never give it proper form. Information circulates in the hidden states of recurrent nets in a similar way.

Sounds confusing? That's because it is. 

The exact relationship between present and past input depends on the RNN’s weights. Creating a coherent sequence as we go along is only possible if one can recall what came before. And RNNs do exactly that; they remember what came before. Obviously, RNNs are not magic models. They need to be trained. They work because trained networks identifies and learns patterns in data. 

***
**Dive Deeper(Optional)**


Let's now try to understand this process better by understandig it mathematically.


Following is the (deceptively) simple equation of RNN to carry memory forward:

$$h_t= \phi(W_{x_t} + Uh_{t-1})$$

The hidden state at time step t is h_t. It is a function of:

- $x_t$(input of current time step t) modified by a weight matrix W 
- The hidden state of the previous time step $h_t-1$ multiplied by its own hidden-state-to-hidden-state matrix U(called transition matrix) 


The weight matrices(Both W and U) are the filters that determine how much significance to give to both the current input and the past hidden state(Note here we are talking about past hidden state and not past input).


These weights are optimally learned using the backpropagation(similar to how weights are learned in traditional neural networks) 
Only here we are adding a time step and calling it `backpropogation through time(BPTT)`. Though there’s not much difference between normal backprop and BPTT; when it comes down to it, BPTT is just backprop, but on RNNs! 

Remember that when you "unroll" or "expand" an RNN, it essentially becomes a feedforward network. There’s more work to do to compute the gradients; `time`, in this case, is expressed by well-defined & properly ordered series of calculations linking one time step to the next time step, but that backprop works almost the same way for recurrent nets that it would for traditional ones.
 
The following diagram will help you visualise how it works:

![](final_images/backprop.gif)

In the diagram above, each x(input example) is multiplied with their current filter w,(weights) as well as with a processed(hidden) state of previous input state.

If you are able to follow through the above image, congrats now you understand how RNN works(atleast at a theoretical level). For those who aren't able to, let's use an example.

**End of Dive Deeper(Optional)**
***

*Example:*

Suppose you had to predict the word 'data' using a RNN.

The neural network has the vocabulary: d, a, t. Exactly enough to produce the word 'data'. 

We input the first character, "d" and from there expect the output at the following timesteps to be: "a", "t" and "a" respectively, to form:

`DATA`

In numerical form, it will look something like this:

![](final_images/new_rnn_4.jpg)



The RNN therfore is learning, given "d", "a" is most likely to be the next character. Given "da", "t" is the next likely character and with "data", the final character should be "a".

But, if the neural network wasn’t trained on the word "data", and thus didn’t have optimal weights (ie. just randomly initialized weights), then we’d have garble like "dtaa" coming out.


Now is a good time to point out the concept of `start` & `end` tokens.
The reason why RNNs work so effectively is their ability to have any length input/output vectors.

`<START>` and `<END>` tokens tell us when input begins and when output ends. 

In our previous example, when the final character "a" is outputted , `<END>` token is placed; this tells the RNN that it has completed the word processing. 

Its effectiveness is much more visible when we talk about other RNN application like 'Image Captioning'. In image captioning, depending on the image no. of words in the caption could be n words long.It is the end token which helps the model know that the caption has been completed(as oppposed to running a loop or fixing a max value for n)

`<START>` and `<END>` tokens therfore help RNNs anticipate the processing required.

One small thing though, RNNs don't automatically learn them. We ourselves have to add these tokens while training the data.

Lot of theory, eh?

Let's get our hands dirty with coding

# Task 1



In [27]:
from keras.preprocessing import sequence
from keras.utils import to_categorical
from keras.datasets import reuters
max_features = 10000  # number of words to consider as features
maxlen = 500  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = reuters.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

#one hot encode labes
one_hot_train_labels = to_categorical(y_train)
one_hot_test_labels = to_categorical(y_test)

#train test
x_train = input_train

y_train = one_hot_train_labels

#train test
x_test = input_test

y_test = one_hot_test_labels

Loading data...
8982 train sequences
2246 test sequences
Pad sequences (samples x time)
input_train shape: (8982, 500)
input_test shape: (2246, 500)


## Data Description

## Instructions

- Data has already been loaded in x_train, y_train, x_test, y_test

- Create a `"Sequential()"` model and save it in a variable called `model`

- Add an embedding layer `"Embedding(max_features,100)"` to `model`. This is to take embedding the news data.

- Add RNN model `"SimpleRNN"` with 32 layers to `model`. Note this is the RNN model you are building.

- Next, create the output layer. For this, add `"Dense(46, activation='softmax')"` to `model`. 46 is the number of classes the output has.

- Compile the model using `"compile()"` attribute of `model` and pass the following parameters in it: `optimizer='adam'`, `loss='categorical_crossentropy'`, `metrics=['acc']`

- Fit the model using `"fit()"` attibute of `model` and pass the following parameters in it: `x_train`, `y_train`,`verbose=2`,`epochs=5`,`batch_size=128`,`validation_split=.2`

- Evaluate the model using `"evaluate()"` attribute of `model`. Pass `x_train`,`y_train` as the parameters and store the result of it in a variable called `test_score`

***Note:*** After submitting once(and successfully passing the test cases), feel free to play around with the model parameters

## Test cases

#test_score

Variable declaration

test_score[1]+0.01>=round(0.5084594835793452,2)

In [28]:
#fitting the model

from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout, SimpleRNN
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

model = Sequential()
model.add(Embedding(max_features, 100))
model.add(SimpleRNN(32))
model.add(Dense(46, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.fit(x_train, y_train,verbose=2,
                    epochs=5,
                    batch_size=128,
                    validation_split=.2)

test_score=model.evaluate(x_test,y_test)

Train on 7185 samples, validate on 1797 samples
Epoch 1/5
 - 13s - loss: 3.0128 - acc: 0.2679 - val_loss: 2.4564 - val_acc: 0.3450
Epoch 2/5
 - 12s - loss: 2.4208 - acc: 0.3479 - val_loss: 2.3478 - val_acc: 0.3450
Epoch 3/5
 - 13s - loss: 2.2224 - acc: 0.4242 - val_loss: 2.2190 - val_acc: 0.4062
Epoch 4/5
 - 13s - loss: 1.9052 - acc: 0.5219 - val_loss: 2.0144 - val_acc: 0.4880
Epoch 5/5
 - 14s - loss: 1.5727 - acc: 0.6007 - val_loss: 2.0210 - val_acc: 0.4691


# 1.3 Problem with RNNs

So far RNNs seems fantastic and State of the Art(SOTA) for text processing, right?

Well, they are great except for one **major** problem. Let's see what it is

Continuing with our Tom Cruise movie example, we have placed no constraints on updates, so the knowledge can change chaotically within RNN.

In one frame it sees the characters eating noodles and might conclude they are in Korea, and at the next frame it sees penguins and thinks they are in antartic. Or perhaps it has lot of information to suggest that Tom is a lawyer but then suggests he is a professional assassin after seeing him in a karate class.

This chaos means information transforms and vanishes quickly, and it's difficult for the model to keep a 'long-term memory'.
There are cases where we only need to look at recent information to perform the present task. 


For example, consider a language model predicting the next word. If we are trying to predict the last word in “the sun rises in the east,” we don’t need much context – it’s obvious the word is going to be east. In such cases,the dependency gap between the relevant information is small, RNNs can learn to use such information.

Unfortunately as the gap grows, RNNs inefficiency also grows. In theory, RNNs are absolutely capable of handling such "long-term dependencies." In practice though, RNNs don’t seem to be able to cope with them. Let's try to understand why.

# Gradient issue

The basic problem in RNNs is that gradients propagated over many stages tend to either vanish (most of the time) or explode (rarely, but with much damage to the optimization).

This is in part because the information that is flowing through neural nets passes through many stages of multiplication.

Any quantity multiplied frequently by an amount slightly greater than one can become immeasurably large (For eg. Calculation of compound interest that ends up having the borrower pay much more than the original amount). At the same time, its inverse i.e. multiplying by a quantity less than one, is also true( For e.g Gamblers who win 99 cents on every dollar, go bankrupt fast)

Because the layers and time steps of deep neural networks relate to each other through multiplication, derivatives are prone to vanishing or exploding.



### Vanishing gradient

![](final_images/vanishing_gradient.jpg)


So in recurrent neural networks, layers that get a small gradient update stops learning. Those are usually the earlier layers. So because these layers don’t learn, RNN’s can forget what it seen in longer sequences, thus having a short-term memory.


### Exploding gradient

On the other hand, exploding gradients treat every weight as though it is extremely important. This means that the computation within the RNN can potentially blow up to infinity without sensible weights. This makes learning VERY unstable because a slight shift in the weights in the wrong direction during backprop can blow up the activations during the forward pass. 

But exploding gradients can be solved relatively easily, because they can be truncated or squashed using sigmoid/tanh(This often results in vanishing gradient problem though). 

![](final_images/exploding_gradient.jpg)

Before we understand how to resolve this problem, let's try to quickly recap what we learned about RNNs so far.

|RNN Advantages|RNN Disadvantages|
|---|---|
|Capability of processing any length input<br>(Model size not affected with size of input)|Difficulty in capturing long distance dependencies|
|Past information taken into account during computation|Relatively slow processing|
|Weights are shared across time(BPTT)||


**Solution to Vanishing and Exploding Gradients Problem in RNN:**

1. Smart Weight Initialization:

When we initialize weights too small(<<1), it leads to vanishing gradient
When we initialize weights too large(>>1), it leads to exploding gradient 

If we initialize weights randomly from a uniform distribution, it leads to lesser likelihood of gradient instability. Another way is to initialize weights randomly from a normal distribution. 

2. Better Activation Functions:

Using activation functions like Relu, Leaky Relu(as opposed to Sigmoid, Tanh function) helps avoid exploding and vanishing gradients issue since it outputs a constant gradient of 1 for all the inputs > 0. This makes the neural net learn faster and speeds up the convergence of the training process.

3. LSTM:

An RNN overwrites its memory at each time step in a relatively uncontrolled fashion, an LSTM on the other hand transforms its memory by using specific learning mechanisms that help it keep long term track of information. Let's look at in detail in the next chapter.

# 2: LSTM

## 2.1 Introduction to LSTM

Let's go back to the Tom Cruise movie example, you would like to have the model learn about the world in a controlled way with the following mechanisms:

- Need of a forget mechanism:

If the movie scene ends, the model should not keep track(forget) any scene-specific information(time of day, location, etc). 

At the same time, if a Tom kills someone in the scene, it should remember that the other character is no longer alive. 


- Need of a saving mechanism:

Another important mechanism is with respect to input processing. When a new input comes in, it needs to identify which part of scene to save. Maybe there's a scene with Tom telling a joke to his coworker(You don't need that information saved)

In summary what we need is when new a input comes in, the model first needs to forget any previous long-term information it decides is not required. After that it learns which parts of the new input are worth using, and stores them into its long-term memory.

- Selective long term memory

Think of the forget/save mechanism as maintaining a library(of information). This library model is useless though if it doesn't know which parts of its long-term memory is immediately needed. 

For example, Tom's family details(Is he married? How many kids?) may be a useful piece of information to keep in the long term but is probably irrelevant if Tom is not in the current scene. So instead of using the full long-term memory for every scene, it needs to learn which parts to focus on.


The three features is what is more or less there in a variant of RNN called 'Long Short-Term Memory networks'

**Long Short-Term Memory Networks**

LSTM networks are explicitly designed to avoid the long-term dependency problem that RNNs have. 

``
Remembering information for long periods of time is practically their default behavior, not something they struggle to learn``

***Source:***[Colah Blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)


As mentioned before An RNN overwrites its memory at each time step in a relatively uncontrolled fashion, an LSTM on the other hand transforms its memory by using specific learning mechanisms that help it keep long term track of information.


Let's try to understand LSTM a bit more formally


A crucial addition in LSTM is that the weight on the self-loop is conditioned on the context, rather than ﬁxed. By converting the weight of this self-loop controlled by another hidden unit(also called gated), the time scale of integration can be changed dynamically. 


An LSTM consist of three gates : input, forget, and output gates.

Following is the LSTM circuit image:

![](final_images/lstm_1.jpg)

The image looks a little daunting now. Let's try to understand it better

At time t, we receive a new input $x_t$. 

Let's start with modification to long-term memory, more specifically working memory(i.e. selective memory we talked about in the previous step).


***Note:*** Long term memory is referred to as cell state($C_t$) and working memory is referred to as hidden state($h_t$).


***
***Step 1:***

Identify which pieces of working memory(or hidden state $h_t$) to continue remembering and which to discard. This leads to creation of a forget gate(**$f_t$**)

***Note:*** The term forget gate is bit of a misnomer because here by forgeting we are infact remembering

Alternatively you can write,

**${f_t}$ = Function of new input($x_t$) and previous working memory($h_{t-1}$)**




***
***Step 2:***

Compute the new information we can learn from the present input $x_t$ and store it temporaily in candidate memory

Alternatively you can write,

**$\text{candidate_mem}$= Function of new input($x_t$) and long term memory($C_t$)**

***
***Step 3:*** 

Before we add the candidate memory, we want to learn which parts of it are actually worth using and saving. This gate where you are "saving" is the input gate($i_t$)

Alternatively you can write,

**${i_t}$=  Function of new input($x_t$) and  previous working memory($h_{t-1}$)**

For e.g. In weather forecasting, you should take information about tomorrow's weather, but probably ignore the journalist speculations about the same.

***Note:*** This is similar to remember gate but using different weight matrices.

***Step 4:***

Combining all these steps, we get the actual long-term memory or cell state($C_t$)

Alternatively you can write,

**${C_t}$= Function of previous long term memory(forget gate + candidate memory)and new long term memory candidate(input gate + candidate memory)**  

In this step what we are effectively doing is updating long-term memory by forgetting memories we don't need further and saving useful pieces from new input.

***
***Step 5:***

Update the working memory by focusing certain parts of long-term memory that will be immediately useful. This results in output gate($o_t$) 

**${o_t}$= Function of current input($x_t$) and previous working memory($h_{t-1}$)**


Our new hidden state($h_t$) then becomes:

**${h_t}$= function of output gate($o_t$) and current long term memory($C_t$).**



****
### Dive Deeper(Optional)

Let's look at the LSTM arch. again and understand the gates.

![](final_images/lstm_1.jpg)

Following are the mathematical formulae of the gates we discussed:

**Cell state($C_t$):**

$\widetilde{C_t}= tanh(W_c.[h_{t-1},x_t] + b_C)$

$C_t= f_t*C_{t-1} + i_t*\widetilde{C_t}$


**Hidden state($h_t$):** 

$h_t=o_t*tanh(C_t)$


**Forget gate($f_t$):**

$f_t= \sigma(W_f.[h_{t-1},x_t] + b_f)$

**Input gate($i_t$):** 

$i_t= \sigma(W_i.[h_{t-1},x_t]+ b_i)$

**Output gate($o_t$):**

$o_t= \sigma(W_o[h_{t-1},x_t]+ b_o)$

**End of Dive Deeper(Optional)**
****

To summarize, following is the comparision between LSTM and RNN:

**RNN:**

Updation of hidden gate.

**LSTM:**

Updation of cell state, hidden state

Creation of forget gate, input gate, ouput gate( and candidate memory temporarily)


***If you want to understand the process better, you can check out this [fantastic blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) explaining the underlying mathematics.***

Let's now get our hands dirty with actual LSTM coding.

# Task 2

In [29]:
#fitting the model
from keras.layers import Dense
from keras.layers import LSTM, GRU
from keras.layers import Dropout


model = Sequential()
model.add(Embedding(max_features, 100))
model.add(LSTM(32))
model.add(Dropout(0.5))
model.add(Dense(46, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
history = model.fit(partial_x_train, partial_y_train,
                    epochs=5,
                    batch_size=128,
                    validation_split=.2)

Train on 7185 samples, validate on 1797 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# 2.2 LSTM variants

If you understood LSTM so far(***If you haven't then by all means read again. LSTM is both the most complex as well most pivotal concept that you are going to encounter in NLP using DL***), you might think that LSTM architecture could be optimised. Thankfully, that's what leading researches also ended up thinking which lead to many optimised LSTM architectures.

Following are the two variations of the same: 



### 1. GRU

One thing that seems odd in LSTM is the use of both long-term and working memories. That leads to oddity of why to use both input gates and forget gates.
The tackling of this oddity led to the conception of LSTM variant called Gated Recurent Unit(GRU).

Following is the circuit image of GRU:
![](final_images/gru_1.png)


A GRU has two gates, a reset gate `r` and an update gate `z`.  


**Update gate(z):**

We combine forget and input gates to create a single 'update' gate. Instead of separately learning what to forget and new information to add, we do that together. 

**Reset gate(r):**

The reset gate is another gate that is used to decide how much past information to forget. 


Interestingly if you set the reset gate value to 1 and  update gate to 0, you will end up with a vanilla RNN model. Following are GRU's key differences when compared with LSTM:


- A GRU has 2 gates, an LSTM has 3 gates.

- GRUs don’t possess internal memory($C_t$) 

- GRUs don’t have the output gate like LSTMs.

- Owing to fewer operations; they are faster to train & need less data to generalize than LSTMs. Conversely in cases with enough data, LSTMs with their greater flexibility lead to better results.


***

### Dive Deeper(Optional):

Following are the equations of GRU:

$z_t= \sigma(W_z.[h_{t-1},x_t])$

$r_t= \sigma(W_r.[h_{t-1},x_t])$

$\widetilde{h_t}= tanh(W.[r_t*h_{t-1},x_t])$

${h_t}= (1-z_t)*h_{t-1} + z_t*\widetilde{h_t}$


Following is a side by side operation flow of LSTM and GRU

|LSTM|GRU|
|-----|-----|
|![](final_images/lstm.gif) |![](final_images/gru.gif)|




**End of Dive Deeper(Optional)**
***

### 2. Bi-Directional LSTM

Another variation of LSTM is the bidirectional LSTM. 

As the term `bidirectional` suggests, you try to predict data based on flow from both sides.

Take the example of the following sentences:

"The man went to the local pool. He swam for a while and then got out of the pool"

Let's say we want to predict the word 'pool' in the sentence:

Unidirectional LSTM only preserves information of the past because it has only seen inputs from the past.

On a surface level what a unidirectional LSTM will see is the following:

The man went to the local `_____`

With bidirectional LSTM we will be able to see information further down the road for example:

Forward LSTM:

The man went to ...

Backward LSTM:

...swam and then got out of pool.

Using the information from the future makes it easier for the network to understand what the next word is.


![](final_images/bilstm.jpg)

The working of BiLSTM(or any BiRNN for that matter) is as follows:

- Separate the neurons of a regular LSTM into two directions, one for positive time direction (forward states: seeing the input vectors in correct order), and another for negative time direction (backward states: seeing the input vectors in reverse order). 


- BiLSTMs are trained using similar algorithms to LSTMs, because the two directional neurons do not have any interactions. Just that forward propagation is applied twice , one for the forward cells and one for the backward cells


- When weight updation via back-propagation is applied, additional processes are required because updating input and output layers cannot be done simultaneously.


- During forward pass, the model passes forward states and backward states first and then output neurons. During backward pass, the model passes output neurons first and then forward states and backward states. The weights are updated after forward and backward passes are completed. 

Applications of Bidirectional LSTM include :

- POS tagging
- Named Entity Recognition
- Speech Recognition 
- Machine Translation

It's more effective and more complex than unidirectional LSTM arch. Though it should be used depending upon the required application (For e.g Bi-Directional LSTMs cannot be used where we don't have access to the full sequence like real time translation)

You can understand in detail about Bidirectional LSTM in this [video](https://www.youtube.com/watch?v=bTXGpATdKRY)

# Task 3

In [13]:
#fitting the model
from keras.layers import Dense
from keras.layers import LSTM, GRU, Bidirectional
from keras.layers import Dropout


model = Sequential()
model.add(Embedding(max_features, 100))
# model.add(SimpleRNN(32))
# model.add(SimpleRNN(32, return_sequences=True))
# model.add(SimpleRNN(32, return_sequences=True))
# model.add(SimpleRNN(32, return_sequences=True))
# model.add(LSTM(32))
# model.add(GRU(32))
model.add(Bidirectional(LSTM(32)))
# model.add(Dropout(0.5))
model.add(Dense(46, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
history = model.fit(partial_x_train, partial_y_train,
                    epochs=5,
                    batch_size=128,
                    validation_split=.2)


Train on 7185 samples, validate on 1797 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


--------------------------------------------------------------------------------------------------------------------------------

# -----------------------------------------------
# END OF CONCEPT
# ----------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------