## Gated Recurrent Unit Network

To solve the Vanishing-Exploding gradients problem often encountered during the operation of a basic Recurrent Neural Network, many variations were developed. One of the most famous variations is the Long Short Term Memory Network(LSTM). One of the lesser known but equally effective variations is the Gated Recurrent Unit Network(GRU).

Unlike LSTM, it consists of only two gates and does not maintain an Internal Cell State. The information which is stored in the Internal Cell State in an LSTM recurrent unit is incorporated into the hidden state of the Gated Recurrent Unit. This collective information is passed onto the next Gated Recurrent Unit. The different gates of a GRU are as described below:-

**Update Gate(z):** It determines how much of the past knowledge needs to be passed along into the future. It is analogous to the Output Gate in an LSTM recurrent unit.

**Reset Gate(r):** It determines how much of the past knowledge to forget. It is analogous to the combination of the Input Gate and the Forget Gate in an LSTM recurrent unit.

**Current Memory Gate():** It is often overlooked during a typical discussion on Gated Recurrent Unit Network. It is incorporated into the Reset Gate just like the Input Modulation Gate is a sub-part of the Input Gate and is used to introduce some non-linearity into the input and to also make the input Zero-mean. Another reason to make it a sub-part of the Reset gate is to reduce the effect that previous information has on the current information that is being passed into the future.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5f/Gated_Recurrent_Unit.svg/1594px-Gated_Recurrent_Unit.svg.png" style="height: 300px;" />  


### Working of a Gated Recurrent Unit

1. Take input the current input and and the previous hidden state as vectors.

2. Calculate the values of the three different gates by following the steps given below:- 

a. For each gate, calculate the parameterized currrent input and previous hidden state vectors by performing element-wise multiplication (hadmard product) between the concerned vector and the respective weights for each gate.

b. Apply the respective activation function for each gate element-wise on the parameterized vectors. Below given is the list of the gates with the activation function to be applied for the gate.

Update Gate : Sigmoid Function  
Reset Gate  : Sigmoid Function

The process of calculating the Current Memory Gate is a little different. First, the Hadmard product of the Reset Gate and the previous hidden state vector is calculated. Then this vector is parameterized and then added to the parameterized current input vector. 

$$\overline{h}_{t} = tanh(W\odot x_{t}+W\odot (r_{t}\odot h_{t-1})$$

To calculate the current hidden state, first a vector of ones and the same dimensions as that of the input is defined. This vector will be called ones and mathematically be denoted by 1. First calculate the hadmard product of the update gate and the previous hidden state vector. Then generate a new vector by subtracting the update gate from ones and then calculate the hadmard product of the newly generated vector with the current memory gate. Finally add the two vectors to get the current hidden state vector.

$$h_{t} = z_{t}\odot h_{t-1} + (1-z_{t})\odot \overline{h}_{t}$$

Just like Recurrent Neural Networks, a GRU network also generates an output at each time step and this output is used to train the network using gradient descent.

### GRU Implementation

The conceptual procedure of training the network is to first feed the network a mapping of each character present in the text on which the network is training to a unique number. Each character is then hot-encoded into a vector which is the required format for the network. The data for the described procedure is a collection of short and famous poems by famous poets and is in a .txt format.

In [3]:
from __future__ import absolute_import, division, print_function, unicode_literals 

import numpy as np 
import tensorflow as tf 
import pandas as pd

from keras.models import Sequential 
from keras.layers import Dense, Activation 
from keras.layers import GRU

from keras.optimizers import RMSprop 

from keras.callbacks import LambdaCallback 
from keras.callbacks import ModelCheckpoint 
from keras.callbacks import ReduceLROnPlateau 
import random 
import sys 


Using TensorFlow backend.


In [4]:
# Reading the text file into a string 
with open('poems.txt', 'r') as file: 
	text = file.read() 

# A preview of the text file	 
print(text) 

Buffalo Bill’s

defunct

who used to

ride a watersmooth-silver

stallion

and break one two three four five pigeons just like that

Jesus



he was a handsome man

and what i want to know is

how do you like your blueeyed boy

Mister Death



Had I the heaven’s embroidered cloths,

Enwrought with golden and silver light,

The blue and the dim and the dark cloths

Of night and light and the half-light,

I would spread the cloths under your feet:

But I, being poor, have only my dreams;

I have spread my dreams under your feet;

Tread softly because you tread on my dreams.



He clasps the crag with crooked hands;

Close to the sun in lonely lands,

Ring’d with the azure world, he stands.



The wrinkled sea beneath him crawls;

He watches from his mountain walls,

And like a thunderbolt he falls.



Some say the world will end in fire,

Some say in ice.

From what I’ve tasted of desire

I hold with those who favor fire.

But if it had to perish twice,

I think I know enough of hate

To

In [5]:
# Creating a mapping from each unique character in the text to a unique number
# Storing all the unique characters present in the text 
vocabulary = sorted(list(set(text))) 

# Creating dictionaries to map each character to an index 
char_to_indices = dict((c, i) for i, c in enumerate(vocabulary)) 
indices_to_char = dict((i, c) for i, c in enumerate(vocabulary)) 

print(vocabulary) 

['\n', ' ', '!', ',', '-', '.', ':', ';', 'A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'M', 'O', 'R', 'S', 'T', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', '—', '’']


In [6]:
#Pre-processing the data
# Dividing the text into subsequences of length max_length 
# So that at each time step the next max_length characters 
# are fed into the network 
max_length = 100
steps = 5
sentences = [] 
next_chars = [] 
for i in range(0, len(text) - max_length, steps): 
	sentences.append(text[i: i + max_length]) 
	next_chars.append(text[i + max_length]) 
	
# Hot encoding each character into a boolean vector 

# Initializing a matrix of boolean vectors with each column representing 
# the hot encoded representation of the character 
X = np.zeros((len(sentences), max_length, len(vocabulary)), dtype = np.bool) 
y = np.zeros((len(sentences), len(vocabulary)), dtype = np.bool) 

# Placing the value 1 at the appropriate position for each vector 
# to complete the hot-encoding process 
for i, sentence in enumerate(sentences): 
	for t, char in enumerate(sentence): 
		X[i, t, char_to_indices[char]] = 1
	y[i, char_to_indices[next_chars[i]]] = 1


In [7]:
#Building the GRU network
# Initializing the network 
model = Sequential() 

# Defining the cell type 
model.add(GRU(128, input_shape =(max_length, len(vocabulary)))) 

# Defining the densely connected Neural Network layer 
model.add(Dense(len(vocabulary))) 

# Defining the activation function for the cell 
model.add(Activation('softmax')) 

# Defining the optimizing function 
optimizer = RMSprop(lr = 0.01) 

# Configuring the model for training 
model.compile(loss ='categorical_crossentropy', optimizer = optimizer) 


In [8]:
#Defining some helper functions which will be used during the training of the network
# Helper function to sample an index from a probability array 
def sample_index(preds, temperature = 1.0): 
# temperature determines the freedom the function has when generating text 

	# Converting the predictions vector into a numpy array 
	preds = np.asarray(preds).astype('float64') 

	# Normalizing the predicitons array 
	preds = np.log(preds) / temperature 
	exp_preds = np.exp(preds) 
	preds = exp_preds / np.sum(exp_preds) 

	# The main sampling step. Creates an array of probablities signifying 
	# the probability of each character to be the next character in the 
	# generated text 
	probas = np.random.multinomial(1, preds, 1) 

	# Returning the character with maximum probability to be the next character 
	# in the generated text 
	return np.argmax(probas) 

# Helper function to generate text after the end of each epoch 
def on_epoch_end(epoch, logs): 
	print() 
	print('----- Generating text after Epoch: % d' % epoch) 

	# Choosing a random starting index for the text generation 
	start_index = random.randint(0, len(text) - max_length - 1) 

	# Sampling for different values of diversity 
	for diversity in [0.2, 0.5, 1.0, 1.2]: 
		print('----- diversity:', diversity) 

		generated = '' 

		# Seed sentence 
		sentence = text[start_index: start_index + max_length] 

		generated += sentence 
		print('----- Generating with seed: "' + sentence + '"') 
		sys.stdout.write(generated) 

		for i in range(400): 
			# Initializing the predicitons vector 
			x_pred = np.zeros((1, max_length, len(vocabulary))) 

			for t, char in enumerate(sentence): 
				x_pred[0, t, char_to_indices[char]] = 1.

			# Making the predictions for the next character 
			preds = model.predict(x_pred, verbose = 0)[0] 

			# Getting the index of the most probable next character 
			next_index = sample_index(preds, diversity) 

			# Getting the most probable next character using the mapping built 
			next_char = indices_to_char[next_index] 

			# Building the generated text 
			generated += next_char 
			sentence = sentence[1:] + next_char 

			sys.stdout.write(next_char) 
			sys.stdout.flush() 
		print() 

# Defining a custom callback function to 
# describe the internal states of the network 
print_callback = LambdaCallback(on_epoch_end = on_epoch_end) 

# Defining a helper function to save the model after each epoch 
# in which the loss decreases 
filepath = "weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor ='loss', 
							verbose = 1, save_best_only = True, 
							mode ='min') 

# Defining a helper function to reduce the learning rate each time 
# the learning plateaus 
reduce_alpha = ReduceLROnPlateau(monitor ='loss', factor = 0.2, 
							patience = 1, min_lr = 0.001) 
callbacks = [print_callback, checkpoint, reduce_alpha] 


In [9]:
# Training the GRU model 
model.fit(X, y, batch_size = 128, epochs = 20, callbacks = callbacks) 

Epoch 1/20

----- Generating text after Epoch:  0
----- diversity: 0.2
----- Generating with seed: "led sea beneath him crawls;

He watches from his mountain walls,

And like a thunderbolt he falls.

"
led sea beneath him crawls;

He watches from his mountain walls,

And like a thunderbolt he falls.

                                                                                                                                                                                                                                                                                                                                                                                                                
----- diversity: 0.5
----- Generating with seed: "led sea beneath him crawls;

He watches from his mountain walls,

And like a thunderbolt he falls.

"
led sea beneath him crawls;

He watches from his mountain walls,

And like a thunderbolt he falls.

                d                d        s
 

I think I know enougmoped  ialpudirod’dehgthchefaalugl;ge t

raiglv tfo
sIwaenguiBhe cogTbiuntIn’w g hi e thsrrjcl’ergesday a asyfr fiye


Tap  iithwgnt  —F u
reaougs
g
powcoggr toTIden

w
Isalualid aso
IaAeiposthchd w
t t pfad yr,
d
TIsbyTest nt lDbrv

B’ mbsdnpsiwraauiwotedcahT
tIq


msIsa.awasssreg hrwogw ha eusor sHaioops

o
n auaOagn—jj mfDereast da a -uhlhs 
d dafih’ w
t  nirfiudy mneccluplvsjf;et ttsb



AAgaD
----- diversity: 1.2
----- Generating with seed: "d of desire

I hold with those who favor fire.

But if it had to perish twice,

I think I know enoug"
d of desire

I hold with those who favor fire.

But if it had to perish twice,

I think I know enoug u ret rafh wnireg,

T cIsrulwoukboy thasdohllefrbmlredad t m—cthe,
wrgss  Iah ,dn 
ou,

omegiivzkwomh

s
anDay;

Iwnse,






 on,sdjl DuronTy

Trst-rat ImmDubs
-Y alneirdedim myrasmthdnggvdsddouhlnwt oah fpithHr

iBz;IqrdhvosO-ci gaoua ay’ebuyirgdd
I,eOIinDflnsin

II wromgDn wosiue,

ATvwyo tprh snaBs

Tas alCdbssTa sou ind

<keras.callbacks.callbacks.History at 0x16e39b6aa58>

### Questionnaire

**1. The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates). Why do we make use of GRU when we clearly have more control on the network through the LSTM model (as we have three gates)? In which scenario GRU is preferred over LSTM?**

[Solutions](https://github.com/ebi-byte/kt/blob/master/NN/GRU%20Solutions.ipynb)