<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#RECURRENT-NEURAL-NETWORKS-(RNNS)" data-toc-modified-id="RECURRENT-NEURAL-NETWORKS-(RNNS)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>RECURRENT NEURAL NETWORKS (RNNS)</a></span></li></ul></div>

## RECURRENT NEURAL NETWORKS (RNNS)
A major characteristic of feedforward networks is that these networks take in arbitrary feature vectors with fixed, predetermined input sizes, along with their associated weights and had no hidden state. With feedforward networks, in order to process a sequence of data points, the entire sequence is considered as a single input for the network to process and capture all relevant information at once in a single step. This makes it difficult to deal with sequences of varying length and fails to capture important information. Sequential data usually involve variable lenght inputs, so instead of processing the data point in just a single step, we need a model that will still consider a sequence as a single input to the network but instead of processing it in a single step, the model will internally loop over the sequence elements taking each element as an input and maintaining a state containing information relative to what it has seen so far and this is the ideal behind RNNS.


<b style="text-decoration:underline;font-size: 20px;text-transform: uppercase;">Recurrent neural networks or RNNs</b> are networks containing recurrent connections within their network connections and are often used for processing sequential data.   RNNs assumes that an incoming data take the form of a sequence of vectors or tensors, which can be sequences of word or sequences of characters as in textual data, sequence of observations over a period of time as in time series etc.  

An a assumption of feedforward networks is that, inputs are independent (one input has no dependency on another) of one another , however in sequential data such as textual data (we will limit ourselves to textual data which is the most widespread forms of sequence data), this assumption is not true, since in a sentence the occurrence of a word influences or is influenced by the occurrences of other words in the sentence. 



RNNs have recurrent connections that allows a memory to persist in the network’s internal state keeping track of information observed so far and informing the decisions to be made by the network at later points in time.


RNNs also share parameters across different parts of the network making it possible to extend and apply the network to inputs of variable lengths and generalize across them, this makes RNNs useful for Timeseries forecasting and natural language processing (NLP) systems such as document classification, sentiment analysis, automatic translation, generating text for applications such as chatbots. <b>Since the same parameters are used for all time steps, the parameterization cost of an RNN does not grow as the number of time steps increases.</b>

From the book <a href='https://www.deeplearningbook.org/contents/rnn.html' target="_blank">Deep Learning-chapter 10</a>

> Some examples of important design patterns for recurrent neural networks
include the following:
 - Recurrent networks that produce an output at each time step and have
recurrent connections between hidden units.
 - Recurrent networks that produce an output at each time step and have
recurrent connections only from the output at one time step to the hidden
units at the next time step.
 - Recurrent networks with recurrent connections between hidden units, that
read an entire sequence and then produce a single output.


In this section, we will  consider a class of recurrent networks referred to as Elman Networks (Elman,1990) or simple recurrent networks which serve as the basis for more complex approaches like the Long Short-Term Memory (LSTM) networks and Gated Recurrent Unit (GRU). Let consider a simple RNN which is typically a three-layer network comprising an input layer, a single hidden layer and an output layer.   


<a href="#Fig">Figure 1</a> is a diagramatic view of RNN with  input to hidden connections parametrized by a weight matrix $ U \in R^{d \times h} $, hidden-to-hidden recurrent connections parametrized by a weight matrix $W\in R^{h \times h}$ and hidden-to-output connections parametrized by a weight matrix  $ V \in R^{h*o} $ and $ h \in R^{n*h} $ representing the hidden state of the network.  On the Left side is RNN drawn with recurrent connections and on the Right is the same seen as an time unfolded computational graph, where each node is now associated with one particular time instance and it illustrates the computational logic of an RNN at adjacent time steps.

<img src="images/rnn.png">
<span id='Fig'>Figure 1</span>
<a href="https://www.google.com/search?q=rnn+image&client=firefox-b-d&tbm=isch&source=iu&ictx=1&fir=lD-kwEF8OCJIoM%252C5nGST21LG70DyM%252C_&vet=1&usg=AI4_-kTE51-vQdo1Mb1V3I10kNw5Xv3yAw&sa=X&ved=2ahUKEwir7rW18sjuAhVOXMAKHSm_CMQQ9QF6BAgHEAE&biw=1366&bih=580#imgrc=8TAzbbCVWa8qZM">source: RNN</a>

Most RNNs computation  can be decomposed into three blocks of parameters and associated transformations or activation function:
- 1. from the input to the hidden state,
- 2. from the previous hidden state to the next hidden state, and
- 3. from the hidden state to the output


Armed with a summary of RNNs computational decomposition, let assume we have a minibatch of inputs $X^{t} \in R^{n×d}$ where each row of $X^{t}$ corresponds to one example at time step ***t*** from the sequence,  $ h^{t} \in R^{n*h} $ the hidden state at time ***t***.  Unlike standard feedforward networks, RNNs current hidden state $h^{t}$ is a function $\phi$ of the previous hidden state $h^{t-1}$ and the current input $x^{t}$ defined by $$h^{t}=\phi(x^{t}U+ h^{t-1}W+b^{h} )$$ where  $b^{h} \in R^h$ is the bias term and the weights $W$ determine how the network current state makes use of past context in calculating the output for the current input.


The fact that the computation of the hidden state at time t requires the value of the hidden state
from time t −1 mandates an incremental inference algorithm that proceeds from the start of the sequence to the end and thus RNNs condition the next word on the entire sentence history. With $h^{t}$ defined, RNN is defined as function $\phi$ taking as input a state vector $h^{t-1}$ and an input vector $x^{t}$ and return a new state vector $h^{t}$. The initial state vector $h^{0}$, is also an input to the RNN, assumed to be a zero vector and often omitted. The hidden state $h$ is then used as the input for the output layer and is given by

$$h^{t}=\phi(X^{t}U+ h^{t-1}W+b^{h} )$$
 $$ O=f(h^{t}V +b^{o})$$
  $$\hat y=\phi (O)$$
  

 where $O \in R^{n \times o}$ and  $b^{o} \in R^{o}$
 
 Layers performing $h^{t}=\phi(x^{t}U+ h^{t-1}W+b^{h} )$  in RNNs are called recurrent layers.
 

In [1]:
from tensorflow.keras.layers import Embedding,Layer,Flatten,Dense,Dropout,SimpleRNN
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import tensorflow as tf
from tensorflow.keras.models import Sequential
import os,re,string
import numpy as np

# Loading the IMDB data 
You’ll restrict the movie reviews to the top 15,000 most common words and  considering looking at the first 30 words in every review. The network will learn 16-dimensional embeddings for each of the 15,000 words

In [2]:
!ls  './data/aclImdb/train/'

labeledBow.feat
neg
pos
unsupBow.feat
urls_neg.txt
urls_pos.txt
urls_unsup.txt


In [3]:
batch_size=100
seed = 100
tranin_data=tf.keras.preprocessing.text_dataset_from_directory(directory='./data/aclImdb/train/',
                                                               batch_size=batch_size,subset='training',
                                                              validation_split=0.25,seed=seed)
val_data=tf.keras.preprocessing.text_dataset_from_directory(directory='./data/aclImdb/train/',
                                                               batch_size=batch_size,subset='validation',
                                                              validation_split=0.25,seed=seed)

Found 25000 files belonging to 2 classes.
Using 18750 files for training.
Found 25000 files belonging to 2 classes.
Using 6250 files for validation.


In [4]:
def remove_br(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', '')
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

In [5]:
max_features = 15000  # Maximum vocab size.
max_tokens=228

encoded_input=TextVectorization(max_tokens=max_features,output_mode='int',output_sequence_length=max_tokens,
                         standardize=remove_br)

In [6]:
encoded_input.adapt(tranin_data.map(lambda x,y:x))

In [7]:
embedded_dim=16

In [8]:
model=Sequential()
model.add(encoded_input)
model.add(Embedding(input_dim=max_features,input_length=max_tokens,output_dim=embedded_dim))
model.add(SimpleRNN(32,return_sequences=True, activation="relu"))
model.add(SimpleRNN(32,return_sequences=True, activation="relu"))
model.add(SimpleRNN(32, activation="relu"))
model.add(Dense(1, activation='sigmoid'))

In [9]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

In [10]:
history = model.fit(tranin_data,epochs=10,batch_size=150,validation_data=val_data)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [11]:
text = [
     "This movie is fantastic! I really like it because it is so good!",
    "Not a good movie!",
    "The movie was great!",
    "This movie really sucks! Can I get my money back please?",
    "The movie was terrible...",
    "This is a confused movie.",
  "The movie was great!",
  "This is a confused movie.",
  "The movie was terrible..."
    
]
for i in text:
    predictions = model.predict(np.array([i]))
    result='positive review' if predictions>0.5 else 'negative review'
    print(result)

negative review
negative review
negative review
negative review
negative review
negative review
negative review
negative review
negative review


Reference
-<a href='https://www.deeplearningbook.org/contents/rnn.html'>
Ian Goodfellow, Yoshua Bengio and Aaron Courville (2016). Deep Learning. MIT Press,pp.389-413
</a>

<a href="https://link.springer.com/article/10.1007/BF00114844">Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2), 195-225.</a>

<a href="https://arxiv.org/pdf/1412.7753.pdf">Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., & Ranzato, M. A. (2014). Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753.</a>