A Neural Network consists of different layers connected to each other, working on the structure and function of a human brain. It learns from huge volumes of data and uses complex algorithms to train a neural net.

Several neural networks can help solve different business problems. Let’s look at a few of them.
- Feed-Forward Neural Network: Used for general Regression and Classification problems.
- Convolutional Neural Network: Used for object detection and image classification.
- Deep Belief Network: Used in healthcare sectors for cancer detection.
- RNN: Used for speech recognition, voice recognition, time series prediction, and natural language processing.


# Recurrent Neural Network(RNN)
When it comes to sequential or time series data, traditional feedforward networks cannot be used for learning and prediction. A mechanism is required that can retain past or historic information to forecast the future values. Recurrent neural networks or RNNs for short are a variant of the conventional feedforward artificial neural networks that can deal with sequential data and can be trained to hold the knowledge about the past.

RNN works on the principle of saving the output of a particular layer and feeding this back to the input in order to predict the output of the layer.

Below is how you can convert a Feed-Forward Neural Network into a Recurrent Neural Network. RNN contains networks with loops in them, allowing information to persist.

![image.png](attachment:image.png)

The nodes in different layers of the neural network are compressed to form a single layer of recurrent neural networks. A, B, and C are the parameters of the network. 

![](https://www.simplilearn.com/ice9/free_resources_article_thumb/Network_framework.gif 'gif')

Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C are the network parameters used to improve the output of the model. At any given time t, the current input is a combination of input at x(t) and x(t-1). The output at any given time is fetched back to the network to improve on the output.

In the above diagram, a chunk of neural network, **h** looks at some input **x** and outputs a value **y**. A loop allows information to be passed from one step of the network to the next. These loops make recurrent neural networks seem kind of mysterious. However, if looked a bit more, it turns out that they aren't all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. 

![image.png](attachment:image.png)

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They're the natural architecture of neural network to use for such data. 

Some of the variety of problems to which RNN are applied are:-
- Speech Recognition
- Language Modelling
- Language Translation: Given an input in one language, RNNs can be used to translate the input into different languages as output.
- Image Captioning: RNNs are used to caption an image by analyzing the activities present.
- Time Series Prediction: Any time series problem, like predicting the prices of stocks in a particular month, can be solved using an RNN.

## Why RNN?

RNN were created because there were a few issues in the feed-forward neural network:
- Cannot handle sequential data
- Considers only the current input
- Cannot memorize previous inputs

The solution to these issues is the RNN. An RNN can handle sequential data, accepting the current input data, and previously received inputs. RNNs can memorize previous inputs due to their internal memory.

## Working of Recurrent Neural Network?

In Recurrent Neural networks, the information cycles through a loop to the middle hidden layer.

![image.png](attachment:image.png)

The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the middle layer. 

The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions and weights and biases. If you have a neural network where the various parameters of different hidden layers are not affected by the previous layer, ie: the neural network does not have memory, then you can use a recurrent neural network.

The Recurrent Neural Network will standardize the different activation functions and weights and biases so that each hidden layer has the same parameters. Then, instead of creating multiple hidden layers, it will create one and loop over it as many times as required. 

## Feed-Forward Neural Networks vs Recurrent Neural Networks

A feed-forward neural network allows information to flow only in the forward direction, from the input nodes, through the hidden layers, and to the output nodes. There are no cycles or loops in the network. 

Below is how a simplified presentation of a feed-forward neural network looks like:
![image.png](attachment:image.png)
In a feed-forward neural network, the decisions are based on the current input. It doesn’t memorize the past data, and there’s no future scope. Feed-forward neural networks are used in general regression and classification problems.

## Types of RNN

- One to One RNN: This type of neural network is known as the Vanilla Neural Network. It's used for general machine learning problems, which has a single input and a single output.
![image.png](attachment:image.png)

- One to Many RNN: This type of neural network has a single input and multiple outputs. An example of this is the image caption. ![image-2.png](attachment:image-2.png)

- Many to One RNN:This RNN takes a sequence of inputs and generates a single output. Sentiment analysis is a good example of this kind of network where a given sentence can be classified as expressing positive or negative sentiments.![image-3.png](attachment:image-3.png)

- Many to Many RNN:This RNN takes a sequence of inputs and generates a sequence of outputs. Machine translation is one of the examples.![image-4.png](attachment:image-4.png)

## Problems with Standard RNN

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.
![image.png](attachment:image.png)
But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.
![image-2.png](attachment:image-2.png)

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them.

## Two Issues of Standard RNNs

- Vanishing Gradient Problems: Recurrent Neural Networks enable you to model time-dependent and sequential data problems, such as stock market prediction, machine translation, and text generation. You will find, however, RNN is hard to train because of the gradient problem.
    
    RNNs suffer from the problem of vanishing gradients. The gradients carry information used in the RNN, and when the gradient becomes too small, the parameter updates become insignificant. This makes the learning of long data sequences difficult.
    ![image.png](attachment:image.png)
    
- Exploding Gradient Problems: While training a neural network, if the slope tends to grow exponentially instead of decaying, this is called an Exploding Gradient. This problem arises when large error gradients accumulate, resulting in very large updates to the neural network model weights during the training process.

    Long training time, poor performance, and bad accuracy are the major issues in gradient problems.
    
    ![image-2.png](attachment:image-2.png)

## Now, let’s discuss the most popular and efficient way to deal with gradient problems, i.e., Long Short-Term Memory Network (LSTMs).

## LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

![image.png](attachment:image.png)

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

![image-2.png](attachment:image-2.png)

Let’s just try to get comfortable with the notation we’ll be using.
![image-3.png](attachment:image-3.png)

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations. 

## Workings of LSTMs in RNN

![image.png](attachment:image.png)
LSTMs work in a 3-step process.

### Step 1: Decide How Much Past Data It Should Remember

The first step in the LSTM is to decide which information should be omitted from the cell in that particular time step. The sigmoid function determines this. It looks at the previous state $h_{t-1}$ along with the current input $x_{t}$ and computes the function.
$$f_{t} = \sigma(W_{f}.[h_{t-1},x_{t}]+b_{f})$$
$$f_{t}=\text{forget gate Decides which information to delete that is not important from previous time step}$$

Consider the following two sentences:

Let the output of $$h_{t-1}$$ be “Alice is good in Physics. John, on the other hand, is good at Chemistry.”

Let the current input at $x_{t}$ be “John plays football well. He told me yesterday over the phone that he had served as the captain of his college football team.”

The forget gate realizes there might be a change in context after encountering the first full stop. It compares with the current input sentence at $x_{t}$. The next sentence talks about John, so the information on Alice is deleted. The position of the subject is vacated and assigned to John.

### Step 2: Decide How Much This Unit Adds to the Current State 

In the second layer, there are two parts. One is the sigmoid function, and the other is the tanh function. In the sigmoid function, it decides which values to let through (0 or 1). tanh function gives weightage to the values which are passed, deciding their level of importance (-1 to 1).

$$i_{t} = \sigma(W_{i}.[h_{t-1}, x+{t}]+b_{i}) \\n
C_{t} = tanh(W_{C}.[h_{t-1}, x_{t}]+b_{C})$$

$$i_{t}=\text{input gate Determines which information to let through based on its significance in the current time step}$$

With the current input at x(t), the input gate analyzes the important information — John plays football, and the fact that he was the captain of his college team is important.

“He told me yesterday over the phone” is less important; hence it's forgotten. This process of adding some new information can be done via the input gate.

### Step 3: Decide What Part of the Current Cell State Makes It to the Output

The third step is to decide what the output will be. First, we run a sigmoid layer, which decides what parts of the cell state make it to the output. Then, we put the cell state through tanh to push the values to be between -1 and 1 and multiply it by the output of the sigmoid gate.

$$o_{t} = \sigma(W_{o}[h_{t-1}, x_{t}]+b_{o}) \\n
h_{t} = o_{t}*tanh(C_{t})\\n
o_{t}=\text{output gate allows the passed in information to impact the output in the current time step}$$

Let’s consider this example to predict the next word in the sentence: “John played tremendously well against the opponent and won for his team. For his contributions, brave ____ was awarded player of the match.”

There could be many choices for the empty space. The current input brave is an adjective, and adjectives describe a noun. So, “John” could be the best output after brave.

# Practical Example

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import os
import numpy as np

Architecture for builidng a simple one lyaer, one neuron RNN:- 
![image.png](attachment:image.png)

In [12]:
class SingleRNN(nn.Module):
    def __init__(self, n_inputs, n_neurons):
        super(SingleRNN, self).__init__()
        self.Wx = torch.randn(n_inputs, n_neurons) # 4 x 1
        self.Wy = torch.randn(n_neurons, n_neurons) # 1 x 1
        self.b = torch.zeros(1, n_neurons) # ` x 4
        
        
    def forward(self, X0, X1):
        self.Y0 = torch.tanh(torch.mm(X0, self.Wx)+self.b) # 4 x 1
        self.Y1 = torch.tanh(torch.mm(self.Y0, self.Wy)+torch.mm(X1, self.Wx)+self.b) # 4 x 1
        return self.Y0, self.Y1

Here, I have initialized two weight matrices namely `Wx` and `Wy` with values from a random normal distribution. `Wx` contains connection weights for the inputs of the current time step, while `Wy` contains connection weights for the outputs of the previous time step. We also added a bias `b`. The `forward` function computes two outputs - one for each time step (two overall). Note that we are using `tanh` as the non-linearity (activation function).

As for the input, I have provided 4 instances, with each instance containing two input sequences.
For illustration purposes, this is how the data is being fed into the RNN Model:
![image.png](attachment:image.png)

In [13]:
N_INPUT = 4
N_NEURONS = 1

X0_batch = torch.tensor([[0,1,2,0],
                        [3,4,5,0],
                        [6,7,8,0],
                        [9,0,1,0]], dtype = torch.float) # at t=0=>4x4

X1_batch = torch.tensor([[9,8,7,0],
                        [0,0,0,0],
                        [6,5,4,0],
                        [3,2,1,0]], dtype = torch.float) # at t=1=>4x4

model = SingleRNN(N_INPUT, N_NEURONS)

Y0_val, Y1_val = model(X0_batch, X1_batch)

We obtain outputs for each time step each of size `4x1` which represents the size of `batch and hidden units` respectively.

In [15]:
Y0_val, Y1_val

(tensor([[0.5958],
         [0.9998],
         [1.0000],
         [0.9999]]),
 tensor([[1.0000],
         [0.5901],
         [1.0000],
         [0.9991]]))

### Increasing Neurons in RNN Layer

Generalizing the RNN above to let the single layer support an `n` amount of neurons.  In terms of the architecture, nothing really changes since we have already parameterized the number of neurons in the computation graph we have built. However, the size of the output changes since we have changed the size of number of units (i.e., neurons) in the RNN layer.

Architecture for RNN where each layer supports an `n (In this case, 5)` number of neurons.

![image.png](attachment:image.png)

In [21]:
class BasicRNN(nn.Module):
    def __init__(self, n_inputs, n_neurons):
        super(BasicRNN,self).__init__()
        self.Wx = torch.randn(n_inputs, n_neurons) # n_inputs x n_neurons
        self.Wy = torch.randn(n_neurons, n_neurons) # n_neurons x n_neurons
        self.b = torch.zeros(1, n_neurons) # 1 x n_neurons
        
    
    def forward(self, X0, X1):
        self.Y0 = torch.tanh(torch.mm(X0, self.Wx) + self.b) # batch_size x n_neurons
        self.Y1 = torch.tanh(torch.mm(self.Y0, self.Wy)+torch.mm(X1, self.Wx)+self.b)
        return self.Y0, self.Y1

In [22]:
N_INPUT = 3 # number of features in input
N_NEURONS = 5 # number of units in layer

X0_batch = torch.tensor([[0,1,2], [3,4,5], 
                         [6,7,8], [9,0,1]],
                        dtype = torch.float) #t=0 => 4 X 3

X1_batch = torch.tensor([[9,8,7], [0,0,0], 
                         [6,5,4], [3,2,1]],
                        dtype = torch.float) #t=1 => 4 X 3

model = BasicRNN(N_INPUT, N_NEURONS)

Y0_val, Y1_val = model(X0_batch, X1_batch)

We obtain outputs for each time step each of size `4x5` which represents the size of `batch size and number of neurons` respectively.

In [24]:
Y0_val, Y1_val

(tensor([[-0.9924, -0.9349, -0.9175, -0.9973,  0.9942],
         [-1.0000,  0.9996, -0.9050, -1.0000,  1.0000],
         [-1.0000,  1.0000, -0.8907, -1.0000,  1.0000],
         [-0.9998,  1.0000,  0.9883, -0.6148,  1.0000]]),
 tensor([[-1.0000,  1.0000, -0.9639, -1.0000,  1.0000],
         [ 0.9536,  0.9777, -0.9989,  0.9560, -0.4159],
         [-1.0000,  1.0000, -0.9654, -1.0000,  1.0000],
         [-0.9969,  1.0000, -0.9302, -0.8225,  1.0000]]))

But If taken a closer look at the BasicRNN computation graph built above, we can find a serious flaw. What if we wanted to build an architecture that supports extremely large inputs and outputs. The way it is currently built, it would require us to individually compute the outputs for every time step, increasing the lines of code needed to implement the desired computation graph. Below I will show you how to consolidate and implement this more efficiently and cleanly using the built-in RNNCell module.

Let’s first try to implement this informally to analyze the role RNNCell plays:

In [25]:
rnn = nn.RNNCell(3, 5) # n_input X n_neurons

X_batch = torch.tensor([[[0,1,2], [3,4,5], 
                         [6,7,8], [9,0,1]],
                        [[9,8,7], [0,0,0], 
                         [6,5,4], [3,2,1]]
                       ], dtype = torch.float) # X0 and X1

hx = torch.randn(4, 5) # m X n_neurons
output = []

# for each time step
for i in range(2):
    hx = rnn(X_batch[i], hx)
    output.append(hx)

In [26]:
output

[tensor([[ 0.1367, -0.3844,  0.2619,  0.4333,  0.0134],
         [-0.9646, -0.9779,  0.9426, -0.9457,  0.6075],
         [-0.9867, -0.9999,  0.7206, -0.9211, -0.8564],
         [-0.9127, -0.2206,  0.9966, -0.9993, -0.9072]],
        grad_fn=<TanhBackward0>),
 tensor([[-0.9928, -0.9998,  0.9459, -0.9983, -0.6497],
         [ 0.8722,  0.5926, -0.6301,  0.5173, -0.5845],
         [-0.4334, -0.9781,  0.0501, -0.9670, -0.8408],
         [ 0.5823, -0.5676, -0.2860, -0.6979, -0.8324]],
        grad_fn=<TanhBackward0>)]

With the above code, we have basically implemented the same model that was implemented in BasicRNN. `torch.RNNCell(...)`does all the magic of creating and maintaining the necessary weights and biases for us. torch.RNNCell accepts a tensor as input and outputs the next hidden state for each element in the batch. 

In [28]:
class CleanBasicRNN(nn.Module):
    def __init__(self, batch_size, n_inputs, n_neurons):
        super(CleanBasicRNN, self).__init__()
        self.rnn = nn.RNNCell(n_inputs, n_neurons)
        self.hx = torch.randn(batch_size, n_neurons) #initialize hidden state
        
    
    def forward(self,X):
        output = []
        
        #for each time step
        for i in range(2):
            self.hx = self.rnn(X[i], self.hx)
            output.append(self.hx)
        
        return output, self.hx
    
    
FIXED_BATCH_SIZE = 4 #out batch size is fixed for now
N_INPUT =3
N_NEURONS = 5

X_batch = torch.tensor([[[0,1,2], [3,4,5], 
                         [6,7,8], [9,0,1]],
                        [[9,8,7], [0,0,0], 
                         [6,5,4], [3,2,1]]
                       ], dtype = torch.float) # X0 and X1

model = CleanBasicRNN(FIXED_BATCH_SIZE, N_INPUT, N_NEURONS)
output_val, states_val = model(X_batch)

In [29]:
output_val # contains all output for all timesteps

[tensor([[ 0.3489, -0.2178, -0.7767, -0.0347, -0.1622],
         [ 0.9972,  0.4597, -0.2447, -0.3591,  0.6597],
         [ 1.0000, -0.2101,  0.6157,  0.1537,  0.4235],
         [ 0.9989, -0.4978, -0.7464,  0.6646, -0.9919]],
        grad_fn=<TanhBackward0>),
 tensor([[ 1.0000, -0.8343,  0.1239, -0.6980, -0.3941],
         [-0.6669, -0.2945, -0.1055,  0.1857, -0.5236],
         [ 0.9999, -0.7028,  0.1643, -0.2523, -0.7879],
         [ 0.8953, -0.2621, -0.4800, -0.1950, -0.3313]],
        grad_fn=<TanhBackward0>)]

In [30]:
states_val # contains values for final state for final timestep

tensor([[ 1.0000, -0.8343,  0.1239, -0.6980, -0.3941],
        [-0.6669, -0.2945, -0.1055,  0.1857, -0.5236],
        [ 0.9999, -0.7028,  0.1643, -0.2523, -0.7879],
        [ 0.8953, -0.2621, -0.4800, -0.1950, -0.3313]],
       grad_fn=<TanhBackward0>)