![Alt text](rnn.jpg)

# Recurrent Neural Networks (RNN)

## Overview
Recurrent Neural Networks (RNNs) are a class of neural networks designed for processing sequential data. Unlike traditional feedforward networks, RNNs have connections that loop back on themselves, allowing them to maintain a state or memory of previous inputs.

## Mathematical Foundations

1. **Basic Structure**: 
   - An RNN consists of an input layer, hidden layer(s), and output layer. The key feature is the recurrent connection in the hidden layer.
   
2. **Mathematical Representation**:
   - Given an input sequence \( x = (x_1, x_2, \ldots, x_T) \), the hidden state \( h_t \) at time step \( t \) is computed as:
   $$
   h_t = \sigma(W_h h_{t-1} + W_x x_t + b)
   $$
   where:
   - \( W_h \) is the weight matrix for the hidden state,
   - \( W_x \) is the weight matrix for the input,
   - \( b \) is the bias,
   - \( \sigma \) is an activation function (often \( \tanh \) or ReLU).
   
3. **Output Calculation**:
   - The output \( y_t \) at time step \( t \) can be computed as:
   $$
   y_t = W_y h_t + b_y
   $$
   where \( W_y \) is the weight matrix for the output layer and \( b_y \) is the output bias.

4. **Training with Backpropagation Through Time (BPTT)**:
   - The RNN is trained using BPTT, which unfolds the network across time steps and applies backpropagation to update weights.
   - The loss function (often mean squared error for regression or cross-entropy for classification) is computed, and gradients are propagated back through time.

## Learning Mechanism
- RNNs learn to predict the next item in a sequence based on previous items by adjusting weights to minimize the loss.
- They capture temporal dependencies by maintaining a hidden state that is updated at each time step.

## Applications
RNNs are used in various applications, including:

1. **Natural Language Processing (NLP)**:
   - Language modeling
   - Sentiment analysis
   - Machine translation
   - Text generation

2. **Speech Recognition**:
   - Converting spoken language into text.

3. **Time Series Prediction**:
   - Forecasting stock prices, weather, etc.

4. **Music Generation**:
   - Composing music based on previous notes.

5. **Video Analysis**:
   - Activity recognition in video sequences.

## Advantages
- **Temporal Dynamics**: RNNs are excellent at handling sequences of varying lengths and capturing temporal relationships.
- **Parameter Sharing**: The same weights are applied at each time step, leading to a smaller model size.

## Disadvantages
- **Vanishing and Exploding Gradients**: Training can be unstable due to the vanishing or exploding gradient problems, particularly in long sequences.
- **Long-Term Dependencies**: Standard RNNs struggle to learn long-range dependencies, which is addressed by architectures like LSTMs and GRUs.
- **Training Time**: Training can be slower due to the sequential nature of the data.

## Variants
1. **LSTM (Long Short-Term Memory)**: Designed to better capture long-range dependencies with special gating mechanisms.
2. **GRU (Gated Recurrent Unit)**: A simpler alternative to LSTMs with fewer parameters, retaining the ability to learn long-term dependencies.

## Conclusion
RNNs are powerful tools for sequence prediction and analysis, particularly in fields like NLP and time series forecasting. While they have limitations, advancements like LSTMs and GRUs have enhanced their capabilities, making them a cornerstone of modern deep learning architectures.


![Alt text](rnn_1.png)

### Recurrent Neural Network (RNN) Overview

The diagram represents the structure of a Recurrent Neural Network (RNN), where the same function (denoted as "A") is applied across different time steps to process sequential data.

#### Key Components:

- **A**: The RNN unit, which is the same across all time steps. It processes both the input at the current time step and the hidden state from the previous time step.
- **\( h_t \)**: The hidden state at time step \( t \). It is a function of:
  - The input \( x_t \) at the current time step.
  - The hidden state \( h_{t-1} \) from the previous time step.
  The hidden state captures information from both the current input and previous time steps.
- **\( x_t \)**: The input at time step \( t \).

#### How it works:
- The RNN unit processes each input \( x_t \) sequentially.
- The loop inside each RNN unit represents the recurrent nature of the network, where information from the previous time step is used in the current step.
- This allows the network to maintain a memory of previous inputs, making it ideal for tasks involving sequences, like time-series prediction and natural language processing.

#### Example Use Cases:
- Time-series data (e.g., stock prices, weather data)
- Natural language processing (e.g., text generation, translation)
- Sequence classification and prediction tasks

#### RNN Computation:
At each time step \( t \), the following happens:
\[
h_t = f(W_h \cdot h_{t-1} + W_x \cdot x_t)
\]
Where:
- \( f \) is a non-linear activation function (e.g., tanh, ReLU).
- \( W_h \) and \( W_x \) are weight matrices for the hidden state and input, respectively.


# Recurrent Neural Networks (RNN) with Input Sequence Example

## Input Sequence
Let’s define our input sequence:
- \( x_1 = a \)
- \( x_2 = b \)
- \( x_3 = c \)
- \( x_4 = d \)
- \( x_5 = e \)
- \( x_6 = f \)

## RNN Processing Steps

1. **Initialization**: 
   - At the start, we initialize the hidden state \( h_0 \) (usually set to zeros).

2. **Processing Each Input**:
   - For each time step \( t \), the RNN updates its hidden state based on the current input \( x_t \) and the previous hidden state \( h_{t-1} \).

3. **Hidden State Update**:
   - For each input, the hidden state \( h_t \) is computed using the formula:
   $$
   h_t = \sigma(W_h h_{t-1} + W_x x_t + b)
   $$
   where:
   - \( W_h \) is the weight matrix for the hidden state,
   - \( W_x \) is the weight matrix for the input,
   - \( b \) is the bias,
   - \( \sigma \) is the activation function.

## Step-by-Step Calculation

Let’s illustrate this step-by-step for our sequence:

- **For \( t = 1 \)** (input \( x_1 = a \)):
  $$
  h_1 = \sigma(W_h h_0 + W_x a + b)
  $$

- **For \( t = 2 \)** (input \( x_2 = b \)):
  $$
  h_2 = \sigma(W_h h_1 + W_x b + b)
  $$

- **For \( t = 3 \)** (input \( x_3 = c \)):
  $$
  h_3 = \sigma(W_h h_2 + W_x c + b)
  $$

- **For \( t = 4 \)** (input \( x_4 = d \)):
  $$
  h_4 = \sigma(W_h h_3 + W_x d + b)
  $$

- **For \( t = 5 \)** (input \( x_5 = e \)):
  $$
  h_5 = \sigma(W_h h_4 + W_x e + b)
  $$

- **For \( t = 6 \)** (input \( x_6 = f \)):
  $$
  h_6 = \sigma(W_h h_5 + W_x f + b)
  $$

## Output Calculation
After processing all inputs, the RNN can produce outputs based on the last hidden state or at each time step:

- For the output at time step \( t \):
$$
y_t = W_y h_t + b_y
$$

Where \( W_y \) is the weight matrix for the output layer and \( b_y \) is the output bias.

## Summary
In this way, the RNN processes the entire sequence \( (a, b, c, d, e, f) \) by updating its hidden state at each time step, allowing it to learn from the context provided by previous inputs. This mechanism enables RNNs to capture temporal dependencies in the data.


- a sequence a sequence of data that has a defined order.
- rnn can transform sequences to vectors ,vectors to sequence ,sequences to sequences
- dynamical systems
    1. know the state of a system now at time t.what will be state at time t+n
    2. (stock prediction)
- 


___

### deeplearning?<br>
method of representing differentiable functions that maps a variable of one type to a variable of another type

$$f(in\_var)=out\_var $$
 $$f_{reg}:{R^D} ->R$$
 $$f_{classification}:{R^D} -> R^c $$

vectors are matrices of abstraction of raw data

- a sequence is a sequence of data that has an order
  - rnn can transform sequence to vectors ,vectors to sequences and sequences to sequences
- dynamical system
   - Unlike feedforward neural networks, which process inputs independently, RNNs have loops that allow them to maintain a hidden state that can capture temporal dependencies.
   - predicting future out
   - let S^t= state of a system at time t ,we can say that state of system at t+1 is s^(t+1)= f(s^t;theta)
   

![Alt text](rnn_3.png)
![Alt text](rnn_4.png)

- maps a sequence 'X' to another sequence 'O
- amount of information retained is determined by the Weight W from previous timestep
- h(t) is 


$$
a^{(t)} = W h^{(t-1)} + U x^{(t)} + b_u
$$

$$
h^{(t)} = \tanh(a^{(t)})
$$

$$
h^{(t)} \text{ same as } h^{(t+1)}
$$

$$
x^{(t)} \text{ not same as } x^{(t+1)}
$$

$$
c^{(t)} = V h^{(t)} + b_v
$$

$$
o^{(t)} = \text{softmax}(c^{(t)})
$$

$$
L^{(t)} = \text{Loss}(o^{(t)}, y^{(t)})
$$

$$
L = \sum_{t} L^{(t)}
$$


The hidden state 

h 
(t)
  is recurrent and is meant to capture the internal memory of the network. This memory persists across different time steps, which means it is updated but carries information from previous time steps. Therefore, 

h 
(t)
  and 

h 
(t+1)
  are related—they are not independent. The value of 

h 
(t+1)
  depends on 

h 
(t)
 , but it gets updated as new input is processed.
While they are not exactly the same at different time steps, they are connected in a chain and retain information from prior states.
Input 

x 
(t)
 :

The input 

x 
(t)
  represents the data fed into the network at each time step 
𝑡
t. Unlike 

h 
(t)
 , the input at each time step is typically independent of other inputs. Therefore, 

x 
(t)
  and 

x 
(t+1)
  are not the same, as the data points or observations at these time steps can vary and are not connected like the hidden states.

Back Propagation Through time(bptt)
- at every time step we need to perform back propagation 
   - large cost
- teacher forcing
   - instead of feeding hidden layer of previous state to hidden layer of next state ,we feed output y from previous state
   

![Alt text](rnn_5.png)

- predictions could be off ,because in training we passed actual y,in but real world problem is different scinareo
- BPTT is a method for training RNNs by calculating gradients across time steps.
Teacher Forcing is a strategy during training where the actual target outputs are used to guide the RNN, rather than its own predictions.
- another architecture is sequence to vector model

 ![Alt text](rnn_10.png)
 ![Alt text](rnn_11.png)
 ![Alt text](rnn_12.png)
 ![Alt text](rnn_13.png)

- language translation is not just word to word conversion,words coming next may also affect the translation
- text summarization and language translation dont use rnn in this specific architecture (here length of input sequence need to be length of output sequence)
  - so use bidirectional architecture with unequal input and output sequences

![Alt text](rnn_14.png)

In [1]:
from datetime import datetime 
import itertools
import numpy as np
import nltk
import os
import operator
import sys 

In [2]:
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\ashik\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\ashik\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\ashik\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\chat80.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\ashik\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\ashik\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\ashik\AppData\Roaming\nltk_data...
[nltk_data]    |   U

True

In [4]:
from nltk.corpus import state_union