# Feed Forward Network and RNN for NLP Tasks

## Feedforward networks for NLP: Classification

The input element $x_i$ could be scalar features like those in Fig. 5.2, e.g., 

- $x_1 = \text{count(words 2 doc)}$, 
- $x_2 = \text{count(positive lexicon words 2 doc)}$, 
- $x_3 = \text{1 if “no” 2 doc}$, and 
- ...  

And the output layer $\hat{y}$ could have 3 nodes (positive, negative, neutral), in which case $\hat{y}_1$ would be the estimated probability of positive sentiment, $\hat{y}_2$ the probability of negative and $\hat{y}_3$ the probability of neutral

The resulting equations
would be just what we saw above for a 2-layer network (as always, we’ll continue
to use the $\sigma$ to stand for any non-linearity, whether sigmoid, ReLU or other)

- $x = [x_1, x_2, ... ,x_N]$ (each xi is a hand-designed feature)
- $h = \sigma(Wx+b)$
- $z = Uh$
- $\hat{y} = \text{softmax}(z)$

Instead of using hand-built human-engineered features as the input to our classifier, we draw on deep learning’s ability to learn features from the data by representing words as embeddings, like the *word2vec* or *GloVe* embeddings 

For example, for a text with $n$ input words/tokens $w_1, ..., w_n$, we can turn the $n$ embeddings $e(w_1), ..., e(w_n)$ (each of dimensionality $d$) then apply some poolling

- $x = \text{mean}(e(w_1), e(w_2), ... , e(w_n)) $ 
- $h = \sigma(Wx+b)$
- $z = Uh$
- $\hat{y} = \text{softmax}(z)$

The efficience in the computation we need to convert to vector and matrix

- $H = \sigma(XW^T +b)$
- $Z = HU^T$
- $\hat{Y} = \text{softmax}(Z)$

The idea of using word2vec or GloVe embeddings as our input representation— and more generally the idea of relying on another algorithm to have already learned an embedding representation for our input words—is called pretrainin

## Feedforward Neural Language Modeling 

Prediction upcoming word from prior word context.

Neural Language Modeling 

Tasks:
- Machine Tranlation
- Text Summarization
- Speech Recognicion
- Grammar Correnction
- Chat 

Compared to *n-gram* models, neural language models can handle much longer histories, can generalize better over contexts of similar words, and are more accurate at word-prediction

On the other hand, neural net language models are much more complex, are slower and need more energy to train, and are less interpretable than n-gram models.

The feedforward neural LM approximates the probability of a word given the entire prior context $P(w_t|w_{1:{t−1}})$ by approximating based on the $N − 1$ previous words

$$ P(w_t|w_1, ..., w_{t−1}) ≈ P(w_t|w_{t−N+1}, ..., w_{t−1})$$

In the following examples we’ll use a 4-gram example, so we’ll show a neural net to estimate the probability $P(w_t = i|w_{t−3};w_{t−2};w_{t−1})$.

Neural language models represent words in this prior context by their embeddings, rather than just by their word identity as used in n-gram language models. Using embeddings allows neural language models to generalize better to unseen data

### Forward inference (decoding) in the neural language model

Forward inference is the task, given an input, of running a forward pass on the
network to produce a probability distribution over possible outputs, in this case next
word

1. Represent $N$ ($N=3$ for our example) previous words in one-hot representation of length $|V|$. 
2. Let $x_{t}$ of the dimention $|V|$ x $1$  is one-hot representation of $w_{t}$
3. Given the $E$ matrix of embedding with dimention $\text{d}$ x $\text{|V|}$ (where $d$ is the length of vector that represent the word).
4. Multiply $Ex_{t}=e_t$ of the dimention $\text{d}$ x $1$.
5. For $N=3$ we stacked $e = [e_{t-1}; e_{t-2}; e_{t-3}]$ of the dimention $\text{3d}$ x $1$.
6. Then multiply the weights of the dimention $d_h$ x $\text{3d}$ with stacked embeding, and then apply some non-linear function $h=\sigma(We)$
7. Then $h$ is multiplied by $U$ of the dimention $|V|$ x $d_h$
8. Apply softmax: After the softmax, each node i in the output layer estimates
the probability $P(w_t = i|w_{t−1};w_{t−2};w_{t−3})$

## Recurrent Neural Network

Certain data types such as time-series, text, and biological data contain sequential dependencies. In such cases, a recurrent neural network (RNN) is a type of neural network that can be used to solve such problems. RNN is specifically designed for processing sequential data.


> A Bit of Vector Semantic and Embedding 
> 
> **Embedding**
> 
> It is a way to represent of the meaning of a word in a *vector*. All relate to this word is represented in this vector (semantic, syntax, etc).
> 
> Exists a static embedding and dynamic contextualized embedding (like BERT).
> 
> This avoids the need to feature engineering. 
> 
> This conducts to *self-supervised* ways to learn representations of the input.
> 
> **Vector Semantic**
> 
> It is vector that represents the meaning of a word. Recall, "semantics" is the process of understanding the meaning of a word and this an standar in NLP.
> 
> The idea of vector semantics is to represent a word as a point in a multidimensional space 
> <!-- that is derived from the distributions of embeddings word neighbors. -->
> 
> Vectors for representing words are called *embeddings* (although the term is sometimes more strictly applied only to dense vectors like *word2vec*)
> 
> **Cosine for measuring similarity**
> 
> To measure similarity between two target words v and w, we need a metric that
> takes two vectors gives a measure of their similarity.
> 
> The *dot product* (inner product) acts as a similarity metric because it will tend to be high just when the two vectors have large values in > the same dimensions. Alternatively, vectors that have zeros in different dimensions—orthogonal vectors—will have a dot product of 0, > representing their strong dissimilarity.
> 
> If we normalize this dot product by its length of the vector the result is :
> 
> $$\text{consine} = \frac{\sum_i^N v_iw_i}{\sqrt{\sum_i^N v_i^2}  \sqrt{\sum_i^N w_i^2} } $$
> 
> **Word2vec**

# Sequence Models

#### Motivating Example 

Task: *Entity Recognition*. It is used to find companies's name, times location, currency names, 

$\text{input}$: Harry Potter and Hermione Granger invented a new spell

$$\textbf{x}: [x^{<1>}, x^{<2>} ..., x^{<9>}]$$

$$\textbf{y}: [1, 1, 0, 1, 1, 0, 0, 0, 0]$$

#### Representating Words

$\text{input}$: Harry Potter and Hermione Granger invented a new spell.

We can represent this input sequence in a vector based on this *vocabulary* 

$$\text{vocabulary} = \{\text{a}, \text{aaron}, ...,\text{Harry,} ..., \text{Potter}, ...,\text{zulu}\}$$

We split the sentence in tokens, each token is a word.

$$\text{Input} \: \text{tokens} = [\text{Harry}, \text{and}, \text{Potter}, \text{Hermione}, \text{Granger}, \text{invented}, \text{a}, \text{new}, \text{spell}]$$

$$\text{Input} \: \text{tokens} = [x^{<1>}, x^{<2>}, x^{<3>}, x^{<4>}, x^{<5>}, x^{<6>}, x^{<7>}, x^{<8>}]$$

So $ x^{<1>}=\text{Harry}$ it converted to vector
$$x^{<1>} = \begin{bmatrix} 
0\\ 
0\\
0\\
...\\
1\\
...\\
0\\
...\\
0
\end{bmatrix}$$

This is called one-hot representation. What if we encounter a word that is not in the *vocabulary*? we usually create a token $\text{<UNK>}$.

Other tokes that we use are $\text{<PAD>}$,$\text{<EOS>}$, $\text{<SOS>}$

> Problemns with Feed Forward Neural Network
> * The *inputs* and the *outputs* can have several lenghts. "This is my iPhone" (inputs tokenized by word) (T = 4), "This laptop don't have enough storage" (inputs tokenized by word) (T = 7).
> * The inputs tokenized can be related to earlier inputs tokenized.

## Recurrent Neural Networks

We are focusing on a class of recurrent networks referred to as *Elman Networks* (Elman, 1990) or *simple recurrent networks*.

These networks are useful in their own right and serve as the basis for more
complex approaches like the Long Short-Term Memory (LSTM). In this chapter when we use the term RNN we’ll be referring to
these simpler more constrained networks (although you will often see the term RNN to mean any net with recurrent properties including LSTMs)



A recurrent neural network (RNN) is any network that contains a cycle within its
network connections, meaning that the value of some unit is directly, or indirectly,
dependent on its own earlier outputs as an input.

Critically, this approach does not impose a fixed-length limit on this prior context; the context embodied in the previous hidden layer can include information extending back to the beginning of the sequence.

### Inference in RNNs

Mapping a sequence of inputs to a sequence of outputs

The inference used in this example is for *language modeling task*

Language modeling, or LM, is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of text data to provide a basis for their word predictions.

It one of the most basic and important tasks in natural language processing and RNN do very well. 

RNN language models (Mikolov et al., 2010) process the input sequence one
word at a time, attempting to predict the next word from the current word and the
previous hidden state. RNNs thus don’t have the limited context problem that n-gram
models have, or the fixed context that feedforward language models have, since the
hidden state can in principle represent information about all of the preceding words
all the way back to the beginning of the sequence.

$$P(x^{<t+1>} | x^{<1>},...,x^{<t>})$$

In order to diference the input and the output, the next word ($x^{<t+1>}$) is represented by $y^{<t>}$



0. $e^{<t>} = E\textbf{x}^{<t>}$
1. $a^{<t>} = \Phi(W_{aa}a^{<t-1>} + W_{ae}e^{<t>} + b_a)$
2. $y^{<t>} = \Theta(W_{ya}a^{<t>} + b_y)$

> Where $W_{ya}$ are the weights used to compute $y$ multiplied the vector $a$ 
>
> $a^{<0>}$ is a vector of zeros

1. Let $\textbf{x}^{<{t}>}$, of the dimention $|V|$ x $1$, the one-hot representation of $x^{<{t}>}$ (word).
2. Given the $E$ matrix of embedding with dimention ${d}_e$ x $\text{|V|}$ (where $d$ is the length of vector that represent the word).
3. Multiply $E$ with $\textbf{x}^{<t>}$ resulting $e^{<t>}$ of the dimention $d_e$ x $1$.
4. $e^{<t>}$ is multiplied by $W_{ae}$ (of the dimention $d_h$ x $d_e$) and $a^{<t-1>}$ (of the dimention $d_a$ x $1$ with initial values set to zeros) is multiplied by $W_{aa}$ (of the dimention $d_h$ x $d_a$). This last component summatize all preceding relationship.
4. The previous results are sum and added the bias $b_a$ (of dimention $d_h$ x $1$ ), then is passed to no-linear function (ReLu for example) resulting $a^{<t>}$ of dimention $d_h$ x 1.
5. $a^{<t>}$ is multiply by $W_{ya}$ (of dimention $|V|$ x $d_h$) and added a bias $b_y$ (of teh dimention $|V|$ x $1$).
6. At the previously result (called *logits*) apply softmax ($\Theta$) getting the probability distribution over the possible output classes
7. The class that maximize the probability is the class chosen.

<!-- <div><span style="font-size:25px;font-weight:300">RNN Architecture (One Hidden Layer)</span> </div> -->

Graphical representation

<div style = "">
    <div style = "display:flex;align-items:center; justify-content:center; flex-direction:column" >
    <img src = "./assets/architecture-rnn.png" style="height:150px">
    <span>By: Andrew Ng</span>
    </div>
    
</div>


> Note (*Weight tying*)
> 
> This set $d_h=d_e$ in order to use $E$ and $W_{ya}^T$
> 
> *This approach significantly reduces the number of parameters required for the model*

> *Note*
> 
> In actually this is an specific structure called *Many-to-Many* in which the inputs and the outputs has the same length ( $T_x = T_y$) and, also Its is only in one direction (left to right) with one hidden layer.

> In *teacher forcing*
> 
> Instead of using the model's predicted output at each time step as input for the next step, the correct target output (the ground truth) is fed into the model. This means the model gets to see the actual sequence it is supposed to generate during training, which speeds up learning and makes the model converge faster.

#### Examples of RRN architectures

- *Many-to-Many*: Both inputs and outputs are sequences. Can be
direct or delayed.
  - Ex.: Video-captioning, i.e., describing a sequence of images via text (direct).
  - Translating one language into another (delayed)

- *Many-to-One* (Sentimential Classification ) The input data is a sequence, but the output is a fixedsize vector, not a sequence.
  - Ex.: sentiment analysis, the input is some text, and the output is a class label.
- *One-to-Many*:  Input data is in a standard format (not a sequence), theoutput is a sequence.
  - Ex.: Image captioning, where the input is an image, the output is a textdescription of that image

### Training RNN

Back Propagation 

In the Back Propagation we need to compute the loss function.

Let be $\textbf{y}^{<t>}$ (of the dimention $|V|$ x $1$) is the one-hot representation of $y^{<t>}$ (word) and the *logits*  (results before apply softmax of the dimention $|V|$ x $1$).

1. $L^{<t>} = - {\textbf{y}_i^{<t>}}^{T} \log(\text{logits}^{<t>})$ = -$\log(\text{logits}^{<t>})[<t>]$ (from the vector logits extract the index $[<t>]$)
2. $L = \sum_{t=1}^{T_y} L^{<t>}$

$$\frac{\partial L^{<t>}}{\partial W_{aa}} = \frac{\partial L^{<t>}}{\partial y^{<t>}} . \Theta\text{'}. W_{ya}. \Phi \text{'}  a^{<t-1>} + \frac{\partial L^{<t>}}{\partial y^{<t>}} . \Theta\text{'}. W_{ya}. \Phi \text{'}  a^{<t-1>}[\Phi \text{'}  a^{<t-2>}] + ... + \frac{\partial L^{<t>}}{\partial y^{<t>}} . \Theta\text{'}. W_{ya}. \Phi \text{'}  a^{<t-1>}[\Phi \text{'}  a^{<t-2>}]..[\Phi \text{'}  a^{<0>}]$$

This is very problematic:
Vanishing/Exploding gradient problem!

> **Note**
> 
> *Training set*: large corpus (NLP terminology for text with large body) of any language text
> 
> *Tokenize*: Divide the sentences into piece of string (words, n-gram) 
> 
> *Word Tokenize*: divide the sentences into words. Punctuation if is useful can be considere like a word.
> 
> $$\text{cats average 15 hours of sleep a day}.$$
> 
> If considere `.` like a token the tokens would be $y^{<1>}, ..., y^{<8>}, y^{<9>}$, but if it not the tokens would be $y^{<1>}, ..., y^{<8>}$. 
> 
> If we want to tokenize the end of sentence we can add `<EOS>` to tokens.
> 
> If a word is not in a vocabulary (10,000 words) it can be tokenize like `<UNK>` and add to tokens.

*Image generation* and *code generation*, constitute a new area of AI that is often called generative AI

*Autoregressive generation* or *causal LM generation* sample words conditioned on our previous choices. Predicts a value at time $t$ based
on a linear function of the previous values at times $t−1$, $t−2$, and so on. 

Recall, the predicction start with appropriate *context*. It is state-of-the-art for taks like machine translation, summarization, and question answering.

For translation the context is the sentence in the source language; for summarization it’s the long text we want to summarize.

### Stacked and Bidirectional RNN architectures



- Stacked RNNs consist of multiple networks where the output of one layer serves as
the input to a subsequent layer

- A bidirectional RNN (Schuster and Paliwal, 1997) combines two independent RNNs, one where the input is processed from the start to the end, and the other from the end to the start.
Bidirectional RNNs have also proven to be quite effective for sequence classification. 


<div style = "">
    <div style = "display:flex;align-items:center; justify-content:center; flex-direction:column" >
    <img src = "./assets/Bidirectional RNN.png" style="height:180px; border: 0.5px solid;">
    </div>
    
</div>


<!-- ### In the practice

Let the sequence `"The reinforcement learning is the key of machine learning"` and we want to build a *Language Model* to predict the next word. 

So first we need to convert the sequence into *tokens* (in this case the token is a word). 

So the sequence is convert to a vector: $[\text{The}, \: \text{reinforcement}, \: \text{learning},\: \text{is}, \:\text{the}, \: \text{key}, \: \text{of},\: \text{machine},\: \text{learning}]$. 

Above sequence is splitted in *Source Text* and *Target Text* of equal length. $T_x = T_y$

Source Text : $[\text{The}, \: \text{reinforcement}, \: \text{learning},\: \text{is}, \:\text{the}, \: \text{key}, \: \text{of},\: \text{machine}]$. 

Target Text : $[\text{reinforcement}, \: \text{learning},\: \text{is}, \:\text{the}, \: \text{key}, \: \text{of},\: \text{machine},\: \text{learning}]$. 

We can fomalize into variables

Let be $x^{<t>} = [x^{<1>}, .., x^{<8>}]$ and $y^{<t>} = [y^{<1>}, .., y^{<8>}]$ -->

<!-- When $x^{<t>}$ and $y^{<t>}$ is fed to RNN (1 hidden layer) to train, it apply the following steps:

- $x^{<1>} = \text{The}$ is transformed using *embedding transformation* (we define the size of vocabulary and the dimention of embedding) or *one-hot* (based in to the vocabulary). So if $\text{vocabulary}\_\text{size} = 20$ and the dimention of  $\text{embedding}\_\text{size} = 5$ the vector would be $x^{<1>} = [0.79632116, 0.09251072, 0.20748794, 0.20226105, 0.533899253]$ and each time that the word *the* needs to encode this vector will be used.

- $a^{<0>}$ init with zero values $[0, 0, 0, 0, 0]$, $W_{aa}$ and $W_{ax}$ are matrices  of shape = $(5, 5)$ inits with random values. Where the first $5$ to left are the weights for the first unit and $b_a$ is a vector of shape $(1, 5)$ where each single value is the bias for each unit.
- With those parameters and data we can compute $a^{<1>} =\Phi(a^{<0>}\text{.}W_{aa} + x^{<1>}\text{.}W_{ax} + b_a)$.

- Since we need to predict the probability of the next word, we use $\text{softmax}$ function and output should be equal to $20$ the size of the vocabulary. So we can compute the probabilities $P(y^{<1>}), P(y^{<2>}), ..., P(y^{<20>})$.

- With those requeriments the weights $W_{ya}$ is a matrix of shape $(5,20)$ and $b_y$ is vecor of shape $(1, 20)$ with random values at init.

- So we can compute the $y^{<1>} = \Theta(a^{<1>}\text{.}W_{ya} + b_y)$.
- In the next iterartion we will apply the same steps but with the data and parameters of the previous iteration. -->

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [41]:
# Define a simple dataset for word prediction
class WordDataset(Dataset):
    def __init__(self, sequences, sequence_length):
        
        self.sequences = sequences
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence = self.sequences[idx]
        input_sequence = torch.tensor(sequence[:-1], dtype=torch.long)
        target = torch.tensor(sequence[1:], dtype=torch.long)
        return input_sequence, target

# Define the RNN model
class RNN(nn.Module):
    def __init__(self, vocabulary_size, embedding_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocabulary_size, embedding_size)
        self.rnn = nn.RNN(embedding_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        embedded = self.embedding(x)
        out, hidden = self.rnn(embedded, hidden)
        print("Forward: ",out)
        print("Hidden: ",hidden)
        out = self.fc(out)
        print("out: ",out)
        return out, hidden

In [42]:
# Function to generate a simple dataset
def generate_dataset():
    # Simple dataset of word sequences
    sequences = [
        [0, 1, 2, 3, 4],  # "hello world"
        [5, 6, 7, 8, 4],  # "how are you"
        [9, 1, 10, 3, 4], # "goodbye world"
    ]
    return sequences

In [43]:
# Define hyperparameters

# input_size = 11
vocabulary_size = 11
# Input size = Vocabulary size
# This implies that the range of posibles values 
# is between 0-10. 
hidden_size = 8 # Size of the hidden state
output_size = 11  # Output size (same as input size)
sequence_length = 4  # Sequence length for training
batch_size = 1  # Batch size for training
num_epochs = 100  # Number of training epochs
embedding_size = 10
# Create the dataset and dataloader
sequences = generate_dataset()
dataset = WordDataset(sequences, sequence_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Create the RNN model instance
model = RNN(vocabulary_size, embedding_size, hidden_size, output_size)

# Define the loss function and optimizer
# For multinomial model
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [None]:
# Training loop
# for epoch in range(num_epochs):
#     total_loss = 0
for idx, (batch_inputs, batch_targets) in enumerate(dataloader):
    optimizer.zero_grad()
    hidden = torch.zeros(1, batch_size, hidden_size) # Initialize hidden state
    outputs, _ = model(batch_inputs, hidden)
    print("outputs", outputs)
    print('-', _)
    loss = criterion(outputs.view(-1, output_size), batch_targets.view(-1))
    loss.backward()
    optimizer.step()
    # total_loss += loss.item()    
    # print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}')
    if idx == 0:
        break


In [None]:
# Prediction
test_input = torch.tensor([[0, 1, 2, 3]], dtype=torch.long)  # Input sequence: "hell"
hidden = torch.zeros(1, 1, hidden_size)  # Initialize hidden state
with torch.no_grad():
    output, _ = model(test_input, hidden)
predicted_idx = torch.argmax(output, dim=2)
print("Predicted sequence:", [idx.item() for idx in predicted_idx.squeeze()])

## The LSTM

The main reason to 

Downsides of using RNN:

- *Retrieval Distant Information* is difficult since the information encoded in hidden states tends to be fairly local, more relevant to the most recent parts of the input sequence and recent decisions. This happen since the weights perform two task simultaneously *provide information useful for the current decision, and updating and carrying forward information required for future decisions*.
- A second difficulty with training RNN is *vanishing gradients problem*.

**LSTM** divide the context management problem into two subproblems: removing information no longer
needed from the context, and adding information likely to be needed for later decision making.

LSTMs accomplish this by first adding an explicit context layer to the architecture (in addition to the
usual recurrent hidden layer), and through the use of specialized neural units that
make use of gates to control the flow of information into and out of the units that
comprise the network layers. 

The *gates* in an LSTM share a common design pattern; each consists of a feedforward layer, followed by a sigmoid activation function, followed by a pointwise multiplication with the layer being gated.


Values in the layer being *gated* that align with values near $1$ in the mask are passed through
nearly unchanged; values corresponding to lower values are essentially erased

The first gate we’ll consider is the **forget gate**. The purpose of this gate is to delete information from the context that is no longer needed.
- $f_t = \sigma(U_f h_{t-1} + W_f x_t)$
- $k_t = c_{t-1} \odot f_{t}$

The next task is to compute the actual information we need to extract from the previous hidden state and current inputs—the same basic computation we’ve been using for all our recurrent networks

- $g_t=\tanh(U_g h_{t-1} + W_g x_{t})$

Next, we generate the mask for the add gate to select the information to add to the *current context*

- $i_t = \sigma(U_i h_{t-1} + W_i x_{t})$
- $j_t = g_t \odot i_t$

Next, we add this to the modified context vector to get our new context vector

- $c_t = k_t + j_t $

The final gate we’ll use is the output gate which is used to decide what information is required for the current hidden state (as opposed to what information needs to be preserved for future decisions
- $o_t=\sigma(U_o h_{t-1} + W_o x_{t})$
- $h_t = o_t \odot \tanh(c_t) $

## The Encoder-Decoder Model with RNN

The *encoder-decoder* (sequence-to-sequence) networks model is used in *machine translation task*, in which the input and output sequence has different lenght.

The key idea is train an *encoder network* that take an input sequence and create a context that is passed to *decoder network* that generate a sequence.

1. An *encoder* that accepts an input sequence, $x$, and generates a corresponding
sequence of contextualized representations, $h^e$ . LSTMs, convolutional networks, and Transformers can all be employed as encoders.
2. A *context vector*, $c$, which is a function of $h^e$, and conveys (carry) the essence (of information) of the input to the decoder.
3. A *decoder*, which accepts $c$ as input and generates an arbitrary length sequence of hidden states $h^d$ , from which a corresponding sequence of output states $y$ , can be obtained. Just as with encoders, decoders can be realized by
any kind of sequence architecture

<div style = "">
    <div style = "display:flex;align-items:center; justify-content:center; flex-direction:column" >
    <img src = "./assets/encoder-decoder-model.png" style="height:220px;">
    </div>
    
</div>


> *Note*
> 
> The entire purpose of the encoder is to generate a contextualized representation of the input

0. $e^{<t>}_x = E\textbf{x}^{<t>}$
1. $a^{<t>}_e = \Phi(W_{aa}a^{<t-1>}_e + W_{ae}e^{<t>}_x + b_a)$
2. $c = a^{T_x}_e=a^{0}_d$, $y^{0} = \text{<s>}$
3. $e^{<\tau>}_y = E\textbf{y}^{<\tau>}$
4. $a^{<\tau>}_d = \Phi(W_{aa}a^{<\tau-1>}_d + W_{ae}e^{<\tau>}_y + W_{ac}c + b_a)$
<!-- 5. $c = a^{T_x}_e$ -->
5. $y^{<\tau>} = \Theta(W_{ya}a^{<\tau>}_d + b_y)$

> Where $W_{ya}$ are the weights used to compute $y$ multiplied the vector $a$ 
>
> $a^{<0>}$ is a vector of zeros

Enconder 
1. Let $\textbf{x}^{<{t}>}$, of the dimention $|V|$ x $1$, the one-hot representation of $x^{<{t}>}$ (word).
2. Given the $E$ matrix of embedding with dimention ${d}_e$ x $\text{|V|}$ (where $d$ is the length of vector that represent the word).
3. Multiply $E$ with $\textbf{x}^{<t>}$ resulting $e^{<t>}_x$ of the dimention $d_e$ x $1$.
4. $e^{<t>}_x$ is multiplied by $W_{ae}$ (of the dimention $d_h$ x $d_e$) and $a^{<t-1>}_x$ (of the dimention $d_a$ x $1$ with initial values set to zeros) is multiplied by $W_{aa}$ (of the dimention $d_h$ x $d_a$). This last component summarize all preceding relationship.
4. The previous results are sum and added the bias $b_a$ (of dimention $d_h$ x $1$ ), then is passed to no-linear function (ReLu for example) resulting $a^{<t>}_e$ of dimention $d_h$ x 1.
5. At time $t=T_x$ (the length of the input sequence) we obtain $a^{T_x}_e$, this is the context vector ($c$) used in decoder net, the vector that summarize all input information.

Decoder

1. Start with initial values $a^{<0>}_d=c$, $y^{<1>}=\text{<s>}$
1. Let $\textbf{y}^{<\tau>}$, of the dimention $|V|$ x $1$, the one-hot representation of $y^{<\tau>}$ (word).
2. Given the $E$ matrix of embedding with dimention ${d}_e$ x $\text{|V|}$ (where $d$ is the length of vector that represent the word).
3. Multiply $E$ with $\textbf{y}^{<\tau>}$ resulting $e^{<\tau>}_y$ of the dimention $d_e$ x $1$.
4. $e^{<\tau>}_y$ is multiplied by $W_{ae}$ (of the dimention $d_h$ x $d_e$), $c$ (of the dimention $d_a$ x $1$) is multiplied by $W_{ac}$ (of the dimention $d_h$ x $d_a$) and $a^{<\tau-1>}_d$ (of the dimention $d_a$ x $1$ with initial values set to zeros) is multiplied by $W_{aa}$ (of the dimention $d_h$ x $d_a$). 
4. The previous results are sum and added the bias $b_a$ (of dimention $d_h$ x $1$ ), then is passed to no-linear function (ReLu for example) resulting $a^{<\tau>}_d$ of dimention $d_h$ x 1.
5. $a^{<\tau>}_d$ is multiply by $W_{ya}$ (of dimention $|V|$ x $d_h$) and added a bias $b_y$ (of the dimention $|V|$ x $1$).
6. At the previously result (called *logits*) apply softmax ($\Theta$) getting the probability distribution over the possible output classes
7. The class that maximize the probability is the class chosen.

In training, therefore, it is more common to use teacher forcing in the
decoder. *Teacher forcing* means that we force the system to use the gold target token
from training as the next input, rather than allowing it to rely on the (possibly
erroneous) decoder output.

## Attention 

Loung Attention 

Downsides of using a simple encoder-decoder:
- The final hidden state is thus acting as a **bottleneck**. It must represent absolutely
everything about the meaning of the source text, since the only thing the decoder
knows about the source text is what’s in this context vector

The attention mechanism is a solution to the bottleneck problem, a way of
allowing the decoder to get information from all the hidden states of the encoder,
not just the last hidden state

The idea of attention is instead to create the single fixed-length vector c by taking
a weighted sum of all the encoder hidden states

The weights focus on (‘attend to’) a particular part of the source text that is relevant for the token the decoder is
currently producing

Attention thus replaces the static context vector with one that
is dynamically derived from the encoder hidden states

This is the weighted average over all the *encoder hidden states*.

So $c \rightarrow c^{<\tau>} = \sum_{<t>=1}^{T_x} \alpha^{<t,\tau>} a^{<t>}_e$

Where the alpha tells us the proportional relevance of each encoder hidden
state $<t>$ to the prior hidden decoder state $a^{<\tau>}_d$

$\alpha^{<t,\tau>} = \text{softmax}(\text{score}(a^{<t>}_e, a^{<\tau>}_d))$



Dot Product Attenction

$\text{score}(a^{<t>}_e, a^{<\tau>}_d) = a^{<\tau>}_d.a^{<t>}_e$

Implements relevance as attention similarity: measuring how similar the decoder hidden state is to an encoder hidden state

## Application

The most common way for checking if an email is valid or not is with authentication. But if the company where you work not have an application that can check if an email is valid or not, then you can build a model that can correct the email (correct only domain part). 

$$\text{wings@gmial.com} \rightarrow \text{wings@gmail.com}$$

We will be applying the following concepts:
- Embedding
- RNN (GRU or LSTM)

In [8]:
from __future__ import unicode_literals, print_function, division
from io import open
import re
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Loading data files

The data for this project is a set of hundred of domains + extension, correct and incorrect domains + extensions.



We\'ll need a unique index per letter to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called `Domain` which has letter → index (`letter2index`) and index → letter
(`index2letter`) dictionaries, as well as a count of each letter
`letter2count` which will be used to replace rare letters later.

In [273]:
SOS_token = 0
EOS_token = 1

class Domain:
    def __init__(self, name):
        self.name = name
        self.letter2index = {}
        self.letter2count = {}
        self.index2letter = {0: "SOS", 1: "EOS"}
        self.n_letter = 2  # Count SOS and EOS

    def addWord(self, word):
        for letter in re.split("",  word):
            if letter != '':
                self.addLetter(letter)

    def addLetter(self, letter):
        if letter not in self.letter2index:
            self.letter2index[letter] = self.n_letter
            self.letter2count[letter] = 1
            self.index2letter[self.n_letter] = letter
            self.n_letter += 1
        else:
            self.letter2count[letter] += 1

In [274]:
def readFile():
    print("Reading lines...")
    # Read the file and split into lines
    lines = open('%s.csv' % ('data'), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[s for s in l.split('\t')] for l in lines]
    
    input_domain = Domain('incorrect')
    output_domain = Domain('correct')

    return input_domain, output_domain, pairs

In [None]:
def prepareData():
    input_domain, output_domain, pairs = readFile()

    print("Read %s sentence pairs" % len(pairs))

    print("Counting letters...")

    for pair in pairs:
        input_domain.addWord(pair[0])
        output_domain.addWord(pair[1])

    print("Counted letters:")
    print(input_domain.name, input_domain.n_letter)
    print(output_domain.name, output_domain.n_letter)
    return input_domain, output_domain, pairs

input_domain, output_domain, pairs = prepareData()

In [277]:
MAX_LENGTH = 35

### Encoder

In [278]:
class EncoderRNN(nn.Module):
    def __init__(self,  input_size, hidden_size, p_dropout=0.1):
        super(EncoderRNN, self).__init__()
        # If it setted 
        # input size is the vocabulary size
        # in this example is = 35
        #
        self.embedding = nn.Embedding(input_size, 
                                      hidden_size)
        
        self.gru = nn.GRU(hidden_size,
                           hidden_size, 
                           batch_first=True)
        
        self.dropout = nn.Dropout(p=p_dropout)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden

Decoder

In [279]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size) -> None:
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, keys):
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        scores = scores.squeeze(2).unsqueeze(1)
        # Batch Matrix-Matrix Product
        # Both input tensors must be 3-dimensional with the same batch size.
        weights = F.softmax(scores, dim=-1)
        context = torch.bmm(weights, keys)

        return context, weights


In [280]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size,hidden_size)
        self.attention = BahdanauAttention(hidden_size)
        self.gru = nn.GRU(2*hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, 
                                    dtype=torch.long, device=device).fill_(SOS_token)
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_hidden, attentions


    def forward_step(self, input, hidden, encoder_outputs):
        # batch_size = 32
        # hidden_size = 20
        # input shape = (32, 1)
        embedded =  self.dropout(self.embedding(input))
        # for each value the embedding will be of a shape of (1, 20)
        # for all values the embedding will be of shape of (32, 20)

        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.gru(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

In [281]:
x = torch.rand(5, 3) 
#.permute(1, 0, 2)

In [None]:
x.size(1) #.permute(2, 1, 0)

In [284]:
def indexesFromWord(domain, word):
    return [domain.letter2index[letter]
            for letter in re.split("",word) if letter != ""]

In [285]:
def tensorFromWord(domain, word):
    indexes = indexesFromWord(domain, word)
    indexes.append(EOS_token)
    out = torch.tensor(indexes, dtype=torch.long, device=device).view(1, -1)
    return  out

In [286]:
def tensorsFromPair(pair):
    input_tensor = tensorFromWord(input_domain, pair[0])
    target_tensor = tensorFromWord(output_domain, pair[1])
    return (input_tensor, target_tensor)

In [287]:
def get_dataloader(batch_size):
    input_domain, output_domain, pairs = prepareData()

    n = len(pairs)
    input_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)
    target_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)

    for idx, (inp, tgt) in enumerate(pairs):
        inp_ids = indexesFromWord(input_domain, inp)
        tgt_ids = indexesFromWord(output_domain, tgt)
        inp_ids.append(EOS_token)
        tgt_ids.append(EOS_token)
        input_ids[idx, :len(inp_ids)] = inp_ids
        target_ids[idx, :len(tgt_ids)] = tgt_ids

    train_data = TensorDataset(torch.LongTensor(input_ids).to(device),
                               torch.LongTensor(target_ids).to(device))

    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
    return input_domain, output_domain, train_dataloader

In [None]:
input_domain, output_domain, train_dataloader = get_dataloader(20)

In [289]:
encoder = EncoderRNN(35, 20)
encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.001)

In [290]:
decoder = AttnDecoderRNN(20,35)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=0.001)

In [291]:
attn = BahdanauAttention(20)

In [None]:
for idx, data in enumerate(train_dataloader):
    input_tensor, target_tensor = data
    encoder_optimizer.zero_grad()
    encoder_outputs, encoder_hidden = encoder(input_tensor)
    # decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)
    # if idx == 0:
    #     break

In [None]:
decoder_outputs

In [252]:
batch_size = encoder_outputs.size(0)
decoder_input = torch.empty(batch_size, 1, 
                                    dtype=torch.long, device=device).fill_(SOS_token)

In [None]:
encoder_hidden.permute(1, 0, 2)

In [None]:
encoder_hidden

In [None]:
encoder_outputs.permute(1, 0, 2)[0]

In [None]:
decoder_input

In [None]:
qq.forward_step(decoder_input, encoder_hidden, encoder_outputs)

In [205]:

decoder_hidden = encoder_hidden
decoder_outputs = []
attentions = []

In [None]:
encoder_hidden[0][0]

In [None]:
def train_epoch(dataloader, encoder, decoder, encoder_optimizer,
          decoder_optimizer, criterion):

    total_loss = 0
    for data in dataloader:
        input_tensor, target_tensor = data

        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        encoder_outputs, encoder_hidden = encoder(input_tensor)
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )
        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

In [None]:
import time
import math

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

In [None]:
def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001,
               print_every=100, plot_every=100):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    for epoch in range(1, n_epochs + 1):
        loss = train_epoch(train_dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
                                        epoch, epoch / n_epochs * 100, print_loss_avg))

        if epoch % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    # showPlot(plot_losses)

In [None]:
hidden_size = 128
batch_size = 32

input_lang, output_lang, train_dataloader = get_dataloader(batch_size)

encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
# decoder = AttnDecoderRNN(hidden_size, output_lang.n_words).to(device)
decoder = AttnDecoderRNN(hidden_size, output_lang.n_words).to(device)

train(train_dataloader, encoder, decoder, 80, print_every=5, plot_every=5)

In [4]:
len_max = max(data_model.DOMAIN_EXTENSION_FEATURES.str.len())
domain_extension_feat = data_model.DOMAIN_EXTENSION_FEATURES.str.ljust(len_max, '|') 
set_of_chars = set(domain_extension_feat.sum())
vocabulary = {char:idx for idx, char in enumerate(set_of_chars)}

In [5]:
def replace_chars(word):
    replaced_word = [vocabulary.get(char, char) for char in word]
    return replaced_word

In [6]:
data = domain_extension_feat.apply(replace_chars)
data_idx = list(data.values)
target_idx = data_model[["IDX_DOMAIN"]].values.tolist()

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [10]:
# Define a simple dataset for word prediction
class WordDataset(Dataset):
    def __init__(self, sequences, target, sequence_length):
        self.sequences = sequences
        self.target = target
        self.sequence_length = sequence_length
        
    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence = self.sequences[idx]
        input_sequence = torch.tensor(sequence, dtype=torch.long)
        target = torch.tensor(self.target[idx], dtype=torch.long)
        return input_sequence, target

# Define the RNN model
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.rnn = nn.RNN(hidden_size, hidden_size,num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        embedded = self.embedding(x)
        out, hidden = self.rnn(embedded, hidden)
        out = self.fc(out)
        return out, hidden

In [14]:
# Define hyperparameters
input_size = len(vocabulary)  # Vocabulary size
hidden_size = 20  # Size of the hidden state
output_size = 1  # Output size
sequence_length = len_max  # Sequence length for training
batch_size = 1  # Batch size for training
num_epochs = 10  # Number of training epochs
num_layers = 10

In [15]:
# Create the dataset and dataloader
# sequences = generate_dataset()
dataset = WordDataset(data_idx, target_idx, sequence_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Create the RNN model instance
model = RNN(input_size, hidden_size, output_size, num_layers)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [None]:
# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for batch_inputs, batch_targets in dataloader:
        optimizer.zero_grad()
        hidden = torch.zeros(num_layers, batch_size, hidden_size)  # Initialize hidden state
        outputs, _ = model(batch_inputs, hidden)
        b_target = (torch.nn.functional.one_hot(batch_targets.view(-1).to(torch.int64), len_max).T).to(torch.float64)
        loss = criterion(outputs.view(-1, output_size), b_target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}')

In [None]:
outputs.shape

In [17]:
s = dataset[11][0].view(1, -1)

In [None]:
dataset[11][1]

In [None]:

(torch.nn.functional.one_hot(dataset[11][1].to(torch.int64), len_max).T)

In [25]:
# Prediction
# test_input = torch.tensor([[0, 1, 2, 3]], dtype=torch.long)  # Input sequence: "hell"
hidden = torch.zeros(num_layers, 1, hidden_size)  # Initialize hidden state
with torch.no_grad():
    output, _ = model(s, hidden)

In [None]:
output