# Recurrent Neural Network

Certain data types such as time-series, text, and biological data contain sequential dependencies. In such cases, a recurrent neural network (RNN) is a type of neural network that can be used to solve such problems. RNN is specifically designed for processing sequential data.


## Embedding 

It is a way to represent of the meaning of a word in a *vector*. All relate to this word is represented in this vector (semantic, syntax, etc).

Exists a static embedding and dynamic contextualized embedding (like BERT).

This avoids the need to feature engineering. 

This conducts to *self-supervised* ways to learn representations of the input.

### Vector Semantic

It is vector that represents the meaning of a word. Recall, "semantics" is the process of understanding the meaning of a word and this an standar in NLP.

The idea of vector semantics is to represent a word as a point in a multidimensional semantic space that is derived from the distributions of embeddings word neighbors.

Vectors for representing words are called *embeddings* (although the term is sometimes more strictly applied only to dense vectors like *word2vec*)

### Cosine for measuring similarity

To measure similarity between two target words v and w, we need a metric that
takes two vectors gives a measure of their similarity.

The *dot product* (inner product) acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alternatively, vectors that have zeros in different dimensions—orthogonal vectors—will have a dot product of 0, representing their strong dissimilarity.

If we normalize this dot product by its length of the vector the result is :

$$\text{consine} = \frac{\sum_i^N v_iw_i}{\sqrt{\sum_i^N v_i^2}  \sqrt{\sum_i^N w_i^2} } $$

### Word2vec

#
# Sequence Models

#### Motivating Example 

Problem: *Entity Recognition*

It is used to find companies's name, times location, currency names, 

$\text{input}$: Harry Potter and Hermione Granger invented a new spell

$$\textbf{x}: [x^{<1>}, x^{<2>} ..., x^{<9>}]$$

$$\textbf{y}: [1, 1, 0, 1, 1, 0, 0, 0, 0]$$

#### Representating Words

$\text{input}$: Harry Potter and Hermione Granger invented a new spell.

We can represent this input sequence in a vector based on this *vocabulary* 

$$\text{vocabulary} = \{\text{a}, \text{aaron}, ...,\text{Harry,} ..., \text{Potter}, ...,\text{zulu}\}$$

We split the sentence in tokens, each token is a word.

$$\text{Input} \: \text{tokens} = [\text{Harry}, \text{Potter}, \text{Hermione}, \text{Granger}, \text{invented}, \text{a}, \text{new}, \text{spell}]$$

$$\text{Input} \: \text{tokens} = [x^{<1>}, x^{<2>}, x^{<3>}, x^{<4>}, x^{<5>}, x^{<6>}, x^{<7>}, x^{<8>}]$$

So $ x^{<1>}=\text{Harry}$ it converted to vector
$$x^{<1>} = \begin{bmatrix} 
0\\ 
0\\
0\\
...\\
1\\
...\\
0\\
...\\
0
\end{bmatrix}$$

This is called one-hot representation. What if we encounter a word that is not in the *vocabulary*? we usually create a token $\text{<UNK>}$.

Other tokes that we use are $\text{<PAD>}$,$\text{<EOS>}$, $\text{<SOS>}$

> Problemns with Feed Forward Neural Network
> * The *inputs* and the *outputs* can have several lenghts. "This is my iPhone" (inputs tokenized by word) (T = 4), "This laptop don't have enough storage" (inputs tokenized by word) (T = 7).
> * The inputs tokenized can be related to earlier inputs tokenized.

## Recurrent Neural Networks (Unidirectional)

Notation for *Unidirectional Recurrent Neural Network (one hidden layer)*

$$a^{<t>} = \Phi(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)$$

$$y^{<t>} = \Theta(W_{ya}a^{<t>} + b_y)$$

> Where $W_{ya}$ are the weights used to compute $y$ multiplied the vector $a$ 
>
> $a^{<0>}$ is a vector of zeros

<!-- <div><span style="font-size:25px;font-weight:300">RNN Architecture (One Hidden Layer)</span> </div> -->

### RNN Architecture (One Hidden Layer)

Graphical representation

<div style = "">
    <div style = "display:flex;align-items:center; justify-content:center; flex-direction:column" >
    <img src = "./assets/architecture-rnn.png" style="height:300px">
    <span>Andrew Ng</span>
    </div>
    
</div>


#### Examples of RRN architectures

- *Many-to-Many*: Both inputs and outputs are sequences. Can be
direct or delayed.
  - Ex.: Video-captioning, i.e., describing a sequence of images via text (direct).
  - Translating one language into another (delayed)

- *Many-to-One* (Sentimential Classification ) The input data is a sequence, but the output is a fixedsize vector, not a sequence.
  - Ex.: sentiment analysis, the input is some text, and the output is a class label.
- *One-to-Many*:  Input data is in a standard format (not a sequence), theoutput is a sequence.
  - Ex.: Image captioning, where the input is an image, the output is a textdescription of that image

### Forward Propagation and Back Propagation 

Using *RNN Many-to-Many* architecture (one hidden layer) with length $T_x = T_y$ as *Language Model*.

So the number of inputs and output are the same.

**Note:** language models predict the next word in a sequence given some preceding context. Since the next word can be any word in the vocabulary we will define the output as a vector of probabilities. The probability of next word is in $k^{th}$ class of $K$ classes in the time $<t>$ given a context $y^{<1>}, y^{<2>}, \dots, y^{<t-1>}$.

$$\hat{y_k}^{<t>} = P({y_k}^{<t>} | y^{<1>}, y^{<2>}, \dots, y^{<t-1>})$$

$$\hat{y}^{<t>} = [\hat{y_1}^{<1>}, \hat{y_2}^{<2>}, \dots, \hat{y_{T_y}}^{<T_y>}]$$

For $t=1$, $\hat{y}^{<1>} = P({y}^{<1>})$

In the Forward Propagation Fase the input of training instance are fed to *RNN* and the final result the vector of probabilities 

$$\hat{y}^{<1>}, \hat{y}^{<2>}, \dots, \hat{y}^{<T_y>}$$

In the Back Propagation we need to compute the loss function.

The $y^{<t>}$ is convert to one-hot.

$$L^{<t>} = -\sum_i y_i^{<t>} log(\hat{y}_i^{<t>})$$

$$L = \sum_{t=1}^{T_y} L^{<t>}$$

$$\frac{\partial L^{<t>}}{\partial W_{aa}} = \frac{\partial L^{<t>}}{\partial y^{<t>}} . \Theta\text{'}. W_{ya}. \Phi \text{'}  a^{<t-1>} + \frac{\partial L^{<t>}}{\partial y^{<t>}} . \Theta\text{'}. W_{ya}. \Phi \text{'}  a^{<t-1>}[\Phi \text{'}  a^{<t-2>}] + ... + \frac{\partial L^{<t>}}{\partial y^{<t>}} . \Theta\text{'}. W_{ya}. \Phi \text{'}  a^{<t-1>}[\Phi \text{'}  a^{<t-2>}]..[\Phi \text{'}  a^{<0>}]$$

This is very problematic:
Vanishing/Exploding gradient problem!

### Language Modelling

Language modeling, or LM, is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of text data to provide a basis for their word predictions.

It one of the most basic and important tasks in natural language processing and RNN do very well. 

We can predict a sentence given a sequence of words. This can be achieved using the chain rule of conditional probability already defined above.

$$P(y^{<1>}, ..., y^{<T_y>}) = P(y^{<1>})P(y^{<2>}|y^{<1>}) \dots P(y^{<T_y>}|y^{<1>},\dots, y^{<T_y-1>})$$



### Language modelling with an RNN

*Training set*: large corpus (NLP terminology for text with large body) of any language text

*Tokenize*: Divide the sentences into piece of string (words, n-gram) 

*Word Tokenize*: divide the sentences into words. Punctuation if is useful can be considere like a word.

$$\text{cats average 15 hours of sleep a day}.$$

If considere `.` like a token the tokens would be $y^{<1>}, ..., y^{<8>}, y^{<9>}$, but if it not the tokens would be $y^{<1>}, ..., y^{<8>}$. 

If we want to tokenize the end of sentence we can add `<EOS>` to tokens.

If a word is not in a vocabulary (10,000 words) it can be tokenize like `<UNK>` and add to tokens.

### RNN Model

Based in a sequence of tokens where:

$$x^{<1>} =: \text{cats} = \begin{bmatrix} 
0\\ 
0\\
0\\
...\\
1\\
...\\
0
\end{bmatrix}$$

Such as the dimention of the $x^{<1>}$ vector is the vocabulary. 

$$y^{<1>} = \begin{bmatrix} 
P(\text{a})\\ 
P(\text{aaron})\\
P(\text{average})\\
...\\
P(\text{cats})\\
...\\
P(\text{zuu})
\end{bmatrix}$$



<div><span style="font-size:25px;font-weight:300">RNN Architecture (One Hidden Layer)</span> </div>
<div >
    <div style = "display:flex;align-items:center; justify-content:center; flex-direction:column" >
    <img src = "./assets/architecture-rnn-language-modelling.png" style="height:200px">
    </div>
    
</div>


### In the practice

Let the sequence `"The reinforcement learning is the key of machine learning"` and we want to build a *Language Model* to predict the next word. 

So first we need to convert the sequence into *tokens* (in this case the token is a word). 

So the sequence is convert to a vector: $[\text{The}, \: \text{reinforcement}, \: \text{learning},\: \text{is}, \:\text{the}, \: \text{key}, \: \text{of},\: \text{machine},\: \text{learning}]$. 

Above sequence is splitted in *Source Text* and *Target Text* of equal length. $T_x = T_y$

Source Text : $[\text{The}, \: \text{reinforcement}, \: \text{learning},\: \text{is}, \:\text{the}, \: \text{key}, \: \text{of},\: \text{machine}]$. 

Target Text : $[\text{reinforcement}, \: \text{learning},\: \text{is}, \:\text{the}, \: \text{key}, \: \text{of},\: \text{machine},\: \text{learning}]$. 

We can fomalize into variables

Let be $x^{<t>} = [x^{<1>}, .., x^{<8>}]$ and $y^{<t>} = [y^{<1>}, .., y^{<8>}]$

When $x^{<t>}$ and $y^{<t>}$ is fed to RNN (1 hidden layer) to train, it apply the following steps:

- $x^{<1>} = \text{The}$ is transformed using *embedding transformation* (we define the size of vocabulary and the dimention of embedding) or *one-hot* (based in to the vocabulary). So if $\text{vocabulary}\_\text{size} = 20$ and the dimention of  $\text{embedding}\_\text{size} = 5$ the vector would be $x^{<1>} = [0.79632116, 0.09251072, 0.20748794, 0.20226105, 0.533899253]$ and each time that the word *the* needs to encode this vector will be used.

- $a^{<0>}$ init with zero values $[0, 0, 0, 0, 0]$, $W_{aa}$ and $W_{ax}$ are matrices  of shape = $(5, 5)$ inits with random values. Where the first $5$ to left are the weights for the first unit and $b_a$ is a vector of shape $(1, 5)$ where each single value is the bias for each unit.
- With those parameters and data we can compute $a^{<1>} =\Phi(a^{<0>}\text{.}W_{aa} + x^{<1>}\text{.}W_{ax} + b_a)$.

- Since we need to predict the probability of the next word, we use $\text{softmax}$ function and output should be equal to $20$ the size of the vocabulary. So we can compute the probabilities $P(y^{<1>}), P(y^{<2>}), ..., P(y^{<20>})$.

- With those requeriments the weights $W_{ya}$ is a matrix of shape $(8,20)$ and $b_y$ is vecor of shape $(1, 20)$ with random values at init.

- So we can compute the $y^{<1>} = \Theta(a^{<1>}\text{.}W_{ya} + b_y)$.
- In the next iterartion we will apply the same steps but with the data and parameters of the previous iteration.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [120]:
# Define a simple dataset for word prediction
class WordDataset(Dataset):
    def __init__(self, sequences, sequence_length):
        
        self.sequences = sequences
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence = self.sequences[idx]
        input_sequence = torch.tensor(sequence[:-1], dtype=torch.long)
        target = torch.tensor(sequence[1:], dtype=torch.long)
        return input_sequence, target

# Define the RNN model
class RNN(nn.Module):
    def __init__(self, vocabulary_size, embedding_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(vocabulary_size, embedding_size)
        self.rnn = nn.RNN(embedding_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        embedded = self.embedding(x)
        out, hidden = self.rnn(embedded, hidden)
        out = self.fc(out)
        return out, hidden

In [121]:
# Function to generate a simple dataset
def generate_dataset():
    # Simple dataset of word sequences
    sequences = [
        [0, 1, 2, 3, 4],  # "hello world"
        [5, 6, 7, 8, 4],  # "how are you"
        [9, 1, 10, 3, 4], # "goodbye world"
    ]
    return sequences

# Define hyperparameters

# input_size = 11
vocabulary_size = 11
# Input size = Vocabulary size
# This implies that the range of posibles values 
# is between 0-10. 
hidden_size = 8 # Size of the hidden state
output_size = 11  # Output size (same as input size)
sequence_length = 4  # Sequence length for training
batch_size = 1  # Batch size for training
num_epochs = 100  # Number of training epochs
embedding_size = 8
# Create the dataset and dataloader
sequences = generate_dataset()
dataset = WordDataset(sequences, sequence_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Create the RNN model instance
model = RNN(vocabulary_size,embedding_size, hidden_size, output_size)

# Define the loss function and optimizer
# For multinomial model
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [122]:
# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for batch_inputs, batch_targets in dataloader:
        optimizer.zero_grad()
        hidden = torch.zeros(1, batch_size, hidden_size) # Initialize hidden state
        outputs, _ = model(batch_inputs, hidden)
        loss = criterion(outputs.view(-1, output_size), batch_targets.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()    
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}')


Epoch 1/100, Loss: 2.5559
Epoch 2/100, Loss: 2.4079
Epoch 3/100, Loss: 2.2709
Epoch 4/100, Loss: 2.1404
Epoch 5/100, Loss: 2.0146
Epoch 6/100, Loss: 1.8934
Epoch 7/100, Loss: 1.7769
Epoch 8/100, Loss: 1.6663
Epoch 9/100, Loss: 1.5625
Epoch 10/100, Loss: 1.4676
Epoch 11/100, Loss: 1.3818
Epoch 12/100, Loss: 1.3038
Epoch 13/100, Loss: 1.2336
Epoch 14/100, Loss: 1.1705
Epoch 15/100, Loss: 1.1126
Epoch 16/100, Loss: 1.0601
Epoch 17/100, Loss: 1.0121
Epoch 18/100, Loss: 0.9677
Epoch 19/100, Loss: 0.9267
Epoch 20/100, Loss: 0.8879
Epoch 21/100, Loss: 0.8512
Epoch 22/100, Loss: 0.8161
Epoch 23/100, Loss: 0.7827
Epoch 24/100, Loss: 0.7503
Epoch 25/100, Loss: 0.7196
Epoch 26/100, Loss: 0.6902
Epoch 27/100, Loss: 0.6612
Epoch 28/100, Loss: 0.6341
Epoch 29/100, Loss: 0.6076
Epoch 30/100, Loss: 0.5819
Epoch 31/100, Loss: 0.5577
Epoch 32/100, Loss: 0.5344
Epoch 33/100, Loss: 0.5122
Epoch 34/100, Loss: 0.4912
Epoch 35/100, Loss: 0.4711
Epoch 36/100, Loss: 0.4520
Epoch 37/100, Loss: 0.4339
Epoch 38/1

In [123]:
# Prediction
test_input = torch.tensor([[0, 1, 2, 3]], dtype=torch.long)  # Input sequence: "hell"
hidden = torch.zeros(1, 1, hidden_size)  # Initialize hidden state
with torch.no_grad():
    output, _ = model(test_input, hidden)
predicted_idx = torch.argmax(output, dim=2)
print("Predicted sequence:", [idx.item() for idx in predicted_idx.squeeze()])

Predicted sequence: [1, 2, 3, 4]


## Bidirectional RNN and Stacked RNN

## The LSTM

## The GRU

## Application

The most common way for checking if an email is valid or not is with authentication. But if the company where you work not have an application that can check if an email is valid or not, then you can build a model that can correct the email (correct only domain part). 

$$\text{wings@gmial.com} \rightarrow \text{wings@gmail.com}$$

We will be applying the following concepts:
- Embedding
- RNN (GRU or LSTM)

In [2]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import re
import random
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Loading data files

The data for this project is a set of hundred of domains + extension, correct and incorrect domains + extensions.

We\'ll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called `Lang` which has word → index (`word2index`) and index → word
(`index2word`) dictionaries, as well as a count of each word
`word2count` which will be used to replace rare words later.

In [273]:
SOS_token = 0
EOS_token = 1

class Domain:
    def __init__(self, name):
        self.name = name
        self.letter2index = {}
        self.letter2count = {}
        self.index2letter = {0: "SOS", 1: "EOS"}
        self.n_letter = 2  # Count SOS and EOS

    def addWord(self, word):
        for letter in re.split("",  word):
            if letter != '':
                self.addLetter(letter)

    def addLetter(self, letter):
        if letter not in self.letter2index:
            self.letter2index[letter] = self.n_letter
            self.letter2count[letter] = 1
            self.index2letter[self.n_letter] = letter
            self.n_letter += 1
        else:
            self.letter2count[letter] += 1

In [274]:
def readFile():
    print("Reading lines...")
    # Read the file and split into lines
    lines = open('%s.csv' % ('data'), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[s for s in l.split('\t')] for l in lines]
    
    input_domain = Domain('incorrect')
    output_domain = Domain('correct')

    return input_domain, output_domain, pairs

In [276]:
def prepareData():
    input_domain, output_domain, pairs = readFile()

    print("Read %s sentence pairs" % len(pairs))

    print("Counting letters...")

    for pair in pairs:
        input_domain.addWord(pair[0])
        output_domain.addWord(pair[1])

    print("Counted letters:")
    print(input_domain.name, input_domain.n_letter)
    print(output_domain.name, output_domain.n_letter)
    return input_domain, output_domain, pairs

input_domain, output_domain, pairs = prepareData()

Reading lines...
Read 705 sentence pairs
Counting letters...
Counted letters:
incorrect 41
correct 39


In [277]:
MAX_LENGTH = 35

### Encoder

In [278]:
class EncoderRNN(nn.Module):
    def __init__(self,  input_size, hidden_size, p_dropout=0.1):
        super(EncoderRNN, self).__init__()
        # If it setted 
        # input size is the vocabulary size
        # in this example is = 35
        #
        self.embedding = nn.Embedding(input_size, 
                                      hidden_size)
        
        self.gru = nn.GRU(hidden_size,
                           hidden_size, 
                           batch_first=True)
        
        self.dropout = nn.Dropout(p=p_dropout)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden

Decoder

In [279]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size) -> None:
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, keys):
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        scores = scores.squeeze(2).unsqueeze(1)
        # Batch Matrix-Matrix Product
        # Both input tensors must be 3-dimensional with the same batch size.
        weights = F.softmax(scores, dim=-1)
        context = torch.bmm(weights, keys)

        return context, weights


In [280]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size,hidden_size)
        self.attention = BahdanauAttention(hidden_size)
        self.gru = nn.GRU(2*hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, 
                                    dtype=torch.long, device=device).fill_(SOS_token)
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_hidden, attentions


    def forward_step(self, input, hidden, encoder_outputs):
        # batch_size = 32
        # hidden_size = 20
        # input shape = (32, 1)
        embedded =  self.dropout(self.embedding(input))
        # for each value the embedding will be of a shape of (1, 20)
        # for all values the embedding will be of shape of (32, 20)

        query = hidden.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)

        output, hidden = self.gru(input_gru, hidden)
        output = self.out(output)

        return output, hidden, attn_weights

In [281]:
x = torch.rand(5, 3) 
#.permute(1, 0, 2)

In [282]:
x.size(1) #.permute(2, 1, 0)

3

In [284]:
def indexesFromWord(domain, word):
    return [domain.letter2index[letter]
            for letter in re.split("",word) if letter != ""]

In [285]:
def tensorFromWord(domain, word):
    indexes = indexesFromWord(domain, word)
    indexes.append(EOS_token)
    out = torch.tensor(indexes, dtype=torch.long, device=device).view(1, -1)
    return  out

In [286]:
def tensorsFromPair(pair):
    input_tensor = tensorFromWord(input_domain, pair[0])
    target_tensor = tensorFromWord(output_domain, pair[1])
    return (input_tensor, target_tensor)

In [287]:
def get_dataloader(batch_size):
    input_domain, output_domain, pairs = prepareData()

    n = len(pairs)
    input_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)
    target_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)

    for idx, (inp, tgt) in enumerate(pairs):
        inp_ids = indexesFromWord(input_domain, inp)
        tgt_ids = indexesFromWord(output_domain, tgt)
        inp_ids.append(EOS_token)
        tgt_ids.append(EOS_token)
        input_ids[idx, :len(inp_ids)] = inp_ids
        target_ids[idx, :len(tgt_ids)] = tgt_ids

    train_data = TensorDataset(torch.LongTensor(input_ids).to(device),
                               torch.LongTensor(target_ids).to(device))

    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
    return input_domain, output_domain, train_dataloader

In [288]:
input_domain, output_domain, train_dataloader = get_dataloader(20)

Reading lines...
Read 705 sentence pairs
Counting letters...
Counted letters:
incorrect 41
correct 39


In [289]:
encoder = EncoderRNN(35, 20)
encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.001)

In [290]:
decoder = AttnDecoderRNN(20,35)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=0.001)

In [291]:
attn = BahdanauAttention(20)

In [301]:
for idx, data in enumerate(train_dataloader):
    input_tensor, target_tensor = data
    encoder_optimizer.zero_grad()
    encoder_outputs, encoder_hidden = encoder(input_tensor)
    # decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)
    # if idx == 0:
    #     break

tensor([[ 3, 31,  3, 12, 21, 12, 25,  9, 11, 12,  9,  3,  8, 11,  7, 23, 21, 11,
         17,  7,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [10,  7,  5, 10, 21, 10, 11, 17,  7,  1,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [24,  7,  6,  7, 16,  4, 17,  3,  7, 11, 17,  7,  1,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [13,  3,  8, 22,  7, 15,  4,  5,  6, 33,  3, 16, 15,  7,  8,  5,  9, 11,
         12,  9, 13,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [17,  8,  9, 15,  7,  4,  5, 17,  7,  8, 21, 11, 12,  9, 13,  1,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 7, 10, 15, 21, 23,  4,  9, 18,  3, 14,  3, 11, 12,  9, 13,  1,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 6, 13,  3,  4, 16, 11, 12,  9,  1,  

In [262]:
decoder_outputs

tensor([[[-3.5157, -3.2959, -3.2506,  ..., -3.5495, -3.9369, -3.2888],
         [-3.6846, -3.2415, -3.2439,  ..., -3.4782, -3.6354, -3.5759],
         [-3.8572, -3.1351, -3.3892,  ..., -3.7030, -3.5965, -3.5366],
         ...,
         [-4.1298, -3.3924, -3.1606,  ..., -3.5697, -3.2676, -3.0428],
         [-4.0428, -3.3729, -3.0954,  ..., -3.5798, -3.2507, -3.1020],
         [-4.0920, -3.3658, -3.0775,  ..., -3.5628, -3.2438, -3.0607]]],
       grad_fn=<LogSoftmaxBackward0>)

In [252]:
batch_size = encoder_outputs.size(0)
decoder_input = torch.empty(batch_size, 1, 
                                    dtype=torch.long, device=device).fill_(SOS_token)

In [253]:
encoder_hidden.permute(1, 0, 2)

tensor([[[-0.6255,  0.0973, -0.1827, -0.2655, -0.8882,  0.1329,  0.7038,
          -0.1865,  0.5318,  0.5003, -0.1420, -0.2387, -0.5108,  0.7329,
           0.5157, -0.4887,  0.6601, -0.1458, -0.4213,  0.4513,  0.0055,
           0.6962,  0.1793, -0.0806, -0.3179,  0.4270,  0.4415,  0.6404,
          -0.8983,  0.7225, -0.0383,  0.5296]]], grad_fn=<PermuteBackward0>)

In [236]:
encoder_hidden

tensor([[[-0.6255,  0.0973, -0.1827, -0.2655, -0.8882,  0.1329,  0.7038,
          -0.1865,  0.5318,  0.5003, -0.1420, -0.2387, -0.5108,  0.7329,
           0.5157, -0.4887,  0.6601, -0.1458, -0.4213,  0.4513,  0.0055,
           0.6962,  0.1793, -0.0806, -0.3179,  0.4270,  0.4415,  0.6404,
          -0.8983,  0.7225, -0.0383,  0.5296]]], grad_fn=<StackBackward0>)

In [237]:
encoder_outputs.permute(1, 0, 2)[0]

tensor([[-0.2062,  0.1338, -0.3355,  0.3414, -0.6480,  0.2240, -0.1252,  0.0252,
          0.2681, -0.0411, -0.0927,  0.0172,  0.2046,  0.0497,  0.0073, -0.5470,
          0.2593, -0.0483, -0.1065,  0.3331,  0.1543,  0.0232,  0.1063, -0.3404,
         -0.0736,  0.0889, -0.0934,  0.0108,  0.1893,  0.1067,  0.1679,  0.5443]],
       grad_fn=<SelectBackward0>)

In [238]:
decoder_input

tensor([[0]])

In [254]:
qq.forward_step(decoder_input, encoder_hidden, encoder_outputs)

weights tensor([[[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]],

        [[1.]]], grad_fn=<SoftmaxBackward0>)
scores tensor([[[-0.1569]],

        [[-0.3842]],

        [[-0.2082]],

        [[-0.1605]],

        [[-0.1908]],

        [[-0.3119]],

        [[-0.2547]],

        [[-0.2644]],

        [[-0.3318]],

        [[-0.2021]],

        [[-0.1359]],

        [[-0.1769]],

        [[-0.2103]],

        [[-0.2647]],

        [[-0.2137]],

        [[-0.1464]],

        [[

RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1 but got size 35 for tensor number 1 in the list.

In [205]:

decoder_hidden = encoder_hidden
decoder_outputs = []
attentions = []

tensor([[ 0.0099, -0.0744, -0.0425,  ...,  0.3508, -0.2164, -0.0227],
        [-0.0391,  0.4516,  0.0288,  ...,  0.3973, -0.3705, -0.0697],
        [ 0.1854,  0.2195,  0.2014,  ...,  0.6389, -0.3764, -0.0347],
        ...,
        [-0.2908, -0.3483,  0.7059,  ...,  0.6978,  0.6935, -0.4306],
        [-0.2546, -0.3885,  0.6932,  ...,  0.7494,  0.6911, -0.2461],
        [-0.2882, -0.4019,  0.7049,  ...,  0.7579,  0.7133, -0.3795]],
       grad_fn=<SelectBackward0>)

In [178]:
encoder_hidden[0][0]

tensor([-0.2882, -0.4019,  0.7049,  0.0957,  0.0388, -0.7721, -0.0935, -0.7499,
         0.2285,  0.2972,  0.6144, -0.2520, -0.4238, -0.0793, -0.2018,  0.1237,
         0.8639,  0.0808,  0.2070, -0.6430, -0.3698, -0.3117, -0.7089,  0.4119,
         0.5417,  0.2847,  0.6854,  0.0236, -0.5349,  0.7579,  0.7133, -0.3795],
       grad_fn=<SelectBackward0>)

In [None]:
def train_epoch(dataloader, encoder, decoder, encoder_optimizer,
          decoder_optimizer, criterion):

    total_loss = 0
    for data in dataloader:
        input_tensor, target_tensor = data

        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        encoder_outputs, encoder_hidden = encoder(input_tensor)
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )
        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)

In [None]:
import time
import math

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

In [None]:
def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001,
               print_every=100, plot_every=100):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    for epoch in range(1, n_epochs + 1):
        loss = train_epoch(train_dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
                                        epoch, epoch / n_epochs * 100, print_loss_avg))

        if epoch % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    # showPlot(plot_losses)

In [None]:
hidden_size = 128
batch_size = 32

input_lang, output_lang, train_dataloader = get_dataloader(batch_size)

encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
# decoder = AttnDecoderRNN(hidden_size, output_lang.n_words).to(device)
decoder = AttnDecoderRNN(hidden_size, output_lang.n_words).to(device)

train(train_dataloader, encoder, decoder, 80, print_every=5, plot_every=5)

In [4]:
len_max = max(data_model.DOMAIN_EXTENSION_FEATURES.str.len())
domain_extension_feat = data_model.DOMAIN_EXTENSION_FEATURES.str.ljust(len_max, '|') 
set_of_chars = set(domain_extension_feat.sum())
vocabulary = {char:idx for idx, char in enumerate(set_of_chars)}

In [5]:
def replace_chars(word):
    replaced_word = [vocabulary.get(char, char) for char in word]
    return replaced_word

In [6]:
data = domain_extension_feat.apply(replace_chars)
data_idx = list(data.values)
target_idx = data_model[["IDX_DOMAIN"]].values.tolist()

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [10]:
# Define a simple dataset for word prediction
class WordDataset(Dataset):
    def __init__(self, sequences, target, sequence_length):
        self.sequences = sequences
        self.target = target
        self.sequence_length = sequence_length
        
    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence = self.sequences[idx]
        input_sequence = torch.tensor(sequence, dtype=torch.long)
        target = torch.tensor(self.target[idx], dtype=torch.long)
        return input_sequence, target

# Define the RNN model
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.rnn = nn.RNN(hidden_size, hidden_size,num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        embedded = self.embedding(x)
        out, hidden = self.rnn(embedded, hidden)
        out = self.fc(out)
        return out, hidden

In [14]:
# Define hyperparameters
input_size = len(vocabulary)  # Vocabulary size
hidden_size = 20  # Size of the hidden state
output_size = 1  # Output size
sequence_length = len_max  # Sequence length for training
batch_size = 1  # Batch size for training
num_epochs = 10  # Number of training epochs
num_layers = 10

In [15]:
# Create the dataset and dataloader
# sequences = generate_dataset()
dataset = WordDataset(data_idx, target_idx, sequence_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Create the RNN model instance
model = RNN(input_size, hidden_size, output_size, num_layers)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [16]:
# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for batch_inputs, batch_targets in dataloader:
        optimizer.zero_grad()
        hidden = torch.zeros(num_layers, batch_size, hidden_size)  # Initialize hidden state
        outputs, _ = model(batch_inputs, hidden)
        b_target = (torch.nn.functional.one_hot(batch_targets.view(-1).to(torch.int64), len_max).T).to(torch.float64)
        loss = criterion(outputs.view(-1, output_size), b_target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}')

Epoch 1/10, Loss: 0.0000
Epoch 2/10, Loss: 0.0000
Epoch 3/10, Loss: 0.0000
Epoch 4/10, Loss: 0.0000
Epoch 5/10, Loss: 0.0000
Epoch 6/10, Loss: 0.0000
Epoch 7/10, Loss: 0.0000
Epoch 8/10, Loss: 0.0000
Epoch 9/10, Loss: 0.0000
Epoch 10/10, Loss: 0.0000


In [31]:
outputs.shape

torch.Size([1, 34, 1])

In [17]:
s = dataset[11][0].view(1, -1)

In [24]:
dataset[11][1]

tensor([6])

In [19]:

(torch.nn.functional.one_hot(dataset[11][1].to(torch.int64), len_max).T)

tensor([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [1],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0]])

In [25]:
# Prediction
# test_input = torch.tensor([[0, 1, 2, 3]], dtype=torch.long)  # Input sequence: "hell"
hidden = torch.zeros(num_layers, 1, hidden_size)  # Initialize hidden state
with torch.no_grad():
    output, _ = model(s, hidden)

In [28]:
output

tensor([[[0.3272],
         [0.4066],
         [0.3755],
         [0.3679],
         [0.3828],
         [0.3856],
         [0.3819],
         [0.3831],
         [0.3905],
         [0.3730],
         [0.3939],
         [0.3758],
         [0.3947],
         [0.3775],
         [0.3918],
         [0.3842],
         [0.3905],
         [0.3794],
         [0.3896],
         [0.3835],
         [0.3926],
         [0.3824],
         [0.3885],
         [0.3855],
         [0.3879],
         [0.3872],
         [0.3864],
         [0.3874],
         [0.3865],
         [0.3873],
         [0.3866],
         [0.3870],
         [0.3868],
         [0.3869]]])

## Long Short-Term Memory (LSTM)

The cell memory is updated 

The update of cell memory depend on *forget gate* $\hat{\alpha_t}$, *input gate* $\hat{z_{t}}$ and cell gate $\hat{c_t}$

$$c_t = \hat{\alpha}\: \text{o} \:c_{t-1} + \hat{z}_t \: \text{o} \: \hat{c}_{t}  $$

Where $\hat{\alpha_t}$ is refered as *forget gate*, $\hat{z_{t}}$ as *input gate* and $\hat{r}$ as *output gate*

The hidden state is updated using the followig definition:

$$h_t = \hat{r_t} \: \text{o} \: \tanh{(c_t)}$$

The  reset, forget, input and cell memory gates ared updated by *RNN* and all depending on the hidden state $h_t$

$$ \text{Output gate} :  \hat{r_{t}} = \sigma(U_r h_{t-1} + W_r x_t + b_r)$$
$$ \text{Forget gate} : \hat{\alpha_{t}} = \sigma(U_{\alpha} h_{t-1} + W_{\alpha} x_t + b_{\alpha})$$
$$ \text{Input gate} : \hat{z_{t}} = \sigma(U_{z} h_{t-1} + W_z x_t + b_z)$$
$$ \text{Cell memory gate} : \hat{c_{t}} = \sigma(U_{c} h_{t-1} + W_c x_t + b_c)$$