### Pytorch from scratch 

- we will use one-hot vector, giant vector of zeros, except for a single one
- define a helper class Lang 

In [3]:
from __future__ import unicode_literals
import unicodedata
import torch 
import torch.nn as nn 
from torch import optim
import re 
import torch.nn.functional as F

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
SOS_token = 0
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2
    
    # split the sentence, and then add the word to the word2index dictionary 
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
    
    # addWord: adds the word to the dict, if they don't exist in the dictionary 
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [5]:
# add a convert to ascii because engglish and french kind of save an alphabet, and it would allow us to save
# some characters 

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

## normalize, make all strings lowercase and trim, removing non-letter characters
def normalize_string(s):
    s = unicodeToAscii(s.lower().strip())
    # find punctuation, and split it
    s = re.sub(r"([.!?])", r" \1", s)
    # remove non-letter characters  
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)

    return s

In [6]:
# create lines, and create languages 
def read_langs(lang1, lang2, reverse=False):
    print("Reading lines...")
    lines = []
    # read the file and split into lines 
    with open("eng-fra.txt") as f:
        lines = f.read().splitlines()
        # normalize string
        # for each line, we split it into an array [eng, french] using .split()
        lines = [[normalize_string(s) for s in l.split('\t')] for l in lines] 
    input_lang = Lang(lang1)
    output_lang = Lang(lang2)
    return input_lang, output_lang, lines 


In [7]:
# train something quickly, trim the data set to only those who are short and simple 
# max length is 10 words, and only sentences that translate to the form "I am / He is, account for apostrophes and other thing s
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

# returns true if less than MAX_LENGTH and prefix is the correct one 
def filter_pair(p):
    filter = MAX_LENGTH > len(p[0].split(' '))
    starts_with_filter = p[0].startswith(eng_prefixes)

    return filter and starts_with_filter
def filter_pairs(pairs):
    return [pair for pair in pairs if filter_pair(pair)]

In [8]:
def prepare_data(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = read_langs(lang1, lang2, reverse)
    print("Reading %s sentence pairs" % len(pairs))
    pairs = filter_pairs(pairs)
    print("trimmed to %s sentence pairs " % len(pairs))

    print("counting words")
    
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs 

prepare_data('eng', 'fra', True)
print("end")

Reading lines...
Reading 135842 sentence pairs
trimmed to 11492 sentence pairs 
counting words
Counted words:
eng 2978
fra 4597
end


### Encoder 
- in this case we're doing word level tokenization. 

- encoder consists of:
    - Embedding layer (turning intput tokens into meaning vector)
    - For each token GRU layer (RNN) portion, takes in embedded input vector, outputs an output vector, as well as a hidden state. In RNN fashion, the vector 
    - hidden state is used in the next run so ultimate GRU layer outputs N output vectors, 
    where N is the number of tokens, and ultimately one last hidden state. 


Encoder():
    input_size **size of vocab**
    hidden_size **dimension of hidden states** 
    dropout rate **rate at which the neurons will output a 0 instead of their regular output**
    **dropout rate of 0.1 means that for every 10 neurons, 1 neuron is going to output 0 regardless of its desired output** 
    EMBEDDING LAYER ( nn.Embedding(output_size, hidden_size))
    
neuron just is an element in the vector. For example, embedding layer which is represented by a vector (e.g. 300 x 1), each of the elements in this vector is a neuron
- when you apply the dropoout to embeddings, we randomly set some of the elements of the embedding layer to 0, for the purpose of preventing overfitting in nn 

Dropout example:
# create embedding layer of 1000 words, and embedding dimension of 5 
embedding = nn.Embedding(1000, 5)
dropout = nn.Dropout(p=0.4)

# input: indices of two words 
- we input the indices of two words 
torch.tensor([42, 123])
- think about this, we have 2 words, and we are applying embedding to them. Embedding has 5 dimensions, so you get a 2 x 5 tensor, where each 1 x 5 tensor represents the embedding of the word
emb_output = embedding(input_indices)
result:
[[
    0.123, 0.23, -0.23, 0.567, -0.9
    0.243, 0.555, 0.666, -0.999, 0.77
]]

**each of these numbers is a neuron! So when we apply dropout to it, we expect 40% of the neurons to turn to 0, but since each word is 5 neurons we are not dropping the word, but more like dropping some features of the word**

output might look like this:
[[
    0.000, 0.23, 0.000, 0.567, -0.9
    0.243, 0.000, 0.666, -0.999, 0.00
]]

**Embedding layer**
- firstly, an embedding is a mapping. Maps integer indices to dense vectors. 

- when creating an embedding layer, you specify 2 main parameters: 
1) number of embeddings
2) embedding dimension 

- embedding layer creates a weight matrix of shape (num_embeddings, embedding_dim)
- EMBEDDING layer is just one big matrix. This is how it acts as a dictionary mapping:
word 0: vector 0
word 1: vector 1
word 2: vector 2

- so it is a matrix of (num_words, vector dim)
- to get the vector representing the nthword, you choose the nth row in the matrix 
- These vectors are called **dense vectors**, and the way they are created is that originally the vectors are initialized randomly, using the Embedding() class 
- DURING BACKPROP, these layers are also updated!, so for the RNN learning process this layer is also updated to minimize loss. 
- That's how the embedding layer is able to map the meaning of a word, it learns from the meaning of the word from the training 

**KEY TAKEAWAY**: embedding layer is a mapping of word index to dense vector. Dense vector changes to minimize loss during backprop. 

**GRU layer**
- RNN architecture, used to **capture** dependencies, which I guess work by the addition of the hidden state at each level. 

- creation: 
    self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
- the gru layer has input and hidden state, hidden_size. 
- gru processes input sequence one element at a time. Outputs a sequence of hidden states (one for each input time state) and the final hidden state  

**Forward pass**
def forward(self, input):
    embedded = self.dropout(self.embedding(input))
    output, hidden = self.gru(embedded)
    return output, hidden

- IMPORTANT: input is NOT one token, it is the entire input sequence. So for an input sentence, we don't just pass one token at a time, the forward pass handles the entire input sentence.

- **what is input**

- input is a matrix: 
    (batch size x sequence_length)
    - batch_size is equal to the number of sentences in input
    - so input captures the sentences, and the length of each sentence 

    - so how does this get the embedding vector for each word in the sentence?
    - input:
s0    w0 w1 w2 w3 w 4 
s1    w0 w1 w2 w3 w 4 
s2    w0 w1 w2 w3 w 4 
23    w0 w1 w2 w3 w 4   

    - so basically the input is a bunch of sentences, and the word for each sentence
    - the word for each sentence is acutally just the INDEX of the word corresponding to the embedding layer
    - so for each index, we can pull the dense vector that corresponds to that word from the embedding layer! 

**Key takeaways**
SINGLE step of GRU'S internal processing:
(input) --> embedding_layer --> (embedded vector ) + prev_hidden --> gru --> output, hidden

While forward pass is the entire process! Multiple of these steps 



In [9]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        output, hidden = self.gru(embedded)
        return output, hidden 

### Decoder
- decoder is just another RNN.
Input: Encoder's output vectors 
Output: Sequence of words that represents the translation 

- RNN, so every step it takes:
input: input token, hidden state
**Initial input token is <sos> token and the hidden state is actually the CONTEXT vector, the encoder's last hidden state**

(input) --> embedding --> relu(embedding) + prev_hidden --> gru 

hidden + gru --> out --> softmax --> translated token! 

In [10]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.embedding = nn.Embedding(output_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)
        decoder_hidden = encoder_hidden
        decoder_outputs = []
    
        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden = self.forward_step(decoder_input, decoder_hidden)
            decoder_outputs.append(decoder_output)

            if target_tensor is not None:
                decoder_input = target_tensor[:, i].unsqueeze(1)
            else:
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()
        
        decoder_ouputs = torch.cat()
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        return decoder_outputs, decoder_hidden, None 

    # one forward step 
    # remember, one input --> embedding --> relu + prev_hidden --> GRU --> softmax(output), hidden 

    def forward_step(self, input, hidden):
        output = self.embedding(input)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.out(output)
        return output, hidden 

### Attention decoder
- problem with current set up is that we only pass the last hidden state
- this single vector has to encode the entire sequence. Attention allows the decoder network to "focus" on a different part of the encoder's outputs for every step of the decoder's own outputs. 
- calculate the attention weights. With the attention weights, we multiply then by the encoder output vectors, to create a weighted combination. The result will contain information about that specific part of the input sequence, and thus help the decoder choose the right output words 

- calculating the attention weights is done with another feed-forward layer attn, using the decoder's input and hidden state as inputs. Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sequence length that it can apply to. Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few 

- we'll do this after