# Sequence to Sequence Models

We will now look at another major architecture called Seq2Seq which basically takes sequences as input and outputs another sequence. Where can we use this?

We generally use this for machine translation. Given a set of words in a language, we find what will be the its translation in another language. We will be attempting to do this for translating English to Hindi. We will also look at some new metrics to gauge the "correctness" of our model.

To read more about Seq2Seq : https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/


We will start by importing all necessary libraries and defining directory paths...

In [1]:
# For file handling...
import pandas as pd
import os,string
import numpy as np
from collections import Counter
from functools import partial
from pathlib import Path
import itertools
from nltk import wordpunct_tokenize


#For dataset creation...
from torch.utils.data import Dataset,DataLoader,random_split


#For model building...
import torch
import torch.nn as nn
import torch.nn.functional as F


# For model training...
import torch.optim as optim
from tqdm import tqdm,tqdm_notebook

In [2]:
DATA_PATH = os.path.join(os.getcwd(),"data","english_to_hindi.txt")
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Data reading

This section will be geared towards building functions for:
1. Reading the text file
2. Dealing with missing values if any
3. Tokenizing the text data

In [3]:
def readFile(path,chkNa=True):
    """
        Load data from a text file. The file must have Lang1(delimiter)Lang2 in each row.
        Eg : "Hello Hallo" or "Hello Ola" {here, delimiter was space} 
    """
    
    try:
        df = pd.read_csv(path,header=None,sep="\t",names=["EN","HI"])
        if chkNa:
            print(df.isna().sum())
        return df
    except FileNotFoundError:
        print(f"{path} does not specify a text file.")    
    except OSError:
        print(f"{path} does not exist")

#checking to make sure...
df = readFile(DATA_PATH)
df.head()

EN    0
HI    0
dtype: int64


Unnamed: 0,EN,HI
0,Help!,बचाओ!
1,Jump.,उछलो.
2,Jump.,कूदो.
3,Jump.,छलांग.
4,Hello!,नमस्ते।


## Background on the "HI" part seen above

The above dataframe holds strings from UTF-8 character encoding. The strings in the "HI" column are all formed from the devnagiri script. Unicode is a larger character map which encompasses(or "supports", for the layman :) ) many scripts like Cyrillic ( Russian, Ukranian, etc.) and those accents in french as well as the Umlauts in German(the a,e,i,o,u with a "snakebite" on top). Dealing with these strings is fairly easy if you know how they are formed. A good place to understand this is:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

http://www.utf-8.com/

To help you wrap our heads around, all the strings are made of characters. However, All the characters are represented in memory as streams of bits. As we wanted to incorporate more languages, we increased the number of bits. To look at what is the integer representation of a character or vice-versa, we use following functions:

ord(char) -> int

chr(int) -> char

Note that the two functions are inverses of each other i.e, ord(chr(someInt) = someInt

https://stackoverflow.com/questions/38454521/how-to-print-character-using-its-unicode-value-in-python

In [4]:
strin = df["HI"][0]

# The first letter in Devnagiri script...
print(ord("ऀ"))

# The last letter in Devnagiri script...
print(ord("ॿ"))

#The difference between these two should cover all the letters...




2304
2431


In [5]:
pad = " <PAD> "
sentBegin = "<BEG> "
sentEnd = " <END>"
unk = "<UNK>"

def clean(txt):
    unwanted = "~|\\/_।.?,*@#$%^&(){}[]=+\"-'"
    for char in unwanted:
        txt = txt.replace(char,' ')
    return txt

def tokenize(txt):
    txt = clean(txt) 
    tokens = txt.split()
    return tokens

for i in range(2008,2010):
    #print(df["HI"][i])
    print(tokenize(df["HI"][i]))
    print(tokenize(df["EN"][i]))

['मुझे', 'टिकटें', 'कहाँ', 'से', 'लेनीं', 'होंगीं']
['Where', 'should', 'I', 'pick', 'the', 'tickets', 'up']
['तुम', 'आज', 'सुबह', 'यहाँ', 'क्यों', 'आए']
['Why', 'did', 'you', 'come', 'here', 'this', 'morning']


In [6]:
for col in df.columns:
    print(col)

EN
HI


## Creating Dataset and Dataloaders

This section will deal with generating our torch dataset and datloaders. Our dataset class will be:
1. Taking a txt file path as input
2. Reading the txt file
3. Tokenizing the text data
4. Creating vocabulary
5. Creating charMaps and reverse charMaps
6. 

In [7]:
class EngHinData(Dataset):    
    def __init__(self,path,maxVocabSize=500):
        """
            Read a text file from path and generate the input and target sequences
            Also generate english and hindi vocabulary with a max size.
            The most commonly occuring words are chosen.
        """
        self.maxVocabSize = maxVocabSize
        
        df = readFile(path,chkNa=False)
        self.df = self.tokenizeDf(df)
        
        #Generate a vocabulary for both languages...
        enVocab = self.mostFreqTokens(self.df.ENTokenized.tolist())
        hiVocab = self.mostFreqTokens(self.df.HITokenized.tolist())
        
        #Replace rare tokens with "<UNK>"
        self.replaceRareTokens(self.df)
        #Impute zero length targets...
        self.findZeroTargets()
        #Remove all datarows with >20% unknowns...
        self.df = self.removeHighUnk(self.df)
        
        # Create char maps and reverse char maps
        self.enEncoder,self.enDecoder = self.generateMaps(enVocab,rev=True)
        self.hiEncoder,self.hiDecoder = self.generateMaps(hiVocab,rev=True)
        
        # Add <BEG> and <END> to all tokens...
        self.appendExtras(self.df)
        
        # change tokens to indices...
        self.token2idx(self.df)
        
        # Drop all columns except num...
        self.df.drop(["level_0","index","EN","HI","ENTokenized","HITokenized"],axis=1,inplace=True)
        self.df.reset_index(inplace=True)
        
        
    
    def __getitem__(self,i):
        return self.df.ENNum[i],self.df.HINum[i]
    
    def __len__(self):
        return self.df.shape[0]
    
    
    def token2idx(self,df):
        df["ENNum"] = df.ENTokenized.apply(lambda tokenList: [self.enEncoder[token] for token in tokenList])
        df["HINum"] = df.HITokenized.apply(lambda tokenList: [self.hiEncoder[token] for token in tokenList])
    
    
    def appender(self,tokenList):
        tokenList.insert(0,"<BEG>")
        tokenList.append("<END>")
        return tokenList
    
        
    def appendExtras(self,df):
        """
            Adds <BEG> and <END> at the start and end of each tokenList
        """
        
        
        df.ENTokenized.apply(self.appender)
        df.HITokenized.apply(self.appender)
        
        
    
    def generateMaps(self,vocab,rev=False):
        """
            Generates a dictionary {char : idx}
            If rev is set to True, a reverse map will also be generated {idx : char}
        """
        extras = ["<PAD>","<BEG>","<END>","<UNK>"]    
        charMap = {char : idx for idx,char in enumerate(vocab)}
        for extra in extras:
            charMap[extra] = len(charMap)
        
        if not rev:
            return charMap
        else:
            revCharMap = {idx : char for char,idx in charMap.items()}
            return charMap,revCharMap 
        
    
    def tokenizeDf(self,df):
        df["ENTokenized"] = df.EN.apply(tokenize)
        df["HITokenized"] = df.HI.apply(tokenize)
        return df
    
    def replaceRareTokens(self,df):
        commonInputs = self.mostFreqTokens(df.ENTokenized.tolist())
        commonTargets = self.mostFreqTokens(df.HITokenized.tolist())
        
        df.loc[:, 'ENTokenized'] = df.ENTokenized.apply(
            lambda tokens: [token if token in commonInputs 
                            else "<UNK>" for token in tokens]
        )
        df.loc[:, 'HITokenized'] = df.HITokenized.apply(
            lambda tokens: [token if token in commonTargets
                            else "<UNK>" for token in tokens]
        )
        
    
    def mostFreqTokens(self,sequence):
        allTokens = [word for sent in sequence for word in sent]
        common_tokens = set(list(zip(*Counter(allTokens).most_common(self.maxVocabSize - 4)))[0])
        return common_tokens
    
    def removeHighUnk(self, df, threshold=0.8):
        """Remove sequences with mostly <UNK>."""
        calculate_ratio = (
            lambda tokens: sum(1 for token in tokens if token != '<UNK>')/ len(tokens) > threshold
        )
        
        df = df[df.ENTokenized.apply(calculate_ratio)]
        df = df[df.HITokenized.apply(calculate_ratio)]
        df.reset_index(inplace=True)
        return df
    
        
    def findZeroTargets(self):
        badVals = []
        for i,val in enumerate(self.df.HITokenized.values):
            if len(val)==0:
                badVals.append(i)
        
        print(f"Found {len(badVals)} bad values...Imputing them...")
        self.df.drop(badVals,axis=0,inplace=True)
        self.df.reset_index(inplace=True)

In [8]:
ds = EngHinData(DATA_PATH,3000)

Found 2 bad values...Imputing them...


## What happened?

The class defined above does the following:
1. It reads from the path specified text file
2. It creates a pandas dataframe from it
3. It tokenizes both the english and the hindi sentences.
4. It found the most frequently occuring tokens.
5. Using the most frequent tokens, it replaced other tokens as UNK since they wont be in our charMap.
6. It removed the datarows which have more than 20% UNKnowns (by default, can be changed).
7. It creates a charMap (or vocabulary) for both hindi and english languages
8. And a reverse charMap which helps to decode back(for testing purposes)

In [9]:
ds.df.head(10)

Unnamed: 0,index,ENNum,HINum
0,0,"[2997, 2344, 2322, 1548, 2998]","[2997, 656, 1568, 300, 2998]"
1,1,"[2997, 782, 1927, 2998]","[2997, 1831, 876, 1942, 2998]"
2,2,"[2997, 1326, 2998]","[2997, 1258, 303, 2825, 2998]"
3,3,"[2997, 2384, 2306, 2998]","[2997, 1470, 1045, 2998]"
4,4,"[2997, 2384, 2306, 2998]","[2997, 1470, 2525, 2998]"
5,5,"[2997, 2344, 2529, 2998]","[2997, 656, 1994, 1391, 2998]"
6,6,"[2997, 2344, 2529, 2998]","[2997, 656, 1994, 669, 2998]"
7,7,"[2997, 2344, 2322, 1561, 2998]","[2997, 1837, 2826, 1911, 1391, 2825, 2998]"
8,8,"[2997, 125, 489, 2998]","[2997, 2485, 556, 2199, 2998]"
9,9,"[2997, 2863, 489, 2998]","[2997, 201, 275, 2998]"


In [10]:
ds.df.shape

(15063, 3)

In [11]:
train_size = int(0.99 * len(ds))
test_size = len(ds) - train_size
train_ds, test_ds = torch.utils.data.random_split(ds, [train_size, test_size])

## Collation function and dataloaders



In [12]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence, pad_sequence

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def collate(batch):
    
    inputs = [torch.LongTensor(item[0]) for item in batch]
    targets = [torch.LongTensor(item[1]) for item in batch]
    
    
    # Pad sequencse so that they are all the same length (within one minibatch)
    padded_inputs = pad_sequence(inputs, padding_value=ds.enEncoder["<PAD>"], batch_first=True)
    padded_targets = pad_sequence(targets, padding_value=ds.hiEncoder["<PAD>"], batch_first=True)
    
    
    # Sort by length for CUDA optimizations
    lengths = torch.LongTensor([len(x) for x in inputs])
    lengths, permutation = lengths.sort(dim=0, descending=True)

    return padded_inputs[permutation].to(device), padded_targets[permutation].to(device), lengths.to(device)


batchSize = 1024
train_loader = DataLoader(train_ds, batch_size=batchSize, collate_fn=collate)
test_loader = DataLoader(test_ds, batch_size=1, collate_fn=collate)

## Creating the model

The model will have 2 major parts namely the encoder and the decoder.


In [13]:
class Encoder(nn.Module):
    
    def __init__(self,vocabSize,embDims,hiddenSize,batchSize):
        
        super(Encoder,self).__init__()
        
        # Copy in all the required sizes...
        self.batchSize = batchSize
        self.hiddenSize= hiddenSize
        self.vocabSize = vocabSize
        self.embDims = embDims
        
        
        # Encoder architecture...
        self.embedding = nn.Embedding(vocabSize,embDims)
        self.gru = nn.GRU(self.embDims,self.hiddenSize,batch_first=True)
    
    
    def forward(self,inputs,lengths):
        self.batchSize = inputs.size(0)
        
        
        
        x = self.embedding(inputs)
        
        x = pack_padded_sequence(x, lengths, batch_first=True)
        output,self.hidden = self.gru(x,self.initWghts())
        
        output, _ = pad_packed_sequence(output)
        
        return output,self.hidden
    
    
    def initWghts(self):
        wghts = torch.empty(1,self.batchSize,self.hiddenSize)
        return nn.init.kaiming_normal_(wghts).to('cuda')
    
    

In [14]:
class Decoder(nn.Module):
    def __init__(self,vocabSize,embDims,encoderSize,decoderSize,batchSize):
        
        super(Decoder,self).__init__()
        
        self.batchSize = batchSize
        self.vocabSize = vocabSize
        self.encoderSize = encoderSize
        self.decoderSize = decoderSize
        self.embDims = embDims
        
        self.embedding = nn.Embedding(self.vocabSize,self.embDims)
        self.gru = nn.GRU(self.embDims+self.encoderSize,
                          self.decoderSize,
                         batch_first=True)
        
        self.fc = nn.Linear(self.encoderSize,self.vocabSize)
        
        self.W1 = nn.Linear(self.encoderSize,self.decoderSize)
        self.W2 = nn.Linear(self.encoderSize,self.decoderSize)
        self.V = nn.Linear(self.encoderSize,1)
        
    def forward(self,targets,hidden,encoderOutput):
        self.batchSize = inputs.size(0)
        
        encoderOutput = encoderOutput.permute(1,0,2)
        hiddenTimeAxis = hidden.permute(1,0,2)
        
        score = torch.tanh(self.W1(encoderOutput)+self.W2(hiddenTimeAxis))
        
        attention = torch.softmax(self.V(score),dim=1)
        
        context = attention * encoderOutput
        context = torch.sum(context,dim=1)
        
        x = self.embedding(targets)
        x = torch.cat((context.unsqueeze(1),x),-1)
        
        output,state = self.gru(x,self.initWghts())
        x = self.fc(output)
        
        return x,state,attention
    
    def initWghts(self):
        wghts = torch.empty(1,self.batchSize,self.decoderSize)
        return nn.init.kaiming_normal_(wghts).to('cuda')

In [15]:
criterion = nn.CrossEntropyLoss()

def loss_func(actual,predicted):
    
    mask = actual.ge(1).float().to('cuda')
    #print(predicted.size(),actual.size())
    loss = criterion(predicted.squeeze(1),actual) * mask
    
    return torch.mean(loss)

In [16]:
class EngToHinModel(nn.Module):
    def __init__(self,inputVocabSize,targetVocabSize,
                 hiddenSize,embDims,batchSize,
                 targetsStart,targetsEnd):
        
        super(EngToHinModel,self).__init__()
        
        self.batchSize = batchSize
        self.targetsStart = targetsStart
        self.targetsEnd = targetsEnd
        
        
        self.encoder = Encoder(inputVocabSize,embDims,
                               hiddenSize,batchSize).to('cuda')
        
        self.decoder = Decoder(targetVocabSize,embDims,
                               hiddenSize,hiddenSize,batchSize).to('cuda')
    
    def predict(self,inputs,lengths):
        self.batchSize= inputs.size(0)
        
        encoderOutput,encoderHidden = self.encoder(inputs.to('cuda'),lengths)
        decoderHidden = encoderHidden
        
        decoderInput = torch.LongTensor([[self.targetsStart]] * self.batchSize)
        
        output = []
        for _ in range(20):
            pred,decoderHidden,_ = self.decoder(decoderInput.to('cuda'),
                                             decoderHidden.to('cuda'),
                                             encoderOutput.to('cuda'))
            
            prediction = torch.multinomial(F.softmax(pred,dim=1),1)
            decoderInput = prediction
            
            prediction = prediction.item()
            output.append(prediction)
            
            if prediction == self.targetsEnd:
                return output
        
    def forward(self,inputs,targets,lengths):
        self.batchSize = inputs.size(0)

        encOut, encHidden = self.encoder(inputs.to('cuda'),lengths)

        decHidden = encHidden

        decIn = torch.LongTensor([[self.targetsStart]] * self.batchSize)


        #teacher forcing...
        loss=0
        for ts in range(1,targets.size(1)):
            preds,decHidden,_ = self.decoder(decIn.to('cuda'),
                                           decHidden.to('cuda'),
                                           encOut.to('cuda'))
            decIn = targets[:,ts].unsqueeze(1)

            loss += loss_func(targets[:,ts],preds)

        return loss/targets.size(1)
            

In [17]:
myModel = EngToHinModel(inputVocabSize=len(ds.enEncoder),
                        targetVocabSize=len(ds.hiEncoder),
                        hiddenSize=128,
                        embDims=100,
                        batchSize=batchSize,
                        targetsStart=ds.hiEncoder["<BEG>"],
                        targetsEnd=ds.hiEncoder["<END>"] 
                        ).to('cuda')

In [21]:
optimizer = optim.Adam([p for p in myModel.parameters() if p.requires_grad], lr=0.001)

# Training loop
myModel.train()
for epoch in range(20):
    total_loss = total = 0
    progress_bar = tqdm_notebook(train_loader, desc='Training')
    
    for inputs, targets, lengths in progress_bar:
        # Clean old gradients
        optimizer.zero_grad()

        # Forwards pass
        loss = myModel(inputs, targets, lengths)

        # Perform gradient descent, backwards pass
        loss.backward()

        # Take a step in the right direction
        optimizer.step()

        # Record metrics
        total_loss += loss.item()
        total += targets.size(1)

    train_loss = total_loss / total
    
    tqdm.write(f'epoch #{epoch + 1:3d}\ttrain_loss: {train_loss:.2e}\n')

HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  1	train_loss: 4.76e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  2	train_loss: 4.65e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  3	train_loss: 4.57e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  4	train_loss: 4.49e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  5	train_loss: 4.41e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  6	train_loss: 4.33e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  7	train_loss: 4.25e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  8	train_loss: 4.17e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch #  9	train_loss: 4.08e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 10	train_loss: 4.00e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 11	train_loss: 3.92e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 12	train_loss: 3.83e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 13	train_loss: 3.75e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 14	train_loss: 3.66e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 15	train_loss: 3.58e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 16	train_loss: 3.49e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 17	train_loss: 3.41e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 18	train_loss: 3.34e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 19	train_loss: 3.25e-02



HBox(children=(IntProgress(value=0, description='Training', max=15, style=ProgressStyle(description_width='ini…


epoch # 20	train_loss: 3.17e-02



In [19]:
print(total_loss)

16.34958690404892


In [22]:
torch.save(myModel.state_dict(), os.path.join(os.getcwd(),"models","Seq2SeqForEngHin-attempt1.pt"))

In [24]:
myModel.load_state_dict(torch.load(os.path.join(os.getcwd(),"models","Seq2SeqForEngHin-attempt1.pt")))
myModel.eval()

EngToHinModel(
  (encoder): Encoder(
    (embedding): Embedding(3000, 100)
    (gru): GRU(100, 128, batch_first=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(3000, 100)
    (gru): GRU(228, 128, batch_first=True)
    (fc): Linear(in_features=128, out_features=3000, bias=True)
    (W1): Linear(in_features=128, out_features=128, bias=True)
    (W2): Linear(in_features=128, out_features=128, bias=True)
    (V): Linear(in_features=128, out_features=1, bias=True)
  )
)