# Sequence to Sequence Models

We will now look at another major architecture called Seq2Seq which basically takes sequences as input and outputs another sequence. Where can we use this?

We generally use this for machine translation. Given a set of words in a language, we find what will be the its translation in another language. We will be attempting to do this for translating English to Hindi. We will also look at some new metrics to gauge the "correctness" of our model.

To read more about Seq2Seq : https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/


We will start by importing all necessary libraries and defining directory paths...

In [1]:
# For file handling...
import pandas as pd
import os,string
import numpy as np
from collections import Counter
from functools import partial
from pathlib import Path
import itertools
from nltk import wordpunct_tokenize


#For dataset creation...
from torch.utils.data import Dataset,DataLoader,random_split


#For model building...
import torch
import torch.nn as nn
import torch.nn.functional as F


# For model training...
import torch.optim as optim
from tqdm import tqdm,tqdm_notebook

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
DATA_PATH = os.path.join(os.getcwd(),"data","english_to_hindi.txt")
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Data reading

This section will be geared towards building functions for:
1. Reading the text file
2. Dealing with missing values if any
3. Tokenizing the text data

In [14]:
def readFile(path,chkNa=True):
    """
        Load data from a text file. The file must have Lang1(delimiter)Lang2 in each row.
        Eg : "Hello Hallo" or "Hello Ola" {here, delimiter was space} 
    """
    
    try:
        df = pd.read_csv(path,header=None,sep="\t",names=["EN","HI"])
        if chkNa:
            print(df.isna().sum())
        return df
    except FileNotFoundError:
        print(f"{path} does not specify a text file.")    
    except OSError:
        print(f"{path} does not exist")

#checking to make sure...
df = readFile(DATA_PATH)
df.head()

EN    0
HI    0
dtype: int64


Unnamed: 0,EN,HI
0,Help!,बचाओ!
1,Jump.,उछलो.
2,Jump.,कूदो.
3,Jump.,छलांग.
4,Hello!,नमस्ते।


## Background on the "HI" part seen above

The above dataframe holds strings from UTF-8 character encoding. The strings in the "HI" column are all formed from the devnagiri script. Unicode is a larger character map which encompasses(or "supports", for the layman :) ) many scripts like Cyrillic ( Russian, Ukranian, etc.) and those accents in french as well as the Umlauts in German(the a,e,i,o,u with a "snakebite" on top). Dealing with these strings is fairly easy if you know how they are formed. A good place to understand this is:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

http://www.utf-8.com/

To help you wrap our heads around, all the strings are made of characters. However, All the characters are represented in memory as streams of bits. As we wanted to incorporate more languages, we increased the number of bits. To look at what is the integer representation of a character or vice-versa, we use following functions:

ord(char) -> int

chr(int) -> char

Note that the two functions are inverses of each other i.e, ord(chr(someInt) = someInt

https://stackoverflow.com/questions/38454521/how-to-print-character-using-its-unicode-value-in-python

In [32]:
strin = df["HI"][0]

# The first letter in Devnagiri script...
print(ord("ऀ"))

# The last letter in Devnagiri script...
print(ord("ॿ"))

#The difference between these two should cover all the letters...




2304
2431


In [174]:
pad = " <PAD> "
sentBegin = "<BEG> "
sentEnd = " <END>"
unk = "<UNK>"

def clean(txt):
    unwanted = "~|\\/_।.?,*@#$%^&(){}[]=+\"-'"
    for char in unwanted:
        txt = txt.replace(char,' ')
    return txt

def tokenize(txt):
    txt = clean(txt) 
    tokens = txt.split()
    return tokens

for i in range(2008,2010):
    #print(df["HI"][i])
    print(tokenize(df["HI"][i]))
    print(tokenize(df["EN"][i]))

['मुझे', 'टिकटें', 'कहाँ', 'से', 'लेनीं', 'होंगीं']
['Where', 'should', 'I', 'pick', 'the', 'tickets', 'up']
['तुम', 'आज', 'सुबह', 'यहाँ', 'क्यों', 'आए']
['Why', 'did', 'you', 'come', 'here', 'this', 'morning']


In [172]:
for col in df.columns:
    print(col)

EN
HI


## Creating Dataset and Dataloaders

This section will deal with generating our torch dataset and datloaders. Our dataset class will be:
1. Taking a txt file path as input
2. Reading the txt file
3. Tokenizing the text data
4. Creating vocabulary
5. Creating charMaps and reverse charMaps
6. 

In [319]:
class EngHinData(Dataset):    
    def __init__(self,path,maxVocabSize=500):
        """
            Read a text file from path and generate the input and target sequences
            Also generate english and hindi vocabulary with a max size.
            The most commonly occuring words are chosen.
        """
        self.maxVocabSize = maxVocabSize
        
        df = readFile(path,chkNa=False)
        self.df = self.tokenizeDf(df)
        
        self.replaceRareTokens(self.df)
        self.findThatFucker()
        self.df = self.removeHighUnk(self.df)
        
    
    def __getitem__(self,i):
        return self.sequences[i],self.targets[i]
    
    def __len__(self):
        return len(self.sequences)
    
    
    def tokenizeDf(self,df):
        df["ENTokenized"] = df.EN.apply(tokenize)
        df["HITokenized"] = df.HI.apply(tokenize)
        return df
    
    def replaceRareTokens(self,df):
        commonInputs = self.mostFreqTokens(df.ENTokenized.tolist())
        commonTargets = self.mostFreqTokens(df.HITokenized.tolist())
        
        df.loc[:, 'ENTokenized'] = df.ENTokenized.apply(
            lambda tokens: [token if token in commonInputs 
                            else "<UNK>" for token in tokens]
        )
        df.loc[:, 'HITokenized'] = df.HITokenized.apply(
            lambda tokens: [token if token in commonTargets
                            else "<UNK>" for token in tokens]
        )
    
    
    def mostFreqTokens(self,sequence):
        allTokens = [word for sent in sequence for word in sent]
        common_tokens = set(list(zip(*Counter(allTokens).most_common(self.maxVocabSize - 4)))[0])
        return common_tokens
    
    def removeHighUnk(self, df, threshold=0.8):
        """Remove sequences with mostly <UNK>."""
        calculate_ratio = (
            lambda tokens: sum(1 for token in tokens if token != '<UNK>')/ len(tokens) > threshold
        )
        
        df = df[df.ENTokenized.apply(calculate_ratio)]
        df = df[df.HITokenized.apply(calculate_ratio)]
        return df
    
        
    def findThatFucker(self):
        for i,val in enumerate(self.df.HITokenized.values):
            if len(val)==0:
                print(f"Found a target fucker, index: {i}")
                print(f"English: {self.df.EN[i]}")
                print(f"Hindi: {self.df.HI[i]}")
        
                   

In [320]:
train_ds = EngHinData(DATA_PATH,500)

Found a target fucker, index: 15643
English: Clear
Hindi: ...
Found a target fucker, index: 28850
English: Open a file to load
Hindi: ~


ZeroDivisionError: division by zero

In [302]:
train_ds.df.tail(10)

Unnamed: 0,EN,HI,ENTokenized,HITokenized
29405,Fade Curve,धीमा होने का वक्र,"[<UNK>, Curve]","[<UNK>, होने, का, वक्र]"
29406,Fade To Volume,इस आवाज़ तक धीमा हों,"[<UNK>, To, <UNK>]","[इस, आवाज़, तक, <UNK>, हों]"
29407,Fade Time,धीमा होने का समय,"[<UNK>, Time]","[<UNK>, होने, का, समय]"
29408,Start Fade,धीमा करना प्रारंभ करें,"[Start, <UNK>]","[<UNK>, करना, प्रारंभ, करें]"
29409,Cannot find demultiplexer plugin for the given...,दिए गए मीडिया डाटा के लिए डीमल्टीप्लेक्सर प्लग...,"[Cannot, <UNK>, <UNK>, plugin, for, the, given...","[दिए, गए, मीडिया, डाटा, के, लिए, <UNK>, प्लगइन..."
29410,Playback failed because no valid audio or vide...,प्लेबैक असफल चूंकि कोई वैध ऑडियो या वीडियो आउट...,"[<UNK>, <UNK>, <UNK>, no, <UNK>, <UNK>, or, vi...","[<UNK>, असफल, <UNK>, कोई, <UNK>, ऑडियो, या, वी..."
29411,fade curve,फेड कर्व,"[<UNK>, curve]","[<UNK>, <UNK>]"
29412,current volume,मौजूदा आवाज़,"[current, <UNK>]","[मौजूदा, आवाज़]"
29413,volume to fade to,आवाज़ जहां तक धीमा होना है,"[<UNK>, to, <UNK>, to]","[आवाज़, <UNK>, तक, <UNK>, <UNK>, है]"
29414,fade time in milliseconds,धीमा होने का समय मिलिसेकण्डों में,"[<UNK>, time, in, <UNK>]","[<UNK>, होने, का, समय, <UNK>, में]"
