Seq2Seq models (Sequence-to-Sequence)
Sequence to sequence models are a variant of deep learning models that consists of an encoder and a decoder. They are used for problems that map an abitrarily long sequence to another arbitrarliy long sequence. For example, in machine translation, you convert a sequence of words in a source language to a sequence of words in a target language. Here we will see how we can use a seq2seq model to solve a machine translation task to convert English to German.

Source of our dataset is : http://www.manythings.org/anki/
    
german-english

## Importing all the necessary libraries and setting their random seed values

In [None]:
import random
import tensorflow as tf
import numpy as np
import time
import json

def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")
 
# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

# Asking TF to allocate GPU memory dynamically (as our GPU memory usage grows)
%env TF_FORCE_GPU_ALLOW_GROWTH = true

# Printing the tensorflow version (helps in troubleshooting, reproducing the work and so on.)
print("TensorFlow version: {}".format(tf.__version__))

env: TF_FORCE_GPU_ALLOW_GROWTH=true
TensorFlow version: 2.9.2


In [None]:
# Mount my drrive
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Section 11.1

import os
import requests
import zipfile

# Make sure the zip file has been downloaded
if not os.path.exists('/content/drive/My Drive/Colab Notebooks/sequence_to_sequence_machine_transaltion/data/deu-eng.zip'):
    raise FileNotFoundError(
        "Uh oh! Did you download the deu-eng.zip from http://www.manythings.org/anki/deu-eng.zip manually and place it in the Ch11/data folder?"
    )

else:
    if not os.path.exists('/content/drive/My Drive/Colab Notebooks/sequence_to_sequence_machine_transaltion/data/deu.txt'):
        with zipfile.ZipFile(('/content/drive/My Drive/Colab Notebooks/sequence_to_sequence_machine_transaltion/data/deu-eng.zip'), 'r') as zip_ref:
            zip_ref.extractall('/content/drive/My Drive/Colab Notebooks/sequence_to_sequence_machine_transaltion/data')
    else:
        print("The extracted data already exists")

The extracted data already exists


## Reading the data

Data is in a single `.txt` file. It is a parallel corpus meaning there is a English sentence/phrase/paragraph and a corresponding German translation of it side-by-side. In the file, the source input and the translation are separated by a tab (i.e. tab-seperated file)

In [None]:
# Section 11.1

import pandas as pd

# Read the csv file
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/sequence_to_sequence_machine_transaltion/data/deu.txt', delimiter='\t', encoding='utf-8', encoding_errors="strict", header=None)
# Set column names
df.columns = ["EN", "DE", "Attribution"]
df = df[["EN", "DE"]]
print('df.shape = {}'.format(df.shape))

df.shape = (255817, 2)


In [None]:
# There are \xc2\xa0 (undecode-able bytes remaining in some text)
# This can cause errors like UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 3: unexpected end of data
# when using the TextVectorization layer
clean_inds = [i for i in range(len(df)) if b"\xc2" not in df.iloc[i]["DE"].encode("utf-8")]

df = df.iloc[clean_inds]

In [None]:
df.head()

Unnamed: 0,EN,DE
0,Go.,Geh.
1,Hi.,Hallo!
2,Hi.,Grüß Gott!
3,Run!,Lauf!
4,Run.,Lauf!


In [None]:
df.tail()

Unnamed: 0,EN,DE
255808,Even if some sentences by non-native speakers ...,Auch wenn Sätze von Nichtmuttersprachlern mitu...
255809,Remember that the purpose of the Tatoeba Proje...,"Es gilt zu bedenken, dass es das Anliegen des ..."
255810,"When I was younger, I hated going to weddings....","Als ich jünger war, hasste ich es, auf Hochzei..."
255811,If someone who doesn't know your background sa...,"Wenn jemand, der deine Herkunft nicht kennt, s..."
255812,If someone who doesn't know your background sa...,"Wenn jemand Fremdes dir sagt, dass du dich wie..."


In [None]:
len(df)

255167

## Use a smaller sample for computational speed

There are more than 250000 samples in the original dataset. We will be using a smaller set of 50000 for our dataset. 

In [None]:
n_samples = 50000
df = df.sample(n=n_samples, random_state=random_seed)

## Introducing the `SOS` and `EOS` tokens (Decoder)

We will add these special tokens to the translated targets. `sos` indicates the start of the sentence and `eos` marks the end of the sentence. 

E.g. `Grüß Gott!` becomes `sos Grüß Gott! eos`

In [None]:
start_token = 'sos'
end_token = 'eos'

df["DE"] = start_token + ' ' + df["DE"] + ' ' + end_token

## Splitting training/validation/testing data

We will be creating three datasets by sampling randomly (without replacement);

* Test dataset - 5000 samples
* Validation dataset - 5000 samples
* Training dataset - 40000 samples

In [None]:
# Randomly sample 5000 examples from the total 50000 randomly
test_df = df.sample(n=int(n_samples/10), random_state=random_seed)
# Randomly sample 5000 examples from the total 50000 randomly
valid_df = df.loc[~df.index.isin(test_df.index)].sample(n=int(n_samples/10), random_state=random_seed)
# Assign the rest to training data
train_df = df.loc[~(df.index.isin(test_df.index) | df.index.isin(valid_df.index))]

print('test_df.shape = {}'.format(test_df.shape))
print('valid_df.shape = {}'.format(valid_df.shape))
print('train_df.shape = {}'.format(train_df.shape))

test_df.shape = (5000, 2)
valid_df.shape = (5000, 2)
train_df.shape = (40000, 2)


## Analysing the vocabulary sizes (English and German)

Calculate the vocabulary size. We will only consider the words that appear at least 10 times in the corpus.

In [None]:
# Section 11.1

from collections import Counter

# Create a flattened list from English words
en_words = train_df["EN"].str.split().sum()
# Create a flattened list of German words
de_words = train_df["DE"].str.split().sum()

# Get the vocabulary size of words appearing more than or equal to 10 times
n=10

# Code listing 11.1
def get_vocabulary_size_greater_than(words, n, verbose=True):
    
    """ Get the vocabulary size above a certain threshold """
    
    # Generate a counter object i.e. dict word -> frequency
    counter = Counter(words)
    
    # Create a pandas series from the counter, then sort most frequent to least
    freq_df = pd.Series(list(counter.values()), index=list(counter.keys())).sort_values(ascending=False)
    
    if verbose:
        # Print most common words
        print(freq_df.head(n=10))

    # Count of words >= n frequent    
    n_vocab = (freq_df>=n).sum()
    
    if verbose:
        print("\nVocabulary size (>={} frequent): {}".format(n, n_vocab))
        
    return n_vocab

print("English corpus")
print('='*50)
en_vocab = get_vocabulary_size_greater_than(en_words, n)

print("\nGerman corpus")
print('='*50)
de_vocab = get_vocabulary_size_greater_than(de_words, n)

English corpus
Tom    9498
to     8488
I      8243
the    6920
you    6092
a      5800
is     4318
in     2583
of     2544
was    2279
dtype: int64

Vocabulary size (>=10 frequent): 2225

German corpus
sos      40000
eos      40000
Tom       9960
Ich       7782
ist       4773
nicht     4546
zu        3528
Sie       3374
du        3141
das       2941
dtype: int64

Vocabulary size (>=10 frequent): 2482


## Analysing the sequence length (English and German)

Here we compute the sequence length of the sequences in the English and German corpora. To ignore the outliers, we only consider data between the 1% and 99% quantiles.