<a href="https://colab.research.google.com/github/hussain0048/Projects-/blob/master/Build_and_Deploy_Data_Science_Products_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1-Understand the landscape of solutions available for machine translation**[1]


In this series we will take a use case, understand the solution landscape and its evolution, explore different architecture choices, look under the hood of the architecture to understand the nuts and bolts, build a prototype, convert the prototype into production ready code, build an application from the production ready code and finally understand the process for deploying the application.

The use case we will be dealing with will be **Machine Translation**. By the end of the series you would have working knowledge on how to build and deploy a Machine translation application, which translates, German sentences into English. This series will comprise of the following posts.

## **1.1 Introduction to Machine Translation**[1]

**language translation** has always been a tough nut to crack. What makes it tough is the variations in structure and lexicon when one traverses from one language to the other. For this reason the problem of automated language translation or Machine translation has fascinated and inspired the best minds. Over the past decade some trailblazing advances have happened within this field. We have now reached a stage where machine translation has become quite ubiquitous. These technologies are now embedded in all our devices, mobiles, watches, desktops, tablets etc and have become an integral part of our every day life. A common example is the **Google Translate service** which has the capability to identify our input languge and subsequently translate it to multitudes of languages.

**Machine translation** technologies have transcended different approaches before reaching the state we are in at present. Let us take a quick look at the evolution of the solution landscape of machine translation.

The journey to the current state of the art translation technologies tells a fascinating tale of the strides in machine learning.

The evolution of machine translation can be demarcated to three distinct phase. Let us look at each one of them and understand its distinct characteristics.

![](https://drive.google.com/uc?export=view&id=16oSEPChc0lXTMMBufQF2noWwEi2puYsA)

## **1.2 Classical Machine Translation [1]**
Classical machine translation methods relies heavily on linguisitc rules and deep domain knowledge to translate from a source language to a target language. There are three approaches under this method.

**Direct Translation**

Direct translation is based on a large bilingual dictionary;each entry in the dictionary can be viewed as a small program whose job is to translate one word”

As the name suggests this method adopts a word-to-word translation of the source language to the target language. After the word to word translation a re-ordering of the translated words are required based on linguistic rules formulated between the source language and target language.

**Transfer Method**[1]

In the example we saw on direct translation method, we saw how the mapping of the English words for the translated Spanish sentence had a complete different ordering from the source English sentence. Every language has such structural charachteristics inherent in them. Transfer methods looks at tapping the structural differences between different language pairs.

Unlike the direct method where there is word to word tranlation followed by re-ordering, transfer methods relies on codification of the contrastive knowledge i.e difference between languages, for translation from the source to the target language. Similar to the direct method, this method also relies on deep domain knowledge and codification of complex rules governing language construction.


**Interlingua Method [1]**

The intelingua method works on a completely different approach to the word to word and contrastive translations methods we have already seen.

The intelingua method resonates very closely to the process by which human translators work. When translating , a human translator understands the meaning of the source sentence and translate it to the target language so that the essence of the conversation is not lost. There might not be a word to word mapping of the source sentence and translated sentence. However the meaning would remain intact. This is the principle adopted in the intelingua methods. Like the other two methods in the classical approach, intelingua method also depends on the rich codification of rules and dictionaries

The classical machine translation methods were effective for a large set of use cases. However the classical methods relied on comprehensive set of rules and large dictionaries. Building such knowledge base was a mammoth task requiring specialised skills and expertise. The complexity increased many fold when designing systems able to handle translation of multiple languages. There was a need for an approach different from the domain intensive classical techniques. This led to the rise in popularity of the statistical methods in machine translation.


**Statistical Machine Translation [1]**

When we explored the classical methods we understood the over dependence on domain knowledge in creating linguistic rules and dictionaries. However it was also a fact that no amount of domain knowledge was enough to handle the intricate nuances of languages. What if phrases, idioms and specialised usages in a language do not have any parallels in another language ? In such circumstances what a linguist would do is to go for the closest match given the source language.

This idea of selecting the most probable sentence in the target languge given a sentence in source language is what is leaveraged in statistical machine translation.

Statistical methods builds probabilistic models that aims at maximizing the probability of the target sentence which best captures the essence of the source sentence. In probability terms we can represent this as

argmaxT P(T|S)

where T and S are the target and source languages respectively. The above form is the representation of a posterior probability as per Bayes Theorm. This is proportional to

= argmaxT P(S|T) * P(T)

The first term ( P(S|T) ) is called the translation model and can be interpreted as the likelihood of finding the source sentence given the target sentence. The second term P(T) is called the language model which represents the conditional probability of a word in the languge given some preceeding words.

The statistical model aims at finding the conditional probabilities of words within a corpora and using these probabilities find the best possible translation. Statistical machine translation models make use of large corpora or text available on the source and target languages. Eventhough statistical methods were effective, they also had some weaknesses. This method was predominantly focussed on phrases being translated thereby compromising the broder context of the target language. This method struggled when required to translate to a target language which was different in context from the source context. These shortcomings paved the way to advances in other methods which were more robust to retaining the context between the source and target languages.

**Neural Machine Translation**

Neural machine translation is a different approach where artifical neural networks are used for machine translation. In the statistical machine translation approaches we saw that it uses multiple components like the translation model and language model to do the translations.In NMT models the entire sentence is a single integrated model. In term of approach there isnt drastic deviations from the statistical approaches. However NMTs uses vector representations of words and sentences, which helps in retaining the context of the source and target sentences.

There are different approaches for machine translation using artificial neural networks. One of the earlier approach was to use a multi layer perceptron or a fully connected network for machine translation. However these models werent effective for large sequences of sentences.

Many shortfalls of the earlier approaches were addressed by the adoption of Recurrent Neural network models (RNNs) for machine translation. RNNs are those class of neural networks suited for sequence data. Languages as you know are manifestations of sequence of words with interdependencies between the words within the sequence. RNNs are capable of handling such interdependencies which made such class of models more suited for machine translation. There are different variations of Sequence models which are used for machine translation like encoder-decoder, encoder-decoder with attention etc. We will be using the encoder-decoder models for building our application and will be dealt with in greater depth in the next post.

The state of the art models for machine translation currently are the Transformer models. Transformer models make use of the concept of attention and then builds on it.

# **2- II : Build and Deploy Data Science Products : Exploring Sequence to Sequence architecture for Machine Translation.**

We already know that the problem of machine translation entails deciphering sequence of words in a source language to predict a sequence of target language. For example if you look at the following input German sequence

Ich freue mich darauf, etwas über maschinelle Übersetzung zu lernen.
Which can be translated to 

I look forward to learning about machine translation
From these sequences we can observe the following.

The length of input sequence and the length of the target sequence are different
There is no one to one mapping between words from the input language to the target language
There is dependence on the context which needs to be learned from the input language to get the best translation for the target language.
The inherent complexities like these in machine translation made models like multi layer perceptron ineffective for machine translation. The need of the hour was a model architecuture which was capable of looking accross sequences of words and understand the context of the source language to effectively translate to the target language. This is where Recurrent Neural Networks (RNNs) became popular for solving machine translation problems. Let us now take a deeper look at RNNs.

## **2.1 Recurrent Neural Networks ( RNNs)**

RNN models which fall under the category of Sequence to sequence models are designed to learn the context of any input language. But why is learning the context important ? Let us understand this with a simple example.

Suppose we are predicting the next character in a sequence for the string “Happy B….”. We need to predict the next character after the letter ‘B’. For the time being let us assume that we are ignoring the word “Happy” falling before the letter B. In such a scenario the best bet would be to look for all the words which start with “B” and choose the word which is most frequent. Let us say the most frequent word starting with “B” is the word “Baby”. So the next character which will be predicted would be the letter “a”. Now let us imagine that we started looking at all the characters which preceeds B. Given the information about the preceeding charachters “H”,”A”,”P”,”P”,”Y” “B”, then the probability of predicting ‘i’ would be the highest since the word “Birthday” is the most likely word given the context “Happy B” . This is where the concept of context becomes very significant. Language translation depends a lot on the context and therefore there was the need to adopt an architecture where context was learned. Sequence to sequence models like RNNs became an obvious choice.

![](
https://drive.google.com/uc?export=view&id=1iArA19WCN9R0IAG_tUssJcinl-O-qhCA)

The dynamics of RNN can be represented as above. The circular nodes represents each time step in the sequence. Each of the time steps receives an input represetend as the arrow pointing upwards. In this context each letter in the string becomes the input at each time step. With each character input the output or the prediction is represented at the top. So given the letter ‘H’ the prediction is the letter ‘A’. Once the letter ‘A’ is predicted it becomes the next input and we need to predict the next letter given the context that we had the letter ‘H’ at the previous time step. At each time step we can also see that there is an arrow which points to the right. This is the information or context each time step passes on to the subsequent time step enabling it to predict contextually.

Unlike vanilla neural networks where each layer has a set of parameters, RNNs shares the same parameters accross all the time steps. Because the parameters are shared accross all time steps, the implementation of back propogation is a little different for the case of RNNs. The type of back propogation implemented in RNN is called Back propogation through time(BPTT). We will be covering the dynamics of BPTT with a toy example in the fourth blog of this series.

Earlier we saw that the RNN keeps the context of the previous time steps in memory and applies it when predicting for the time step in consideration. However in practice vanilla RNNs fails when it encounters large sequences. The parameters blow up or shrink to very small values in such cases. These scenarios are called exploding gradients and vanishing gradients respectively. So in practice a RNN can only leaverage few time steps to extract the context. To over come these shortcomings different variations sequence to sequence models are used. One such variation is the LSTM Long Short Term Memory network. We will be using the LSTM network in our application for machine translation. Let us first look at what an LSTM looks like.

## **2.3 Long Short Term Memory Network ( LSTM)**

LSTMs, like vanialla RNNs, have the recurrent connections which entails that the context from the previous time steps are passed on to the current time step when generating an output. However we discussed in the previous section on RNN that they suffer from a major problem of exploding or vanishing gradients when encountered with long sequences. This shortcoming was overcome by building a memory block in LSTMs.

![](
https://drive.google.com/uc?export=view&id=1deEqu5a6TCoCDSO6zCcDciGBG_v1-_jx)

LSTM Network
The LSTM has three information sources,two from previous time steps and one from the current time step. The first one is the cell state denoted by ‘Ct’ . The cell state transmits the information about the context from the previous cell states. The second information which passes from the previous layer is its output denoted by ‘ht’. The third is the input for the present time step. In our context of predicting characters, the input from the time step t1 is the letter ‘H’. All these inputs get processed within the LSTM layer enabling it to have memory for longer sequences. We will be having a very detailed worked out example on the dynamics of LSTM in the next post.

An important part of building applications using sequence to sequence models is the selection of right architecture for the use case. Let us now look at different architecture choices for different use cases.

**Network Architecture for Sequence to Sequence Models**

There are different architecture choices for sequence to sequence models which varies according to the use case. Some of the prominent ones are

- **Many to one architecture**

This is architecture is ideal for use cases like sentiment analysis where seeing a sequences of words in a string, predict a single output which in this case is the sentiment.

![](
https://drive.google.com/uc?export=view&id=1u7e1npU4Ful9bjg-VzkiJ9I5Drwslyo2)

**One to many architecture**

This architecture is well suited for use cases like image translation. In such use cases, an image is provided as the input and a sequence of words describing the image is predicted as output. In this case there is one input and multiple
 outputs.
 ![](
https://drive.google.com/uc?export=view&id=1VmtQwQ0I2J-km2ot015g7nsaMBBpQZmj)

 

**Many to many architecture**

This is the architecuture which is ideal for a use case like Machine translation. In this architecture, a sequence of words is given as input and the output is also another sequence of words. The below figure is a representation of German to English translation using the many to many architecture.This architecture is also called Encoder-Decoder architecture. We will see the encoder-decoder architecture in greater depth during our prototype building phase.

![](https://drive.google.com/uc?export=view&id=1tFDWnhLP9lEc8R-rD9XQTM1f9liJajZ_)

#**3-III : Build and Deploy Data Science Products : Looking under the hood of Machine translation model – LSTM Forward Propagation**

## **3.1 Forward pass of the LSTM**

Let us learn the dynamics of the forward pass of LSTM with a simple network. Our network has two time steps as represented in the below figure. The first time step is represented as 't-1' and the subsequent one as time step 't'


Let us try to understand each of the terms in the above network. A LSTM unit receives as its input the following

- c<t-2> : The cell state of the previous time step
- a<t-2> : The output from the previous time step
- x<t-1> : The input of the present time step

The cell state is the unit which is responsible for trasmitting the context accross different time steps. At each time step certain add and forget operations happens to the context transmitted from the previous time steps. These Operations are controlled through multiple gates. Let us understand each of the gates.

## **3.2 Forget Gate**

The forget gate determines what part of the input have to be introduced into cell state and what needs to be forgotten. The forget gate operation can be represented as follows

Ґf = sigmoid(Wf*[ xt ] + Uf * [ at-1 ] + bf)

There are two weight parameters ( Wf and Uf ) which transforms the input ( xt ) and the output from the previous time step ( at-1) . This equation can be simplified by concatenating both the weight parameters and the corresponding xt & at vectors to a form given below.

Ґf = sigmoid(Wf *[xt , at-1] + bf)

Ґf is the forget gate

Wf is the new weight matrix got by concatenating [ Wf , Uf]

[xt , at-1]is the concatenation of the current time step input and the previous time step output from the

bf is the bias term.

The purpose of the sigmoid function is to quash the values within the bracket to act as a gate with values between 0 & 1 . These gates are used to control the flow of information. A value of 0 means no information can flow and 1 means all information needs to pass through. We will see more of those steps in a short while.

# **V : Build and deploy data science products: Machine translation application-Develop the prototype**




## 5.1 **Building the prototype**

**Downloading the raw text**

Let us first grab the raw data for this application. The data can be downloaded from the link below.

[dataset](http://www.manythings.org/anki/deu-eng.zip)

This is also available in the github repository. The raw text consists of English sentences paired with the corresponding German sentence. Once the data text file is downloaded let us upload the data in our Google drive. If you do not want to do the prototype in Colab, you can download it in your local drive and then use a Jupyter notebook also for the purpose.



**Preprocessing the text**

Before starting the processes, let us import all the packages we will be using for the process



In [1]:
import string
import re
from numpy import array, argmax, random, take
from numpy.random import shuffle
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, RepeatVector
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import load_model
from tensorflow.keras import optimizers
import matplotlib.pyplot as plt
% matplotlib inline
pd.set_option('display.max_colwidth', 200)
from pickle import dump
from unicodedata import normalize
from tensorflow.keras.models import load_model

The raw text which we have downloaded needs to be opened and progressively preprocessed through series of processing steps to ultimately get the train and test set which we require for building our models. Let us first define the path for the text, so as to take it from the google drive. This path has to be changed by you based on the path in which you load the data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Define the path to the raw data set 
#fileurl = '/content/drive/My Drive/Datasets/Build and Deployed/deu.txt'
fileurl="/content/drive/My Drive/Datasets/Build and Deployed/deu.txt"
#"/content/drive/My Drive/Datasets/Credit Card Detection/creditcard.csv"

Once the path is defined, let us read the text data.



In [4]:
# open the file 
file = open(fileurl, mode='rt', encoding='utf-8') 
# read all text 
text = file.read()

The text which is read from the text file would be in the format shown below



In [5]:
text[0:200]

'Go.\tGeh.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)\nHi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)\nHi.\tGrüß Gott!\tCC-BY 2.0'

Output of first 200 characters of text

From the output we can see that each record is seperated by a line (\n) and within each record the data we want is seperated by tabs (\t).So we can first split each record on new lines (\n) and after that each line we split on the tabs (\t) to get the data in the format we want



In [6]:
# Split the text into individual lines
lines = text.strip().split('\n')
# Splitting each line based on tab spaces and creating a list
lines = [line.split('\t') for line in lines]
# Visualizing first 5 lines
lines[0:5]

[['Go.',
  'Geh.',
  'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)'],
 ['Hi.',
  'Hallo!',
  'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)'],
 ['Hi.',
  'Grüß Gott!',
  'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)'],
 ['Run!',
  'Lauf!',
  'CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)'],
 ['Run.',
  'Lauf!',
  'CC-BY 2.0 (France) Attribution: tatoeba.org #4008918 (JSakuragi) & #941078 (Fingerhut)']]

We can see that the processed records are stored as lists with each list containing an enlish word, its German translation and some metadata about the data. Let us store these lists as an array for convenience and then display the shape of the array.

In [7]:
# Storing the lines into an array
mtData = array(lines)
# Displaying the shape of the array
print(mtData.shape)

(227080, 3)


Shape of array
All the above steps we can represent as a function. Let us construct the function which will be used to load the data and do basic preprocessing of the data.

In [8]:
# function to read raw text file
def read_text(filename):
    # open the file
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
     
    # Split the text into individual lines
    lines = text.strip().split('\n')
    # Splitting each line based on tab spaces and creating a list
    lines = [line.split('\t') for line in lines]
 
    file.close()
    return array(lines)

We can call the function to load the data and convert it into an array of English and German sentences. We can also see that the raw data has more than 200,000 rows and three columns. We dont require the third column and therefore we can eliminate them. In addition processing all rows would also be computationally expensive. Let us take the first 50000 rows. However this decision is left to you on how many rows you want based on the capacity of your machine.

In [9]:
# Reading the data using the function
mtData = read_text(fileurl)
# Taking only 50000 rows of data
mtData = mtData[:50000,:2]
print(mtData.shape)
mtData[0:10]

(50000, 2)


array([['Go.', 'Geh.'],
       ['Hi.', 'Hallo!'],
       ['Hi.', 'Grüß Gott!'],
       ['Run!', 'Lauf!'],
       ['Run.', 'Lauf!'],
       ['Wow!', 'Potzdonner!'],
       ['Wow!', 'Donnerwetter!'],
       ['Fire!', 'Feuer!'],
       ['Help!', 'Hilfe!'],
       ['Help!', 'Zu Hülf!']], dtype='<U537')

With the array format, the data is in a neat format with the first column being English and the second one the corresponding German sentence. However if you notice the text, there are lot of punctuations and other characters which are unwanted. We also need to standardize the text to lower case. Let us now crank up our cleaning process. The following are the processes which we will follow

- Normalize all unicode characters,which are special characters found in a language, to its corresponding ascii format. We will be using a library called ‘unicodedata’ for this normalization.
- Tokenize the string to individual words
- Convert all the characters to lower case
- Remove all punctuations from the text
- Remove all non alphabets from text




In [11]:
# Cleaning the document for all unwanted characters
 
def cleanDocs(lines):
  cleanArray = list()
  for docs in lines:
    cleanDocs = list()
    for line in docs:
      # Normalising unicode characters
      line = normalize('NFD', line).encode('ascii', 'ignore')
      line = line.decode('UTF-8')
      # Tokenize on white space
      line = line.split()
      # Removing punctuations from each token
      line = [word.translate(str.maketrans('', '', string.punctuation)) for word in line]
      # convert to lower case
      line = [word.lower() for word in line]
      # Remove tokens with numbers in them
      line = [word for word in line if word.isalpha()]
      # Store as string
      cleanDocs.append(' '.join(line))
    cleanArray.append(cleanDocs)
  return array(cleanArray)

The input to the function is the array which we created in the earlier step. We first initialize some empty lists to store the processed text in Line 3.

Lines 5 – 7, we loop through each row ( docs) and then through each column (line) of the row. The first process is to normalize the special characters . This is done through the normalize function available in the ‘unicodedata’ package. We use a normalization method called ‘NFD’ which maintains the same form of the characters in lines 9-10. The next process is to tokenize the string to individual words by applying the split() function in line 12. We then proceed to remove all unwanted punctuations using the translate() function in line 14 . After this process we convert the text to lower case and then retain only the charachters which are alphabets using the isalpha() function in lines 16-18. We join the individual columns within a row using the join() function and then store the processed row in the ‘cleanArray’ list in lines 20-21. The final output after the whole process looks quite clean and is ready for further processing.



In [None]:
# Cleaning the sentences
cleanMtDocs = cleanDocs(mtData)
cleanMtDocs[0:10]

**Nueral Translation Data Set Preperation**

Now that we have completed the initial preprocessing, its now time to get closer to the core process. Let us first prepare the data sets in the required format we want for modelling. The various steps which we will follow for preparation of data set are

- Tokenizing the text and creating vocabulary dictionaries for English and German sentences
- Define the sequence length for both English and German text
- Encode the text sequences as integer sequences
- Split the data set into train and test sets
Let us see each of these processes



**Tokenization and vocabulary creation**

Tokenization is the process of splitting the string to individual unique words or tokens. So if the string is





"Hi I am enjoying this learning and I look forward for more"


The unique tokens vocabulary would look like the following

{'i': 1, 'hi': 2, 'am': 3, , 'enjoying': 4 , 'this': 5 , 'learning': 6 'and': 7, , 'look': 8 , 'forward': 9, 'for': 10, 'more': 11}



Note that only unique words are taken and each token is given an index which will come in handy when we encode the tokens in later steps. So let us go ahead and prepare the tokens. Please note that we will be creating seperate vocabulary for English words and German words.

In [15]:
# Instantiating the tokenizer class
tokenizer = Tokenizer()

The function which does tokenization is the Tokenizer() class which could be imported from tensorflow.keras as shown above. The first step is to instantiate the Tokenizer() class. Next we will see how to fit text to the tokenizer object we created.

In [None]:
# Fit the tokenizer on the text
tokenizer.fit_on_texts(string)

Fitting the text is done using the fit_on_texts() method. This method splits the strings and then creates the vocabulary we saw earlier. Since these steps have to be repeated multiple times, let us package them as a functio

In [17]:
# Function for creating tokenizers
def createTokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

Let us use the above function to create the tokenizer for English words and look at the total length of words in English

In [18]:
# Create English Tokenizer
eng_tokenizer = createTokenizer(cleanMtDocs[:,0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
print(eng_vocab_size)

6186


We can see that the length of the English vocabulary is 6255. This is after we incremented the actual vocabulary size with 1 to account for any words which is not part of the vocabulary. Let us list down the first 10 words of the English vocabulary.

In [None]:
# Listing the first 10 items of the English tokenizer
list(eng_tokenizer.word_index.items())[0:10]


From the output we can see how the words are assigned an index value. Similary we will create the German vocabulary also

In [20]:
# Create German tokenizer
ger_tokenizer = createTokenizer(cleanMtDocs[:,1])
# Defining German Vocabulary
ger_vocab_size = len(ger_tokenizer.word_index) + 1

Now that we have tokenized the German and English sentences, the next task is to define a standard sequence length for these languges

**Define Sequence lengths for German and English sentences**

From our earlier introduction on sequence models, we know that we need data in sequences. A prerequisite in building sequence models is the sequences to be of standard lenght. However if we look at our corpus of both English and German sentences the lengths of each sentence will vary. We need to adopt a strategy for standardizing this length. One common strategy would be to adopt the maximum length of all the sentences as the standard sequence. Sentences which will have length lesser than the maximum length will have its indexes filled with zeros.However one pitfall of this strategy is, processing will be expensive. Let us say the length of the biggest sentence is 50 and most of the other sentences are of length ranging from 8 to 12. We have a situation wherein for just one sentence we unnecessarily increase the length of all other sentences by filling dummy values. When data sets become large, having all sentences standardized to the longest sentence will make the computation expensive.

To get over such issues we will adopt a strategy of finding a length under which majority of the sentences fall. This can be done by taking a high quantile value under which majority of the sentence lengths fall.

Let us implement this strategy. To start off we will have to count the lengths of all the sentences in the corpus



In [None]:
# Create an empty list to store all english sentence lenghts
len_english = []
# Getting the length of all the English sentences
[len_english.append(len(line.split())) for line in cleanMtDocs[:,0]]
len_english[0:10]

In line 2 we first created an empty list 'len_english'. Next we iterated through all the sentences in the corpus and found the length of each of the sentences and then appended each sentence lengths to the list we created, line 4.

Similarly we will create the list of all German sentence lenghts.

In [None]:
len_German = []
# Getting the length of all the English sentences
[len_German.append(len(line.split())) for line in cleanMtDocs[:,1]]
len_German[0:10]

After getting a distribution of all the lengths of English sentences, let us find the quantile value at 97.5% under which majority of the sentences fall.

In [24]:
import numpy as np


In [25]:
# Find the quantile length
engLength = np.quantile(len_english, .975)
engLength

5.0

From the quantile value we can see that a sequence length of 5.0 would be a good value to adopt as majority of the sentences would fall within this length. Similarly let us calculate for the German sentences also.

In [None]:
# Find the quantile length
gerLength = np.quantile(len_German, .975)
gerLength

We will be using the sequence lengths we have calculated in the next process where we encode the word tokens as sequences of integers.

**Encode the sequences as integers**

Earlier we tokenized all the unique words and created vocabulary dictionaries. In those dictionaries we have a mapping of the word and an integer value for the word. For example let us display the first 5 tokens of the english vocabulary





In [27]:
# First 5 tokens and its integers of English tokenizer
list(eng_tokenizer.word_index.items())[0:5]


[('tom', 1), ('i', 2), ('you', 3), ('is', 4), ('a', 5)]

We can see that each tokens are associated with an integer value . In our sequence model we will be using the integer values instead of the tokens themselves. This process of converting the tokens to its corresponding integer values is called the encoding. We have a method called ‘texts_to_sequences’ in the tokenizer() to convert the tokens to integer sequences.

The standard length of the sequence which we calculated in the previous section will be the length of each of these integer encoding. However what happens if a sentence string has length more than the the standard length ? Well in that case the sentence string will be curtailed to the standard length. In the case of a sentence having length less than the standard length, the additional lengths will be filled with zeros. This process is called padding.

The above two processes will be implemented in a function for convenience. Let us look at the code implementation.

In [28]:
# Function for encoding and padding sequences
 
def encode_sequences(tokenizer,length, lines):
    # Sequences as integers
    X = tokenizer.texts_to_sequences(lines)
    # Padding the sentences with 0
    X = pad_sequences(X,maxlen=length,padding='post')
    return X

The above function takes three variables

tokenizer : Which is the language tokenizer we created earlier

length : The standard length

lines : Which is our data

In line 5 each line is converted to sequenc of integers using the 'texts_to_sequences' method and then padded using pad_sequences method, line 7. The parameter value of padding = 'post' means that the zeros are added after the corresponding length of the sentence till the standard length is reached.

Let us now use this function to prepare the integer sequence data for both English and German sentences. We will split the data set into train and test sets first and then encode the sequences. Please remember that German sequences are our X variable and English sentences are our Y variable as we are translating from German to English.

In [None]:
# Preparing the train and test splits
from sklearn.model_selection import train_test_split
# split data into train and test set
train, test = train_test_split(cleanMtDocs, test_size=0.1, random_state = 123)
print(train.shape)
print(test.shape)

In [None]:
# Creating the X variable for both train and test sets
trainX = encode_sequences(ger_tokenizer,int(gerLength),train[:,1])
testX = encode_sequences(ger_tokenizer,int(gerLength),test[:,1])
print(trainX.shape)
print(testX.shape)

Let us display first few rows of the training set

In [31]:
# Displaying first 5 rows of the traininig set
trainX[0:5]

array([[   5,  940,   33, 2684,    0,    0],
       [  52,  347,    2,   15,    0,    0],
       [   8,   31,   26, 4757,    0,    0],
       [   1,   12,    4, 1496,    0,    0],
       [   1,   14,  156, 2738,  185,    0]], dtype=int32)

From the visualization of the training set we can see the integer encoding of the sequences and also padding of the sequences . Similarly let us repeat the process for English sentences also.

In [None]:
# Creating the Y variable both train and test
trainY = encode_sequences(eng_tokenizer,int(engLength),train[:,0])
testY = encode_sequences(eng_tokenizer,int(engLength),test[:,0])
print(trainY.shape)
print(testY.shape)

We have come to the end of the preprocessing steps. Let us now get to the heart of the process which is defining the model and then training the model with the preprocessed training data.

**Nueral Translation Model Building**

In this section we will look into the building blocks of the model. We will define the model structure in a function as shown below. Let us dive into details of the model





In [35]:
def defineModel(src_vocab,tar_vocab,src_timesteps,tar_timesteps,n_units):
    model = Sequential()
    model.add(Embedding(src_vocab,n_units,input_length=src_timesteps,mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units,return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab,activation='softmax')))
    # Compiling the model
    model.compile(optimizer = 'adam',loss='sparse_categorical_crossentropy')
    # Summarising the model
    model.summary()
     
    return model

n the second article of this series we were introduced to the encoder-decoder architecture. We will be manifesting the encoder architecture within this code block. From the above code uptill line 5 is the encoder part and the remaining is the decoder part.

Let us now walk through each layer in this architecture.

**Model fitting**



In [38]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

In [39]:
# Fitting the model
checkpoint = ModelCheckpoint('model1.h5',monitor='val_loss',verbose=1,save_best_only=True,mode='min')
model.fit(trainX,trainY,epochs=50,batch_size=64,validation_data=(testX,testY),callbacks=[checkpoint],verbose=2)

NameError: ignored

# **References**

[[1] Build and Deploy Data Science Products : A Practical Guide to Building a Machine Translation Application.](https://bayesianquest.com/2020/10/24/build-your-machine-translation-application-byte-by-byte/)

[[2] II:Build and Deploy Data Science Products : Exploring Sequence to Sequence architecture for Machine Translation.](https://bayesianquest.com/2020/10/24/ii-build-and-deploy-data-science-products-exploring-sequence-to-sequence-architecture-for-machine-translation/)