# Neural machine translation - Encoder-Decoder seq2seq model

## Machine Translation – A Brief History
Most of us were introduced to machine translation when Google came up with the service. But the concept has been around since the middle of last century.

Research work in Machine Translation (MT) started as early as 1950’s, primarily in the United States. These early systems relied on huge bilingual dictionaries, hand-coded rules, and universal principles underlying natural language.

In 1954, IBM held a first ever public demonstration of a machine translation. The system had a pretty small vocabulary of only 250 words and it could translate only 49 hand-picked Russian sentences to English. The number seems minuscule now but the system is widely regarded as an important milestone in the progress of machine translation.

Soon, two schools of thought emerged:

- Empirical trial-and-error approaches, using statistical methods, and
- Theoretical approaches involving fundamental linguistic research

In 1964, the Automatic Language Processing Advisory Committee (ALPAC) was established by the United States government to evaluate the progress in Machine Translation. ALPAC did a little prodding around and published a report in November 1966 on the state of MT. Below are the key highlights from that report:

- It raised serious questions on the feasibility of machine translation and termed it hopeless
- Funding was discouraged for MT research
- It was quite a depressing report for the researchers working in this field
- Most of them left the field and started new careers

Not exactly a glowing recommendation!

A long dry period followed this miserable report. Finally, in 1981, a new system called the METEO System was deployed in Canada for translation of weather forecasts issued in French into English. It was quite a successful project which stayed in operation until 2001.

The world’s first web translation tool, Babel Fish, was launched by the AltaVista search engine in 1997.

And then came the breakthrough we are all familiar with now – Google Translate. It has since changed the way we work (and even learn) with different languages.

### Introduction to Sequence-to-Sequence (Seq2Seq) Modeling
Sequence-to-Sequence (seq2seq) models are used for a variety of NLP tasks, such as text summarization, speech recognition, DNA sequence modeling, among others. Our aim is to translate given sentences from one language to another.

Here, both the input and output are sentences. In other words, these sentences are a sequence of words going in and out of a model. This is the basic idea of Sequence-to-Sequence modeling. 

A typical seq2seq model has 2 major components –

- an encoder
- a decoder

Both these parts are essentially two different recurrent neural network (RNN) models combined into one giant network

Use cases of Sequence-to-Sequence modeling below (apart from Machine Translation, of course):

- Speech Recognition
- Name Entity/Subject Extraction to identify the main subject from a body of text
- Relation Classification to tag relationships between various entities tagged in the above step
- Chatbot skills to have conversational ability and engage with customers
- Text Summarization to generate a concise summary of a large amount of text
- Question Answering systems
 

### Business example

For our example implementation, we will use a dataset of pairs of English sentences and their French translation, which you can download from manythings.org/anki. The file to download is called fra-eng.zip. 

We will implement a character-level sequence-to-sequence model, processing the input character-by-character and generating the output character-by-character. Another option would be a word-level model, which tends to be more common for machine translation. 

#### Here's a summary of our process:

- 1) Turn the sentences into 3 Numpy arrays, encoder_input_data, decoder_input_data, decoder_target_data:
    - encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences.
    - decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) containg a one-hot vectorization of the French sentences.
    - decoder_target_data is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].
- 2) Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data. Our model uses teacher forcing.
- 3) Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data).
Because the training process and inference process (decoding sentences) are quite different, we use different models for both, albeit they all leverage the same inner layers.

This is our training model. It leverages three key features of Keras RNNs:

- The return_state contructor argument, configuring a RNN layer to return a list where the first entry is the outputs and the next entries are the internal RNN states. This is used to recover the states of the encoder.
- The inital_state call argument, specifying the initial state(s) of a RNN. This is used to pass the encoder states to the decoder as initial states.
- The return_sequences constructor argument, configuring a RNN to return its full sequence of outputs (instead of just the last output, which the defaults behavior). This is used in the decoder.

**Summary of the algorithm**
- We start with input sequences from a domain (e.g. English sentences)
    and corresponding target sequences from another domain
    (e.g. French sentences).
- An encoder LSTM turns input sequences to 2 state vectors
    (we keep the last LSTM state and discard the outputs).
- A decoder LSTM is trained to turn the target sequences into
    the same sequence but offset by one timestep in the future,
    a training process called "teacher forcing" in this context.
    It uses as initial state the state vectors from the encoder.
    Effectively, the decoder learns to generate `targets[t+1...]`
    given `targets[...t]`, conditioned on the input sequence.
- In inference mode, when we want to decode unknown input sequences, we:
    - Encode the input sequence into state vectors
    - Start with a target sequence of size 1
        (just the start-of-sequence character)
    - Feed the state vectors and 1-char target sequence
        to the decoder to produce predictions for the next character
    - Sample the next character using these predictions
        (we simply use argmax).
    - Append the sampled character to the target sequence
    - Repeat until we generate the end-of-sequence character or we
        hit the character limit.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
# Importing Packages

strFilePath = "/content/drive/My Drive/GitUpload/Machine Translation Seq2Seq/fra.txt"
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

Using TensorFlow backend.


In [3]:
# Importing Data
fileInput = open(strFilePath, encoding="utf-8").read().split('\n')
fileInput[:2]

['\ufeffGo.\tVa !', 'Run!\tCours\u202f!']

In [0]:
# Creating empty lists 

listEnglishInput = []
listFrenchInput = []
setEnglishChars = set()
setFrenchChars = set()
intSamples = 10000

In [0]:
for index in range(intSamples):

  # CurrentString = str(fileInput[index]).encode('ascii','ignore')
  # CurrentString = CurrentString.decode('utf-8')
  CurrentString = str(fileInput[index])

  strEngLine = CurrentString.split('\t')[0]
  # strEngLine = strEngLine.encode('ascii','ignore')
  strFreLine = '\t' + CurrentString.split('\t')[1] + '\n'
  # strFreLine = strFreLine.encode('ascii','ignore')

  listEnglishInput.append(strEngLine)
  listFrenchInput.append(strFreLine)

  for ch in strEngLine:
    if (ch not in setEnglishChars):
      setEnglishChars.add(ch)

  for ch in strFreLine:
    if (ch not in setFrenchChars):
      setFrenchChars.add(ch)

In [6]:
print(listEnglishInput[:3])

['\ufeffGo.', 'Run!', 'Run!']


In [7]:
print(listFrenchInput[:3])

['\tVa !\n', '\tCours\u202f!\n', '\tCourez\u202f!\n']


In [8]:
setEnglishChars

{' ',
 '!',
 '$',
 '&',
 "'",
 ',',
 '-',
 '.',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '9',
 ':',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'Y',
 'Z',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '’',
 '\ufeff'}

In [9]:
setFrenchChars

{'\t',
 '\n',
 ' ',
 '!',
 '$',
 '&',
 "'",
 '(',
 ')',
 ',',
 '-',
 '.',
 '0',
 '1',
 '5',
 '6',
 '9',
 ':',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'Y',
 'Z',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '\xa0',
 '«',
 '»',
 'À',
 'Ç',
 'É',
 'Ê',
 'à',
 'â',
 'ç',
 'è',
 'é',
 'ê',
 'ë',
 'î',
 'ï',
 'ô',
 'ù',
 'û',
 'œ',
 '\u2009',
 '‘',
 '’',
 '\u202f'}

In [0]:
setFrenchChars = sorted(list(setFrenchChars))
setEnglishChars = sorted(list(setEnglishChars))

In [11]:
eng_index_to_char = {}
eng_char_to_index = {}

for k,v in enumerate(setEnglishChars):
  eng_index_to_char[k] = v
  eng_char_to_index[v] = k

eng_char_to_index

{' ': 0,
 '!': 1,
 '$': 2,
 '&': 3,
 "'": 4,
 ',': 5,
 '-': 6,
 '.': 7,
 '0': 8,
 '1': 9,
 '2': 10,
 '3': 11,
 '4': 12,
 '5': 13,
 '6': 14,
 '7': 15,
 '9': 16,
 ':': 17,
 '?': 18,
 'A': 19,
 'B': 20,
 'C': 21,
 'D': 22,
 'E': 23,
 'F': 24,
 'G': 25,
 'H': 26,
 'I': 27,
 'J': 28,
 'K': 29,
 'L': 30,
 'M': 31,
 'N': 32,
 'O': 33,
 'P': 34,
 'Q': 35,
 'R': 36,
 'S': 37,
 'T': 38,
 'U': 39,
 'V': 40,
 'W': 41,
 'Y': 42,
 'Z': 43,
 'a': 44,
 'b': 45,
 'c': 46,
 'd': 47,
 'e': 48,
 'f': 49,
 'g': 50,
 'h': 51,
 'i': 52,
 'j': 53,
 'k': 54,
 'l': 55,
 'm': 56,
 'n': 57,
 'o': 58,
 'p': 59,
 'q': 60,
 'r': 61,
 's': 62,
 't': 63,
 'u': 64,
 'v': 65,
 'w': 66,
 'x': 67,
 'y': 68,
 'z': 69,
 '’': 70,
 '\ufeff': 71}

In [12]:
fre_index_to_char = {}
fre_char_to_index = {}

for k,v in enumerate(setFrenchChars):
  fre_index_to_char[k] = v
  fre_char_to_index[v] = k

fre_char_to_index

{'\t': 0,
 '\n': 1,
 ' ': 2,
 '!': 3,
 '$': 4,
 '&': 5,
 "'": 6,
 '(': 7,
 ')': 8,
 ',': 9,
 '-': 10,
 '.': 11,
 '0': 12,
 '1': 13,
 '5': 14,
 '6': 15,
 '9': 16,
 ':': 17,
 '?': 18,
 'A': 19,
 'B': 20,
 'C': 21,
 'D': 22,
 'E': 23,
 'F': 24,
 'G': 25,
 'H': 26,
 'I': 27,
 'J': 28,
 'K': 29,
 'L': 30,
 'M': 31,
 'N': 32,
 'O': 33,
 'P': 34,
 'Q': 35,
 'R': 36,
 'S': 37,
 'T': 38,
 'U': 39,
 'V': 40,
 'Y': 41,
 'Z': 42,
 'a': 43,
 'b': 44,
 'c': 45,
 'd': 46,
 'e': 47,
 'f': 48,
 'g': 49,
 'h': 50,
 'i': 51,
 'j': 52,
 'k': 53,
 'l': 54,
 'm': 55,
 'n': 56,
 'o': 57,
 'p': 58,
 'q': 59,
 'r': 60,
 's': 61,
 't': 62,
 'u': 63,
 'v': 64,
 'w': 65,
 'x': 66,
 'y': 67,
 'z': 68,
 '\xa0': 69,
 '«': 70,
 '»': 71,
 'À': 72,
 'Ç': 73,
 'É': 74,
 'Ê': 75,
 'à': 76,
 'â': 77,
 'ç': 78,
 'è': 79,
 'é': 80,
 'ê': 81,
 'ë': 82,
 'î': 83,
 'ï': 84,
 'ô': 85,
 'ù': 86,
 'û': 87,
 'œ': 88,
 '\u2009': 89,
 '‘': 90,
 '’': 91,
 '\u202f': 92}

In [13]:
max_len_eng_sent = max([len(line) for line in listEnglishInput])
max_len_fre_sent = max([len(line) for line in listFrenchInput])

print(max_len_eng_sent)
print(max_len_fre_sent)

16
59


In [14]:
print(len(setEnglishChars))
print(len(setFrenchChars))

72
93


In [0]:
tokenized_eng_sent = np.zeros(shape=(intSamples, max_len_eng_sent, len(setEnglishChars)), dtype='float32')
tokenized_fre_sent = np.zeros(shape=(intSamples, max_len_fre_sent, len(setFrenchChars)), dtype='float32')
target_data = np.zeros(shape=(intSamples, max_len_fre_sent, len(setFrenchChars)), dtype='float32')

In [16]:
print(tokenized_eng_sent.shape)
print(tokenized_fre_sent.shape)
print(target_data.shape)

#(Observation_num, Positions, Character)

(10000, 16, 72)
(10000, 59, 93)
(10000, 59, 93)


In [0]:
for i in range(intSamples):
  for pos,ch in enumerate(listEnglishInput[i]):
    tokenized_eng_sent[i,pos,eng_char_to_index[ch]] = 1

  for pos,ch in enumerate(listFrenchInput[i]):
    tokenized_fre_sent[i,pos,fre_char_to_index[ch]] = 1

    if pos>0:
      target_data[i,pos-1,fre_char_to_index[ch]] = 1

In [18]:
target_data

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0.

In [19]:
encoder_input = Input(shape=(None, len(setEnglishChars)))
encoder_LSTM = LSTM(256, return_state=True)
encoder_output, encoder_h, encoder_c = encoder_LSTM(encoder_input)
encoder_states = [encoder_h, encoder_c]






In [0]:
decoder_input = Input(shape=(None, len(setFrenchChars)))
decoder_LSTM = LSTM(256, return_sequences=True, return_state=True)
decoder_out, _, _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
decoder_dense = Dense(len(setFrenchChars), activation='softmax')
decoder_out = decoder_dense(decoder_out)

In [23]:
model = Model(inputs=[encoder_input, decoder_input], outputs=decoder_out)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x=[tokenized_eng_sent, tokenized_fre_sent],
          y=target_data,
          batch_size=64,
          epochs=5,
          validation_split=0.2)


Train on 8000 samples, validate on 2000 samples
Epoch 1/5





Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fde28d1ecf8>

In [0]:
model.fit(x=[tokenized_eng_sent, tokenized_fre_sent],
          y=target_data,
          batch_size=64,
          epochs=15,
          validation_split=0.2)

Train on 8000 samples, validate on 2000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7ff345ad9c50>

In [0]:
# Building inference models

# Encoder Inference model
encoder_model_inf = Model(encoder_input, encoder_states)

# Decoder Inference Model
decoder_states_input_h = Input(shape=(256,))
decoder_states_input_c = Input(shape=(256,))
decoder_input_states = [decoder_states_input_h, decoder_states_input_c]

decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, initial_state=decoder_input_states)
decoder_states = [decoder_h, decoder_c]

decoder_out = decoder_dense(decoder_out)
decoder_model_inf = Model(inputs=[decoder_input]+decoder_input_states, outputs=[decoder_out]+decoder_states)

In [0]:
def decode_seq(inp_seq):
  states_val = encoder_model_inf.predict(inp_seq) # extracting state values of my encoder

  target_seq = np.zeros((1,1,len(setFrenchChars)))
  target_seq[0,0,fre_char_to_index['\t']] = 1  # We are initializing our target prediction with <SOS> in this case its \t

  translated_sent = ''
  stop_condition = False

  while not stop_condition:
    decoder_out, decoder_h, decoder_c = decoder_model_inf.predict(x=[target_seq] + states_val)
    max_val_index = np.argmax(decoder_out[0,-1,:])
    sampled_fra_char = fre_index_to_char[max_val_index]
    translated_sent += sampled_fra_char

    if (sampled_fra_char == '\n') or (len(translated_sent)>max_len_fre_sent):
      stop_condition = True
    
    target_seq = np.zeros((1, 1, len(setFrenchChars)))
    target_seq[0, 0, max_val_index] = 1

    states_val = [decoder_h, decoder_c]

  return translated_sent   
  

In [0]:
(decode_seq(tokenized_eng_sent[0:1]))

'Va chercher !\n'

In [0]:
for seq_index in range(10):
  inp_seq = tokenized_eng_sent[seq_index:seq_index+1]
  translated_sent = decode_seq(inp_seq)
  print('-')
  print('Input sentence is : ',listEnglishInput[seq_index])
  print('Translated sentence is : ',translated_sent)
  print('Actual translated sentence is : ',listFrenchInput[seq_index])

-
Input sentence is :  ﻿Go.
Translated sentence is :  Va chercher !

Actual translated sentence is :  	Va !

-
Input sentence is :  Run!
Translated sentence is :  Pardez-vous !

Actual translated sentence is :  	Cours !

-
Input sentence is :  Run!
Translated sentence is :  Pardez-vous !

Actual translated sentence is :  	Courez !

-
Input sentence is :  Wow!
Translated sentence is :  Attrape les !

Actual translated sentence is :  	Ça alors !

-
Input sentence is :  Fire!
Translated sentence is :  Arrête !

Actual translated sentence is :  	Au feu !

-
Input sentence is :  Help!
Translated sentence is :  Prends un me chanten.

Actual translated sentence is :  	À l'aide !

-
Input sentence is :  Jump.
Translated sentence is :  Sourlez !

Actual translated sentence is :  	Saute.

-
Input sentence is :  Stop!
Translated sentence is :  Arrête !

Actual translated sentence is :  	Ça suffit !

-
Input sentence is :  Stop!
Translated sentence is :  Arrête !

Actual translated sentence is :  