## MACHINE TRANSLATION MODEL - SEQ2SEQ MODEL

In this project, I'm going to build Seq2Seq Model to translate English sentences to French sentences.<br>

Here is the work pipeline:
1. Read data from txt files and split into train and test dataframe
2. Tokenize sentences and transform to one hot encode sequence data
3. Build model using GRU, RepeatVector
4. Predict on test data/ Translate English sentences in test data
5. Evaluate the model using BLEU score (future work)

In [1]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (10.0, 5.0)
%matplotlib inline

### 1. Read data from txt file

In [2]:
en_sents = list()
with open('vocab_en.txt', encoding='utf-8') as txt_file:
    for line in txt_file:
        new_line = line.rstrip('\n')
        en_sents.append(new_line)

fr_sents = list()
with open('vocab_fr.txt', encoding='utf-8') as txt_file:
    for line in txt_file.readlines():
        new_line = line.rstrip('\n')
        fr_sents.append(new_line)
        
df = pd.DataFrame({'english_sentence':en_sents, 'french_sentence': fr_sents})
df.head()

Unnamed: 0,english_sentence,french_sentence
0,"new jersey is sometimes quiet during autumn , ...",new jersey est parfois calme pendant l' automn...
1,the united states is usually chilly during jul...,les états-unis est généralement froid en juill...
2,"california is usually quiet during march , and...","california est généralement calme en mars , et..."
3,the united states is sometimes mild during jun...,"les états-unis est parfois légère en juin , et..."
4,"your least liked fruit is the grape , but my l...","votre moins aimé fruit est le raisin , mais mo..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137860 entries, 0 to 137859
Data columns (total 2 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   english_sentence  137860 non-null  object
 1   french_sentence   137860 non-null  object
dtypes: object(2)
memory usage: 2.1+ MB


In [4]:
#shuffle indicies
indicies = np.array(df.index)
np.random.shuffle(indicies)
indicies

array([ 64874,  19114,  22771, ...,  23172, 123966, 106799], dtype=int64)

#### Split data into train (80% data) and test dataframe (20% data) after shuffling rows by indicies

In [5]:
train_size, test_size = int((len(df)/10) * 8), int((len(df)/10) * 2)
train_indicies, test_indicies = indicies[:train_size], indicies[train_size:]

train_df = df.iloc[train_indicies, :]
test_df = df.iloc[test_indicies, :]

### 2. Tokenize sentences then transform to one hot encode sequence data

In [6]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, GRU, RepeatVector, TimeDistributed
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [7]:
en_token = Tokenizer()
en_token.fit_on_texts(train_df['english_sentence'])

fr_token = Tokenizer()
fr_token.fit_on_texts(train_df['french_sentence'])

In [8]:
en_index_to_word = en_token.index_word
fr_index_to_word = fr_token.index_word

en_word_to_index = en_token.word_index
fr_word_to_index = fr_token.word_index

In [14]:
en_index_to_word[0] = 'Unknown'
fr_index_to_word[0] = 'Unknown'

en_word_to_index['Unknown'] = 0
fr_word_to_index['Unknown'] = 0

In [15]:
max_en_len = max([len(i) for i in en_token.texts_to_sequences(train_df['english_sentence'])])
max_fr_len = max([len(i) for i in fr_token.texts_to_sequences(train_df['french_sentence'])])
print(max_en_len, max_fr_len)

15 21


In [16]:
en_vocab = len(en_token.index_word)
fr_vocab = len(fr_token.index_word)
print(en_vocab, fr_vocab)

200 342


In [17]:
en_oh = to_categorical(np.arange(0,en_vocab))
fr_oh = to_categorical(np.arange(0, fr_vocab))
print(en_oh.shape, fr_oh.shape)

(200, 200) (342, 342)


In [18]:
#function word to oh
def word2oh(word, language='en'):
    oh = ''
    if language == 'en':
        index = en_word_to_index[word]
        oh = en_oh[index]
    elif language =='fr':
        index = fr_word_to_index[word]
        oh = fr_oh[index]      
    return oh

In [19]:
#function sentence to one hot, there is "reverse" argument to reverse one hot encode for better model performance
def sent2oh(sentence, language='en', reverse=False):
    oh = list()
    if language=='en':
        sequence = en_token.texts_to_sequences([sentence])
        sequence = pad_sequences(sequence, padding='post', maxlen=max_en_len) #add padding
        if reverse == True:
            sequence = sequence[:, ::-1]
        for seq in sequence:
            oh.append(en_oh[seq])
    elif language == 'fr':
        sequence = fr_token.texts_to_sequences([sentence])
        sequence = pad_sequences(sequence, padding='post', maxlen=max_fr_len)
        if reverse == True:
            sequence = sequence[:, ::-1]
        for seq in sequence:
            oh.append(fr_oh[seq])     
    return np.array(oh)

In [20]:
train_en_oh_rev = np.vstack([(sent2oh(sent, reverse=True)) for sent in train_df['english_sentence']])
train_fr_oh = np.vstack([(sent2oh(sent, language='fr')) for sent in train_df['french_sentence']])

print(train_en_oh_rev.shape, train_fr_oh.shape)

(110288, 15, 200) (110288, 21, 342)


In [21]:
test_en_oh_rev = np.vstack([(sent2oh(sent, reverse=True)) for sent in test_df['english_sentence']])
test_fr_oh = np.vstack([(sent2oh(sent, language='fr')) for sent in test_df['french_sentence']])

print(test_en_oh_rev.shape, test_fr_oh.shape)

(27572, 15, 200) (27572, 21, 342)


### 3. Build model

In [23]:
hsize = en_vocab

#encoder
en_input = Input(shape=(max_en_len, en_vocab))
en_gru = GRU(hsize, return_state=True)
en_out, en_state = en_gru(en_input)

#decoder
de_input = RepeatVector(max_fr_len)(en_state)
de_gru = GRU(hsize, return_sequences=True)
de_out = de_gru(de_input, initial_state=en_state)

#prediction layer
de_dense_time = TimeDistributed(Dense(fr_vocab, activation='softmax'))
de_pred = de_dense_time(de_out)

#compiling the model
nmt = Model(inputs=en_input, outputs=de_pred)

#summarize model
nmt.summary()

#plot graph
# plot_model(nmt, to_file='layout.png')

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 15, 200)]    0                                            
__________________________________________________________________________________________________
gru_2 (GRU)                     [(None, 200), (None, 241200      input_2[0][0]                    
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 21, 200)      0           gru_2[0][1]                      
__________________________________________________________________________________________________
gru_3 (GRU)                     (None, 21, 200)      241200      repeat_vector_1[0][0]            
                                                                 gru_2[0][1]                

In [24]:
nmt.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
early_stopping = EarlyStopping(monitor='val_acc', patience=5)
nmt.fit(train_en_oh_rev, train_fr_oh, batch_size=128, epochs=10, validation_split=0.2, callbacks=[early_stopping])

Train on 88230 samples, validate on 22058 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2569ef7d048>

### 4. Predict on test data or translate English sentences in test data into French

In [25]:
pred_fr_oh = nmt.predict(test_en_oh_rev)
pred_fr_oh.shape

(27572, 21, 342)

In [26]:
#function to transform one hots to french sentences
def oh2fr(ohs):
    sequences = np.argmax(ohs, axis=-1)
    sentences = list()
    for seq in sequences:
        sent = [fr_index_to_word[i] for i in seq if i != 0]
        sent = ' '.join(sent)
        sentences.append(sent)
    return sentences

In [27]:
pred_fr = oh2fr(pred_fr_oh)

#### Compare 10 predictions with Frech sentences

In [29]:
for i in range(10):
    print('English sentence: ', test_df['english_sentence'].iloc[i])
    print('True French sentence: ', test_df['french_sentence'].iloc[i])
    print('Predicted French sentence: ', pred_fr[i])
    print('==========')

English sentence:  france is sometimes cold during april , but it is never warm in november .
True French sentence:  la france est parfois froid en avril , mais il est jamais chaud en novembre .
Predicted French sentence:  la france est parfois froid en avril mais il est jamais chaud en novembre
English sentence:  china is usually pleasant during december , and it is wet in march .
True French sentence:  chine est généralement agréable en décembre , et il est humide en mars .
Predicted French sentence:  chine est généralement agréable en décembre et il est humide en mars
English sentence:  that cat was my most loved animal .
True French sentence:  ce chat était mon animal le plus aimé .
Predicted French sentence:  ce chat est mon animal le plus aimé
English sentence:  china is never relaxing during may , and it is pleasant in august .
True French sentence:  la chine est jamais relaxant au mois de mai , et il est agréable en août .
Predicted French sentence:  la chine est jamais relaxan

### 5. Evaluate model using BLEU score (future work)