## David Liu
## BAN 676

Build a Seq2Seq Language Translation model for any language pair of your choice. (See Datasets: http://www.manythings.org/anki/)

(1) Build a character-level model 

(2) Build a word level model

(3) Build a word model with attention

Submit code and pdf. Include sample inferences from best performing model 

If you want to explore larger datasets see
1. http://casmacat.eu/corpus/global-voices.html
2. http://www.statmt.org/europarl/
3. http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/index.html
4. https://lionbridge.ai/datasets/25-best-parallel-text-datasets-for-machine-translation-training/
5. http://nlp.nju.edu.cn/cwmt-wmt/
6. http://opus.nlpl.eu/ (https://www.tensorflow.org/datasets/catalog/opus)


### Packages

In [120]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None) 

### Variables

In [121]:
batch_size = 64
num_samples = 10000
epochs = 50
latent_dim = 256
data_path = 'nld.txt'

### Extracting and Vectorizing the Data

Extracting english-dutch data

In [122]:
# Vectorize the data.
punctuations = [".","?","!"]

input_texts = []
target_texts = []
input_words = set()
target_words = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text = line.split('\t')[:2]
    # We use "tab" as the "start sequence" character
    # for the targets, and "\n" as "end sequence" character.
    target_text = '\t' + target_text + '\n'
    #input_texts.append(input_text)
    #target_texts.append(target_text)
    
    input_sentence = []
    target_sentence = []
    
    split_input = input_text.split()
    
    for word in split_input:
        if word[len(word)-1] not in punctuations:
            input_sentence.append(word)
            if word not in input_words:
                input_words.add(word)
        else:
            input_sentence.append(word[:len(word)-1])
            if word[len(word)-1] not in input_words:
                input_words.add(word[len(word)-1])
            input_sentence.append(word[len(word)-1])
            if word[:len(word)-1] not in input_words:
                input_words.add(word[:len(word)-1])
    
    split_target = target_text.split()
    
    for word in split_target:
        if word[len(word)-1] not in punctuations:
            target_sentence.append(word)
            if word not in target_words:
                target_words.add(word)
        else:
            target_sentence.append(word[:len(word)-1])
            if word[len(word)-1] not in target_words:
                target_words.add(word[len(word)-1])
            target_sentence.append(word[len(word)-1])
            if word[:len(word)-1] not in target_words:
                target_words.add(word[:len(word)-1])
    
    input_texts.append(input_sentence)
    target_texts.append(target_sentence)
        
input_words = sorted(list(input_words))
target_words = sorted(list(target_words))
num_encoder_tokens = len(input_words)
num_decoder_tokens = len(target_words)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 10000
Number of unique input tokens: 3047
Number of unique output tokens: 3719
Max sequence length for inputs: 7
Max sequence length for outputs: 13


In [123]:
# Each sentence is converted into a sequence of integers, each integer representing a unique word
input_token_index = dict([(word, i) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i) for i, word in enumerate(target_words)])

encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length), dtype='float32')
decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length), dtype='float32')
decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype='float32')

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, word in enumerate(input_text):
        encoder_input_data[i, t] = input_token_index[word]
    for t, word in enumerate(target_text):
        decoder_input_data[i, t] = target_token_index[word]
        if t > 0:
            # only the output target data set is tokenized with one-hot encoding.
            decoder_target_data[i, t - 1, target_token_index[word]] = 1.

### Building the Model

In [124]:
encoder_inputs = Input(shape=(None,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [125]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.summary()

Model: "model_20"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_41 (InputLayer)           [(None, None)]       0                                            
__________________________________________________________________________________________________
input_42 (InputLayer)           [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_38 (Embedding)        (None, None, 256)    780032      input_41[0][0]                   
__________________________________________________________________________________________________
embedding_39 (Embedding)        (None, None, 256)    952064      input_42[0][0]                   
___________________________________________________________________________________________

In [126]:
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
         batch_size=batch_size,
         epochs=epochs,
         validation_split=0.2)
# Save model
model.save('ENG-DAN-WORD.h5')

Train on 8000 samples, validate on 2000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
