# Neural Machine Translation with Bidirectional LSTM

This is a project to build Neural Machine Translation (NMT) model to translate human readable dates, "10th of December, 2018", into machine readable dates, "2018-12-10". This is **very close to the Google Text Normalization Challenge** I once did.

Here are some more examples of what we want to achieve:
* "11 November 1993" ----------> "1993-11-11"
* "8 June 1967" ---------------> "1967-06-08"
* "20th of February 1992" -----> "1992-02-20"
* "Sat 10 Jul 2007" -----------> "2007-07-10"
* "Tuesday Sept 11 2011" ------> "2011-09-11"

We will use an NLP algorithm called the **Bidirectional LSTM**, a very sophisticated sequence to sequence, seq2seq, to accomplish this.

### Loading the dataset

In terms of the dataset, we will generate some sample dates using the faker library (https://github.com/joke2k/faker)

In [5]:
import random
from faker import Faker
from babel.dates import format_date

fake = Faker()

def load_date():
    
    # Define the format of the data we like to generate
    FORMATS = ["short",
               "medium",
               "long",
               "full",
               "full",
               "full",
               "full",
               "full",
               "full",
               "full",
               "full",
               "full",
               "full",
               "d MMM YYY",
               "d MMMM YYY",
               "dd MMM YYY",
               "d MMM, YYY",
               "d MMMM, YYY",
               "dd, MMM YYY",
               "d MM YY",
               "d MMMM YYY",
               "MMMM d YYY",
               "MMMM d, YYY",
               "dd.MM.YY"]
    
    dt = fake.date_object()
    
    try:
        human_readable = format_date(dt, 
                                     format=random.choice(FORMATS),
                                     locale="en_US").lower().replace(",","")
        machine_readable = dt.isoformat()
        
    except AttributeError as e:
        return None, None, None
    
    return human_readable, machine_readable, dt

In [11]:
def load_dataset(m=10000):
    # Load a dataset with 'm' examples
    
    human_vocab = set()
    machine_vocab = set()
    dataset = []
    Tx = 30 # The maximum length of date format in the dataset
    
    for i in range(m):
        h, m, _ = load_date()
        if h is not None:
            dataset.append((h,m))
            human_vocab.update(tuple(h))
            machine_vocab.update(tuple(m))
            
    human = dict(zip(sorted(human_vocab) + ['<unk>',"<pad>"], list(range(len(human_vocab) + 2))))
    inv_machine = dict(enumerate(sorted(machine_vocab)))
    machine = {v:k for k,v in inv_machine.items()}
    
    return dataset, human, machine, inv_machine

Let's load a dataset of 10,000 human readable dates of different formats and their equivalent, standardized, machine readable dates. Below are some examples

In [48]:
m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m=m)
dataset[:10]

[('thursday april 6 1995', '1995-04-06'),
 ('29 nov 1983', '1983-11-29'),
 ('23 oct 1977', '1977-10-23'),
 ('04.12.85', '1985-12-04'),
 ('monday march 4 1974', '1974-03-04'),
 ('monday may 22 2000', '2000-05-22'),
 ('17 apr 1987', '1987-04-17'),
 ('03 oct 2008', '2008-10-03'),
 ('19 jan 2000', '2000-01-19'),
 ('19.07.98', '1998-07-19')]

Notice that,
* dataset: a list of tuples of (human readable date, machine readable date)
* human_vocab: dictionary mapping all characters used in the human readable dates to an integer-valued index
* machine_vocab: dictionary mapping all characters used in machine readable dates to an integer-valued index
* inv_machine_vocab: the inverse dictionary of machine_vocab, mapping from indices back to characters

### Process the data into trainable format

We need to turn words into numbers (word2vec)! Here we map the raw text data into index values using the vocabulary dictionaries loaded above.

In [20]:
import numpy as np
from keras.utils import to_categorical

def string_to_int(string, length, vocab):
    
    # Convert all strings in the vocabulary into a list of integers representing the 
    # positions of the input string's characters in the "vocab"
    
    string = string.lower().replace(",","")
    
    if len(string) > length:
        string = string[:length]
        
    rep = list(map(lambda x: vocab.get(x, "<unk>"), string))
    
    if len(string) < length:
        rep += [vocab["<pad>"]] * (length - len(string))
        
    return rep
    
def preprocess_data(dataset, human_vocab, machine_vocab, max_length_X, max_length_y):
    
    X, y = zip(*dataset)
    
    X = np.array([string_to_int(i,max_length_X,human_vocab) for i in X])
    y = [string_to_int(t,max_length_y,machine_vocab) for t in y]
    
    X_one_hot = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)),X)))
    y_one_hot = np.array(list(map(lambda x: to_categorical(x, num_classes=len(machine_vocab)), y)))
    
    return X, np.array(y), X_one_hot, y_one_hot

We assume,
* max_length_X = 30: maximum length of human readable dates
* max_length_y = 10: maximum of machine readable dates, which is of the format "YYYY-MM-DD", which is 10 characters long

if any dates in the human readable set is shorter than 30 characters, we pad them with 36's, or the "<pad\>" object

In [22]:
max_length_X = 30
max_length_y = 10

X, y, X_one_hot, y_one_hot = preprocess_data(dataset, human_vocab, machine_vocab, max_length_X, max_length_y)

print("X shape: {}".format(X.shape))
print("y shape: {}".format(y.shape))
print("X_one_hot shape: {}".format(X_one_hot.shape))
print("y_one_hot shape: {}".format(y_one_hot.shape))

X shape: (10000, 30)
y shape: (10000, 10)
X_one_hot shape: (10000, 30, 37)
y_one_hot shape: (10000, 10, 11)


Let's show one random example of preprocess training set

In [36]:
idx = random.randint(0,len(X))
print("Source date: {}".format(dataset[idx][0]))
print("Target date: {}".format(dataset[idx][1]))
print("=" * 100)
print("Source after preprocessing (indices): {}".format(X[idx]))
print("Target after preprocessing (indices): {}".format(y[idx]))
print("=" * 100)
print("Source after preprocessing (one-hot): {}".format(X_one_hot[idx]))
print("Target after preprocessing (one-hot): {}".format(y_one_hot[idx]))

Source date: monday october 8 2007
Target date: 2007-10-08
Source after preprocessing (indices): [24 26 25 16 13 34  0 26 15 30 26 14 17 28  0 11  0  5  3  3 10 36 36 36
 36 36 36 36 36 36]
Target after preprocessing (indices): [3 1 1 8 0 2 1 0 1 9]
Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]]
Target after preprocessing (one-hot): [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]


### Building the NMT model

This section is the meat and flesh of this project. We will build a Bidirectional LSTM (BiLSTM) algorithm to translate the dates. 

We start with the one_step_attention() function, the basic building block of a BiLSTM). Credits to Andrew Ng's course on Deep Learning (https://www.coursera.org/learn/nlp-sequence-models) for reference.

In [40]:
from keras.layers import Dense, Concatenate, RepeatVector, Activation, Dot
import keras.backend as K

def custom_softmax(x, axis=1):
    
    # A custom softmax function 
    
    ndim = K.ndim(x)
    if ndim == 2:
        return K.softmax(x)
    elif ndim > 2:
        e = K.exp(x - K.max(x, axis=axis, keepdims=True))
        s = K.sum(e, axis=axis, keepdims = True)
        return e / s
    else:
        raise ValueError("Cannot apply softmax to a tensor that is 1D")

repeator = RepeatVector(max_length_X)
concatenator = Concatenate(axis=-1)
densor_1 = Dense(units=10,activation="tanh")
densor_2 = Dense(units=1,activation="relu")
activator = Activation(custom_softmax,name="attention_weights") 
dotor = Dot(axes=1)

def one_step_attention(a, s_prev):
    # Repeat s_prev to be of shape (m, max_length_X, n_s)
    s_prev = repeator(s_prev)
    # Concatenate a and s_prev on the last axis
    concat = concatenator([a,s_prev])
    # Use densor_1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable
    energy_intermediate = densor_1(concat)
    # Use densor_2 to propagate concat through a small fully-connected neural network to compute the "energies" variable
    energy = densor_2(energy_intermediate)
    # Use "activator" on "energy" to compute the attention weights "alphas"
    alphas = activator(energy)
    # Use dotor together with "alphas" and "a" to compute the context vector to be given to the next post-attention LSTM-cell
    context = dotor([alphas,a])
    
    return context

Having the building blocks, we can now build out the entire model

In [45]:
from keras.layers import Bidirectional, LSTM

n_a = 32 # hidden state size of the BiLSTM
n_s = 64 # hidden state size of the post-attention LSTM
post_activation_LSTM_cell = LSTM(n_s,return_state=True)
output_layer = Dense(units=len(machine_vocab),activation=custom_softmax)

In [46]:
from keras.layers import Input
from keras.models import Model

def model(max_length_X,max_length_y,n_a,n_s,human_vocab_size,machine_vocab_size):
    
    X = Input(shape=(max_length_X,human_vocab_size))
    s0 = Input(shape=(n_s,),name="s0") # Initial hidden state for the decoder LSTM 
    c0 = Input(shape=(n_s,),name="c0") # Initial hidden state for the decoder LSTM 
    s = s0
    c = c0
    
    outputs = []
    
    # Define the pre-attention BiLSTM
    a = Bidirectional(LSTM(n_a,return_sequences=True),input_shape=(m,max_length_X,n_a**2))(X)
    
    # Iterate for max_length_y steps
    for t in range(max_length_y):
        
        # Perform one step of the attention mechanism to get back the context vector 
        context = one_step_attention(a,s)
        
        # Apply the post-attention LSTM cell to the "context" vector
        s, _, c = post_activation_LSTM_cell(context,initial_state=[s,c])
        
        # Apply Dense layer to the hidden state output of the post-attention LSTM
        out = output_layer(s)
        
        # Append to the output list
        outputs.append(out)
    
    # Create model instance taking three inputs and returning the list of outputs
    model = Model(inputs=[X,s0,c0],outputs=outputs)
    
    return model

Create the model

In [49]:
model = model(max_length_X,max_length_y,n_a,n_s,len(human_vocab),len(machine_vocab))

Take a look at the model summary

In [50]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 30, 37)       0                                            
__________________________________________________________________________________________________
s0 (InputLayer)                 (None, 64)           0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 30, 64)       17920       input_2[0][0]                    
__________________________________________________________________________________________________
repeat_vector_3 (RepeatVector)  (None, 30, 64)       0           s0[0][0]                         
                                                                 lstm_4[0][0]                     
          

Specify optimizer, loss function, and metrics

In [51]:
from keras.optimizers import Adam

optimizer = Adam(lr=0.005,beta_1=0.9,beta_2=0.999,decay=0.01)
model.compile(loss="categorical_crossentropy",optimizer=optimizer,metrics=['accuracy'])

### Training the model

In [55]:
s0 = np.zeros((m,n_s))
c0 = np.zeros((m,n_s))
outputs = list(y_one_hot.swapaxes(0,1))

In [82]:
model.fit([X_one_hot,s0,c0],outputs,epochs=30,batch_size=100)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30


Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x18986ef9550>

### Testing the model

Let's now test our model on some custom data. Feel free to change the custom data.

In [84]:
test_samples = ["11 November 1993","8 June 1967","20th of February 1992","Sat 10 Jul 2007","Tuesday Sept 11 2011"]

for sample in test_samples:
    
    source = string_to_int(sample,max_length_X,human_vocab)
    source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)),source)))
    source = np.array([source])
    prediction = model.predict([source,s0,c0])
    prediction = np.argmax(prediction,axis=-1)
    output = [inv_machine_vocab[int(entry)] for entry in prediction]
    
    print("source: ", sample)
    print("output: ", "".join(output))
    print("=" * 50)

source:  11 November 1993
output:  1993-11-11
source:  8 June 1967
output:  1976-06-07
source:  20th of February 1992
output:  1992-02-20
source:  Sat 10 Jul 2007
output:  2007-07-10
source:  Tuesday Sept 11 2011
output:  2011-09-11


This model is not 100% on every test sample. However we can train it for more epochs and it should do better.