This script implements a basic character-level sequence-to-sequence model to translate translate human readable dates ("25th of June, 2009") into machine readable dates ("2009-06-25").

#### The basic steps of the algorithm are:
- start with input sequences from a domain (human readable dates)
    and the corresponding target sequences from another domain
    (dates in a standard format).
- An encoder LSTM turns input sequences into two state vectors
    (we keep the last LSTM state and discard the outputs).
- A decoder LSTM is trained to turn the target sequences into
    the same sequence but offset by one timestep in the future,
    a training process called "teacher forcing" in this context.
    The decoder uses as initial state the state vectors returned by the encoder.
    Effectively, the decoder learns to generate `targets[t+1...]`
    given `targets[...t]`, conditioned on the input sequence.
- In inference mode, when we want to decode unknown input sequences, we:
    - Encode the input sequence into state vectors.
    - Start with a target sequence of size 1
        (just the start-of-sequence character)
    - Feed the state vectors and 1-char target sequence
        to the decoder to produce predictions for the next character
        (in fact, a distribution for the next char is generated)
    - Sample the next character using these predictions
        (we simply use argmax).
    - Append the sampled character to the target sequence.
    - Repeat until we generate the end-of-sequence character or we
        hit the character limit.

# Data
Dates are randomly generated in the load_dataset() function.


# References
- Sequence to Sequence Learning with Neural Networks
    https://arxiv.org/abs/1409.3215
- Learning Phrase Representations using
    RNN Encoder-Decoder for Statistical Machine Translation
    https://arxiv.org/abs/1406.1078'''

In [37]:
from nmt_utils import *

In [38]:
#Load dates datasets
m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

100%|██████████| 10000/10000 [00:00<00:00, 16431.80it/s]


In [39]:
#[CHECK]
Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

print("X.shape:", X.shape, "-> the maximum length of a phrase is set to 30")
print("Y.shape:", Y.shape,  "-> the output date in standardized format have lenght=10")
print("Xoh.shape:", Xoh.shape, "-> the human vocab contains 37 chars")
print("Yoh.shape:", Yoh.shape,"-> the machine vocab contains 11 chars")

X.shape: (10000, 30) -> the maximum length of a phrase is set to 30
Y.shape: (10000, 10) -> the output date in standardized format have lenght=10
Xoh.shape: (10000, 30, 37) -> the human vocab contains 37 chars
Yoh.shape: (10000, 10, 11) -> the machine vocab contains 11 chars


In [40]:
# Reverse-lookup token index to decode sequences back to
# something readable.
reverse_input_char_index = dict(
    (i, char) for char, i in human_vocab.items())

In [41]:
# Add two additional chars to the dict: a start char and an end char
reverse_target_char_index = {}
reverse_target_char_index = dict(
    (i, char) for char, i in machine_vocab.items())
reverse_target_char_index.update({11:'<start>', 12:'<end>'})

In [42]:
# Use Y_oh to build decoder_input_data and decoder_output_data from Y_oh
num_samples,_,_ = Yoh.shape
Tx_decoder = 11 #it's 10 (length of YYYY-MM-DD) plus addition char 
num_decoder_tokens = len(reverse_target_char_index) # return 13
decoder_input_data = np.zeros([num_samples, Tx_decoder, num_decoder_tokens])
decoder_target_data = np.zeros([num_samples, Tx_decoder, num_decoder_tokens])

## Prepare training data

In [43]:
#####################
# encoder input data
#####################
encoder_input_data = Xoh

####################
# decoder_input_data
####################
# timestep = O : populate with start
oh_start_char = np.zeros(num_decoder_tokens)
oh_start_char[11] = 1 # index 11 corresponds to '<start>'
#for i in range(num_samples):
decoder_input_data[:, 0] = oh_start_char # all the exemples get a <start> char character

for i in range(num_samples):
    for j in range(0,Tx_decoder-1): # j=1,...10
        for k in range(num_decoder_tokens - 2): # minus 2 because of the <start> ad <end> chars we added
            decoder_input_data[i][j+1][k] = Yoh[i][j][k]
            
#####################
# decoder_target_data
#####################

oh_end_char = np.zeros(num_decoder_tokens)
oh_end_char[12] = 1 # index 12 corresponds to '<end>'
decoder_target_data[:, Tx_decoder-1] = oh_end_char # targets get an <end> char in the last timestep

for i in range(num_samples):
    for j in range(0, Tx_decoder-1):
        for k in range(num_decoder_tokens):
            # decoder_target_data is one time step ahead of the decoder_input_data
            decoder_target_data[i][j][k] = decoder_input_data[i][j+1][k] 

# Auxiliary functions

In [44]:
def from_oh_to_chars(matrix, reverse_dictionary):
    # Take a one hot enconding two dimensional ndarray and
    # translate it back to human language (phrase)
    tx, dim = matrix.shape
    resu = str()
    for i in range(tx):
            if len(np.where(matrix[i]==1)[0])==0:
                break
            else:
                index = np.where(matrix[i]==1)[0][0]
                resu+=(reverse_dictionary[index])
    return resu

print(encoder_input_data[1],'\n', from_oh_to_chars(encoder_input_data[1], reverse_input_char_index))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]] 
 27 may 2014<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [45]:
def from_encode_to_chars(vector_indices, reverse_input_char_index):
    resu = ''   
    for _,index in enumerate(vector_indices):
        resu+=reverse_input_char_index[index]
    return resu

print(X[0], " \nis decoded to \n", from_encode_to_chars(X[0], reverse_input_char_index))

[30 31 17 29 16 13 34  0 24 13 28 15 20  0  5 11  0  4 12 12  8 36 36 36
 36 36 36 36 36 36]  
is decoded to 
 tuesday march 28 1995<pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [46]:
def encode_string(phrase, human_vocab, Tx):
    # take string and return a list with the index of each char (in the dictionary human vocab)
    # take phrase 'foo' and return [18,26,26]
    resu = np.zeros(Tx, dtype=np.int8)
    for idx,char in enumerate(phrase):
        resu[idx] = human_vocab.get(char)
        idx=idx+1
    resu[idx:] = human_vocab.get('<pad>')
    return resu

def from_encode_to_oh(encoded_phrase_indices, reverse_input_char_index):
    cols = encoded_phrase_indices
    matrix = np.zeros((len(encoded_phrase_indices) , len(reverse_input_char_index) ) )
    matrix[np.arange(len(encoded_phrase_indices)) ,cols ] = 1
    return matrix

def from_nl_to_oh(phrase,human_vocab,reverse_input_char_index, Tx):
    encoded_phrase_indices = encode_string(phrase, human_vocab, Tx)
    resu = from_encode_to_oh(encoded_phrase_indices,reverse_input_char_index)
    return resu

from_nl_to_oh('20 may 1998', human_vocab, reverse_input_char_index, Tx)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

## Random Checks

In [47]:
print("Example 9 may 1998")
print(from_encode_to_chars(X[0], reverse_input_char_index))
print(from_oh_to_chars(decoder_input_data[0], reverse_target_char_index)) # '<start>1998-05-09'
print(from_oh_to_chars(decoder_target_data[0], reverse_target_char_index)) # '1998-05-09<end>'

Example 9 may 1998
tuesday march 28 1995<pad><pad><pad><pad><pad><pad><pad><pad><pad>
<start>1995-03-28
1995-03-28<end>


In [48]:
print("Example 10.09.70")
print(from_encode_to_chars(X[1], reverse_input_char_index))
print(from_oh_to_chars(decoder_input_data[1], reverse_target_char_index)) # '<start>1998-05-09'
print(from_oh_to_chars(decoder_target_data[1], reverse_target_char_index)) # '1998-05-09<end>'

Example 10.09.70
27 may 2014<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
<start>2014-05-27
2014-05-27<end>


In [49]:
print("Example: sunday may 22 1988")
print(from_encode_to_chars(X[5], reverse_input_char_index))
print(from_oh_to_chars(decoder_input_data[5], reverse_target_char_index)) # '<start>1998-05-09'
print(from_oh_to_chars(decoder_target_data[5], reverse_target_char_index)) # '1998-05-09<end>'

Example: sunday may 22 1988
october 22 1979<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
<start>1979-10-22
1979-10-22<end>


## Start with the model

In [50]:
from __future__ import print_function

#from keras.models import Model
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
import numpy as np

In [51]:
# Metaparams
batch_size = 64 # 64  # Batch size for training.
epochs =  50  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = num_samples  # Number of samples to train on.

num_encoder_tokens = len(human_vocab)
num_decoder_tokens = num_decoder_tokens

In [52]:
#lstm_decoder =  LSTM(latent_dim, return_sequences=True, return_state=True, name='lstm_decoder')
# Let's define here the building blocks of our models
lstm_encoder_layer = LSTM(latent_dim, return_state=True, name='lstm_encoder')
lstm_decoder_layer =  LSTM(latent_dim, return_sequences=True, return_state=True, name='lstm_decoder')
dense_layer = Dense(num_decoder_tokens, activation='softmax', name="decoder_dense")

In [53]:
###########################
# Define model for training
##########################

# Define an input sequence and process it.
def get_training_model(lstm_encoder_layer, lstm_decoder_layer, dense_layer):
    
    encoder_inputs = Input(shape=(None, num_encoder_tokens), name='encoder_inputs')
    
    # The input sequence is encoded; the resulting state vectors are kept; 
    # encoder_outputs, state_h, state_c are tensors
    encoder_outputs, state_h, state_c = lstm_encoder_layer(encoder_inputs)

    # We discard `encoder_outputs` and only keep the states.
    encoder_states = [state_h, state_c]

    decoder_inputs = Input(shape=(None, num_decoder_tokens))

    """
    We set our decoder up to return full output sequences,
    and to return internal states as well. We don't use the
    return states in the training model, but we will use them 
    in the inference phase.
    """
    decoder_outputs_0, _, _ = lstm_decoder_layer(decoder_inputs, initial_state=encoder_states)
    decoder_outputs_1 = dense_layer(decoder_outputs_0)

    # Define the model that will turn
    # `encoder_input_data` and `decoder_input_data` into `decoder_target_data`
    training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs_1)
    return training_model

training_model = get_training_model(lstm_encoder_layer, lstm_decoder_layer, dense_layer)

In [54]:
lstm_decoder_layer.get_config()

{'name': 'lstm_decoder',
 'trainable': True,
 'dtype': 'float32',
 'return_sequences': True,
 'return_state': True,
 'go_backwards': False,
 'stateful': False,
 'unroll': False,
 'units': 256,
 'activation': 'tanh',
 'recurrent_activation': 'hard_sigmoid',
 'use_bias': True,
 'kernel_initializer': {'class_name': 'VarianceScaling',
  'config': {'scale': 1.0,
   'mode': 'fan_avg',
   'distribution': 'uniform',
   'seed': None,
   'dtype': 'float32'}},
 'recurrent_initializer': {'class_name': 'Orthogonal',
  'config': {'gain': 1.0, 'seed': None, 'dtype': 'float32'}},
 'bias_initializer': {'class_name': 'Zeros', 'config': {'dtype': 'float32'}},
 'unit_forget_bias': True,
 'kernel_regularizer': None,
 'recurrent_regularizer': None,
 'bias_regularizer': None,
 'activity_regularizer': None,
 'kernel_constraint': None,
 'recurrent_constraint': None,
 'bias_constraint': None,
 'dropout': 0.0,
 'recurrent_dropout': 0.0,
 'implementation': 1}

In [55]:
training_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, None, 37)     0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, None, 13)     0                                            
__________________________________________________________________________________________________
lstm_encoder (LSTM)             [(None, 256), (None, 301056      encoder_inputs[0][0]             
__________________________________________________________________________________________________
lstm_decoder (LSTM)             [(None, None, 256),  276480      input_3[0][0]                    
                                                                 lstm_encoder[0][1]               
          

In [56]:
##############
# Run training
##############
TRAINING = False
if TRAINING:
    training_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
    training_model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
    validation_split=0.2)
    # Save model
    training_model.save('s2s_dates_translator.h5')
else:
    training_model.load_weights('s2s_dates_translator.h5')

In [57]:
# Next: inference mode (sampling).
# Voici the steps:
# 1) encode input and retain output as initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states

def sampling_model(latent_dim, lstm_encoder_layer, lstm_decoder_layer,decoder_dense):
    # Define sampling models
    encoder_inputs = Input(shape=(None, num_encoder_tokens), name='encoder_inputs')
    encoder_outputs, state_h, state_c = lstm_encoder_layer(encoder_inputs)
    # We discard `encoder_outputs` and keep only the states.
    encoder_states = [state_h, state_c]
    
    encoder_model = Model(encoder_inputs, encoder_states)

    decoder_state_input_h = Input(shape=(latent_dim,), name='decoder_state_input_h')
    decoder_state_input_c = Input(shape=(latent_dim,), name='decoder_state_input_c')
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    # This is the lstm we defined before 
    decoder_inputs = Input(shape=(None, num_decoder_tokens))
    decoder_outputs, state_h, state_c = lstm_decoder_layer(decoder_inputs, initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = Model(
        [decoder_inputs] + decoder_states_inputs, # the input is a list containing three tensors
        [decoder_outputs] + decoder_states) # the input is also a list containing tensors
    return encoder_model, decoder_model

In [22]:
encoder_model, decoder_model = sampling_model(latent_dim, lstm_encoder_layer, lstm_decoder_layer, dense_layer)

In [159]:
states_v = encoder_model.predict(encoder_input_data[2:3])
target_seq = np.zeros((1, 11, num_decoder_tokens))
target_seq[0, 0, np.where(oh_start_char==1)[0][0]] = 1.
outs = decoder_model.predict([target_seq]+states_v)

In [160]:
acum=[]
for t in range(0,11):
    token_at_t = np.argmax(outs[0][0][t])
    acum+=reverse_target_char_index[token_at_t]

In [161]:
acum

['2', '0', '0', '4', '-', '0', '3', '-', '2', '2', '<', 'e', 'n', 'd', '>']

In [162]:
decode_sequence(encoder_input_data[2:3], np.where(oh_start_char==1)[0][0],encoder_model, decoder_model)

'2004-03-22<end>'

In [58]:
def decode_sequence(input_seq, index_start_char,encoder_model, decoder_model):
    # Encode the input sequence as state vectors.
    states_value = encoder_model.predict(input_seq)
    
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, index_start_char] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    decoded_sentence = ''
    for char_elem in range(11) :
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)
  
        # Sample a token wih argmax
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        
        # Get char associatedwith token
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states. Here we reassign the initial decoder states.  
        states_value = [h, c]

    return decoded_sentence

### Single check

In [59]:
print("Input date:", from_oh_to_chars(encoder_input_data[0], reverse_input_char_index))
print("Predicted date", decode_sequence(encoder_input_data[0:1], np.where(oh_start_char==1)[0][0], encoder_model, decoder_model)) # '1990-04-28<end>'
print("Expected result is", from_oh_to_chars(decoder_target_data[0], reverse_target_char_index)) # '1990-04-28<end>'

Input date: tuesday march 28 1995<pad><pad><pad><pad><pad><pad><pad><pad><pad>
Predicted date 1995-03-28<end>
Expected result is 1995-03-28<end>


In [60]:
print("Input date:", from_oh_to_chars(encoder_input_data[2], reverse_input_char_index))
print("Predicted date", decode_sequence(encoder_input_data[2:3], np.where(oh_start_char==1)[0][0], encoder_model, decoder_model)) # '1990-04-28<end>'
print("Expected result is", from_oh_to_chars(decoder_target_data[2], reverse_target_char_index)) # '1990-04-28<end>'

Input date: 22 march 2004<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
Predicted date 2004-03-22<end>
Expected result is 2004-03-22<end>


In [27]:
import pandas as pd

In [28]:
df = pd.DataFrame(columns=['matches','decoded','expected', 'input_sentence'])
df.loc[0, ['matches','decoded', 'expected', 'input_sentence']]=[True, 'decoded_value','true_value', 'elem']

In [29]:
for i, elem in enumerate(encoder_input_data[0:10000]):
    input_seq = elem.reshape(1,30,37)
    decoded_value = decode_sequence(input_seq,np.where(oh_start_char==1)[0][0],encoder_model, decoder_model)
    true_value = from_oh_to_chars(decoder_target_data[i], reverse_target_char_index)
    matches = decoded_value==true_value
    df.loc[i, ['matches','decoded', 'expected', 'input_sentence']]=[matches, decoded_value,true_value,  from_oh_to_chars(elem, reverse_input_char_index)]

In [30]:
df.matches.value_counts()

True     9826
False     174
Name: matches, dtype: int64

In [162]:
df['years'] = df.expected.str[0:4]

### Test with data in the triaining set

In [67]:
for seq_index in range(4):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq,np.where(oh_start_char==1)[0][0],encoder_model, decoder_model)

    print('Input sentence:', dataset[seq_index][0])
    print('Decoded sentence:', decoded_sentence , '(should be ', dataset[seq_index][1], ')')

Input sentence: tuesday march 28 1995
Decoded sentence: 1995-03-28<end> (should be  1995-03-28 )
Input sentence: 27 may 2014
Decoded sentence: 2014-05-27<end> (should be  2014-05-27 )
Input sentence: 22 march 2004
Decoded sentence: 2004-03-22<end> (should be  2004-03-22 )
Input sentence: saturday october 29 1988
Decoded sentence: 1988-10-29<end> (should be  1988-10-29 )


In [68]:
from_oh_to_chars(input_seq[0], reverse_input_char_index)

'thursday january 26 1995<pad><pad><pad><pad><pad><pad>'

### Test with new data

In [79]:
#Load dates datasets
m = 10000
dataset_test, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)

100%|██████████| 10000/10000 [00:00<00:00, 20093.52it/s]


In [80]:
#[CHECK]
Tx = 30
Ty = 10
X_test, Y_test, Xoh_test, Yoh_test = preprocess_data(dataset_test, human_vocab, machine_vocab, Tx, Ty)

In [99]:
i=np.random.randint(0,10000)
print(from_encode_to_chars(Y_test[i], reverse_target_char_index))
print(from_encode_to_chars(X_test[i], reverse_input_char_index))
#print(from_oh_to_chars(decoder_input_data[0], reverse_target_char_index)) # '<start>1998-05-09'
#print(from_oh_to_chars(decoder_target_data[0], reverse_target_char_index)) # '1998-05-09<end>'

1998-06-20
jun 20 1998<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [100]:
# use this dataset to check whether the test date is already in the input dataset
nl_input_data= [from_encode_to_chars(X_test[i], reverse_input_char_index).replace('<pad>','') for i in range(0,10000)]

In [136]:
def encode_date_to_oh(date_nl, nl_input_data):
    # look if date in argument is already present in training set
    if date_nl.replace('<pad>','') in nl_input_data:
        raise ValueError('date is already in input dataset ')
    
    oh_new_phrase = from_nl_to_oh(date_nl, human_vocab, reverse_input_char_index, 30)
    oh_new_date = np.zeros((1,oh_new_phrase.shape[0], num_encoder_tokens))
    oh_new_date[0] = oh_new_phrase
    return oh_new_date

In [101]:
nl_input_data

['sunday july 23 1995',
 'wednesday june 5 1996',
 '19 feb 2011',
 'monday october 29 1979',
 '20 june 2005',
 '21 10 09',
 '11 09 83',
 '20 november 2004',
 '16 january 1997',
 '29 nov 1971',
 '26.10.72',
 'saturday may 12 1984',
 'thursday september 2 1971',
 'sunday november 13 2005',
 'saturday march 22 2008',
 'june 27 2012',
 'friday october 7 2016',
 'thursday february 19 2009',
 '5 nov 2013',
 '8 jun 1988',
 'february 22 2009',
 'feb 11 2010',
 'sunday september 2 2018',
 'sunday december 13 1992',
 '14 sep 2016',
 '03.02.77',
 '1 september 1994',
 'december 26 2011',
 '21 aug 2001',
 '8/26/88',
 'monday september 27 2004',
 '03.01.92',
 'march 5 1975',
 '4 08 85',
 'saturday november 28 1987',
 'wednesday october 18 2017',
 '11/9/15',
 'saturday january 29 1977',
 'friday june 26 1970',
 '21 august 1983',
 'thursday october 27 2016',
 '26 sep 1984',
 '24.06.08',
 '18 aug 2008',
 'november 20 2010',
 'friday september 25 1992',
 'jul 10 1985',
 '12/23/91',
 'friday july 7 1995'

In [120]:
df_test = pd.DataFrame(columns=['matches','decoded','expected', 'input_sentence'])
df_test.loc[0, ['matches','decoded', 'expected', 'input_sentence']]=[True, 'decoded_value','true_value', 'elem']
for i, elem in enumerate(Xoh_test[0:10000]):
    input_seq = elem.reshape(1,30,37)
    decoded_value = decode_sequence(input_seq,np.where(oh_start_char==1)[0][0],encoder_model, decoder_model).replace('<end>','')
    true_value = from_encode_to_chars(Y_test[i], reverse_target_char_index)
    matches = decoded_value==true_value
    df_test.loc[i, ['matches','decoded', 'expected', 'input_sentence']]=[matches, decoded_value,true_value,  from_oh_to_chars(elem, reverse_input_char_index)]

In [121]:
df_test.matches.value_counts()

True     9809
False     191
Name: matches, dtype: int64

## Study wrong cases

In [61]:
oh_new_phrase = from_nl_to_oh('9 may 1998', human_vocab, reverse_input_char_index, 30)
oh_new_date = np.zeros((1,oh_new_phrase.shape[0], num_encoder_tokens))
oh_new_date[0] = oh_new_phrase
decode_sequence(oh_new_date, np.where(oh_start_char==1)[0][0],encoder_model, decoder_model) # '2018-05-23<end>'

'199<start><start><start><start><start><start><start><start>'

In [63]:
oh_new_phrase = from_nl_to_oh('9 may 1989', human_vocab, reverse_input_char_index, 30)
oh_new_date = np.zeros((1,oh_new_phrase.shape[0], num_encoder_tokens))
oh_new_date[0] = oh_new_phrase
decode_sequence(oh_new_date, np.where(oh_start_char==1)[0][0],encoder_model, decoder_model) # '2018-05-23<end>'

'199<start><start><start><start><start><start><start><start>'

In [64]:
### Switching last two digits from years sometimes works

In [65]:
oh_new_phrase = from_nl_to_oh('9 may 1987', human_vocab, reverse_input_char_index, 30)
oh_new_date = np.zeros((1,oh_new_phrase.shape[0], num_encoder_tokens))
oh_new_date[0] = oh_new_phrase
decode_sequence(oh_new_date, np.where(oh_start_char==1)[0][0],encoder_model, decoder_model) # '2018-05-23<end>'

'1987-05-09<end>'

In [66]:
oh_new_phrase = from_nl_to_oh('9 may 1978', human_vocab, reverse_input_char_index, 30)
oh_new_date = np.zeros((1,oh_new_phrase.shape[0], num_encoder_tokens))
oh_new_date[0] = oh_new_phrase
decode_sequence(oh_new_date, np.where(oh_start_char==1)[0][0],encoder_model, decoder_model) # '2018-05-23<end>'

'1978-05-09<end>'