## Description

This notebook was written while **learning** about:
1. the seq2seq implementations using "teacher forcing" in Keras.
2. using DistilBert via huggingface's transformers. 
3. using `tf.data` and tensorflow 2 for high performance data pipelines

As such, please treat it as a scratch book of a student more than a tutorial. Ideally only use it for brainstorming, rather than relying on any particular part. It is quite possible that the model design here has hidden "gotchas" or outright glaring mistakes...

In [9]:
import re
import json
from transformers import DistilBertTokenizerFast, TFDistilBertModel, DistilBertConfig
import tensorflow as tf
import tensorflow_io as tfio

import numpy as np
import h5py
from tensorflow.keras.utils import Sequence

from pathlib import Path
data_path = Path('../data') / 'span_model_oie'
model_path = Path('../models')

In [10]:
from pathlib import Path
data_path = Path('../data') / 'span_model_oie'
path_hdf5 = str(data_path/'encoded_sample.hdf5')
# Let's quickly get the shapes from HDF5 for bookkeeping
if 'fp' in locals():
    fp.close()
fp = h5py.File(data_path / "encoded_sample.hdf5", "r")
x_train = fp['x_train']
x_test = fp['x_test']
y_train = fp['y_train']
y_test = fp['y_test']
x_train_shape = x_train.shape
y_train_shape = y_train.shape
x_test_shape = x_test.shape
y_test_shape = y_test.shape
fp.close()
print("data sizes: x_train %s, y_train %s, x_test %s, y_test %s ." % \
      (x_train_shape, y_train_shape, x_test_shape, y_test_shape))
validation_index = 1+int(0.9*x_train_shape[0])

x_test = tfio.IODataset.from_hdf5(path_hdf5, dataset='/x_test')
y_test = tfio.IODataset.from_hdf5(path_hdf5, dataset='/y_test')
x_train = tfio.IODataset.from_hdf5(path_hdf5, dataset='/x_train')
y_train = tfio.IODataset.from_hdf5(path_hdf5, dataset='/y_train')


data sizes: x_train (10000, 128), y_train (10000, 64), x_test (2500, 128), y_test (2500, 64) .


In [11]:
# Thanks to the excellent tutorial at:
# https://towardsdatascience.com/working-with-hugging-face-transformers-and-tf-2-0-89bf35e3555a
# Setup the config and embedding layer, then prep data.
distil_bert = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(distil_bert)
max_input_size = 128
max_target_size = 64

config = DistilBertConfig(dropout=0.2, attention_dropout=0.2, trainable=False)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [12]:
# Instantiate the full model.
# The crucial part for our setup is "teacher forcing", so as to properly teach the full triple generation
# Great starting example at: https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py#L159
vocab_size = tokenizer.vocab_size
num_encoder_tokens = num_decoder_tokens = vocab_size
latent_dim = int(max_input_size)

encoder_inputs = tf.keras.layers.Input(shape=(max_input_size,), name='input_token', dtype='int32')
encoder_masks  = tf.keras.layers.Input(shape=(max_input_size,), name='masked_token', dtype='int32')

lm_embedding = transformer_model(encoder_inputs, attention_mask=encoder_masks)[0]
encoder = tf.keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(lm_embedding)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = tf.keras.layers.Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = tf.keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                      initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = tf.keras.Model([encoder_inputs, encoder_masks, decoder_inputs], decoder_outputs)

for layer in model.layers[:3]:
  layer.trainable = False

print(model.summary())


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model_1 (TFDisti ((None, 128, 768),)  66362880    input_token[0][0]                
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None, 30522) 0                                            
____________________________________________________________________________________________

In [13]:
# Prepare training data stream and iterate efficiently via tf.data
label_dim = tokenizer.vocab_size
batch_size = 80

full_train_hdf5 =  tf.data.Dataset.zip((x_train,y_train))

# we need to map the HDF5 x,y into ((x, x_mask, y_teacher), y_target)
# where 
# 1) x_mask is an attention mask that simply zeros-out padding tokens and enables all regular tokens
# 2) we are doing "teacher forcing" with a y_teacher and a shift-by-1 y_target
full_train = full_train_hdf5.map(\
    lambda x,y:((x,tf.cast(tf.math.not_equal(x, 0), tf.int64),\
                 tf.one_hot(y,label_dim)),\
                 tf.one_hot(tf.concat([tf.slice(y, [1], [max_target_size-1]), [0]],0),label_dim)))

# we can now split out a validation set and batch the datasets
train_batched = full_train.take(validation_index).batch(batch_size)
validation_batched = full_train.skip(validation_index).batch(batch_size)


In [15]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=['accuracy'])

checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=str(model_path/'checkpoint_sample_oie'),
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    verbose=1,
    save_best_only=True)
earlystop = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)


model.fit(train_batched,
    validation_data=validation_batched,
    workers = 8,
    use_multiprocessing=True,
    epochs=20,
    callbacks=[checkpoint,earlystop],
    verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fcc6c10d220>

In [86]:
model.save_weights(str(model_path / 'sample_oie.h5'))

In [39]:

# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states

# Define sampling models
encoder_model = tf.keras.Model([encoder_inputs,encoder_masks], encoder_states)
print(encoder_model.summary())

lm_embedding = transformer_model(encoder_inputs, attention_mask=encoder_masks)[0]
encoder = tf.keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(lm_embedding)


decoder_state_input_h = tf.keras.layers.Input(shape=(latent_dim,))
decoder_state_input_c = tf.keras.layers.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = tf.keras.Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)
print(decoder_model.summary())

Model: "model_14"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_token (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model_1 (TFDisti ((None, 128, 768),)  66362880    input_token[0][0]                
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, 128), (None, 459264      tf_distil_bert_model_1[0][0]     
Total params: 66,822,144
Trainable params: 459,264
Non-trainable params: 66,362,880
_______

In [55]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    input_mask = np.array([0 if x==0 else 1 for x in input_seq[0]]).reshape(1,128)
    states_value = encoder_model.predict((input_seq,input_mask))

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    CLS_id = 101 # tokenizer.encode('[CLS]') to check this
    PAD_id = 0 
    target_seq[0, 0, CLS_id] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    sentence_ids = []
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq]+states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sentence_ids.append(sampled_token_index)

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_token_index == PAD_id or
           len(sentence_ids) > max_target_size):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return tokenizer.decode(sentence_ids)


In [85]:
# Try to extract triple for first 100 training examples
previous = set()
for input_seq in iter(x_train):
    np_input = np.array(input_seq)
    if str(np_input) in previous:
        continue
    previous.add(str(np_input))
    if len(previous)>100:
        break
    decoded_sentence = decode_sequence(np.array([np_input]))
    print('---')
    print('Input sentence:', tokenizer.decode(input_seq).replace(' [PAD]',''))
    print('')
    print('Decoded triples:', decoded_sentence.replace(' [PAD]',''))
    #print('Expected triples:', tokenizer.decode(y_train[seq_index]))


---
Input sentence: [CLS] of the agricultural land, 21. 6 % is used for growing crops and 13. 9 % is pastures, while 1. 4 % is used for orchards or vine crops and 5. 0 % is used for alpine pastures [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SEP]
---
Input sentence: [CLS] of the agricultural land, 22. 3 % is used for growing crops and 12. 7 % is pastures, while 1. 4 % is used for orchards or vine crops [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SEP]
---
Input sentence: [CLS] of the agricultural land, 24. 1 % is used for growing crops and 15. 5 % is pastures, while 1. 4 % is used for orchards or vine crops [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SEP]
---
Input sentence: [CLS] of the agricultural land, 26. 6 % is used for growing crops and 10. 1 % is pastures, while 1. 4 % is used for orchards or vine crops [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SE

---
Input sentence: [CLS] of the agricultural land, 57. 2 % is used for growing crops, while 1. 4 % is used for orchards or vine crops [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SEP]
---
Input sentence: [CLS] of the agricultural land, 57. 2 % is used for growing crops and 17. 7 % is pastures, while 1. 4 % is used for orchards or vine crops [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SEP]
---
Input sentence: [CLS] of the agricultural land, 57. 2 % is used for growing crops and 6. 0 % is pastures, while 1. 4 % is used for orchards or vine crops [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SEP]
---
Input sentence: [CLS] of the agricultural land, 58. 3 % is used for growing crops and 4. 4 % is pastures, while 1. 4 % is used for orchards or vine crops [SEP]

Decoded triples: 1. 8 [SEP] is used [SEP] for orchards or vine crops [SEP]
---
Input sentence: [CLS] of the agricultural land, 59. 1 % is

---
Input sentence: [CLS] as of 2008, 1. 4 % of the population are resident foreign nationals [SEP]

Decoded triples: 10 % of the population [SEP] are [SEP] the population of the population [SEP]
---
Input sentence: [CLS] as of 2012, 1. 4 % of the population are resident foreign nationals [SEP]

Decoded triples: 10 % of the population [SEP] are [SEP] the population [SEP]
---
Input sentence: [CLS] 1. 4 % of the population was hispanic or latino of any race [SEP]

Decoded triples: 10 % of the population [SEP] were [SEP] hispanic [SEP]
---
Input sentence: [CLS] 1. 4 % of the population was of hispanic or latino ancestry [SEP]

Decoded triples: 10 % of the population [SEP] were [SEP] of the population of the population [SEP]
---
Input sentence: [CLS] 1. 4 % of the population were hispanic or latino of any race [SEP]

Decoded triples: 10 % of the population [SEP] were [SEP] hispanic [SEP]
---
Input sentence: [CLS] 1. 4 % of the population were indigenous australians [SEP]

Decoded triples: 

In [92]:
def string_to_triples(text):
    text_ids = tokenizer.encode(text, max_length=max_input_size, padding='max_length')
    triples = decode_sequence(np.array([text_ids])).replace(' [PAD]','')
    print("---")
    print("Input: %s"%text)
    print("")
    print("Extracted: %s"% triples)
    
    
string_to_triples('In mathematics, a monomial is, roughly speaking, a polynomial which has only one term.')

---
Input: In mathematics, a monomial is, roughly speaking, a polynomial which has only one term.

Extracted: 10 the the the [SEP] is [SEP] a a - - - - -....... - - - - - - - - - -propropropronz gymnastics gymnastics gymnastics fife proposal carroll whom whose um um um um um um um um um um um um um um um um um um um um um


### Discussion

As can be seen from the run above, 20 epochs are already sufficient to overfit 10,000 training examples over 20 million parameters in our training loop. There is nothing to brag about in the decoded triples, generalization is poor if any.

It is possible this is also due to the very naive "teacher forcing" implementation, which followed closely the seq2seq example from Keras, but without actual expertise, as this is the very first attempt at learning about the seq2seq training setup over such a large vocabulary.

Much more likely, 10,000 training examples are just too few to saturate a model of this many parameters, and/or it is possible the setup was somehow ill-posed. This may become more apparent with a run over the full data.