<a href="https://colab.research.google.com/github/dfossl/IWSS_ReactionPrediction_CoLab/blob/main/IWSS_ReactionPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hello!

### This notebook was made as a more detailed walkthrough of code that was breifly mentioned in [my IWSS talk](https://video.unbc.ca/media/IWSS+January+21st+2022.mp4/0_jl9w5gyh/282400) on the appropriation of RNN language Seq2Seq models in reaction prediction. I made the notebook to explain some of the code/theory/math that was hand waved over in the talk.

### The assume the notebook will be navigated by primarily undergraduate students who have some interest in Deep learning, so the notebook is constructed assuming that level education. I also will assume that the reader has the background provided in the talk.

### Also feel it is important to note that [Schwaller et al. 2018](https://pubs.rsc.org/en/content/articlelanding/2018/sc/c8sc02339e) was the major inspiration for this work and the data is sourced form their paper.

### Note: Still a work in progress. Will likely add more descriptions in future!

### Have questions? --> Dylan.T.Fossl@gmail.com

---

If you stumbled on this but didn't watch the talk I would recommend it for important background.

If you find any errors let me know!

importing libraries

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

# Data
Have and example dataset on my github. Samples 1100 reactions from an original dataest found [here](https://figshare.com/articles/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873).

Reactions are between 30 and 40 characters in length. Added the length constraint to make sure this subset was more uniform in size rather then sampling ransomly from the whole set.

In [2]:
path_to_data = tf.keras.utils.get_file(fname="example_reaction_data_1100.csv",
                        origin="https://raw.githubusercontent.com/dfossl/IWSS_ReactionPrediction_CoLab/main/example_reaction_data_1100.csv")

We can load the data with pandas and have a quick look.

In [3]:
reactions = pd.read_csv(path_to_data, index_col=0, dtype="string")
reactions.head()

Unnamed: 0,Reaction,Product
895533,C c 1 n c 2 c c c c c 2 [nH] 1 . Cl C C C Br >...,C c 1 n c 2 c c c c c 2 n 1 C C C Cl
29754,O = S ( Cl ) Cl . O C c 1 c c c c c 1 O c 1 c ...,Cl C c 1 c c c c c 1 O c 1 c c c c c 1
377126,C O C ( = O ) c 1 c c ( C O ) c ( C ) o 1 > A_...,C O C ( = O ) c 1 c c ( C = O ) c ( C ) o 1
857844,N c 1 c c c c c 1 . O = C c 1 c c c o 1 > A_Cl...,C ( = N / c 1 c c c c c 1 ) \ c 1 c c c o 1
2247,Cl c 1 n c n c c 1 O c 1 c c c c c 1 . [NH4+] ...,N c 1 n c n c c 1 O c 1 c c c c c 1


The reactions are SMILE strings that have already been tokenized. The tokens are seperated by spaces.

If interested in the rules for tokenization they are described in [Schwaller et al. 2018](https://pubs.rsc.org/en/content/articlelanding/2018/sc/c8sc02339e).

If interested in SMILES format see the [wiki](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system).

---
The reactions have the form:
Reactants > Reagents > Products

>  - **Reactants** are chemicals that are need for the reaction and go through a chemical change. Another way to say this is the reactants atoms can be mapped to the product.

>  - **Reagents** are chemicals that are needed for the reaction but DO NOT go through a chemical change. In this dataste the reagents are unified into thier own single tokens. The idea there is that since the reagent does not map to the product the catorgization of its identity is enough to capture the importantance of its prescence. However, reagents can be general and may reagents can actually perform similar roles, so breaking reagents into normal tokens could allow the model to learn generality amoung reagents. Maybe an exercise for the readerimportantance of its prescence. However, reagents can be general and may reagents can actually perform similar roles, so breaking reagents into normal tokens could allow the model to learn generality amoung reagents. Maybe an exercise for the reader 😉.

> - Products are a single SMILE. The dataset only contains reactions with single products.


In [4]:
train, test = train_test_split(reactions, test_size=100)

In [5]:
test = test[test["Reaction"].str.len() <= max(train["Reaction"].str.len())]
test = test[test["Product"].str.len() <= max(train["Product"].str.len())]

## Creating a TensorFlow Dataset

Although not always necessary, we are utilizing [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to create a tensorflow dataset. Tf datasets allow us to preset batch and buffer size.

In [6]:
BUFFER_SIZE = len(train)//4
BATCH_SIZE = 32

trainDataset = tf.data.Dataset.from_tensor_slices((train["Reaction"], train["Product"])).shuffle(BUFFER_SIZE)
trainDataset = trainDataset.batch(BATCH_SIZE)

In [7]:
BUFFER_SIZE = len(train)//4
BATCH_SIZE = 32

testDataset = tf.data.Dataset.from_tensor_slices((test["Reaction"], test["Product"])).shuffle(BUFFER_SIZE)
testDataset = testDataset.batch(BATCH_SIZE)

We can take a peak at what a batch looks like.

In [8]:
for example_rxn_batch, example_product_batch in trainDataset.take(-1):
  print(f"Batch input shape: {tf.shape(example_rxn_batch)}")
  print(f"First 5 inputs of batch:\n{example_rxn_batch[:5]}")
  print()
  print(f"Batch output shape: {tf.shape(example_rxn_batch)}")
  print(f"First 5 outputs of batch:\n{example_rxn_batch[:5]}")
  print()
  break

Batch input shape: [32]
First 5 inputs of batch:
[b'C C ( C ) C [C@H] ( C C ( = O ) O ) C ( = O ) O >'
 b'C c 1 c c ( C O ) c c 2 c 1 C C C C 2 . O = S ( Cl ) Cl > A_ClCCl'
 b'C C C C I . C O C ( = O ) c 1 c c n c c 1 . [OH-] > A_O A_[Na+] A_Cl A_Cc1ccccc1 A_CC#N A_CC(C)=O'
 b'C [O-] . C c 1 c c ( Cl ) n c ( Cl ) c 1 > A_[Na+] A_CO'
 b'N = C ( N ) c 1 c c c ( C ( F ) ( F ) F ) c c 1 > A_[Na+] A_C[O-]']

Batch output shape: [32]
First 5 outputs of batch:
[b'C C ( C ) C [C@H] ( C C ( = O ) O ) C ( = O ) O >'
 b'C c 1 c c ( C O ) c c 2 c 1 C C C C 2 . O = S ( Cl ) Cl > A_ClCCl'
 b'C C C C I . C O C ( = O ) c 1 c c n c c 1 . [OH-] > A_O A_[Na+] A_Cl A_Cc1ccccc1 A_CC#N A_CC(C)=O'
 b'C [O-] . C c 1 c c ( Cl ) n c ( Cl ) c 1 > A_[Na+] A_CO'
 b'N = C ( N ) c 1 c c c ( C ( F ) ( F ) F ) c c 1 > A_[Na+] A_C[O-]']



We will be processing the input text using the TextVectorization class provided by TensorFlow. It allows us write our own text processing function. In this case we will just be adding out START and END tokens.

In [9]:
def tf_format_reactions(text):


  text = tf.strings.strip(text)

  text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
  return text


Running the adapt() method can take some time. This is constructing our vocabulary for us.

In [10]:
rxn_processor = tf.keras.layers.TextVectorization(
    standardize=tf_format_reactions)
rxn_processor.adapt(train["Reaction"])

In [11]:
product_processor = tf.keras.layers.TextVectorization(
    standardize=tf_format_reactions)
product_processor.adapt(train["Product"])

If we call get_vocabulary() we can see a list of all of our tokens. 

In [12]:
rxn_processor.get_vocabulary()[:5]

['', '[UNK]', 'c', 'C', '1']

Our text processer can take in a string of tokens and provide use with and integer encoding. These integers are just the indices in the vocabulary.

In this example we can see the zero paddings.

In [13]:
print(f'Reaction: {example_rxn_batch[0]}')
print(f"Product: {rxn_processor(example_rxn_batch)[0]}")

Reaction: b'C C ( C ) C [C@H] ( C C ( = O ) O ) C ( = O ) O >'
Product: [ 9  3  3  7  3  6  3 49  7  3  3  7  8  5  6  5  6  3  7  8  5  6  5 11
 10  0  0  0  0  0  0  0]


# Encoder

With the data available we can explore the construction of the encoder.

The encoder is just an RNN and we will be using [tf.keras.layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers) Module to construct the RNN.

Give a few examples here.

In [14]:
# Make a single layer RNN with 10 units
cells = [tf.keras.layers.SimpleRNNCell(units=10)]
example_encoder = tf.keras.layers.RNN(cells, return_sequences=True, return_state=True)

In [15]:
# Make example data
print(f"Shape before onehot (batch, seq_length): {tf.shape(rxn_processor(example_rxn_batch))}")
oneHot = tf.one_hot(rxn_processor(example_rxn_batch), rxn_processor.vocabulary_size())
print(f"Shape after onehot (batch, seq_length, vocab_size): {tf.shape(oneHot)}")

Shape before onehot (batch, seq_length): [32 32]
Shape after onehot (batch, seq_length, vocab_size): [ 32  32 135]


When we pass the data through the RNN we get two things returned. We get and output which holds the hidden states for each step of the RNN, this is important if we use attention. And we get the state, which for a simple RNN and the GRU is the '*h*' matrix shown in exmaples in the talk. For LSTM the state is a pair of matrices, *h* and *c* as discussed in the presentation.


In [16]:
#Pass through encoder
output, state = example_encoder(oneHot)
print(f"Output shape (batch, seq_length, units): {tf.shape(output)}")
print(f"Statet shape (batch, units): {tf.shape(state)}")


Output shape (batch, seq_length, units): [32 32 10]
Statet shape (batch, units): [32 10]


We can do multiple layers like this:




In [17]:

cells = [tf.keras.layers.SimpleRNNCell(units=10), tf.keras.layers.SimpleRNNCell(units=10)]
example_encoder = tf.keras.layers.RNN(cells, return_sequences=True, return_state=True)


In the case of two layers we will have two states returned.

In [18]:

output, *state = example_encoder(oneHot)
print(f"Output shape (batch, seq_length, units): {tf.shape(output)}")
print(f"Number of states: {len(state)}")
print(f"State 1 shape (batch, units): {tf.shape(state[0])}")
print(f"State 2 shape (batch, units): {tf.shape(state[1])}")



Output shape (batch, seq_length, units): [32 32 10]
Number of states: 2
State 1 shape (batch, units): [32 10]
State 2 shape (batch, units): [32 10]


And we can do Bidirectionality like this:

In [19]:
cells = [tf.keras.layers.SimpleRNNCell(units=10), tf.keras.layers.SimpleRNNCell(units=10)]
example_encoder = tf.keras.layers.Bidirectional(tf.keras.layers.RNN(cells, return_sequences=True, return_state=True))
output, *state = example_encoder(oneHot)
print(f"Output shape (batch, seq_length, units): {tf.shape(output)}")
print(f"Number of states: {len(state)}")

Output shape (batch, seq_length, units): [32 32 20]
Number of states: 4


It is important to point out here that the output has doubled in unit size. This is because my default the outputs are concatinated for the forward and backward RNNs. We also have 4 states for a bidirectional RNN with two layers. However, for bidirection it is important to combined the forward and reverse states. We can use [tf.keras.layers.Concatenate()](https://keras.io/api/layers/merging_layers/concatenate/) to do this. The returned states are going to follow the pattern of [layer_1_forward, layer_1_back, layer_2_forward, layer_2_back,...] repeated for each layer. 

So we can concatinate like this:

In [20]:
states = [tf.keras.layers.Concatenate()([state[0], state[1]]), tf.keras.layers.Concatenate()([state[2], state[3]])]
print(f"Number of states: {len(states)}")
print(f"State 1 shape (batch, units): {tf.shape(states[0])}")
print(f"State 2 shape (batch, units): {tf.shape(states[1])}")

Number of states: 2
State 1 shape (batch, units): [32 20]
State 2 shape (batch, units): [32 20]


I would reccomend playing around with the number of layers and bidirectional and cell types ([tf.keras.layers.GRUCell()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRUCell) or [tf.keras.layers.LSTMCell()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTMCell)).

You'll notice that when you use LSTMs your state will return a context AND a hidden matrix. I'll leave it up to you to explore all of that but if you need hints you can look at future code where I handle this case.

# Making Encoder Class

In the Encoder we will be allowing for parameters for the layer type, number of layers and bidirectionality.

In [21]:
class Encoder(tf.keras.layers.Layer):
  """
  Custom Encoder class that implements Tensorflow layer
  Constructs an encoder of the provided architecture
  """
  def __init__(self, input_vocab_size, enc_units, layertype="rnn", layers=1, isBidirectional=False):
    """
    Given the Vocabulary size, the number of units, layertype, number of layers and bidirectionality it
    will construct an encoder with these properties.
    """
    super(Encoder, self).__init__()
    self.enc_units = enc_units
    self.input_vocab_size = input_vocab_size
    self.layertype=layertype
    self.layers = layers
    self.isBidirectional = isBidirectional



    if not layertype in {"gru", "lstm", "rnn"}:
      raise ValueError(f"[layertype == {layertype}]: layertype must be one of: [gru, lstm, rnn]")    

    cells = []
    if self.layertype == "gru": 
      for _ in range(self.layers):
        cells.append(tf.keras.layers.GRUCell(units=enc_units,
                                             recurrent_initializer='glorot_uniform',
                                             recurrent_dropout=.2))
    elif self.layertype == "lstm":
      for _ in range(self.layers):
        cells.append(tf.keras.layers.LSTMCell(units=enc_units,
                                             recurrent_initializer='glorot_uniform',
                                             recurrent_dropout=.2))
    else:
      for _ in range(self.layers):
        cells.append(tf.keras.layers.SimpleRNNCell(units=enc_units,
                                             recurrent_initializer='glorot_uniform',
                                             recurrent_dropout=.2))

    if self.isBidirectional:
      self.rnn = tf.keras.layers.Bidirectional(tf.keras.layers.RNN(cells, return_sequences=True, return_state=True))
    else:
      self.rnn = tf.keras.layers.RNN(cells, return_sequences=True, return_state=True)




This is a helper function for getting the forward, backward state pairs for bidirectional networks.

In [22]:
def getBidirectionalStates(states, layerNum, layertype, state=None):
    start = 2*(layerNum-1)
    if layertype == "lstm":
        if state == "hidden":
            return [states[start][0], states[start+1][0]]
        elif state == "context":
            return [states[start][1], states[start+1][1]]
        else:
            raise ValueError(f"[layertpye == lstm but state = {state}]: state must be 'hidden' or 'context' for lstm")
        

    elif layertype in ["rnn", "gru"]:
        if state:
            raise ValueError(f"[layertype == {layertype} but state = {state}]: state must be None for {layertype}")
        
        return [states[start], states[start+1]]

In [23]:
def call(self, tokens, state=None):
    
    oh_input = tf.one_hot(tokens, depth=self.input_vocab_size)

    if self.isBidirectional:

      if self.layertype == "lstm":
        # lstm need to concatinate both context and hidden states.

        output, *s = self.rnn(oh_input, initial_state=state)

        state = []
        for i in range(self.layers):
          layerHiddenStates = getBidirectionalStates(s, layerNum=i+1, 
                                                     layertype=self.layertype, 
                                                     state="hidden")
          concatHiddenStates = tf.keras.layers.Concatenate()(layerHiddenStates)


          layerContextStates = getBidirectionalStates(s, layerNum=i+1, 
                                                      layertype=self.layertype, 
                                                      state="context")
          concatContextStates = tf.keras.layers.Concatenate()(layerContextStates)


          state.append([concatHiddenStates, concatContextStates])

      else:
        # is GRU or RNN
        output, *s = self.rnn(oh_input, initial_state=state)

        state = []
        for i in range(self.layers):
          layerStates = getBidirectionalStates(s,
                                               layerNum=i+1,
                                               layertype=self.layertype)

          state.append(tf.keras.layers.Concatenate()(layerStates))



    else:
      #Not Bidirectional
      output, *state = self.rnn(oh_input, initial_state=state)



    # In single layer networks state is either one tensor or a tuple of tensors
    # in n layer networks state is a list of of n states for each layer n.
    # Shapes:
    # For GRU:  output (batch, max_input_len, dims)
    #           state (1, dims)
    # For LSTM: output (batch, max_input_len, dims)
    #           state ( h(1, dims), c(1, dims) )
    # For RNN:  output (batch, max_input_len, dims)
    #           state (1, dims)
    # If n layers then the number of states *n
    # if bidirectional dims -> 2*dims because of concatination.
    return output, state

In [24]:
Encoder.call = call

We can do some tests. Play around with different parameters!

In [25]:
example_tokens = rxn_processor(example_rxn_batch)
encoder = Encoder(rxn_processor.vocabulary_size(), 128, layertype="rnn", layers=1, isBidirectional=False)
example_enc_output, example_enc_state = encoder(example_tokens)

print(f'Input batch, shape (batch): {example_rxn_batch.shape}')
print(f'Input batch tokens, shape (batch, max_seq_length): {example_tokens.shape}')
print(f'Encoder output, shape (batch, max_seq_length, units): {example_enc_output.shape}')
for i in range(len(example_enc_state)):
  print(f"Encoder state {i+1} shape: {example_enc_state[i].shape}")

Input batch, shape (batch): (32,)
Input batch tokens, shape (batch, max_seq_length): (32, 32)
Encoder output, shape (batch, max_seq_length, units): (32, 32, 128)
Encoder state 1 shape: (32, 128)


# Implimenting Attention

Here we are going to use TensorFlows built in multiplicative attention mechanism (similar to Luong Attention). 

The query is the decoders hidden states, and query mask has which parts of the decoders output are padding. The value parameter is the encoders hidden states and the value mask provided is the mask for the encoders sequence to know which parts of encoder sequence were padding.

By default the TesorFlow implimentation does not have weights, so we add a dense layer with no bias to act as the weighted matrix. 


In [26]:
class luongLikeAttention(tf.keras.layers.Layer):
  """
  Implementation of Attention layer with tensorflows implemenation of
  multiplicative attention.
  """
  def __init__(self, units):
    super().__init__()
    
    self.W1 = tf.keras.layers.Dense(units, use_bias=False)


    self.attention = tf.keras.layers.Attention()

  def call(self, query, query_mask, value, value_mask):

    w1_query = self.W1(query)


    context_vector, attention_weights = self.attention(
        inputs = [w1_query, value],
        mask=[query_mask, value_mask],
        return_attention_scores = True,
    )


    return context_vector, attention_weights

If we test out the attention layer we can see the output for the attention weights.

We notice that the attention weights should have shape (batch_size, max_output_len, max_input_length). This is because for each output token there is a vector of size max_input_length that holds the weights for what input location to pay attention too.

In [27]:
attention_layer = luongLikeAttention(128)

example_attention_query = tf.random.normal(shape=[len(example_tokens), 32, 128])

context_vector, attention_weights = attention_layer(
    query=example_attention_query,
    query_mask = (example_tokens != 0), # should be a unique mask just using as an example
    value=example_enc_output,
    value_mask=(example_tokens != 0))

print(f'Attention layer output shape: (batch_size, max_output_len, units):           {context_vector.shape}')
print(f'Attention weights shape: (batch_size, max_output_len, max_input_length): {attention_weights.shape}')

Attention layer output shape: (batch_size, max_output_len, units):           (32, 32, 128)
Attention weights shape: (batch_size, max_output_len, max_input_length): (32, 32, 32)


# Implimenting Decoder

This will have a very similar structure to the Encoder in that it is just an RNN. It will have an additional Dense layer for the output since we need the decoder to output vectors with size equal to the max output sequence to actually get the one hot encoding for the predictived character. The Decoder will also have a attention layer.

Note decoder is not given bidirection option since decoder needs to produce a a sequence and cannot see ahead of where it is at.

In [28]:
class Decoder(tf.keras.layers.Layer):
  """
  Custom decoder class that implements tensor flow layer.
  """
  def __init__(self, output_vocab_size, dec_units, layertype="rnn", layers=1, useAttention=True):
    """
    Given vocab size, units, layertype, number of layers, and attention and constructs a decoder
    of those properties.
    """
    super(Decoder, self).__init__()
    self.dec_units = dec_units
    self.output_vocab_size = output_vocab_size

    self.layertype=layertype
    self.layers = layers
    self.useAttention = useAttention



    if not layertype in {"gru", "lstm", "rnn"}:
      raise ValueError(f"[layertype == {layertype}]: layertype must be one of: [gru, lstm, rnn]")    

    cells = []
    if self.layertype == "gru": 
      for _ in range(self.layers):
        cells.append(tf.keras.layers.GRUCell(units=dec_units,
                                             recurrent_initializer='glorot_uniform',
                                             dropout=.3))
    elif self.layertype == "lstm":
      for _ in range(self.layers):

        cells.append(tf.keras.layers.LSTMCell(units=dec_units,
                                             recurrent_initializer='glorot_uniform',
                                             dropout=.3))
    else:
      # rnn
      for _ in range(self.layers):
        cells.append(tf.keras.layers.SimpleRNNCell(units=dec_units,
                                             recurrent_initializer='glorot_uniform',
                                             dropout=.3))

    self.rnn = tf.keras.layers.RNN(cells, return_sequences=True, return_state=True)
    



    if self.useAttention:
      self.attention = luongLikeAttention(self.dec_units)

      # This weighted matrix is for applying the context vector to decoder
      # output
      self.Wc = tf.keras.layers.Dense(dec_units, activation=tf.math.tanh,
                                      use_bias=False)

    self.fc = tf.keras.layers.Dense(self.output_vocab_size)

  def call(self, inputs, state=None):

      vectors = tf.one_hot(inputs["input_tokens"], depth=self.output_vocab_size)


      rnn_output, *state = self.rnn(vectors, initial_state=state)


      if self.useAttention:
        context_vector, attention_weights = self.attention(
            query=rnn_output, query_mask = inputs["dec_mask"], value=inputs["enc_output"], value_mask=inputs["enc_mask"])

        context_and_rnn_output = tf.concat([context_vector, rnn_output], axis=-1)

        last_vector = self.Wc(context_and_rnn_output)

        
      else:
        attention_weights = None
        last_vector = rnn_output

      
      logits = self.fc(last_vector)

      return {"logits":logits, "attention_weights":attention_weights}, state

In [29]:
#Set up example Encoder
example_tokens = rxn_processor(example_rxn_batch)
encoder = Encoder(rxn_processor.vocabulary_size(), 128, layertype="rnn", layers=1, isBidirectional=False)
example_enc_output, example_enc_state = encoder(example_tokens)

#Set up example Decoder
decoder = Decoder(product_processor.vocabulary_size(), 128, layertype="rnn", useAttention=True)
example_output_tokens = product_processor(example_product_batch)
decoder_input = {"input_tokens":example_output_tokens[:,:-1],
                 "dec_mask":example_output_tokens[:,:-1] != 0,
                 "enc_output":example_enc_output,
                 "enc_mask":example_tokens!=0}

dec_result, dec_state = decoder(decoder_input, example_enc_state)

print(f"input tokens shape (batch, max_seq_len-1): {decoder_input['input_tokens'].shape}")
print(f'logits shape: (batch_size, product_max_len-1, product_vocab_size) {dec_result["logits"].shape}')
for i in range(len(dec_state)):
  print(f'state {i+1} shape: (batch_size, dec_units) {dec_state[i].shape}')

input tokens shape (batch, max_seq_len-1): (32, 31)
logits shape: (batch_size, product_max_len-1, product_vocab_size) (32, 31, 40)
state 1 shape: (batch_size, dec_units) (32, 128)


# Some additional functions.


### MaskedLoss extends Loss and makes sure the categorical cross entropy is only calculated for unmasked values.

In [30]:
class MaskedLoss(tf.keras.losses.Loss):
  """
  Extenstion of Tensorflow lost class for masking padding values.
  """
  def __init__(self):
    self.name = 'masked_loss'
    self.loss = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction="sum")

  def __call__(self, y_true, y_pred):


    # Mask off the losses on padding.
    mask = y_true != 0
    mask = tf.cast(mask, dtype=tf.int64)

    loss = self.loss(y_true, y_pred, sample_weight=mask)



    # We divide loss by reduce sum of mask as this allows us
    # to divide by the number of unmasked values.
    return loss / tf.reduce_sum(tf.cast(mask, tf.float32))


### MaskedTotalAccuracy extends metric and gives the total accuracy rather then sequence accracy.

In [31]:
class MaskedTotalAccuracy(tf.keras.metrics.Metric):

  def __init__(self, name="masked_tot_acc", **kwargs):
    super(MaskedTotalAccuracy, self).__init__(name=name, **kwargs)
    self.sumBatchAcc = self.add_weight(name="Sum of Average Prediction Accuracies", initializer="zeros")
    self.numBatches = self.add_weight(name="Number of Batches called", initializer="zeros")
    

  def update_state(self, y_true, y_pred, sample_weight=None):

    # holds 1 for correct guess and 0 for wrong guess
    equalityTrueAndPrediction = tf.math.equal(tf.cast(y_true, dtype=tf.int32), 
                                              tf.cast(tf.math.argmax(y_pred, axis=-1), dtype=tf.int32))


    if sample_weight is not None:

      # Negate mask to make all masked values 1
      n_sample_weight = tf.logical_not(tf.cast(sample_weight, dtype=tf.bool))

      # Since all masked values are 1 the logical or forces all padding guesses to be true
      # For total acuracy this is fine because the percentage right isn't being messaged
      # only the totality of correctness. In this way guesses can only be wrong in
      # non-padding regions.
      maskedResults = tf.logical_or(n_sample_weight, equalityTrueAndPrediction)
    else:
      maskedResults = equalityTrueAndPrediction


    collapsedResults = tf.reduce_all(maskedResults, axis=1, keepdims=True)


    batchNumberTruePositives = tf.reduce_sum(tf.cast(collapsedResults, dtype=tf.float32))

    batch_acc = batchNumberTruePositives / tf.cast(tf.shape(y_true)[0], dtype=tf.float32)
    self.sumBatchAcc.assign_add(batch_acc)

    self.numBatches.assign_add(1.)


  def result(self):
    # Return the current average batch accruacy
    return self.sumBatchAcc/self.numBatches
  
  def reset_state(self):
    self.sumBatchAcc.assign(0.)
    self.numBatches.assign(0.)


# General Model Constructor

In [32]:
class Seq2SeqModelConstructor(tf.keras.Model):
  """
  Seq2SeqModelConstructor extends Tensorflow model and handles the creation of the full model with
  encoder and decoder and impliments the custom training and evaluation steps.
  """
  def __init__(self,
               units,
               input_rxn_processor,
               output_rxn_processor,
               output_vocab_size,
               n_layers=1,
               layertype="gru",
               isBidirectional=False,
               useAttention=True):
    
    super().__init__()

    self.input_rxn_processor = input_rxn_processor
    self.output_rxn_processor = output_rxn_processor
    self.output_vocab_size = output_vocab_size
    self.n_layers = n_layers
    self.layertype = layertype
    self.isBidirectional = isBidirectional
    self.useAttention = useAttention


    self.encoder = Encoder(input_vocab_size=input_rxn_processor.vocabulary_size(),
                        enc_units=units,
                        layertype=self.layertype,
                        layers=self.n_layers,
                        isBidirectional=self.isBidirectional)
    

    if self.isBidirectional:
      dec_units = 2*units
    else:
      dec_units = units
    

    self.decoder = Decoder(output_vocab_size=output_rxn_processor.vocabulary_size(),
                                  dec_units=dec_units,
                                  layertype=self.layertype,
                                  layers=self.n_layers,
                                  useAttention=self.useAttention)
    

    self.seqAcc_1 = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1)
    self.seqAcc_2 = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=2)
    self.seqAcc_3 = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=3)
    self.seqAcc_4 = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=4)
    self.seqAcc_5 = tf.keras.metrics.SparseTopKCategoricalAccuracy(k=5)
    self.totAcc = MaskedTotalAccuracy()

  
  @property
  def metrics(self):
    return [self.seqAcc_1,self.seqAcc_2,self.seqAcc_3,self.seqAcc_4,self.seqAcc_5,self.totAcc]

# Helper methods for Seq2Seq model

In [33]:
  def topKAccLoop(self, y_true, y_pred):
    """
    This only is needed because the Keras topKaccruacy seems to have issues with
    Batches?
    Here is my open issue:
    https://github.com/keras-team/keras/issues/15939
    """
    mask = tf.math.logical_not(tf.math.equal(y_true, 0))
    
    mask = tf.cast(mask, dtype=tf.int32)

    for i in tf.range(tf.shape(y_true)[0]):
      self.seqAcc_1.update_state(y_true[i,:], y_pred[i,:,:], sample_weight=mask[i,:])
      self.seqAcc_2.update_state(y_true[i,:], y_pred[i,:,:], sample_weight=mask[i,:])
      self.seqAcc_3.update_state(y_true[i,:], y_pred[i,:,:], sample_weight=mask[i,:])
      self.seqAcc_4.update_state(y_true[i,:], y_pred[i,:,:], sample_weight=mask[i,:])
      self.seqAcc_5.update_state(y_true[i,:], y_pred[i,:,:], sample_weight=mask[i,:])

Seq2SeqModelConstructor.topKAccLoop = topKAccLoop

In [34]:
def preprocess(self, input_text, target_text):


  # Convert the text to token IDs
  input_tokens = self.input_rxn_processor(input_text)
  target_tokens = self.output_rxn_processor(target_text)


  # Convert IDs to masks.
  input_mask = input_tokens != 0


  target_mask = target_tokens != 0


  return input_tokens, input_mask, target_tokens, target_mask

Seq2SeqModelConstructor._preprocess = preprocess

There is one thing I neglatcted to mention in the talk to save time but is an important detail for this training process. In traing we are using something called [teacher forcing](https://en.wikipedia.org/wiki/Teacher_forcing). The main idea is in training rather then feeding outputs of the decoder back into itself we actually feed the correct token in each time regardless of the prediciton. Intuitivly what this is doing is making sure that even if the decoder is making mistakes early on that these mistakes don't compound over the sequence.

An anology is if we treat training like a test. Sequence prediction is one of those questions that has part 'c' depend on part 'b' and part 'b' depend on part 'a'. Teacher forcing is like a nice techaer that will give you 'a' even if you got it wrong so you atleast have a shot at doing 'b'.

---

We impliment train_step so that we can call model.fit()
If you want you can always make you're own custom training loop!

In [35]:
@tf.function(input_signature=[[tf.TensorSpec(dtype=tf.string, shape=[None]),
                              tf.TensorSpec(dtype=tf.string, shape=[None])]])
def train_step(self, inputs):
  input_text, target_text = inputs


  (input_tokens, input_mask,
  target_tokens, target_mask) = self._preprocess(input_text, target_text)


  max_target_length = tf.shape(target_tokens)[1]

  with tf.GradientTape() as tape:
    enc_output, enc_state = self.encoder(input_tokens)

    dec_state = enc_state


    decoder_input = {"input_tokens":target_tokens[:, :-1],
                                "dec_mask":target_mask[:, :-1],
                                "enc_output":enc_output,
                                "enc_mask":input_mask}

    dec_result, dec_state = self.decoder(decoder_input, state=dec_state)


    y = target_tokens[:,1:]
    y_pred = dec_result["logits"]
    average_loss = self.loss(y, y_pred)
    self.totAcc.update_state(y, y_pred, target_mask[:,1:])


    self.topKAccLoop(y, y_pred)


    acc_top1 = self.seqAcc_1.result()
    acc_top2 = self.seqAcc_2.result()
    acc_top3 = self.seqAcc_3.result()
    acc_top4 = self.seqAcc_4.result()
    acc_top5 = self.seqAcc_5.result()
    totA = self.totAcc.result()

    
      
  variables = self.trainable_variables 
  gradients = tape.gradient(average_loss, variables)
  self.optimizer.apply_gradients(zip(gradients, variables))


  return {'batch_loss': average_loss,
          "acc_1":acc_top1,
          "acc_2":acc_top2,
          "acc_3":acc_top3,
          "acc_4":acc_top4,
          "acc_5":acc_top5,
          "totA":totA}

Seq2SeqModelConstructor.train_step = train_step

test_step is necessary for being able to call model.evaluate(). In this case test_step still uses the teacher forcing workflow. For true evaluation this is bad but for the propose of seeing how the test and training results compare in the best contexts. We will impliment our own evaluate for real evaluation.

In [36]:
@tf.function(input_signature=[[tf.TensorSpec(dtype=tf.string, shape=[None]),
                              tf.TensorSpec(dtype=tf.string, shape=[None])]])
def test_step(self, inputs):
  input_text, target_text = inputs


  (input_tokens, input_mask,
  target_tokens, target_mask) = self._preprocess(input_text, target_text)


  max_target_length = tf.shape(target_tokens)[1]


  enc_output, enc_state = self.encoder(input_tokens)

  dec_state = enc_state


  decoder_input = {"input_tokens":target_tokens[:, :-1],
                              "dec_mask":target_mask[:, :-1],
                              "enc_output":enc_output,
                              "enc_mask":input_mask}

  dec_result, dec_state = self.decoder(decoder_input, state=dec_state)


  y = target_tokens[:,1:]
  y_pred = dec_result["logits"]


  average_loss = self.loss(y, y_pred)
  self.totAcc.update_state(y, y_pred, target_mask[:,1:])


  self.topKAccLoop(y, y_pred)

  # tf.print(average_loss)

  acc_top1 = self.seqAcc_1.result()
  acc_top2 = self.seqAcc_2.result()
  acc_top3 = self.seqAcc_3.result()
  acc_top4 = self.seqAcc_4.result()
  acc_top5 = self.seqAcc_5.result()
  totA = self.totAcc.result()

    

  return {'batch_loss': average_loss,
          "acc_1":acc_top1,
          "acc_2":acc_top2,
          "acc_3":acc_top3,
          "acc_4":acc_top4,
          "acc_5":acc_top5,
          "totA":totA}

Seq2SeqModelConstructor.test_step = test_step

This evaluation function will feed decoder outputs back in as inputs to get real prediction evaluation.

In [37]:

def evaluate_dataset(self, dataset):
  """
  Loops over dataset batches calculating and printing metrics
  for each batch. Final results are the batch averages.

  @return Dictionary with Batch average of metrics.
  """

  total_loss = 0
  acc_top1 = 0
  acc_top2 = 0
  acc_top3 = 0
  acc_top4 = 0
  acc_top5 = 0
  totA = 0

  for batch, (input_text, target_text) in enumerate(dataset.take(-1)):


    (input_tokens, input_mask,
    target_tokens, target_mask) = self._preprocess(input_text, target_text)



    max_target_length = tf.shape(target_tokens)[1]


    enc_output, enc_state = self.encoder(input_tokens)

    dec_state = enc_state


    

    pred_tokens = target_tokens[:, 0:1]
    
    for t in range(max_target_length-1):


      decoder_input = {"input_tokens":pred_tokens,
                            "dec_mask":target_mask[:, t:t+1],
                            "enc_output":enc_output,
                            "enc_mask":input_mask}


      dec_result, dec_state = self.decoder(decoder_input, state=dec_state)

      if t == 0:
        logits = dec_result["logits"]
      else:
        logits = tf.concat((logits, dec_result["logits"]), 1)

      pred_tokens = tf.argmax(dec_result["logits"], -1)
    

    y = target_tokens[:,1:]
    y_pred = logits

    loss = self.loss(y, y_pred).numpy()

    total_loss += loss

    self.totAcc.update_state(y, y_pred, target_mask[:,1:])


    self.topKAccLoop(y, y_pred)

    print((f"Batch: {batch} - batch_loss: {loss:.3f} -"),
          (f"acc_1: {self.seqAcc_1.result().numpy():.3f} -"),
          (f"acc_2: {self.seqAcc_2.result().numpy():.3f} -"),
          (f"acc_3: {self.seqAcc_3.result().numpy():.3f} -"),
          (f"acc_4: {self.seqAcc_4.result().numpy():.3f} -"),
          (f"acc_5: {self.seqAcc_5.result().numpy():.3f} -"),
          (f"TotA: {self.totAcc.result().numpy():.3f}"))
    
    acc_top1 += self.seqAcc_1.result().numpy()
    acc_top2 += self.seqAcc_2.result().numpy()
    acc_top3 += self.seqAcc_3.result().numpy()
    acc_top4 += self.seqAcc_4.result().numpy()
    acc_top5 += self.seqAcc_5.result().numpy()
    totA += self.totAcc.result().numpy()

    self.seqAcc_1.reset_state()
    self.seqAcc_2.reset_state()
    self.seqAcc_3.reset_state()
    self.seqAcc_4.reset_state()
    self.seqAcc_5.reset_state()
    self.totAcc.reset_state()


  return {"batch_loss":total_loss/(batch+1),
          "acc_1":acc_top1/(batch+1),
          "acc_2":acc_top2/(batch+1),
          "acc_3":acc_top3/(batch+1),
          "acc_4":acc_top4/(batch+1),
          "acc_5":acc_top5/(batch+1),
          "totA":totA/(batch+1)}


Seq2SeqModelConstructor.evaluate_dataset = evaluate_dataset

### This is enough in the model constructor to do some training.

In [38]:
units = 512
n_layers = 2
isBidirectional = True
useAttention = True
model = Seq2SeqModelConstructor(
    units = units,
    input_rxn_processor=rxn_processor,
    output_rxn_processor=product_processor, 
    output_vocab_size=product_processor.vocabulary_size(),
    n_layers=n_layers,
    layertype="lstm",
    isBidirectional=isBidirectional,
    useAttention=useAttention)


model.compile(
    optimizer=tf.keras.optimizers.Adam(clipnorm=5),
    loss=MaskedLoss()
)

# Note on Training

As desceibed at the start I took a subset of my dataset to 1000 reactions for training so that CPU training on the colab wouldn't take too long.

BUUUT, that means I can't gaurntee that training will be as good or as effective on this smaller dataset. Because there are potentially alot of reaction patterns, it could be that this 1000 subset is diverse in the kinds of reactions it has, so although epochs will be quick it may have trouble learning. Just something to keep in mind.

In testing

>units = 512

>n_layers = 2

>isBidirectional = True

>useAttention = True

Took about 10 epochs to hit a acc_1 = ~80%

So for educational purposes play around and run models, if you want to actually get the accuracy high will likely have to run for awhile!

In [39]:
h = model.fit(trainDataset, epochs=1)



Reminder this is Evaluating WITH teacher forcing. Can compare to fit values to see over fitting.

In [40]:
h = model.evaluate(testDataset)



This is true evaluation where the decoder is being fed in its last output. These values are the values for actual prediction.

In [41]:
h = model.evaluate_dataset(testDataset)
print(h)

Batch: 0 - batch_loss: 2.362 - acc_1: 0.346 - acc_2: 0.509 - acc_3: 0.616 - acc_4: 0.701 - acc_5: 0.761 - TotA: 0.000
Batch: 1 - batch_loss: 2.279 - acc_1: 0.313 - acc_2: 0.471 - acc_3: 0.587 - acc_4: 0.659 - acc_5: 0.716 - TotA: 0.000
Batch: 2 - batch_loss: 2.345 - acc_1: 0.323 - acc_2: 0.449 - acc_3: 0.551 - acc_4: 0.643 - acc_5: 0.703 - TotA: 0.000
Batch: 3 - batch_loss: 2.502 - acc_1: 0.306 - acc_2: 0.494 - acc_3: 0.600 - acc_4: 0.671 - acc_5: 0.694 - TotA: 0.000
{'batch_loss': 2.37203711271286, 'acc_1': 0.32199597358703613, 'acc_2': 0.48089657723903656, 'acc_3': 0.5885052680969238, 'acc_4': 0.6684274226427078, 'acc_5': 0.718407079577446, 'totA': 0.0}


### TODOs for myself, but if you want to play around these are some things to try!

!!TODO!!

- A prediction function that takes one reaction at a time?
- A visualization for attention weights?
