SeqToSeq Fingerprint
--------------------

In this example, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules.  This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

Many types of models require their inputs to have a fixed shape.  Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them.  We need a way of generating a fixed length "fingerprint" for each molecule.  Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs).  But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

A `SeqToSeq` model performs sequence to sequence translation.  For example, they are often used to translate text from one language to another.  It consists of two parts called the "encoder" and "decoder".  The encoder is a stack of recurrent layers.  The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector".  The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence.  By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

In this case, we will use SMILES strings describing molecules as the input sequences.  We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences.  For that to work, the encoder must create embedding vectors that contain all information from the original sequence.  That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

Let's start by loading the data.  We will use the MUV dataset.  It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

In [1]:
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv()
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids

  from ._conv import register_converters as _register_converters


About to load MUV dataset.
Loading dataset from disk.
Loading dataset from disk.
Loading dataset from disk.


We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences.  (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.)  Make a list of every character that appears in any training sequence.

In [2]:
tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))

Create the model and define the optimization method to use.  In this case, learning works much better if we gradually decrease the learning rate.  We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

In [5]:
from deepchem.models import SeqToSeq
from deepchem.models.tensorgraph import layers
from deepchem.models.tensorgraph.layers import Layer
import tensorflow as tf

class AspuruGuzikAutoEncoder(SeqToSeq):
  def __init__(self,
               input_tokens,
               output_tokens,
               max_output_length,
               encoder_layers=4,
               decoder_layers=4,
               embedding_dimension=512,
               dropout=0.0,
               reverse_input=True,
               variational=False,
               annealing_start_step=5000,
               annealing_final_step=10000,
               **kwargs):
    self.filter_sizes = [9, 9, 10]
    self.kernel_sizes = [9, 9, 11]
    super(AspuruGuzikAutoEncoder, self).__init__(
      input_tokens,
      output_tokens,
      max_output_length,
      encoder_layers,
      decoder_layers,
      embedding_dimension,
      dropout,
      reverse_input,
      variational,
      annealing_start_step,
      annealing_final_step,
    )


  def _create_encoder(self, n_layers, dropout):
    """Create the encoder layers."""
    prev_layer = self._features
    for i in range(len(self.filter_sizes)):
      filter_size = self.filter_sizes[i]
      kernel_size = self.kernel_sizes[i]
      if dropout > 0.0:
        prev_layer = layers.Dropout(dropout, in_layers=prev_layer)
      prev_layer = layers.Conv1D(
        kernel_size, filter_size, in_layers=prev_layer, activation_fn=tf.nn.relu)
    prev_layer = layers.Gather(in_layers=[prev_layer, self._gather_indices])
    if self._variational:
      self._embedding_mean = layers.Dense(
        196, in_layers=prev_layer)
      self._embedding_stddev = layers.Dense(
        196, in_layers=prev_layer)
      prev_layer = layers.CombineMeanStd(
        [self._embedding_mean, self._embedding_stddev], training_only=True)
    return prev_layer

  def _create_decoder(self, n_layers, dropout):
    """Create the decoder layers."""
    prev_layer = layers.Repeat(
      self._max_output_length, in_layers=self.embedding)
    for i in range(3):
      if dropout > 0.0:
        prev_layer = layers.Dropout(dropout, in_layers=prev_layer)
      prev_layer = layers.GRU(
        488, self.batch_size, in_layers=prev_layer)
    retval = layers.Dense(
      len(self._output_tokens),
      in_layers=prev_layer,
      activation_fn=tf.nn.softmax)
    return retval


In [9]:
from deepchem.models.tensorgraph.optimizers import Adam, ExponentialDecay


max_length = max(len(s) for s in train_smiles)
model = AspuruGuzikAutoEncoder(tokens,
                           tokens,
                           max_length,
                           encoder_layers=2,
                           decoder_layers=2,
                           embedding_dimension=256,
                           variational=False,
                           model_dir='fingerprint')
batches_per_epoch = len(train_smiles)/model.batch_size

Let's train it!  The input to `fit_sequences()` is a generator that produces input/output pairs.  On a good GPU, this should take a few hours or less.

In [None]:
def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(100000))

Ending global_step 999: Average loss 96.9049
Ending global_step 1999: Average loss 95.9348


Let's see how well it works as an autoencoder.  We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

In [None]:
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
  print(''.join(p))
  if ''.join(p) == s:
    count += 1
print('reproduced', count, 'of 500 validation SMILES strings')

Now we'll trying using the encoder as a way to generate molecular fingerprints.  We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors.  The amount of data is small enough that we can just store everything in memory.

In [None]:
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
                                                train_dataset.y,
                                                train_dataset.w,
                                                train_dataset.ids)

valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
                                                valid_dataset.y,
                                                valid_dataset.w,
                                                valid_dataset.ids)

For classification, we'll use a simple fully connected network with one hidden layer.

In [None]:
classifier = dc.models.MultiTaskClassifier(n_tasks=len(tasks),
                                                      n_features=256,
                                                      layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)

Find out how well it worked.  Compute the ROC AUC for the training and validation datasets.

In [None]:
import numpy as np
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)