# Installing requirements

The USPTO datasets are in MoleculeNet and the infrastructure for tokenizing the reactions is provided by the Reaction featurizer wrapper in DeepChem. The featurizer uses the RobertaTokenizerFast in its backend, which requires the HuggingFace library to be installed.
The model and the pre-trained weights are also loaded from HuggingFace. 

In [None]:
!pip install --pre deepchem
!pip install --pre deepchem[torch]
!pip install transformers
!pip install rdkit-pypi

Collecting deepchem
  Downloading deepchem-2.6.0.dev20210924223259-py3-none-any.whl (609 kB)
[?25l[K     |▌                               | 10 kB 22.4 MB/s eta 0:00:01[K     |█                               | 20 kB 27.5 MB/s eta 0:00:01[K     |█▋                              | 30 kB 13.3 MB/s eta 0:00:01[K     |██▏                             | 40 kB 9.6 MB/s eta 0:00:01[K     |██▊                             | 51 kB 5.3 MB/s eta 0:00:01[K     |███▎                            | 61 kB 5.8 MB/s eta 0:00:01[K     |███▊                            | 71 kB 5.6 MB/s eta 0:00:01[K     |████▎                           | 81 kB 6.3 MB/s eta 0:00:01[K     |████▉                           | 92 kB 4.7 MB/s eta 0:00:01[K     |█████▍                          | 102 kB 5.1 MB/s eta 0:00:01[K     |██████                          | 112 kB 5.1 MB/s eta 0:00:01[K     |██████▌                         | 122 kB 5.1 MB/s eta 0:00:01[K     |███████                         | 133 kB 5.1 

In [None]:
import deepchem as dc
from deepchem.molnet.load_function import uspto_datasets
from transformers import RobertaTokenizerFast

# Loading and tokenizing the datasets

The USPTO datasets can be loaded in using the ```load_uspto``` method inside MoleculeNet. The loader can load in the entire USPTO dataset consisting of over a million reactions or it can also load in different subsets such as the STEREO, MIT and 50K. The loader also provides support for separating the reagents from the reactants for training.  

In [None]:
tasks, datasets, transformers = uspto_datasets.load_uspto(subset="MIT", splitter= "random", sep_reagent=True)

Downloading:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/98.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/515 [00:00<?, ?B/s]

  return array(a, dtype, copy=False, order=order)


In [None]:
train, test, valid = datasets

<DiskDataset X.shape: (383228, 2, 2, 1), y.shape: (), w.shape: (), task_names: []>

In [None]:
print(f"The train dataset : {train}")
print(f"Some reactions from the training set:\n {train.X[2]}")

The train dataset : <DiskDataset X.shape: (383228, 2, 2, 1), y.shape: (), w.shape: (), task_names: []>
Some reactions from the training set:
 [[[list([0, 304, 263, 51, 13, 71, 21, 261, 12, 50, 13, 308, 21, 39, 18, 51, 33, 39, 21, 267, 263, 51, 13, 50, 21, 288, 34, 291, 12, 39, 13, 39, 33, 51, 18, 51, 2])]
  [list([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]]

 [[list([0, 304, 263, 51, 13, 71, 21, 261, 12, 50, 13, 71, 12, 288, 13, 272, 21, 39, 2])]
  [list([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]]]


In [None]:
from transformers import EncoderDecoderModel

# set encoder decoder tying to True
roberta_shared = EncoderDecoderModel.from_encoder_decoder_pretrained("seyonec/ChemBERTa-zinc-base-v1", "seyonec/ChemBERTa-zinc-base-v1", tie_encoder_decoder=True)

Downloading:   0%|          | 0.00/501 [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Downloading:   0%|          | 0.00/171M [00:00<?, ?B/s]

Some weights of the model checkpoint at seyonec/ChemBERTa-zinc-base-v1 were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForCausalLM were not initialized from the model checkpoint at seyonec/ChemBERTa-zinc-base-v1 and are newly initialized: ['roberta.encoder.layer.2.crossattention.output.LayerNorm.weight', 'roberta.encoder.layer.4.crossattention.self.

In [None]:
#model training

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    logging_steps=2,  # set to 1000 for full training
    save_steps=16,  # set to 500 for full training
    eval_steps=4,  # set to 8000 for full training
    warmup_steps=1,  # set to 2000 for full training
    max_steps=16, # delete for full training
    overwrite_output_dir=True,
    save_total_limit=3,
    fp16=True, 
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=bert2bert,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=val_data,
)
trainer.train()

NameError: ignored