# LLM for Retrosynthesis

## 1. Setting up Environments

Following the scripts on the github page, set up the environment as follows: (I would adivse to use different pytorch and python versions as corrected below) <br>

> ``` conda create -n mol_transformer python=3.6 ``` <br>
> ``` conda activate mol_transformer ``` <br>
> ``` conda install rdkit -c rdkit ``` <br>
> ``` conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=9.2 -c pytorch ``` <br>
> ``` conda install future six tqdm pandas ```<br>

Clone the environment into the root of this notebook using: <br>
> ``` git clone https://github.com/pschwllr/MolecularTransformer.git ```

Lastly, run the following two lines of code to complete the setup (ensure to run *cd MolecularTransformer*):<br> 
> ``` pip install torchtext==0.3.1``` <br>
> ```pip install -e . ```

## 2. Preprocess the data

1. From raw data (this could be done in the morning sessions):

In [1]:
import re
import pandas as pd
from pathlib import Path
from rdkit import Chem

In [2]:
import torch
print(torch.cuda.is_available()) # Must return true, otherwise training takes too long

False


In [3]:
def tokenize_smiles(smiles):
    import re
    pattern =  "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
    regex = re.compile(pattern)
    tokens = [token for token in regex.findall(smiles)]
    assert smiles == ''.join(tokens)
    return ' '.join(tokens)

In [4]:
def get_smiles_from_df(df:pd.DataFrame) -> tuple:
    reactions = df.iloc[:,-1].values
    products, reactants = [], []
    for reaction in reactions:
        reac, pro = reaction.split('>>')
        try:
            pro_mol, reac_mol = Chem.MolFromSmiles(pro), Chem.MolFromSmiles(reac)
            [a.SetAtomMapNum(0) for a in pro_mol.GetAtoms()]
            [a.SetAtomMapNum(0) for a in reac_mol.GetAtoms()]
            product_smi = Chem.MolToSmiles(pro_mol, canonical=True, isomericSmiles=True)
            reactant_smi = Chem.MolToSmiles(reac_mol, canonical=True, isomericSmiles=True)
            products.append(tokenize_smiles(product_smi)), reactants.append(tokenize_smiles(reactant_smi))
        except:
            continue

    return products, reactants

In [5]:
# get path to data
raw_data_path = Path().cwd() / "RawData" 
uspto_path = Path().cwd() / 'MolecularTransformer' / 'data' / 'USPTO50'
raw_test = pd.read_csv(raw_data_path / "raw_test.csv")
raw_train = pd.read_csv(raw_data_path / "raw_train.csv")
raw_val = pd.read_csv(raw_data_path / "raw_val.csv")

# create input files for the LLM
modes = ['train', 'val', 'test']
for mode, df in zip(modes, [raw_train, raw_val, raw_test]):
    df_new = pd.DataFrame()
    products, reactants = get_smiles_from_df(df)
    df_new['products'] = products
    df_new['reactants'] = reactants
    df_new[['products']].to_csv( uspto_path / f"src-{mode}.txt", index=False, header=False)
    df_new[['reactants']].to_csv( uspto_path / f"tgt-{mode}.txt", index=False, header=False)


We have saved the tokenized strings to source and target files in the USPTO50 folder. Now, we only have to perform one more preprocessing step before training the model

Either use bash (if on linux/mac) or cmd (if on windows) to run this command:

In [None]:
%%bash 
cd MolecularTransformer
dataset=MIT_mixed_augm # MIT_mixed_augm / STEREO_mixed_augm
python preprocess.py -train_src data/${dataset}/src-train.txt \
                     -train_tgt data/${dataset}/tgt-train.txt \
                     -valid_src data/${dataset}/src-val.txt \
                     -valid_tgt data/${dataset}/tgt-val.txt \
                     -save_data data/${dataset}/${dataset} \
                     -src_seq_length 1000 -tgt_seq_length 1000 \
                     -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab

In [9]:
%%cmd
set dataset=USPTO50
cd MolecularTransformer
python preprocess.py -train_src data\%dataset%\src-train.txt ^
-train_tgt data\%dataset%\tgt-train.txt ^
-valid_src data\%dataset%\src-val.txt ^
-valid_tgt data\%dataset%\tgt-val.txt ^
-save_data data\%dataset%%dataset% ^
-src_seq_length 1000 -tgt_seq_length 1000 ^
-src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab

Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. All rights reserved.

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks>set dataset=USPTO50

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks>cd MolecularTransformer

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>python preprocess.py -train_src data\%dataset%\src-train.txt ^
More? -train_tgt data\%dataset%\tgt-train.txt ^
More? -valid_src data\%dataset%\src-val.txt ^
More? -valid_tgt data\%dataset%\tgt-val.txt ^
More? -save_data data\%dataset%%dataset% ^
More? -src_seq_length 1000 -tgt_seq_length 1000 ^
More? -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>

[2023-07-05 16:47:26,604 INFO] Extracting features...
[2023-07-05 16:47:26,610 INFO]  * number of source features: 0.
[2023-07-05 16:47:26,610 INFO]  * number of target features: 0.
[2023-07-05 16:47:26,610 INFO] Building `Fields` object...
[2023-07-05 16:47:26,610 INFO] Building & saving training data...
[2023-07-05 16:47:26,610 INFO] Reading source and target files: data\USPTO50\src-train.txt data\USPTO50\tgt-train.txt.
[2023-07-05 16:47:26,628 INFO] Splitting shard 0.
[2023-07-05 16:47:26,642 INFO] Building shard 0.
[2023-07-05 16:47:30,451 INFO]  * saving 0th train data shard to data\USPTO50USPTO50.train.0.pt.
[2023-07-05 16:47:31,602 INFO] Building & saving validation data...
[2023-07-05 16:47:31,608 INFO] Reading source and target files: data\USPTO50\src-val.txt data\USPTO50\tgt-val.txt.
[2023-07-05 16:47:31,610 INFO] Splitting shard 0.
[2023-07-05 16:47:31,613 INFO] Building shard 0.
[2023-07-05 16:47:32,084 INFO]  * saving 0th valid data shard to data\USPTO50USPTO50.valid.0.pt.

## 3. Training the model

The training takes around 3 hours on the hpc cluster, however, checkpoints are written constantly. I have saved the last 20 checkpoints in the MolecularTransformer\checkpoints\USPTO50 folder

To run the training either use one of the following two commands:

In [None]:
%%bash
dataset=USPTO50
cd MolecularTransformer

python  train.py -data data/${dataset}/${dataset} \
                   -save_model experiments/checkpoints/${dataset}/${dataset}_model \
                   -seed 42 -gpu_ranks 0 -save_checkpoint_steps 10000 -keep_checkpoint 20 \
                   -train_steps 500000 -param_init 0  -param_init_glorot -max_generator_batches 32 \
                   -batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0  -accum_count 4 \
                   -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000  \
                   -learning_rate 2 -label_smoothing 0.0 -report_every 1000 \
                   -layers 4 -rnn_size 256 -word_vec_size 256 -encoder_type transformer -decoder_type transformer \
                   -dropout 0.1 -position_encoding -share_embeddings \
                   -global_attention general -global_attention_function softmax -self_attn_type scaled-dot \
                   -heads 8 -transformer_ff 2048

In [16]:
%%cmd
set dataset=USPTO50
cd MolecularTransformer

python train.py -data data%dataset%%dataset% ^
-save_model experiments\checkpoints%dataset%%dataset%_model ^
-seed 42 -gpu_ranks 0 -save_checkpoint_steps 10000 -keep_checkpoint 20 ^
-train_steps 500000 -param_init 0 -param_init_glorot -max_generator_batches 32 ^
-batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0 -accum_count 4 ^
-optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 ^
-learning_rate 2 -label_smoothing 0.0 -report_every 1000 ^
-layers 4 -rnn_size 256 -word_vec_size 256 -encoder_type transformer -decoder_type transformer ^
-dropout 0.1 -position_encoding -share_embeddings ^
-global_attention general -global_attention_function softmax -self_attn_type scaled-dot ^
-heads 8 -transformer_ff 2048

Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. All rights reserved.

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks>set dataset=USPTO50

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks>cd MolecularTransformer

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>
(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>python train.py -data data%dataset%%dataset% ^
More? -save_model experiments\checkpoints%dataset%%dataset%_model ^
More? -seed 42 -gpu_ranks 0 -save_checkpoint_steps 10000 -keep_checkpoint 20 ^
More? -train_steps 500000 -param_init 0 -param_init_glorot -max_generator_batches 32 ^
More? -batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0 -accum_count 4 ^
More? -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 ^
More? -learning_rate 2 -lab

Traceback (most recent call last):
  File "train.py", line 118, in <module>
    main(opt)
  File "train.py", line 51, in main
    single_main(opt, 0)
  File "c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer\onmt\train_single.py", line 86, in main
    opt = training_opt_postprocessing(opt, device_id)
  File "c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer\onmt\train_single.py", line 77, in training_opt_postprocessing
    torch.cuda.set_device(device_id)
  File "c:\Users\fgh18\Anaconda3\envs\mol_transformer\lib\site-packages\torch\cuda\__init__.py", line 263, in set_device
    torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'


The above returns an error since I do not have the cuda package installed on this local machine. Otherwise, it would start training

## 4. Testing

To test the model, we simply need to put our tokenized test molecule in the MolecularTransformer\data\USPTO50 folder.
We could either load a single model from the checkpoint (below) or build an ensemble of models

Ensure to create a results folder in the experiments directory first

For Linux/OS, you simply have to run: 

In [None]:
%%bash
cd MolecularTransformer
dataset=USPTO50
model=${dataset}_model_step_500000.pt

python translate.py -model experiments/checkpoints/${dataset}/${model} \
                    -src data/${dataset}/src-test.txt \
                    -output experiments/results/predictions_${model}_on_${dataset}_test.txt \
                    -batch_size 64 -replace_unk -max_length 200 -fast

For Windows, you run:

In [25]:
%%cmd
cd MolecularTransformer
set dataset=USPTO50
set model=%dataset%_model_step_500000.pt

python translate.py -model experiments\checkpoints\%dataset%\%model% ^
                    -src data\%dataset%\src-test.txt ^
                    -output experiments\results\predictions_%model%_on_%dataset%_test.txt ^
                    -batch_size 64 -replace_unk -max_length 200 -fast


Microsoft Windows [Version 10.0.19044.2728]
(c) Microsoft Corporation. All rights reserved.

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks>cd MolecularTransformer

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>set dataset=USPTO50

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>set model=%dataset%_model_step_500000.pt

(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>
(mol_transformer) c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer>python translate.py -model experiments\checkpoints\%dataset%\%model% ^
More?                     -src data\%dataset%\src-test.txt ^
More?                     -output experiments\results\predictions_%model%_on_%dataset%_test.txt ^
More?                     -batch_size 64 -replace_unk -max_length 200 -fast

(mol_transformer) c:\Users\

  var = torch.tensor(arr, dtype=self.dtype, device=device)
Traceback (most recent call last):
  File "translate.py", line 36, in <module>
    main(opt)
  File "translate.py", line 24, in main
    attn_debug=opt.attn_debug)
  File "c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer\onmt\translate\translator.py", line 235, in translate
    batch_data = self.translate_batch(batch, data, fast=self.fast)
  File "c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer\onmt\translate\translator.py", line 342, in translate_batch
    return_attention=self.replace_unk)
  File "c:\Users\fgh18\Documents\Chem_LLM_Hackathon\example_notebooks\MolecularTransformer\onmt\translate\translator.py", line 471, in _fast_translate_batch
    [alive_seq.index_select(0, select_indices),
RuntimeError: index_select(): Expected dtype int64 for index
