# DeepPurpose Deep Dive
## Tutorial 1: Training a Drug-Target Interaction Model from Scratch
#### [@KexinHuang5](https://twitter.com/KexinHuang5)

In this tutorial, we take a deep dive into DeepPurpose and show how it builds a drug-target interaction model from scratch. 

Agenda:

- Part I: Overview of DeepPurpose and Data
- Part II: Drug Target Interaction Prediction
    - DeepPurpose Framework
    - Applications to Drug Repurposing and Virtual Screening
    - Pretrained Models
    - Hyperparameter Tuning
    - Model Robustness Evaluation

Let's start!

In [1]:
from DeepPurpose import utils, dataset
from DeepPurpose import DTI as models
import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


## Part I: Overview of DeepPurpose and Data

Drug-target interaction measures the binding of drug molecules to the protein targets. Accurate identification of DTI is fundamental for drug discovery and supports many downstream tasks. Among others, drug screening and repurposing are two main applications based on DTI. Drug screening helps identify ligand candidates that can bind to the protein of interest, whereas drug repurposing finds new therapeutic purposes for existing drugs. Both tasks could alleviate the costly, time-consuming, and labor-intensive process of synthesis and analysis, which is extremely important, especially in the cases of hunting effective and safe treatments for COVID-19.

DeepPurpose is a pytorch-based deep learning framework that is initiated to provide a simple but powerful toolkit for drug-target interaction prediction and its related applications. We see many exciting recent works in this direction, but to leverage these models, it takes lots of efforts due to the esoteric instructions and interface. DeepPurpose is designed to make things as simple as possible using a unified framework.

DeepPurpose uses an encoder-decoder framework. Drug repurposing and screening are two applications after we obtain DTI models. The input to the model is a drug target pair, where drug uses the simplified molecular-input line-entry system (SMILES) string and target uses the amino acid sequence. The output is a score indicating the binding activity of the drug target pair. Now, we begin talking about the data format expected.


(**Data**) DeepPurpose takes into an array of drug's SMILES strings (**d**), an array of target protein's amino acid sequence (**t**), and an array of label (**y**), which can either be binary 0/1 indicating interaction outcome or a real number indicating affinity value. The input drug and target arrays should be paired, i.e. **y**\[0\] is the score for **d**\[0\] and **t**\[0\].

Besides transforming into numpy arrays through some data wrangling on your own, DeepPurpose also provides two ways to help data preparation. 

The first way is to read from local files. For example, to load drug target pairs, we expect a file.txt where each line is a drug SMILES string, followed by a protein sequence, and an affinity score or 0/1 label:

```CC1=C...C4)N MKK...LIDL 7.365``` \
```CC1=C...C4)N QQP...EGKH 4.999```

Then, we use ```dataset.read_file_training_dataset_drug_target_pairs``` to load it.

In [2]:
X_drugs, X_targets, y = dataset.read_file_training_dataset_drug_target_pairs('./toy_data/dti.txt')
print('Drug 1: ' + X_drugs[0])
print('Target 1: ' + X_targets[0])
print('Score 1: ' + str(y[0]))

Drug 1: CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N
Target 1: MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQ

Many method researchers want to test on benchmark datasets such as KIBA/DAVIS/BindingDB, DeepPurpose also provides data loaders to ease preprocessing. For example, we want to load the DAVIS dataset, we can use ```dataset.load_process_DAVIS```. It will download, preprocess to the designated data format. It supports label log-scale transformation for easier regression and also allows label binarization given a customized threshold.

In [2]:
X_drugs, X_targets, y = dataset.load_process_DAVIS(path = './data', binary = False, convert_to_log = True, threshold = 30)
print('Drug 1: ' + X_drugs[0])
print('Target 1: ' + X_targets[0])
print('Score 1: ' + str(y[0]))

Beginning Processing...
Beginning to extract zip file...
Default set to logspace (nM -> p) for easier regression
Done!
Drug 1: CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N
Target 1: MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMT

In [4]:
X_drugs.shape

(30056,)

In [5]:
X_targets.shape

(30056,)

In [6]:
y.shape

(30056,)

For more detailed examples and tutorials of data loading, checkout this [tutorial](./DEMO/load_data_tutorial.ipynb).

## Part II: Drug Target Interaction Prediction Framework

DeepPurpose provides a simple framework to conduct DTI research using 8 encoders for drugs and 7 for proteins. It basically consists of the following steps, where each step corresponds to one line of code:

- Encoder specification
- Data encoding and split
- Model configuration generation
- Model initialization
- Model Training
- Model Prediction and Repuposing/Screening
- Model Saving and Loading

Let's start with data encoding! 

(**Encoder specification**) After we obtain the required data format from Part I, we need to prepare them for the encoders. Hence, we first specify the encoder to use for drug and protein. Here we try MPNN for drug and CNN for target.

If you find MPNN and CNN are too large for the CPUs, you can try smaller encoders by uncommenting the last line:

In [7]:
drug_encoding, target_encoding = 'MPNN', 'CNN'
#drug_encoding, target_encoding = 'Morgan', 'Conjoint_triad'

Note that you can switch encoder just by changing the encoding name above. The full list of encoders are listed [here](https://github.com/kexinhuang12345/DeepPurpose#encodings). Here, we are using the message passing neural network encoder for drug and convolutional neural network encoder for protein.

(**Data encoding and split**) Now, we encode the data into the specified format, using ```utils.data_process``` function. It specifies train/validation/test split fractions, and random seed to ensure same data splits for reproducibility. This function also support data splitting methods such as ```cold_drug``` and ```cold_protein```, which splits on drug/proteins for model robustness evaluation to test on unseen drug/proteins.

The function outputs train, val, test pandas dataframes.

In [8]:
train, val, test = utils.data_process(X_drugs, X_targets, y, 
                                drug_encoding, target_encoding, 
                                split_method='random',frac=[0.7,0.1,0.2],
                                random_seed = 1)
train.head(1)

Drug Target Interaction Prediction Mode...
in total: 30056 drug-target pairs
encoding drug...
unique drugs: 68
encoding protein...
unique target sequence: 379
splitting dataset...
Done.


Unnamed: 0,SMILES,Target Sequence,Label,drug_encoding,target_encoding
0,CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC...,PFWKILNPLLERGTYYYFMGQQPGKVLGDQRRPSLPALHFIKGAGK...,5.0,"[[[tensor(1.), tensor(0.), tensor(0.), tensor(...","[P, F, W, K, I, L, N, P, L, L, E, R, G, T, Y, ..."


(**Model configuration generation**) Now, we initialize a model with its configuration. You can modify almost any hyper-parameters (e.g., learning rate, epoch, batch size), model parameters (e.g. hidden dimensions, filter size) and etc in this function. The supported configurations are listed here in this [link](https://github.com/kexinhuang12345/DeepPurpose/blob/e169e2f550694145077bb2af95a4031abe400a77/DeepPurpose/utils.py#L486).

For the sake of example, we specify the epoch size to be 5, and set the model parameters to be small so that you can run on both CPUs & GPUs quickly and can proceed to the next steps. For a reference parameters, checkout the notebooks in the DEMO folder.

In [9]:
config = utils.generate_config(drug_encoding = drug_encoding, 
                        target_encoding = target_encoding, 
                        cls_hidden_dims = [1024,1024,512], 
                        train_epoch = 5, 
                        LR = 0.001, 
                        batch_size = 128,
                        hidden_dim_drug = 128,
                        hidden_dim_protein = 128,
                        mpnn_hidden_size = 128,
                        mpnn_depth = 3, 
                        cnn_target_filters = [32,64,96],
                        cnn_target_kernels = [4,8,12],
                        general_architecture_version = 'mlp',
                        cuda_id='6',
                        wandb_project_name = 'DeepPurpose',
					    wandb_project_entity = 'diliadis',
                        use_early_stopping = True,
					    patience = 5,
					    delta = 0.001,
					    metric_to_optimize_early_stopping = 'loss',
                        )

In [10]:
config

{'input_dim_drug': 1024,
 'input_dim_protein': 8420,
 'hidden_dim_drug': 128,
 'hidden_dim_protein': 128,
 'cls_hidden_dims': [1024, 1024, 512],
 'batch_size': 128,
 'train_epoch': 5,
 'test_every_X_epoch': 20,
 'LR': 0.001,
 'drug_encoding': 'MPNN',
 'target_encoding': 'CNN',
 'result_folder': './result/',
 'binary': False,
 'num_workers': 0,
 'cuda_id': '6',
 'general_architecture_version': 'mlp',
 'experiment_name': None,
 'wandb_project_name': 'DeepPurpose',
 'wandb_project_entity': 'diliadis',
 'use_early_stopping': True,
 'patience': 5,
 'delta': 0.001,
 'metric_to_optimize_early_stopping': 'loss',
 'metric_to_optimize_best_epoch_selection': 'loss',
 'mpnn_hidden_size': 128,
 'mpnn_depth': 3,
 'cnn_target_filters': [32, 64, 96],
 'cnn_target_kernels': [4, 8, 12]}

(**Model initialization**) Next, we initialize a model using the above configuration.

In [11]:
model = models.model_initialize(**config)
model

Using the following device: cuda:6
Using the MLP version of the architecture...


<DeepPurpose.DTI.DBTA at 0x7f146b5ce350>

In [12]:
model.model

MLP_Classifier(
  (model_drug): MPNN(
    (W_i): Linear(in_features=50, out_features=128, bias=False)
    (W_h): Linear(in_features=128, out_features=128, bias=False)
    (W_o): Linear(in_features=167, out_features=128, bias=True)
  )
  (model_protein): CNN(
    (conv): ModuleList(
      (0): Conv1d(26, 32, kernel_size=(4,), stride=(1,))
      (1): Conv1d(32, 64, kernel_size=(8,), stride=(1,))
      (2): Conv1d(64, 96, kernel_size=(12,), stride=(1,))
    )
    (fc1): Linear(in_features=96, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (predictor): ModuleList(
    (0): Linear(in_features=256, out_features=1024, bias=True)
    (1): Linear(in_features=1024, out_features=1024, bias=True)
    (2): Linear(in_features=1024, out_features=512, bias=True)
    (3): Linear(in_features=512, out_features=1, bias=True)
  )
)

(**Model Training**) Next, it is ready to train, using the ```model.train``` function! If you do not have test set, you can just use ```model.train(train, val)```. 

In [13]:
model.train(train, val, test)

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mdiliadis[0m. Use [1m`wandb login --relogin`[0m to force relogin


--- Data Preparation ---
--- Go for Training ---
Training at Epoch 1 iteration 0 with loss 29.2203. Total time 0.0 hours
Training at Epoch 1 iteration 100 with loss 1.10130. Total time 0.01166 hours
Validation at Epoch 1 with loss:0.39055, MSE: 0.78777 , Pearson Correlation: 0.25437 with p-value: 1.32E-45 , Concordance Index: 0.62580
Training at Epoch 2 iteration 0 with loss 0.93652. Total time 0.02138 hours
Training at Epoch 2 iteration 100 with loss 0.77707. Total time 0.03305 hours
Validation at Epoch 2 with loss:0.96737, MSE: 0.75909 , Pearson Correlation: 0.35641 with p-value: 9.73E-91 , Concordance Index: 0.67717
-----------------------------EarlyStopping counter: 1 out of 5---------------------- best epoch currently 0
Training at Epoch 3 iteration 0 with loss 0.94964. Total time 0.0425 hours
Training at Epoch 3 iteration 100 with loss 0.96878. Total time 0.05444 hours
Validation at Epoch 3 with loss:0.65576, MSE: 0.76386 , Pearson Correlation: 0.38056 with p-value: 3.35E-104 , C

0,1
batch,▁▃▅▆█
best_val_MSE,▁
best_val_concordance_index,▁
best_val_loss,▁
best_val_pearson_correlation,▁
epoch,▁▃▅▆█
test_MSE,▁
test_concordance_index,▁
test_pearson_correlation,▁
train_batch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███

0,1
batch,825.0
best_val_MSE,0.69062
best_val_concordance_index,0.71401
best_val_loss,0.66291
best_val_pearson_correlation,0.40036
epoch,4.0
test_MSE,0.70074
test_concordance_index,0.70435
test_pearson_correlation,0.37993
train_batch,825.0
