# MoleculeACE - ChEMBL cliff training

Once the desired encoders have been pre-trained using the accompanying encoder_pretraining and are placed in the [name of corresponding folder], we proceed to training with the datasets included in MoleculeACE, in preparation for evaluation [1].  
The following ChEMBL datasets were chosen, as specified by the criteria in the accompanying thesis publication, [2].  

* ChEMBL234 - Dopamine D3 receptor
* ChEMBL4203 - Dual specificity protein kinase
* ChEMBL2047 - Farnesoid X receptor
* ChEMBL4616 - Ghrelin receptor
* ChEMBL264 - Histamine H3 receptor
* ChEMBL2835 - Janus kinase 1
* ChEMBL4792 - Orexin receptor 2

## Setup

In [None]:
import os.path

try:
    from google.colab import drive
    drive.mount('/content/drive')
    _home = 'drive/MyDrive/tlacamr'
except ImportError:
    _home = '~'
finally:
    project_root = os.path.join(_home, 'tlacamr')

print(project_root)

Mounted at /content/drive
drive/MyDrive/tlacamr/tlacamr


In [None]:
%cd $project_root
!pip install .
### install statement should look like this once repo is public
###!pip install git+https://github.com/my-user/my-repo

/content/drive/MyDrive/tlacamr/tlacamr
Processing /content/drive/MyDrive/tlacamr/tlacamr
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting lightning>=2.0.0 (from acsuite==0.1)
  Downloading lightning-2.1.3-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchmetrics>=0.11.4 (from acsuite==0.1)
  Downloading torchmetrics-1.3.0.post0-py3-none-any.whl (840 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m840.2/840.2 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hydra-core==1.3.2 (from acsuite==0.1)
  Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hydra-colorlog==1.2

## Model imports




## Training

### Classification

### MLP 2048

In [None]:
!HYDRA_FULL_ERROR=1 python3 src/train.py +experiment/property_prediction/classification=mlp_2048

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                                                                         [37mtrain/loss: 0.010          [0m
[2K[1A[2K[1A[2K[1A[2K[1A[2KEpoch 67/149 [35m━━━━━━[0m[35m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [37m5/19[0m [37m0:00:00 • 0:00:01[0m [37m58.14it/s[0m [37mv_num: iu2m val/loss: 1.675[0m
                                                                         [37mval/AUROC: 0.758           [0m
                                                                         [37mval/AUROC_best: 0.801      [0m
                                                                         [37mtrain/loss: 0.010          [0m
[2K[1A[2K[1A[2K[1A[2K[1A[2KEpoch 67/149 [35m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [37m6/19[0m [37m0:00:00 • 0:00:01[0m [37m59.08it/s[0m [37mv_num: iu2m val/loss: 1.675[0m
                                                                         [37mval/AUROC: 0.758 

#### MLP 256

In [None]:
!HYDRA_FULL_ERROR=1 python3 src/train.py experiment=property_prediction/jointautoencoder/classification/ChEMBL234 ++trainer.accelerator=gpu

### Regression

#### MLP 256

In [None]:
!HYDRA_FULL_ERROR=1 python3 src/train.py experiment=property_prediction/jointautoencoder/regression/ChEMBL234 ++trainer.accelerator=gpu

#### MLP 2048

### Refs

[1] Derek van Tilborg, Alisa Alenicheva, and Francesca Grisoni.“Exposing the Limitations of Molecular Machine Learning with Activity Cliffs”. In: Journal of Chemical Information and Modeling 62.23 (Dec. 2022), pp. 5938–5951. DOI: 10.1021/acs.jcim.2c01073. URL: https://doi.
org/10.1021/acs.jcim.2c01073.   
[2] César Miguel Valdez Córdova. Towards learning activity cliff-aware molecular representations. Publication pending.

## deprecated

In [None]:
from MoleculeACE import MLP, Data, Descriptors, calc_rmse, calc_cliff_rmse, get_benchmark_config

import datamol as dm
import torch
from molfeat.calc import FP_FUNCS, FPCalculator
from molfeat.trans.concat import FeatConcat
from molfeat.trans import MoleculeTransformer

In [None]:
datasets = 'CHEMBL234_Ki', 'CHEMBL4203_Ki', 'CHEMBL2047_EC50', 'CHEMBL4616_EC50', 'CHEMBL264_Ki', 'CHEMBL2835_Ki', 'CHEMBL4792_Ki'
algorithm = MLP
dataset = 'CHEMBL4203_Ki'
data = Data(dataset)
descriptor = Descriptors.ECFP
# Load data

# Get the already optimized hyperparameters
hyperparameters = get_benchmark_config(dataset, algorithm, descriptor)

In [None]:
train_smiles = data.smiles_train
test_smiles = data.smiles_test
featurizer = MoleculeTransformer(FPCalculator('ecfp', length=2048, radius=4))
featurized_train = torch.as_tensor(featurizer(train_smiles), dtype = torch.float32)
featurized_test = torch.as_tensor(featurizer(test_smiles), dtype=torch.float32)

In [None]:
# Train and use a model for prediction
model = algorithm(**hyperparameters)

model.train(data.x_train, data.y_train)
y_hat = model.predict(data.x_test)

# Evaluate your model on activity cliff compounds
rmse = calc_rmse(data.y_test, y_hat)
rmse_cliff = calc_cliff_rmse(y_test_pred=y_hat, y_test=data.y_test, cliff_mols_test=data.cliff_mols_test)

print(f"rmse: {rmse}")
print(f"rmse_cliff: {rmse_cliff}")