# step by step notebook for reproducing the workflow from RT pred manuscript 

Ensure that the requriements are met i.e.
- rdkit 
- pytorch 
- deepchem
- chemprop
- etc.. 




In [1]:
import pandas as pd 
#from rdkit import Chem
from rdkit import RDLogger 
import deepchem as dc
import os 

from functions.featurizers import get_METLIN_data, LogD_calculations, get_features
from functions.data_splitting import data_splitter, feature_splitter_csv, feature_splitter_diskdatasets


RDLogger.DisableLog('rdApp.*') # disable rdkit log
seed = 42


Skipped loading some Tensorflow models, missing a dependency. No module named 'tensorflow'
Skipped loading some PyTorch models, missing a dependency. No module named 'torch'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch'
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'torch'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


## Featurization and data splitting

This workflow is run using the METLIN SMRT Dataset published by Domingo-Almenara et al. https://www.nature.com/articles/s41467-019-13680-7 

But any dataset could be used, however, it must contain columns 'rt' and 'smiles'

By default we remove non-retained compounds from the METLIN SMRT dataset and subsample (2000, randomly)

In [2]:
get_METLIN_data(sampling = 2000, only_retained = True)

saved to:  ../data/metlin_smrt\sample_dataset_2000_datapoints.csv
             id                                             smiles      rt
45507  53143824  CC(C)c1noc(-c2cc(=O)n(CC(=O)Nc3ccc(N(C)C)cc3)c...   737.1
1182    2966969  CCCCN1C(=O)[C@](N=C(O)c2ccccc2OC)(C(F)(F)F)C2=...  1076.0
29828  20952107             OC(=NCCc1ccco1)C1CCN(c2nc3cccnc3s2)CC1   697.1


There are a number of features that we use for modeling. 

- LogD descriptor set (pH 0.5 to 7.4)
- ECFP4-2048 fingerprints
- RDKit descriptors 
- molecular graph convolutions 

based on the feature_list ``` feature_list = ['logD', 'ecfp4', 'rdkit', 'molgraphconv]``` it is possible to define which features to be calculated. 

Note that the ECFP4 and RDKit descriptors come in two variants: a flat .csv file or a DeepChem DiskData object. This choice will depend on downstream application and models.


OBS! LogD calculations are based on the proprietary software cxcalc from Chemaxon (https://docs.chemaxon.com/display/docs/cxcalc-command-line-tool.md). A path to a chemaxon licence is needed for this calculation, as well as a specified path to the chemaxon tool. If not provided, the LogD calculations are skipped


In [3]:
# file paths
path_to_data = '../data/metlin_smrt/sample_dataset_2000_datapoints.csv'
path_to_features = '../data/metlin_smrt/features/' 


In [4]:
feature_list = ['logD', 
                'ecfp4', 
                'rdkit', 
                'molgraphconv'
                ]

get_features(path_to_data, 
             path_to_features, feature_list, 
             path_to_chemaxon_licence= r"G:\Nuevolution\ChemAxon\licenses\license.cxl", 
             path_to_chemaxon_tools = r"C:\nuevolution\knime_temp")

ChemAxon-based LogD calculations - to .CSV
*** NA values in: ../data/metlin_smrt/features/logd_calculations
ECFP4 Featurization - to DiskDataset
ECFP4 Featurization - to .CSV
RDKit Descriptors - to diskdataset
RDKit Descriptors - to .CSV
MolGraphConv Feat - to diskdataset


### Data Splitting 

The data is split into train/test. In addition the train dataset is split 5 times to allow for 5-fold cross validation. 

In [5]:
# base directory for the split data
path_to_splits = '../data/metlin_smrt/data_splits/'
if not os.path.exists(path_to_splits):
    os.makedirs(path_to_splits)

data_splitter(path_to_data, path_to_splits, number_of_splits = 5, seed = seed)

The following functions will take the features already calculated and bundle them according to the above splits

In [6]:

#* splitting the .CSV-based features according to the data splits
feature_splitter_csv(path_to_features, path_to_splits, save_folder_name = 'cv_splits')


#* splitting the diskDataset-based features according to the data splits
feature_splitter_diskdatasets(path_to_features,path_to_splits, save_folder_name = 'cv_splits')

#### Model training 