# ADMET Group classification tasks
This notebook contains the benchmark of Zaira-Chem for the ADMET datasets from Therapeutics Data Commons with classification tasks

In [5]:
import os
import pandas as pd
import numpy as np

from tdc.benchmark_group import admet_group
group = admet_group(path = 'data/')

DATAPATH = "../data"
PREDPATH = "../predictions"

Downloading Benchmark Group...
100%|██████████| 1.47M/1.47M [00:00<00:00, 1.49MiB/s]
Extracting zip file...
Done!


In [6]:
admet_datasets = ["Bioavailability_Ma",
                  "HIA_Hou", 
                  "Pgp_Broccatelli",
                  "BBB_Martins",
                  "CYP2C9_Veith", 
                  "CYP2D6_Veith", 
                  "CYP3A4_Veith",
                  "CYP2C9_Substrate_CarbonMangels", 
                  "CYP2D6_Substrate_CarbonMangels",
                  "CYP3A4_Substrate_CarbonMangels",    
                  "hERG", 
                  'AMES', 
                  "DILI",
                ]

## Data preparation

In [None]:
#save train and test sets cleaned for ZairaChem inputs
for a in admet_datasets:
    benchmark = group.get(a)
    predictions = {}
    name = benchmark['name']
    train_val, test = benchmark['train_val'], benchmark['test']
    train_val.drop(columns=["Drug_ID"], inplace=True)
    test.drop(columns=["Drug_ID"], inplace=True)
    test.to_csv(os.path.join(DATAPATH, "{}_test.csv".format(a)), index=False)
    train_val.to_csv(os.path.join(DATAPATH, "{}_train.csv".format(a)), index=False)

## Model Training
Files in the /data folder can be directly used as ZairaChem inputs. A minimum of 5 folds per assay is required. 

ZairaChem performs its own train/validation splits automatically, so the whole train_val set must be passed as input for model training.

```
zairachem fit -i <dataset>_train.csv -m <dataset>_fold1
zairachem predict -i <dataset>_test.csv -m <dataset>_fold1 -o <dataset>_pred_fold1
```

## Model Evaluation
As an example, we provide the results of 8 folds of ZairaChem models for each dataset. The automated reports generated by ZairaChem, as well as raw data outputs is shared. 

An example model is also provided in the /model folder.

*Some molecules cannot be inferred using the ZairaChem pipeline due to the molecular descriptors used. The results for these molecules is calculated as the mean of all the predictions, and the molecules are saved for manual inspection*

In [8]:

predictions_list = []
nan_mols = {}

for i in range(1,6):
    predictions = {}
    for a in admet_datasets:
        path = os.path.join("../predictions", a)
        pred_file = pd.read_csv(os.path.join(path, "fold{}".format(i), "output_table.csv"))
        y_pred_test = pred_file["pred-value"]
        if i == 1: #keep molecules for which preds are not calculated once
            smi = []
            for n,y in enumerate(y_pred_test):
                if np.isnan(y):
                    smi += [pred_file.loc[n]["input-smiles"]]
            nan_mols[a] = smi
        #replace Nan values by mean values
        arr = np.array(y_pred_test)
        mean_val = np.nanmean(arr)
        arr[np.isnan(arr)] = mean_val
        y_pred_test = arr.tolist()
        predictions[a]= y_pred_test
    predictions_list.append(predictions)

In [9]:
group.evaluate_many(predictions_list)

{'bioavailability_ma': [0.74, 0.017],
 'hia_hou': [0.957, 0.014],
 'pgp_broccatelli': [0.944, 0.002],
 'bbb_martins': [0.93, 0.003],
 'cyp2c9_veith': [0.792, 0.003],
 'cyp2d6_veith': [0.65, 0.09],
 'cyp3a4_veith': [0.872, 0.002],
 'cyp2c9_substrate_carbonmangels': [0.421, 0.038],
 'cyp2d6_substrate_carbonmangels': [0.725, 0.006],
 'cyp3a4_substrate_carbonmangels': [0.656, 0.005],
 'herg': [0.861, 0.012],
 'ames': [0.863, 0.003],
 'dili': [0.936, 0.008]}