Before running this code, ensure that you have the following set of descriptors. You can find a full set of used descriptors in the Supplementary Information, Table 4.

1.  Physocochemical descriptors of ligands (calculated using KNIME Analytics  platform, use CDK Molecular Properties, Indigo Molecule Properties and RDKit Descriptor Calculation nodes)

2. Protein descriptors should be calculated in iFeature python package (https://github.com/Superzchen/iFeature)

3. Rescore the docking poses obtained from Glide using ITScore_Aff (http://huanglab.phys.hust.edu.cn/ITScoreAff/) and Cyscore (http://clab.labshare.cn/software/) scoring functions.

4. Extract the docking descriptors from Glide and calculate SIFt descriptors (Tasks -> Interaction Fingerprints)



---



DUD-E decoys should be prepared to represent non-binding molecules (https://dude.docking.org/). To better simulate a real-world virtual screening scenario, the number of DUD-E decoys should be approximately 50 times greater than the size of your ligand dataset.


In [None]:
#Install required libraries
pip install autogluon
pip install PyTDC
import os
import pandas as pd
import numpy as np
from sklearn.metrics import average_precision_score, fbeta_score
from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.core.metrics import make_scorer
from tdc import Evaluator
from datetime import datetime

Upload your dataset. Your data should be organized in a tabular structure, where each line represents a docking pose of CYP450-ligand binding and each column represents one of the associated descriptors (listed in Supplemen
tary Information, Table 4). Ligand and protein descriptors
associated of the used crystal structure and SMILES ligand are also included in the line. "Class" column should be binary and represent whether this molecule is a ligand or decoy.

In [None]:
#Upload your training and validation dataset
train_data = TabularDataset('train_dataset.csv')
test_data = TabularDataset('validation_dataset.csv')
label = "Class"

In [None]:
#Add logAUC metric
evaluator = Evaluator(name="range_logAUC")
ag_logauc_scorer = make_scorer(name="evaluator", score_func=evaluator, optimum=1, greater_is_better=True)

Since we aim to perform multiple model iterations for statistical analysis, each iteration should be assigned a unique identifier to enable individual access and evaluation. "Medium" is a default Autogluon preset, which provides fast training time, ideal for initial prototyping. For more customisation we recommend you to set max number of epochs and num_workers parameters according to your computational resources availability. Autogluon uses max_epochs = 10 as a default, but you can increase this parameter if your data has complex patterns. Optimize num_workers according to your available GPU. The details for customization are described here https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html.

In [None]:
#Start training
model_path = f"model_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
print(f"Training has started. The results will be saved in {model_path}")

predictor = TabularPredictor(label=label, path=model_path, eval_metric=ag_logauc_scorer).fit(
    train_data,
    presets="medium"
)

print("Training is complete")


Evaluate validation results using LogAUC. Autogluon also provides preselected metrics, such as ROC_AUC and Matthews Correlation Coefficient. You can add additional metrics (e.g. AUPR, F2) by editing "evaluate" function

In [None]:
#Evaluate validation results
y_pred = predictor.predict(test_data.drop(columns=[label]))
metrics = predictor.evaluate(test_data, silent=True)
print(f"Test metrics: {metrics}")

In [None]:
# Save test metrics to a separate CSV file
eval_path = f"evaluation_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
os.makedirs(eval_path, exist_ok=True)
metrics_df = pd.DataFrame([metrics])  # Convert dict to DataFrame
metrics_df.to_csv(f"{eval_path}/test_metrics.csv", index=False)
print(f"Test metrics saved to {eval_path}/test_metrics.csv")