# pIC50 Prediction Task

During the course of a drug discovery program, a critical task is the ability to “screen” a
library of compounds in order to find molecules that can bind to and potentially inhibit
the activity of a target protein (we call such readout “potency”). Due to the prohibitive
cost of large scale experimental screening, virtual in silico screening serves as an initial
step. This approach significantly reduces costs while facilitating the evaluation and
prioritization of an extensive range of small molecules.

A variety of methods is available for virtual screening, including ligand-based machine
learning models that rely on the molecular structure as input to predict their activities.
This notebook includes an exploration of a dataset of 4.6k compounds that have undergone 
experimental testing against the Epidermal Growth Factor Receptor (EGFR) kinase, a target
associated with various cancers, as well as a prediction of the potency value (pIC50), using 
an pretrained foundation model, for novel compounds targeting EGFR.


In [57]:
#creating deepcopy of model instances
from copy import deepcopy

#Python standard libraries
import time
import warnings
from pathlib import Path
from warnings import filterwarnings
import time
import pandas as pd
import numpy as np

#XGBoost library
import xgboost as xgb

#visualization
import seaborn as sns
import matplotlib.pyplot as plt

import torch

#selected plotting functions
# from statsmodels.graphics.tsaplots import plot_acf,plot_pacf

#sklearn models and metric functions
from sklearn import svm, metrics, clone
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold, train_test_split
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from skopt import BayesSearchCV
from sklearn.metrics import auc, accuracy_score, recall_score, roc_curve, precision_score
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score,confusion_matrix

#FP generators
from rdkit import Chem
from rdkit.Chem import MACCSkeys, rdFingerprintGenerator

#Silence some expected warnings
filterwarnings("ignore")

#Fix seed for reproducible results
SEED = 22
torch.manual_seed(SEED)

<torch._C.Generator at 0x7fc21c783a50>

In [None]:
#import train and test datasets
train = pd.read_csv("pIC50_prediction/data/train.csv")
test = pd.read_csv("pIC50_prediction/data/test.csv")

In [None]:
#import pretrained model