## ML Exercise

The following dataset was extracted from the paper: "Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships" (https://pubs.acs.org/doi/10.1021/ci500747n) and the code presented imports the dataset and performs pre-processing operations to make sure that the training and testing dataset have the same features.

In [2]:
# The used datasets are relative to inhibition of HIV integrase in a cell based assay (−log(IC50) M)

In [3]:
## Import dataset with the defined train and test partitions

import pandas as pd
dataset_training = pd.read_csv("HIVINT_training_disguised.csv")
print("Original dataset size (train and test):")
print(dataset_training.shape)

dataset_test = pd.read_csv("HIVINT_test_disguised.csv")
print(dataset_test.shape)

## filter datasets to contain only mutual features
common_cols = list(set(dataset_training.columns) & set(dataset_test.columns))

dataset_training_eq, dataset_test_eq = dataset_training[common_cols], dataset_test[common_cols]

print("Dataset size with equal features (train and test):")
print(dataset_training_eq.shape)
print(dataset_test_eq.shape)

## Set the molecule ID as index
dataset_training_eq.set_index('MOLECULE', inplace=True)
dataset_test_eq.set_index('MOLECULE', inplace=True)

## Separate training and testing dataset as input and output
## inputs (X) are the descriptors and outputs (Y) are the activity values
x_train, y_train = dataset_training_eq.loc[:,dataset_training_eq.columns != "Act"], dataset_training_eq["Act"]
x_test, y_test = dataset_test_eq.loc[:,dataset_test_eq.columns != "Act"], dataset_test_eq["Act"]

print("Final dataset size (train and test): ")
print(x_train.shape,y_train.shape)
print(x_test.shape,y_test.shape)

Original dataset size (train and test):
(1815, 4188)
(598, 2952)
Dataset size with equal features (train and test):
(1815, 2832)
(598, 2832)
Final dataset size (train and test): 
(1815, 2830) (1815,)
(598, 2830) (598,)


The output variable (Y) measures the compound activity and, as such is numeric. The input variables are calculated from the compund descriptors (fingerprints). The goals of the exercise is to obtain the best possible model for the dataset.

In [4]:
# imports
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.model_selection import GridSearchCV

In [5]:
# The output variable is continuous and needs to be categorical.

In [6]:
#convert y values to categorical values
label_encoder = preprocessing.LabelEncoder()
y_test_categorical = label_encoder.fit_transform(y_test)
y_train_categorical = label_encoder.fit_transform(y_train)

#print(y_test_categorical)
#print(y_train_categorical)

#### Data Standardization
Standardize input variables (descriptors).

In [7]:
scaler = StandardScaler()
scaler.fit(x_train)
sc_x_train = scaler.transform(x_train)
sc_x_test = scaler.transform(x_test)

#### Test different models

In [8]:
# The article cited RF and SVM as the most predictive model for QSAR problems and RFM as the gold standard.

In [9]:
#Random Forest Classifier
rf_model = RandomForestClassifier()

scores_rf = cross_val_score(rf_model, sc_x_train, y_train_categorical, cv = 5).mean()
scores_rf



0.06446280991735537

In [10]:
#Support Vector Classifier
svm_model = svm.SVC()

scores_svm = cross_val_score(svm_model, sc_x_train, y_train_categorical, cv=5).mean()
scores_svm



0.08650137741046833

In [11]:
# KNeighbors Classifiers
knn_model = KNeighborsClassifier()

scores_knn = cross_val_score(knn_model, sc_x_train, y_train_categorical, cv = 5).mean()
scores_knn



0.061157024793388436

#### Feature Selection

In [12]:
selector = VarianceThreshold(threshold=0.1)
filter_feat = selector.fit_transform(x_train, y_train_categorical)

print(sc_x_train.shape, sc_x_test.shape)
print(filter_feat.shape)

(1815, 2830) (598, 2830)
(1815, 1280)


In [13]:
best_model_score = cross_val_score(svm_model, filter_feat, y_train_categorical, cv=5).mean()
best_model_score



0.08429752066115703

In [14]:
# Lowered score with the filtered features.

#### Hyperparameter Optimization

In [15]:
grid_list = {"C": np.arange(2, 10, 2),
             "gamma": np.arange(0.1, 1, 0.2)}

opt_model = GridSearchCV(svm_model, param_grid = grid_list, n_jobs = 4, cv = 3)
opt_model.fit(sc_x_train, y_train_categorical)
opt_model.cv_results_

print(opt_model.best_estimator_)



SVC(C=2, gamma=0.1)


In [16]:
opt_model_score  = cross_val_score(opt_model, filter_feat,  y_train_categorical, cv = 5).mean()
print(opt_model_score)



0.07988980716253444


#### Train and test model with the best score

In [17]:
svm_model.fit(sc_x_train, y_train_categorical)

svm_model.score(sc_x_test, y_test_categorical)

0.0016722408026755853