## Task: Random Forest Model for BBBP Classifier using Morgan Fingerprints and Hyperparameter Optimization

* Develop a random forest model for the BBBP classifier using the default hyperparameters and Morgan fingerprints, and report the performance measure (ROC_AUC) for all three datasets.

* List the default hyperparameters used in the model (n_estimators, max_depth, min_samples_leaf, min_impurity_decrease, max_features).

* Optimize the model using grid search to find the optimal set of hyperparameters and report the performance measure (ROC_AUC) for all three datasets with optimized hyperparameters.

* Compare results with the baseline model and the results reported in Table 4 of the following paper: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00479-8

* Summarize results and comment on the hyperparameters explored in the paper. (The paper explored the following hyperparameters: n_estimators (10, 50, 100, 200, 300, 400, 500), max_depth (3 to 12), min_samples_leaf (1, 3, 5, 10, 20, 50), min_impurity_decrease (0 to 0.01), and max_features (‘sqrt’, ‘log2’, 0.7, 0.8, 0.9).



### import modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from rdkit import Chem
from rdkit.Chem import AllChem

from sklearn.model_selection import (train_test_split, 
                                     cross_val_predict, 
                                     GridSearchCV)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

### import data

In [2]:
bbbp = pd.read_csv("BBBP.csv")
bbbp.head()

Unnamed: 0,num,name,p_np,smiles
0,1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12
1,2,Terbutylchlorambucil,1,C(=O)(OC(C)(C)C)CCCc1ccc(cc1)N(CCCl)CCCl
2,3,40730,1,c12c3c(N4CCN(C)CC4)c(F)cc1c(c(C(O)=O)cn2C(C)CO...
3,4,24,1,C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C
4,5,cloxacillin,1,Cc1onc(c2ccccc2Cl)c1C(=O)N[C@H]3[C@H]4SC(C)(C)...


### calculate morgan fingerprints

In [3]:
# suppress warnings from invalid molecules
# so error messages don't blow up — will deal later
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

In [4]:
# function: generate canon SMILES
def gen_canon_smiles(smiles_list):
    
    invalid_ids = []
    canon_smiles = []

    for i in range(len(smiles_list)):   
        mol = Chem.MolFromSmiles(smiles_list[i])
        
        # do not append NoneType if invalid
        if mol is None: 
            invalid_ids.append(i)
            continue

        canon_smiles.append(Chem.MolToSmiles(mol))

    return canon_smiles, invalid_ids

# function: calculate morgan fingerprints from SMILES
def calc_morgan_fpts(smiles_list):
    morgan_fingerprints = []
    
    for i in smiles_list:
        mol = Chem.MolFromSmiles(i)
        
        # do not try to calculate if invalid
        if mol is None: continue
            
        fpts = AllChem.GetMorganFingerprintAsBitVect(mol,2,2048)
        mfpts = np.array(fpts)
        morgan_fingerprints.append(mfpts) 
        
    return np.array(morgan_fingerprints)

In [5]:
# generate canon smiles
canon_smiles, invalid_ids = gen_canon_smiles(bbbp.smiles)

# drop rows with invalid SMILES
bbbp = bbbp.drop(invalid_ids)

# replace SMILES with canon SMILES
bbbp.smiles = canon_smiles

# drop duplicates to prevent train/valid/test contamination
bbbp.drop_duplicates(subset=['smiles'], inplace=True)

In [6]:
X = calc_morgan_fpts(bbbp.smiles) # features
y = bbbp.p_np # labels

### split data

In [7]:
# split data into training set and everything else
X_train, X_valid_and_test, y_train, y_valid_and_test = train_test_split(X, y, train_size=0.8)

# split everything else into validation and test sets
X_valid, X_test, y_valid, y_test = train_test_split(X_valid_and_test, y_valid_and_test, test_size=0.5)

### random forest model

In [8]:
# train and fit
forest = RandomForestClassifier().fit(X_train, y_train)

In [9]:
# compute decision scores
y_train_scores = forest.predict_proba(X_train)
y_valid_scores = forest.predict_proba(X_valid)
y_test_scores = forest.predict_proba(X_test)

# print ROC-AOC scores
print("train_set scores:", roc_auc_score(y_train, y_train_scores[:,1]))
print("valid_set scores:", roc_auc_score(y_valid, y_valid_scores[:,1]))
print("test_set scores :", roc_auc_score(y_test, y_test_scores[:,1]))

train_set scores: 0.9999955327228055
valid_set scores: 0.8754136789851075
test_set scores : 0.9012714558169103


The model overfits the training set, we can see this as the performance on the training set is significantly higher than than the performance on the validation or test sets.

In [10]:
# print default hyperparams
forest.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [None]:
# optimize model with grid search
params = {
    'n_estimators': [10, 50, 100, 200, 300, 400, 500], 
    'max_depth': [int(x) for x in np.linspace(3, 12)] + [None], 
    'max_features': ['log2', 'sqrt', 0.7, 0.8, 0.9], 
    'min_samples_split': [int(x) for x in np.linspace(2, 6, 5)],
    'min_samples_leaf': [1, 3, 5, 10, 20, 50]
}

forest_op = GridSearchCV(forest, params, cv=3, n_jobs=-1)
forest_op.fit(X_train, y_train)
forest_op.best_estimator_

In [None]:
forest_op = forest_op.best_estimator_

# compute decision scores
y_train_scores = forest_op.predict_proba(X_train)
y_valid_scores = forest_op.predict_proba(X_valid)
y_test_scores = forest_op.predict_proba(X_test)

# print ROC-AOC scores
print("train_set scores:", roc_auc_score(y_train, y_train_scores[:,1]))
print("valid_set scores:", roc_auc_score(y_valid, y_valid_scores[:,1]))
print("test_set scores :", roc_auc_score(y_test, y_test_scores[:,1]))

My random forest model performed about the same as the baseline model in the paper. Just like the model in the paper, my model overfit the training set significantly. In addition, the performance on the validation and test sets were similar to the performance presented in the paper.

In doing this task, I reviewed how to calculate morgan fingerprints and use the RDkit module. I learned more in depth about how a RandomForestClassifier works by training many estimators and what each specific hyperparameter does. That being said, other than 'n_estimators' it is difficult to know which hyperparameters need or should be changed, and the process or trial and error is very, very time consuming. The paper chose slightly different hyperparameters than I did, with the exception of major ones such as 'n_estimators', and 'max_depth'. Even for the hyperparameters we both modified, we chose different ranges to check. I wish there were a better way to choose a starting point for checking hyper parameters so that it were less time consuming. I chose the hyperparameters that I chose because they sounded the most important, but I do not know for sure, which hyper parameters contribute the most to performance.