#### This notebook defines the applicability domain of models and computes the pIC50 values for blinded molecules.

Methodology used to define applicability domain:

1) We use an ensemble of model predictions  to define the applicability domain.<br>
2) In this notebook we use 35 of the best CNN models to create an ensemble of model predictions for each molecule.<br>
3) Suppose the models predict x1,x2,...x35 as pIC50 values for a particular molecule.<br>
4) We use non parametric confidence interval to choose mid 95 percentile values for pIC50 values.<br>
5) The mid 95 percentile values are considered which are symmetric around the median of predictions to reject outlier predictions.<br>
6) The median of predictions would be the predicted value of the molecule.<br>
7) After choosing mid 95 percentile predictions we are left with 32 predicted values as opposed to the 35 we started with.<br>
8) Now these 32 values can be considered as a vector in a 32 dimensional space. Lets call this vector a.<br>
9) The median value is repeated 32 times and it too becomes a vector in 32 dimensional space. Lets call this vector b.<br>
10) A distance to model metric is defined taking a and b into account.<br><br>
distance_to_model = |a - b| / |b| where | | means the magnitude of the vector.<br><br>
11) If all predictions are the same then a and b will be the same and |a - b| will 0 and hence distance_to_model will be 0. This would mean that all models are predicting similar values and hence the prediction can be considered reliable.<br>
12) On the other hand if a and b vectors are drastically different then |a - b| will be large. This would mean that the models are predicting values which are different and hence prediction cannot be considered reliable.<br>
13) |a - b| is divided by |b| to represent the vector difference as a percentage of the magnitude of median vector b. This makes quantifying reliability of predictions independant of the actual magnitudes of a and b vector.<br>
14) We decide empirically to fix the cutoff for reliable predictions as distance_to_model <=0.3. We arrived at this number by observing the prediction quality on training and test set compounds.<br>


In [1]:
from keras.models import Sequential,model_from_json
from keras.layers import Dense, Conv2D, Flatten, MaxPooling2D,UpSampling2D, Dropout
from keras.callbacks import EarlyStopping,ModelCheckpoint
import pandas as pd
import numpy as np
import sys
import pickle
pd.set_option('display.max_rows', None)
from scipy.stats import norm
import math
from sklearn.preprocessing import MinMaxScaler
from scipy import stats
import math
from numpy import percentile

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# get best models according to r2 test
# the key of this dictionary is the folder name where best models are saved and the values represent the best models themselves.
best_models = {"best_models":['1','10','13','14','15','18',
                              '20','21','23','26','28','3',
                              '31','32','35','38','39','40',
                              '42','45','49','5','54','56',
                              '57','59','63','9','24','53','62',
                              '16','61','4','7']}

# Read training and loo datasets
training_csv = pd.read_csv('best_models/data/training_set_24.csv')
loo_csv = pd.read_csv('best_models/data/loo_set__24.csv')
training_csv = training_csv.append(loo_csv)
training_csv1 = training_csv
training_compound_smiles = training_csv1['Name']
training_csv = training_csv.loc[:,'nAcid':'Zagreb']
# Add dummy zero descriptor to make the number of descriptors per molecule = 1120 
training_csv['zeros'] = 0
# Reshape descriptors into 35 by 32 grayscale images.
training_imgs = np.reshape(np.array(training_csv),(-1,35,32,1))

In [3]:
training_csv1

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Name,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,...,WTPT-1,WTPT-2,WTPT-3,WTPT-4,WTPT-5,WPATH,WPOL,XLogP,Zagreb,values
0,7,7,CCCC1=CC(Cl)=NC(SCC(=O)NC2=CC=C(Cl)C=C2)=N1,0.0,0.548801,0.064859,0.41773,0.354838,0.388889,0.35,...,0.324713,0.359553,0.46864,5.9e-05,0.496087,0.178921,0.285714,0.590487,0.276316,0.311631
1,8,8,COC1=CC(=CC=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,0.0,0.180712,0.124568,0.694061,0.610455,0.388889,0.35,...,0.694176,0.422779,0.891471,0.633133,0.764134,0.565556,0.678571,0.175807,0.657895,0.279766
2,10,10,COC1=C(OC)C=C(NC(=O)CSC2=NC(O)=CC(=N2)C2=CC=CC...,0.0,0.275568,0.038484,0.318809,0.527526,0.722222,0.7,...,0.557796,0.564844,0.569524,0.509872,0.500191,0.383557,0.517857,0.547035,0.5,0.266719
3,11,11,COC1=CC=CC(=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,0.0,0.280942,0.035076,0.689344,0.599287,0.388889,0.35,...,0.661913,0.478353,0.637702,0.486536,0.63572,0.512133,0.642857,0.342337,0.605263,0.234854
4,12,12,CCC(SC1=NC(C2=CC=CC(OC)=C2)=C(C#N)C(=O)N1)C(=O...,0.0,0.25275,0.054709,0.785562,0.7116,0.388889,0.35,...,0.731942,0.436872,0.639543,0.487294,0.636674,0.570808,0.732143,0.41879,0.671053,0.234854
5,28,28,CC1=COC2=C1C(=O)C(=O)C1=C2C=CC2=C(C)C=CC=C12,0.0,0.48293,0.021242,0.144981,0.251797,0.555556,0.6,...,0.318461,0.931009,0.099453,0.357162,0.0,0.095799,0.535714,0.599178,0.381579,0.283425
6,60,60,ClC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,0.0,0.244776,0.061047,0.560408,0.523047,0.388889,0.35,...,0.56669,0.691137,0.667453,0.473463,0.537009,0.356755,0.625,0.317205,0.592105,0.305491
7,61,61,COC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,0.0,0.112668,0.216554,0.767963,0.707723,0.388889,0.35,...,0.744605,0.589432,0.855034,1.0,0.537653,0.593625,0.839286,0.149501,0.75,0.304206
8,62,62,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CCC3=CC=CC=C...,0.0,0.173017,0.133703,0.545617,0.551734,0.388889,0.35,...,0.569249,0.727467,0.586647,0.473426,0.536257,0.368164,0.607143,0.471873,0.578947,0.294842
9,64,64,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CC2)C2=CC=CC...,0.0,0.133783,0.185306,0.429155,0.415337,0.388889,0.35,...,0.496488,0.757511,0.685392,0.473528,0.703664,0.268381,0.589286,0.207399,0.526316,0.252141


In [4]:
# Get actual training values. The scaler used to transform the data is used to 
#inverse transform the data to get actual training values.

scaler = pickle.load(open('scaler_data/scaler.dat','rb'))
training_csv1 = training_csv1.loc[:,'nAcid':'values']
training_csv1_inverse_transform = scaler.inverse_transform(training_csv1)
training_csv1_inverse_transform_actual_values = training_csv1_inverse_transform[:,len(training_csv1_inverse_transform[0])-1]
df_training_values_inverse_transform = pd.DataFrame(training_csv1_inverse_transform_actual_values,columns=['actual_training_values'])
df_training_values_inverse_transform.reset_index(drop=True, inplace=True)
training_compound_smiles.reset_index(drop=True, inplace=True)
pd.concat([training_compound_smiles,df_training_values_inverse_transform],axis=1)

Unnamed: 0,Name,actual_training_values
0,CCCC1=CC(Cl)=NC(SCC(=O)NC2=CC=C(Cl)C=C2)=N1,-1.477121
1,COC1=CC(=CC=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.60206
2,COC1=C(OC)C=C(NC(=O)CSC2=NC(O)=CC(=N2)C2=CC=CC...,-1.653213
3,COC1=CC=CC(=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.778151
4,CCC(SC1=NC(C2=CC=CC(OC)=C2)=C(C#N)C(=O)N1)C(=O...,-1.778151
5,CC1=COC2=C1C(=O)C(=O)C1=C2C=CC2=C(C)C=CC=C12,-1.587711
6,ClC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,-1.501196
7,COC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,-1.506234
8,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CCC3=CC=CC=C...,-1.54295
9,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CC2)C2=CC=CC...,-1.710371


#### Predict training set

In [5]:
# Compute values for training set
model_names = []

# Saves the output of CNN regression
predicted_training_values_modelwise = []
# Saves the output after after inverse transform of CNN regression output.
predicted_training_values_modelwise_inverse_transform = []

# Iterate through best models.
for key,values in best_models.items():

    for value in values:
    
        model_names.append(key+","+value)
        # Load best models
        json_file = open(key+'/model_'+str(value)+'.json')
        loaded_model_json = json_file.read()
        json_file.close()
        loaded_model = model_from_json(loaded_model_json)
        loaded_model.load_weights(key+'/model'+str(value)+'.h5')

        # Predict training set
        predicted_training_values = loaded_model.predict(training_imgs)
        predicted_training_values = np.reshape(predicted_training_values,-1)

        # min max scaler inverse transform for predicted train values
        training_data_predicted = training_csv.loc[:,'nAcid':'Zagreb']
        training_data_predicted['values'] = predicted_training_values
        training_data_inverse_transform_predicted = scaler.inverse_transform(training_data_predicted)
        training_value_predicted_inverse_transform = training_data_inverse_transform_predicted[:,len(training_data_inverse_transform_predicted[0])-1]

        predicted_training_values_modelwise.append(predicted_training_values)
        predicted_training_values_modelwise_inverse_transform.append(training_value_predicted_inverse_transform)


df_modelwise = pd.DataFrame(np.transpose(np.array(predicted_training_values_modelwise)),columns = model_names)




#### Display actual training set values along with values predicted by different models

In [6]:
# Display  predicted train values in their original scale.
df_modelwise_inverse_transform = pd.DataFrame(np.transpose(np.array(predicted_training_values_modelwise_inverse_transform)),columns = model_names)
pd.concat([training_compound_smiles,df_training_values_inverse_transform,df_modelwise_inverse_transform],axis = 1)

Unnamed: 0,Name,actual_training_values,"best_models,1","best_models,10","best_models,13","best_models,14","best_models,15","best_models,18","best_models,20","best_models,21",...,"best_models,59","best_models,63","best_models,9","best_models,24","best_models,53","best_models,62","best_models,16","best_models,61","best_models,4","best_models,7"
0,CCCC1=CC(Cl)=NC(SCC(=O)NC2=CC=C(Cl)C=C2)=N1,-1.477121,-1.47432,-1.412471,-1.475875,-1.465381,-1.449068,-1.436387,-1.446122,-1.447732,...,-1.379441,-1.483464,-1.484856,-1.328337,-1.475454,-1.449938,-1.46067,-1.44385,-1.385848,-1.464808
1,COC1=CC(=CC=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.60206,-1.708336,-1.54961,-1.586811,-1.592853,-1.557375,-1.577646,-1.554884,-1.569967,...,-1.540483,-1.610723,-1.611405,-1.569894,-1.561618,-1.598784,-1.551831,-1.622843,-1.569972,-1.570997
2,COC1=C(OC)C=C(NC(=O)CSC2=NC(O)=CC(=N2)C2=CC=CC...,-1.653213,-1.657388,-1.608692,-1.660258,-1.668773,-1.600697,-1.619682,-1.59255,-1.638717,...,-1.587421,-1.650874,-1.659273,-1.551068,-1.638265,-1.606812,-1.61409,-1.582529,-1.651059,-1.611438
3,COC1=CC=CC(=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.778151,-1.774718,-1.755901,-1.771163,-1.756724,-1.727756,-1.719781,-1.763305,-1.780978,...,-1.697038,-1.73172,-1.802463,-1.679384,-1.758363,-1.654902,-1.729909,-1.826888,-1.768031,-1.755477
4,CCC(SC1=NC(C2=CC=CC(OC)=C2)=C(C#N)C(=O)N1)C(=O...,-1.778151,-1.776239,-1.755804,-1.762228,-1.756724,-1.720213,-1.705125,-1.760706,-1.779911,...,-1.705168,-1.739381,-1.802758,-1.673427,-1.756235,-1.675094,-1.723369,-1.783661,-2.256568,-1.751167
5,CC1=COC2=C1C(=O)C(=O)C1=C2C=CC2=C(C)C=CC=C12,-1.587711,-1.563229,-1.545589,-1.58249,-1.57105,-1.53983,-1.533819,-1.560019,-1.578282,...,-1.592372,-1.588145,-1.585114,-1.554292,-1.564724,-1.42241,-1.584533,-1.531968,-1.570688,-1.562284
6,ClC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,-1.501196,-1.476024,-1.469956,-1.494936,-1.49112,-1.473965,-1.455668,-1.490315,-1.467778,...,-1.411797,-1.586061,-1.512546,-1.462549,-1.485103,-1.177573,-1.473867,-1.446921,-1.454959,-1.468604
7,COC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,-1.506234,-1.469491,-1.47119,-1.494343,-1.493208,-1.466994,-1.471141,-1.491704,-1.469738,...,-1.421264,-1.567348,-1.524915,-1.481373,-1.492692,-1.361671,-1.47883,-1.415635,-1.46233,-1.679817
8,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CCC3=CC=CC=C...,-1.54295,-1.499576,-1.4895,-1.532403,-1.52777,-1.495326,-1.487563,-1.522677,-1.49542,...,-1.431742,-1.587724,-1.540124,-1.504152,-1.506394,-1.217925,-1.501498,-1.478402,-1.472599,-1.514372
9,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CC2)C2=CC=CC...,-1.710371,-1.690553,-1.675866,-1.719726,-1.689192,-1.65551,-1.67947,-1.674179,-1.714456,...,-1.575525,-1.658686,-1.486584,-1.599115,-1.66037,-1.229836,-1.671426,-1.737532,-1.687206,-1.693697


#### Distance to model computation for training compounds :<br>

Compute distance to model by |a- b| / |b|. a = prediction vector, b = median vector.<br>
a = collection of mid 95 percentile predicted values.<br>
b = tiled median vector. (median value repeated 32 times) <br>
Both a and b are 32 dimensional in this case.


In [35]:
num_of_models = len(model_names)

# Calculate non parametric CI at 95% confidence (training)
df_modelwise_inverse_transform_array = np.array(df_modelwise_inverse_transform)

# Get predicted pIC50 values at 2.5 percentile and 97.5 percentile
two_point_five_percentile = percentile(df_modelwise_inverse_transform_array,2.5,axis=1)
ninety_seven_point_five_percentile = percentile(df_modelwise_inverse_transform_array,97.5,axis=1)

lower_training_df = pd.DataFrame(two_point_five_percentile,columns=['value at 2.5 percentile'])
upper_training_df = pd.DataFrame(ninety_seven_point_five_percentile,columns=['value at 97.5 percentile'])
median_training = np.reshape(np.array(df_modelwise_inverse_transform.median(axis = 1)),(-1,1))
median_training_df = pd.DataFrame(median_training,columns=['predicted_training_values (median)'])

# Get mid 95 percent values about median to ignore outlier predictions to some extent.
sorted_df_modelwise_inverse_transform_array = np.sort(df_modelwise_inverse_transform_array,axis = 1)
lower_index = int(round(0.025 * num_of_models))
upper_index =int(round(0.95* num_of_models))

mid_ninety_five_percent_values = sorted_df_modelwise_inverse_transform_array[:,lower_index:upper_index]
number_of_values_in_mid_95_percent = len(mid_ninety_five_percent_values[0])
median_training_tiled = np.tile(median_training,(1,number_of_values_in_mid_95_percent))

# For a particular compound the median prediction is considered the correct prediction. 
subtract_median_from_predicted_values = np.sqrt(np.sum(np.square(median_training_tiled - mid_ninety_five_percent_values),axis = 1,keepdims=True))
magnitude_of_median_training_vector = np.sqrt(np.sum(np.square(median_training_tiled),axis = 1,keepdims=True))
weighted_subtraction_values = subtract_median_from_predicted_values/magnitude_of_median_training_vector
distance_to_model = pd.DataFrame(weighted_subtraction_values,columns = ['distance_to_ensemble_models_for_training_compounds'])

In [36]:
# Display Distance to model for various training compounds along with actual and predicted training values
pd.concat([training_compound_smiles,df_training_values_inverse_transform,median_training_df,lower_training_df,upper_training_df
           ,distance_to_model],axis=1)    

Unnamed: 0,Name,actual_training_values,predicted_training_values (median),value at 2.5 percentile,value at 97.5 percentile,distance_to_ensemble_models_for_training_compounds
0,CCCC1=CC(Cl)=NC(SCC(=O)NC2=CC=C(Cl)C=C2)=N1,-1.477121,-1.447427,-1.487963,-1.345836,0.022374
1,COC1=CC(=CC=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.60206,-1.569967,-1.635667,-1.504179,0.01528
2,COC1=C(OC)C=C(NC(=O)CSC2=NC(O)=CC(=N2)C2=CC=CC...,-1.653213,-1.619024,-1.66569,-1.546908,0.018502
3,COC1=CC=CC(=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.778151,-1.747756,-1.806127,-1.647357,0.017118
4,CCC(SC1=NC(C2=CC=CC(OC)=C2)=C(C#N)C(=O)N1)C(=O...,-1.778151,-1.751167,-1.87083,-1.663103,0.018398
5,CC1=COC2=C1C(=O)C(=O)C1=C2C=CC2=C(C)C=CC=C12,-1.587711,-1.562284,-1.594579,-1.351836,0.018023
6,ClC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,-1.501196,-1.473867,-1.52562,-1.349794,0.015686
7,COC1=CC(CN2CCN(CC2)S(=O)(=O)C2=CC3=C(NC(=O)C3=...,-1.506234,-1.479763,-1.584219,-1.376758,0.020237
8,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CCC3=CC=CC=C...,-1.54295,-1.503414,-1.554875,-1.380714,0.018754
9,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCN(CC2)C2=CC=CC...,-1.710371,-1.675866,-1.722397,-1.448072,0.028204


In [17]:
# Read test set compounds
test_csv = pd.read_csv('best_models/data/test_compounds.csv')

# Get actual test values
test_set_smiles = test_csv['Name']
test_csv1 = test_csv.loc[:,'nAcid':'values']
test_csv1_inverse_transform = scaler.inverse_transform(test_csv1)
test_csv1_inverse_transform_actual_values = test_csv1_inverse_transform[:,len(test_csv1_inverse_transform[0])-1]
df_test_values_inverse_transform = pd.DataFrame(test_csv1_inverse_transform_actual_values,columns=['actual_test_values'])
test_set_smiles.reset_index(drop=True, inplace=True)
df_test_values_inverse_transform.reset_index(drop=True, inplace=True)
pd.concat([test_set_smiles,df_test_values_inverse_transform],axis = 1)

Unnamed: 0,Name,actual_test_values
0,CC(C)C1=CC=C(NC(=O)CSC2=NC(=CC=N2)C2=CC=CS2)C=C1,-1.60206
1,CC1=CC(=O)NC(SCC(=O)NC2=C(OC3=CC=CC=C3)C=CC(Cl...,-2.0
2,O=C1N(CC2=CC=C3C=CC=CC3=C2)C2=CC=C(C=C2C1=O)S(...,-1.600646
3,O=C(CC1=NCCS1)C1=NC=CS1,-1.60206
4,ClC(Cl)=C(Cl)C(=O)OC1=CC=C(C=C1)S(=O)(=O)C1=CC...,0.045757
5,IC1=CC=C2N(CC3=CC4=CC=CC=C4S3)C(=O)C(=O)C2=C1,0.022276
6,IC1=CC=C2N(C\C=C\C3=CC4=CC=CC=C4S3)C(=O)C(=O)C...,-1.371068
7,O=C(N1CCN(CC1)S(=O)(=O)C1=CC2=C(NC(=O)C2=O)C=C...,-1.003029
8,IC1=CC=C2N(CC3=CC=C(S3)C(=O)N3CCCCC3)C(=O)C(=O...,-1.243038
9,ClC1=C2C(=O)C(=O)N(CC3=CC4=CC=CC=C4S3)C2=CC=C1,-1.049218


In [19]:
# Test set prediction
test_csv = test_csv.loc[:,'nAcid':'Zagreb']
test_csv['zeros'] = 0
test_imgs = np.reshape(np.array(test_csv),(-1,35,32,1))

predicted_test_values_modelwise = []
predicted_test_values_modelwise_inverse_transform = []

# Iterate through best models
for key,values in best_models.items():
    for value in values:
        json_file = open(key+'/model_'+str(value)+'.json')
        loaded_model_json = json_file.read()
        json_file.close()
        loaded_model = model_from_json(loaded_model_json)
        loaded_model.load_weights(key+'/model'+str(value)+'.h5')

        # Predict test set
        predicted_test_values = loaded_model.predict(test_imgs)
        predicted_test_values = np.reshape(predicted_test_values,-1)

        # min max scaler inverse transform for predicted test values
        test_data_predicted = test_csv.loc[:,'nAcid':'Zagreb']
        test_data_predicted['values'] = predicted_test_values
        test_data_inverse_transform_predicted = scaler.inverse_transform(test_data_predicted)
        test_value_predicted_inverse_transform = test_data_inverse_transform_predicted[:,len(test_data_inverse_transform_predicted[0])-1]

        predicted_test_values_modelwise.append(predicted_test_values)
        predicted_test_values_modelwise_inverse_transform.append(test_value_predicted_inverse_transform)

In [20]:
# Predicted test values by best models
df_modelwise_test_inverse_transform = pd.DataFrame(np.transpose(np.array(predicted_test_values_modelwise_inverse_transform)),columns = model_names)
df_modelwise_test_inverse_transform.reset_index(drop=True, inplace=True)
pd.concat([test_set_smiles,df_test_values_inverse_transform,df_modelwise_test_inverse_transform],axis=1)

Unnamed: 0,Name,actual_test_values,"best_models,1","best_models,10","best_models,13","best_models,14","best_models,15","best_models,18","best_models,20","best_models,21",...,"best_models,59","best_models,63","best_models,9","best_models,24","best_models,53","best_models,62","best_models,16","best_models,61","best_models,4","best_models,7"
0,CC(C)C1=CC=C(NC(=O)CSC2=NC(=CC=N2)C2=CC=CS2)C=C1,-1.60206,-1.733387,-1.452686,-1.370995,-1.693146,-1.358057,-1.456654,-1.346855,-1.562684,...,-1.444901,-1.697453,-1.457122,-1.219501,-1.51791,-1.493024,-1.331349,-1.573673,-1.608117,-1.440976
1,CC1=CC(=O)NC(SCC(=O)NC2=C(OC3=CC=CC=C3)C=CC(Cl...,-2.0,-2.269752,-2.215128,-2.216603,-1.796119,-2.297105,-1.831612,-2.026257,-2.326262,...,-2.267178,-2.25718,-2.117335,-2.320551,-2.327586,-2.06984,-2.257858,-2.298715,-2.255553,-2.173285
2,O=C1N(CC2=CC=C3C=CC=CC3=C2)C2=CC=C(C=C2C1=O)S(...,-1.600646,-1.727484,-1.373396,-1.21872,-1.382407,-1.354863,-1.497482,-1.466267,-1.197824,...,-1.389359,-1.619741,-1.488664,-1.52169,-1.283891,-1.035548,-1.193922,-1.40772,-1.473333,-1.625267
3,O=C(CC1=NCCS1)C1=NC=CS1,-1.60206,-1.178464,-1.149302,-1.13212,-1.101345,-1.122643,-1.109243,-1.138791,-1.13397,...,-1.038247,-1.152533,-1.139822,-1.082118,-1.126022,-1.138944,-1.135026,-1.094325,-1.159938,-1.235071
4,ClC(Cl)=C(Cl)C(=O)OC1=CC=C(C=C1)S(=O)(=O)C1=CC...,0.045757,-0.262465,-1.108509,-1.190463,-0.319793,-0.62199,-0.250439,-1.200078,-0.917332,...,-0.816969,-0.731002,-0.265995,-1.021762,-0.385958,-1.181407,-1.125431,-1.076879,-0.861368,-0.967678
5,IC1=CC=C2N(CC3=CC4=CC=CC=C4S3)C(=O)C(=O)C2=C1,0.022276,-0.502165,-0.213468,-0.619173,-0.926541,-0.521386,-1.01807,-0.418198,-0.501806,...,-0.498024,-0.509444,-1.105792,-0.533791,-0.987066,-0.870434,-0.673374,-0.692081,-0.608079,-0.971281
6,IC1=CC=C2N(C\C=C\C3=CC4=CC=CC=C4S3)C(=O)C(=O)C...,-1.371068,-0.838703,-0.897667,-1.044282,-0.681875,-0.382858,-1.06651,-0.891666,-0.737871,...,-0.436646,-1.010835,-1.039023,-0.332532,-0.938307,-1.101981,-0.733977,-0.745512,-0.490127,-0.934043
7,O=C(N1CCN(CC1)S(=O)(=O)C1=CC2=C(NC(=O)C2=O)C=C...,-1.003029,-1.580919,-1.541925,-1.555936,-1.541679,-1.496585,-1.536901,-1.558467,-1.538739,...,-1.376533,-1.530489,-1.473132,-1.29839,-1.499918,-1.100533,-1.574779,-1.403806,-1.276412,-1.569787
8,IC1=CC=C2N(CC3=CC=C(S3)C(=O)N3CCCCC3)C(=O)C(=O...,-1.243038,-1.141522,-1.112915,-1.09608,-1.086872,-1.124977,-1.071094,-1.096062,-1.096094,...,-0.947453,-1.091237,-1.128374,-0.94492,-1.103771,-1.10523,-1.06985,-1.129861,-1.062358,-1.060731
9,ClC1=C2C(=O)C(=O)N(CC3=CC4=CC=CC=C4S3)C2=CC=C1,-1.049218,-1.111519,-1.068595,-0.947309,-1.103404,-0.766985,-1.028217,-0.822793,-1.080779,...,-0.773126,-0.853605,-1.056886,-0.710334,-1.007207,-1.134072,-0.993568,-0.686744,-0.760711,-1.029675


#### Distance to model computation for test compounds.

In [21]:
# Calculate non parametric CI at 95% confidence (test)
df_modelwise_test_inverse_transform_array = np.array(df_modelwise_test_inverse_transform)
two_point_five_percentile = percentile(df_modelwise_test_inverse_transform_array,2.5,axis=1)
ninety_seven_point_five_percentile = percentile(df_modelwise_test_inverse_transform_array,97.5,axis=1)
lower_test_df = pd.DataFrame(two_point_five_percentile,columns=['lower_test'])
upper_test_df = pd.DataFrame(ninety_seven_point_five_percentile,columns=['upper_test'])
median_test = np.reshape(np.array(df_modelwise_test_inverse_transform.median(axis = 1)),(-1,1))
median_test_df = pd.DataFrame(median_test,columns=['predicted_test_values'])

# Get mid 95 percent values
sorted_df_modelwise_inverse_transform_array = np.sort(df_modelwise_test_inverse_transform_array,axis = 1)
lower_index = int(round(0.025 * num_of_models))
upper_index =int(round(0.95* num_of_models))

mid_ninety_five_percent_values = sorted_df_modelwise_inverse_transform_array[:,lower_index:upper_index]
number_of_values_in_mid_95_percent = len(mid_ninety_five_percent_values[0])

median_test_tiled = np.tile(median_test,(1,number_of_values_in_mid_95_percent))

subtract_median_from_predicted_values = np.sqrt(np.sum(np.square(median_test_tiled - mid_ninety_five_percent_values),axis = 1,keepdims=True))
magnitude_of_median_test_vector = np.sqrt(np.sum(np.square(median_test_tiled),axis = 1,keepdims=True))

weighted_subtraction_values = subtract_median_from_predicted_values/np.abs(magnitude_of_median_test_vector)
distance_to_model = pd.DataFrame(weighted_subtraction_values,columns = ['distance_to_ensemble_models_for_test_compounds'])

In [22]:
pd.concat([test_set_smiles,df_test_values_inverse_transform,median_test_df,lower_test_df,upper_test_df,distance_to_model],axis = 1)

Unnamed: 0,Name,actual_test_values,predicted_test_values,lower_test,upper_test,distance_to_ensemble_models_for_test_compounds
0,CC(C)C1=CC=C(NC(=O)CSC2=NC(=CC=N2)C2=CC=CS2)C=C1,-1.60206,-1.452686,-1.911949,-1.214886,0.116968
1,CC1=CC(=O)NC(SCC(=O)NC2=C(OC3=CC=CC=C3)C=CC(Cl...,-2.0,-2.210737,-2.326461,-1.826288,0.040779
2,O=C1N(CC2=CC=C3C=CC=CC3=C2)C2=CC=C(C=C2C1=O)S(...,-1.600646,-1.380437,-1.657532,-1.125171,0.10147
3,O=C(CC1=NCCS1)C1=NC=CS1,-1.60206,-1.13212,-1.265858,-1.055357,0.040177
4,ClC(Cl)=C(Cl)C(=O)OC1=CC=C(C=C1)S(=O)(=O)C1=CC...,0.045757,-0.861368,-1.191905,-0.034737,0.378365
5,IC1=CC=C2N(CC3=CC4=CC=CC=C4S3)C(=O)C(=O)C2=C1,0.022276,-0.533791,-1.054969,-0.197375,0.440668
6,IC1=CC=C2N(C\C=C\C3=CC4=CC=CC=C4S3)C(=O)C(=O)C...,-1.371068,-0.760634,-1.076522,-0.327856,0.261917
7,O=C(N1CCN(CC1)S(=O)(=O)C1=CC2=C(NC(=O)C2=O)C=C...,-1.003029,-1.530489,-1.583327,-1.126235,0.069653
8,IC1=CC=C2N(CC3=CC=C(S3)C(=O)N3CCCCC3)C(=O)C(=O...,-1.243038,-1.096094,-1.153682,-0.942176,0.040907
9,ClC1=C2C(=O)C(=O)N(CC3=CC4=CC=CC=C4S3)C2=CC=C1,-1.049218,-1.007207,-1.134227,-0.634655,0.162427


In [23]:
# Blinded molecules prediction
blinded_csv = pd.read_csv('molecular_descriptors_csv/min_max_scaled_blinded_molecular_descriptors.csv')
names_blinded = blinded_csv['Name']


blinded_csv = blinded_csv.loc[:,'nAcid':'Zagreb']
blinded_csv['zeros'] = 0

blinded_imgs = np.reshape(np.array(blinded_csv),(-1,35,32,1))
blinded_predicted_values_inverse_transform_modelwise = []
blinded_predicted_values_modelwise = []

for key,values in best_models.items():
    for value in values:
        json_file = open(key+'/model_'+str(value)+'.json')
        loaded_model_json = json_file.read()
        json_file.close()
        loaded_model = model_from_json(loaded_model_json)
        loaded_model.load_weights(key+'/model'+str(value)+'.h5')

        # Predict blinded set
        predicted_blinded_values = loaded_model.predict(blinded_imgs)
        predicted_blinded_values = np.reshape(predicted_blinded_values,-1)
        # min max scaler inverse transform for predicted blinded values
        blinded_data_predicted = blinded_csv.loc[:,'nAcid':'Zagreb']
        blinded_data_predicted['values'] = predicted_blinded_values
        blinded_data_inverse_transform_predicted = scaler.inverse_transform(blinded_data_predicted)
        blinded_value_predicted_inverse_transform = blinded_data_inverse_transform_predicted[:,len(blinded_data_inverse_transform_predicted[0])-1]
        blinded_predicted_values_inverse_transform_modelwise.append(blinded_value_predicted_inverse_transform)
        blinded_predicted_values_modelwise.append(predicted_blinded_values)
df_blinded_values_inverse_transform_modelwise = pd.DataFrame(np.transpose(np.array(blinded_predicted_values_inverse_transform_modelwise)),columns=model_names)
df_blinded_values_modelwise = pd.DataFrame(np.transpose(np.array(blinded_predicted_values_modelwise)),columns=model_names)

In [24]:
pd.concat([pd.DataFrame(names_blinded),df_blinded_values_inverse_transform_modelwise],axis=1)

Unnamed: 0,Name,"best_models,1","best_models,10","best_models,13","best_models,14","best_models,15","best_models,18","best_models,20","best_models,21","best_models,23",...,"best_models,59","best_models,63","best_models,9","best_models,24","best_models,53","best_models,62","best_models,16","best_models,61","best_models,4","best_models,7"
0,CSC1=C(C(C)=C(S1)C1=NC(C)=CS1)C1=CC=NC(SCC(=O)...,-1.022385,-1.005521,-1.012495,-1.029578,-1.032155,-1.023687,-0.989363,-1.003371,-1.059993,...,-0.974629,-1.001711,-1.053785,-0.929538,-1.035385,-1.084139,-1.015936,-1.001421,-0.984339,-1.003269
1,COC1=CC=CC(=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.757268,-1.72169,-1.711514,-1.81321,-1.735065,-1.682747,-1.719634,-1.732969,-1.711519,...,-1.670271,-1.729141,-1.781469,-1.684644,-1.747518,-1.58707,-1.768572,-1.835863,-1.91552,-1.714714
2,CC1=COC2=C1C(=O)C(=O)C1=C3CCCC(C)(C)C3=CC=C21,-1.444162,-1.421013,-1.336934,-1.459547,-1.53762,-1.437024,-1.308345,-1.303241,-1.548518,...,-1.357577,-1.573024,-1.458299,-1.526067,-1.450921,-1.378664,-1.611262,-1.707342,-1.622041,-1.36333
3,CCOC(=O)C(=C\NC1=CC=C(C=C1)S(=O)(=O)C1=CC=C(N\...,-1.463089,-0.35046,-1.370257,-1.553446,-1.297334,-1.496654,-1.178616,-1.218292,-1.210204,...,-1.90296,-1.383601,-1.380833,-1.260387,-1.355537,-1.246906,-1.697686,-1.538915,-1.36379,-1.168566
4,CCOC(=O)C(=CNC1=CC=C(C=C1)S(=O)(=O)C1=CC=C(NC=...,-1.382226,-0.320728,-1.413336,-1.668902,-1.343382,-1.444137,-1.2311,-1.213049,-1.216024,...,-1.800269,-1.49182,-1.285916,-1.246091,-0.886349,-1.153783,-1.582269,-1.593865,-1.352281,-1.077205
5,[O-][N+](=O)C1=C(C=CC(Cl)=C1)C1=CC=C(O1)C(=O)O...,0.637426,0.717903,0.624899,0.659582,0.648644,0.642657,0.683269,0.716889,0.743684,...,0.519072,0.6117,0.631748,0.876307,0.640626,0.777624,0.631749,0.881962,0.59277,0.66703
6,ClC1=CC=C(C=C1)C(=O)OC1=CN=CC(Cl)=C1,0.309826,0.446158,0.554757,0.485666,0.397424,0.321294,0.362738,0.357146,0.250157,...,0.453314,0.356685,0.538046,0.264678,0.223916,0.323356,0.249069,0.416404,0.356268,0.364487
7,CN1CCN(CC1)S(=O)(=O)C1=CC2=C(NC(=O)C2=O)C=C1,-0.103866,-0.315213,-0.137522,-0.973966,-1.114284,-0.585965,-0.164043,-0.54575,-0.455687,...,-1.050557,-0.140073,-0.228787,-0.959253,-0.706368,-1.066781,-0.163752,-0.505206,-0.651053,-0.605811
8,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCCCC2)C1=O,-0.329061,-0.453288,-0.358573,-0.095082,-0.442461,-0.20988,-0.386657,-0.375514,-0.54087,...,-0.701598,-0.649795,-0.345255,-0.699136,-0.572533,-0.818171,-0.284937,-0.4591,-0.51974,-0.475153
9,IC1=CC=C2N(CC3COC4=C(O3)C=CC=C4)C(=O)C(=O)C2=C1,-1.148802,-0.989688,-0.009474,-0.998987,-0.070087,-0.407185,-0.946206,-0.554551,-0.856654,...,-0.485938,-1.15668,-1.073934,0.072287,-0.68212,-0.84559,-0.270072,-0.360026,-0.46606,-0.9268


#### Distance to model computation for blinded molecules.

In [27]:
# Calculate non parametric CI at 95% confidence (blinded)
df_modelwise_blinded_inverse_transform_array = np.array(df_blinded_values_inverse_transform_modelwise)
two_point_five_percentile = percentile(df_modelwise_blinded_inverse_transform_array,2.5,axis=1)
ninety_seven_point_five_percentile = percentile(df_modelwise_blinded_inverse_transform_array,97.5,axis=1)
lower_blinded_df = pd.DataFrame(two_point_five_percentile,columns=['lower_blinded'])
upper_blinded_df = pd.DataFrame(ninety_seven_point_five_percentile,columns=['upper_blinded'])
median_blinded = np.reshape(np.array(df_blinded_values_inverse_transform_modelwise.median(axis = 1)),(-1,1))
median_blinded_df = pd.DataFrame(median_blinded,columns=['predicted_blinded_values'])

# Get mid 95 percent values
sorted_df_modelwise_inverse_transform_array = np.sort(df_modelwise_blinded_inverse_transform_array,axis = 1)
lower_index = int(round(0.025 * num_of_models))
upper_index =int(round(0.95* num_of_models))

mid_ninety_five_percent_values = sorted_df_modelwise_inverse_transform_array[:,lower_index:upper_index]
number_of_values_in_mid_95_percent = len(mid_ninety_five_percent_values[0])

median_blinded_tiled = np.tile(median_blinded,(1,number_of_values_in_mid_95_percent))

subtract_median_from_predicted_values = np.sqrt(np.sum(np.square(median_blinded_tiled - mid_ninety_five_percent_values),axis = 1,keepdims=True))
magnitude_of_median_blinded_vector = np.sqrt(np.sum(np.square(median_blinded_tiled),axis = 1,keepdims=True))

weighted_subtraction_values = subtract_median_from_predicted_values/magnitude_of_median_blinded_vector
distance_to_model = pd.DataFrame(weighted_subtraction_values,columns = ['distance_to_ensemble_models_for_blinded_compounds'])

In [28]:
# Blinded molecule pIC50 values are saved in pIC50_values_csv/blinded_pIC50_values.csv
pd.concat([pd.DataFrame(names_blinded),pd.DataFrame(median_blinded_df,columns = ['predicted_blinded_values']),
           pd.DataFrame(distance_to_model)],axis=1).to_csv('pIC50_values_csv/blinded_pIC50_values.csv')

pd.concat([pd.DataFrame(names_blinded),
           pd.DataFrame(median_blinded_df,columns = ['predicted_blinded_values']),
           pd.DataFrame(distance_to_model)],axis=1)


Unnamed: 0,Name,predicted_blinded_values,distance_to_ensemble_models_for_blinded_compounds
0,CSC1=C(C(C)=C(S1)C1=NC(C)=CS1)C1=CC=NC(SCC(=O)...,-1.022385,0.026128
1,COC1=CC=CC(=C1)C1=C(C#N)C(=O)NC(SCC(=O)NC2=CC=...,-1.711519,0.029136
2,CC1=COC2=C1C(=O)C(=O)C1=C3CCCC(C)(C)C3=CC=C21,-1.449959,0.080469
3,CCOC(=O)C(=C\NC1=CC=C(C=C1)S(=O)(=O)C1=CC=C(N\...,-1.380833,0.130387
4,CCOC(=O)C(=CNC1=CC=C(C=C1)S(=O)(=O)C1=CC=C(NC=...,-1.352281,0.147236
5,[O-][N+](=O)C1=C(C=CC(Cl)=C1)C1=CC=C(O1)C(=O)O...,0.650024,0.115624
6,ClC1=CC=C(C=C1)C(=O)OC1=CN=CC(Cl)=C1,0.356268,0.228735
7,CN1CCN(CC1)S(=O)(=O)C1=CC2=C(NC(=O)C2=O)C=C1,-0.651053,0.486359
8,O=C1NC2=C(C=C(C=C2)S(=O)(=O)N2CCCCC2)C1=O,-0.386657,0.360394
9,IC1=CC=C2N(CC3COC4=C(O3)C=CC=C4)C(=O)C(=O)C2=C1,-0.503567,0.719646
