## ML regression to predict the Efficacy of an active G9a inhibitor. Dataset 5 (skewnesness removed after logaritmic transformation)

### Content   <a name="content"></a>

1. [Load data](#1)
2. [Regression Machine Learning](#2)
3. [Cros-validation](#3)
4. [Calculate the relative error of the Gradient Boosting Regressor model](#4)
5. [Feature importance of the Gradient Boosting Regressor model](#5)
6. [Comparison of the first six features from the feature importance results](#6)
7. [Hyperparameter tuning of the model with the reduced features](#7)
8. [Relative error of the reduced data model](#8)

## Load data<a name="1"></a>

In [1]:
# pip install modin[ray] 
# pip install sidetable

In [2]:
import pandas as pd 

# # loading the dataset for the regression ML
df = pd.read_csv('data_reg_no_skew_2.csv', index_col=[0])
# Avoid some columns to be truncated during df display
pd.set_option('display.max_columns', None)
# Display the data frame
print('Shape of df: ', df.shape)
df.head()

Shape of df:  (3890, 22)


Unnamed: 0,O_relative,Similarity,S,C_rel_2D,allAtoms_rel_2D,C_rel_XY_3D,allAtoms_rel_XY_3D,MMY_3D,CBUC,mean_C**SX6_S,mean_H**SX6_S,C_rel**MMZ6_3D,MMX_3D,YZ_3D_volume,XY_3D_volume,XZ_3D_volume,MMY6_3D,MMX,C_relative,HBDC,mean_H_rel**MM_3D,Efficacy
0,0.04879,-3.540459,2.815409,-0.036445,-0.076324,1.227126,0.88923,1.468128,0.0,2.352758,0.30916,-0.409235,2.357357,-4.427062,-0.759324,-1.725495,0.890357,2.021203,-0.798508,0.693147,-10.791702,130.562
1,0.0,-2.813411,3.218076,-0.459874,-0.520649,0.372476,0.231709,2.104756,0.0,-2.518752,-0.723723,-2.047153,2.336465,-1.671056,-2.016523,-3.688788,1.769224,1.760733,-1.049822,0.693147,-7.17033,123.936
2,0.039221,-2.813411,0.0,-0.375204,-0.375204,0.036507,0.08601,2.178302,0.0,-0.624115,-0.673394,-2.92677,2.264312,-2.604338,-0.94136,-1.670658,1.966175,2.018364,-0.798508,0.693147,-7.266715,131.956
3,0.039221,-2.27938,0.0,1.161764,0.92984,1.012986,0.708505,1.627592,0.0,0.551769,-0.356377,-1.32974,2.336097,-2.918561,-0.768325,-1.670452,1.097578,2.363304,-0.941609,1.098612,-8.029929,130.878
4,0.086178,-2.813411,0.0,-0.328363,-0.259727,0.696946,0.515371,1.944337,0.0,-0.038294,0.370315,-2.707203,2.459708,-2.309655,-1.476875,-5.541154,1.57294,2.01835,-0.967584,1.609438,-10.432946,115.735


In [3]:
# Check for NaN
df.isnull().values.any()

False

In [4]:
df.describe(include="all")

Unnamed: 0,O_relative,Similarity,S,C_rel_2D,allAtoms_rel_2D,C_rel_XY_3D,allAtoms_rel_XY_3D,MMY_3D,CBUC,mean_C**SX6_S,mean_H**SX6_S,C_rel**MMZ6_3D,MMX_3D,YZ_3D_volume,XY_3D_volume,XZ_3D_volume,MMY6_3D,MMX,C_relative,HBDC,mean_H_rel**MM_3D,Efficacy
count,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0,3890.0
mean,0.056535,-2.920095,1.319144,0.181489,0.172925,0.814016,0.613197,1.8773,0.0,0.029097,0.024811,-2.044991,2.491408,-2.207809,-1.868079,-2.209979,1.478284,2.293653,-0.885774,0.758967,-10.896239,132.911167
std,0.03891,0.245462,1.305918,0.542658,0.479744,0.407933,0.301689,0.207507,0.0,1.288589,0.430362,1.01128,0.19429,1.64904,1.500742,1.698765,0.329716,0.276471,0.110488,0.419714,2.639387,29.294427
min,0.0,-4.50986,0.0,-1.84191,-1.84191,-0.551531,-0.225224,1.106878,0.0,-6.053763,-2.031212,-7.062266,1.649233,-11.830452,-8.955455,-15.424812,-0.929629,0.957509,-1.275896,0.0,-21.271025,62.0004
25%,0.029559,-3.079114,0.0,-0.195802,-0.170659,0.512311,0.392284,1.721784,0.0,-0.756726,-0.229145,-2.63138,2.36166,-3.157131,-2.791262,-3.213495,1.248095,2.1383,-0.967584,0.693147,-12.50894,112.4935
50%,0.058269,-2.937463,2.253395,0.144369,0.141862,0.809754,0.612417,1.881273,0.0,0.010536,0.027735,-2.088391,2.500185,-2.186439,-1.82225,-2.203014,1.511549,2.314,-0.867501,0.693147,-10.683735,131.5865
75%,0.076961,-2.780621,2.488234,0.551068,0.523259,1.109457,0.826652,2.025041,0.0,0.81309,0.292953,-1.39267,2.625009,-1.232699,-0.886061,-1.205906,1.721283,2.487642,-0.820981,1.098612,-8.978427,151.59975
max,0.170051,-2.27938,3.385922,1.434642,1.232933,1.758897,1.318395,2.387927,0.0,3.34148,0.981189,-0.0819,2.96584,3.549883,1.708549,3.559887,2.200404,2.92008,-0.597124,1.619733,-6.021254,289.972


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3890 entries, 0 to 3889
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   O_relative          3890 non-null   float64
 1   Similarity          3890 non-null   float64
 2   S                   3890 non-null   float64
 3   C_rel_2D            3890 non-null   float64
 4   allAtoms_rel_2D     3890 non-null   float64
 5   C_rel_XY_3D         3890 non-null   float64
 6   allAtoms_rel_XY_3D  3890 non-null   float64
 7   MMY_3D              3890 non-null   float64
 8   CBUC                3890 non-null   float64
 9   mean_C**SX6_S       3890 non-null   float64
 10  mean_H**SX6_S       3890 non-null   float64
 11  C_rel**MMZ6_3D      3890 non-null   float64
 12  MMX_3D              3890 non-null   float64
 13  YZ_3D_volume        3890 non-null   float64
 14  XY_3D_volume        3890 non-null   float64
 15  XZ_3D_volume        3890 non-null   float64
 16  MMY6_3D    

[<a href="#content">Back to top</a>]

## Regression Machine Learning <a name="2"></a>

In [6]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Separate the training columns from the target column 'Fit_HillSlope'
X = df.drop(['Efficacy'], axis=1) 
y = df['Efficacy'] 

# Split the data set into train and test parts 
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=5) 
# # Standardise the data points
sc = StandardScaler()
X_train = sc.fit_transform(X_train_unscaled)
X_test = sc.transform(X_test_unscaled)

# Print the shape of each part
print("Shapes:")
print("X_train: ", X_train.shape)
print("X_test:  ", X_test.shape)
print("y_train: ", y_train.shape)
print("y_test:  ", y_test.shape)

Shapes:
X_train:  (3112, 21)
X_test:   (778, 21)
y_train:  (3112,)
y_test:   (778,)


In [7]:
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate the algorithms that will be used, placing them in a dictionary 
regs = {"SVR":SVR(kernel='linear'),
        "DecisionTree":DecisionTreeRegressor(), 
        "RandomForest":RandomForestRegressor(), 
        "GradientBoost":GradientBoostingRegressor(),}

In [8]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Create statistics with the results of training with different algorithms
def model_fit(regs):
    fitted_model={}
    model_result = pd.DataFrame()
    for model_name, model in regs.items():
        model.fit(X_train,y_train)
        fitted_model.update({model_name:model})
        model_dict = {}
        model_dict['Algorithm'] = model_name
        model_dict['RMSE_Train'] = round(root_mean_squared_error(y_train, model.predict(X_train)),2)
        model_dict['RMSE_Test'] = round(root_mean_squared_error(y_test, model.predict(X_test)),2)
        model_dict['MAE_Train'] = round(mean_absolute_error(y_train, model.predict(X_train)),2)
        model_dict['MAE_Test'] = round(mean_absolute_error(y_test, model.predict(X_test)),2)
        model_dict['R2_Train'] = round(r2_score(y_train, model.predict(X_train)),2)
        model_dict['R2_Test'] = round(r2_score(y_test, model.predict(X_test)),2)
        model_result = model_result._append(model_dict,ignore_index=True)
    return fitted_model, model_result

fitted_model, model_result = model_fit(regs)
model_result.sort_values(by=['MAE_Test'],ascending=True)

Unnamed: 0,Algorithm,RMSE_Train,RMSE_Test,MAE_Train,MAE_Test,R2_Train,R2_Test
2,RandomForest,11.07,27.75,8.62,21.92,0.86,0.03
3,GradientBoost,26.11,27.78,20.61,21.94,0.22,0.03
0,SVR,29.38,27.98,23.02,22.05,0.01,0.01
1,DecisionTree,0.0,39.81,0.0,31.8,1.0,-0.99


[<a href="#content">Back to top</a>]

## Cross-validation <a name="3"></a>

In [9]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Create statistics with the results of cross-validation
def model_CV(regs):
    fitted_model={}
    model_cv_result = pd.DataFrame()
    for model_name, model in regs.items():
        fitted_model.update({model_name:model})
        scores = cross_val_score(model, X_train, y_train, cv=5,
                        scoring=('neg_mean_absolute_error'))
        scores = -scores
        model_dict = {}
        model_dict['Algorithm'] = model_name
        model_dict['CV_MAE'] = round(np.mean(scores), 2)
        model_dict['Sta Dev MAE'] = round(np.std(scores), 2)
        model_dict['List of MAE'] = np.round(scores, 2)
        model_cv_result = model_cv_result._append(model_dict,ignore_index=True)
    return fitted_model, model_cv_result

fitted_model, model_cv_result = model_CV(regs)
model_cv_result.sort_values(by=['CV_MAE'],ascending= True)

Unnamed: 0,Algorithm,CV_MAE,Sta Dev MAE,List of MAE
0,SVR,23.33,0.71,"[23.89, 22.33, 23.28, 24.33, 22.84]"
2,RandomForest,23.34,0.69,"[23.96, 22.14, 23.24, 24.08, 23.27]"
3,GradientBoost,23.64,0.64,"[24.32, 22.82, 23.37, 24.47, 23.23]"
1,DecisionTree,33.28,0.51,"[33.8, 32.76, 33.19, 32.72, 33.93]"


[<a href="#content">Back to top</a>]

## Calculate the relative error of the Gradient Boosting Regressor model  <a name="4"></a>

In [11]:
import sklearn.metrics as metrics
from sklearn.metrics import r2_score

# Instantiate and train a model
model = GradientBoostingRegressor().fit(X_train, y_train)

# Predict 
pred = model.predict(X_test)

# Evaluate
print('Mean Absolute Error (MAE):', round(metrics.mean_absolute_error(y_test, pred),2))
print('Mean Squared Error (MSE):', round(metrics.mean_squared_error(y_test, pred),2))
print('Root Mean Squared Error (RMSE):', round(np.sqrt(metrics.mean_squared_error(y_test, pred))))
print("R2 score for perfect model is:", round(r2_score(y_test, pred), 2))

Mean Absolute Error (MAE): 21.93
Mean Squared Error (MSE): 770.7
Root Mean Squared Error (RMSE): 28
R2 score for perfect model is: 0.03


In [12]:
# Create a data frame with the test values 
data_verify=pd.DataFrame(y_test.tolist(),columns=["Real Values"])

# Create a data frame with the values predicted 
data_predicted=pd.DataFrame(pred.tolist(),columns=["Predicted Values"])

# Concatenate the data frames with the test and the values predicted
final_output=pd.concat([data_verify,data_predicted],axis=1)

# Create column with the difference between the test and prediction values
final_output["Difference"]= np.abs(final_output["Real Values"]-final_output["Predicted Values"])
final_output["Relative proportion Difference/Real Value"]= (final_output["Difference"]/final_output["Real Values"])

# Display the resulted data frame 
final_output

Unnamed: 0,Real Values,Predicted Values,Difference,Relative proportion Difference/Real Value
0,133.9060,131.947811,1.958189,0.014624
1,106.7490,133.912759,27.163759,0.254464
2,73.8972,119.136865,45.239665,0.612197
3,131.3220,132.306512,0.984512,0.007497
4,128.1160,133.454229,5.338229,0.041667
...,...,...,...,...
773,152.0050,133.308785,18.696215,0.122997
774,138.4730,130.454499,8.018501,0.057907
775,134.5040,134.583149,0.079149,0.000588
776,114.2020,127.358077,13.156077,0.115200


In [13]:
# Mean of the relative error
df_reg_rel_mean = final_output["Relative proportion Difference/Real Value"].mean()
print("Relative error: ", df_reg_rel_mean)

Relative error:  0.18161245366016568


[<a href="#content">Back to top</a>]