# Python Notebook 3: Optimizing of Selected Algorithms

In [179]:
# Import Pandas and Numpy
import pandas as pd
import numpy as np

# Ability to export model 
import joblib

# Import fingerprint generator 
from padelpy import padeldescriptor

# Import algorithms 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor


# Import Variance Threshold
from sklearn.feature_selection import VarianceThreshold


# Import Standard Scaler for target scaling
from sklearn.preprocessing import StandardScaler


# Import train-test split and grid search
from sklearn.model_selection import train_test_split, GridSearchCV


# Import metrics for model evaluation
from sklearn.metrics import r2_score, mean_squared_error


# Choose directory for CSV importing/exporting
import os
os.chdir("/Users/danigeiger/Desktop/Capstone/Analysis")
os.getcwd()


'/Users/danigeiger/Desktop/Capstone/Analysis'

## Part 1:  Review Top 20 Algorithms with Feature and Target Selections
    

In [180]:
# Import top 20 best performing models
top20 = pd.read_csv("lazy_predict_best_models.csv")
top20

Unnamed: 0,Model,Adjusted R-Squared,R-Squared,RMSE,Time Taken,Data Set,Feature Reduction,Target Variable
0,BaggingRegressor,1.704929,0.828709,0.689196,0.040054,Full Set,VarThres,Unstandardized
1,BaggingRegressor,1.715556,0.826127,0.432155,0.046716,Full set,VarThres,Standardized
2,RandomForestRegressor,1.023296,0.825953,0.442141,0.693931,Small Set,,Standardized
3,RandomForestRegressor,1.023738,0.822648,0.717056,0.685968,Small Set,,Unstandardized
4,RandomForestRegressor,1.738724,0.820497,0.439096,0.299595,Full set,VarThres,Standardized
5,RandomForestRegressor,1.744667,0.819053,0.708355,0.295899,Full Set,VarThres,Unstandardized
6,RandomForestRegressor,1.025351,0.816618,0.443815,0.687223,Full Set,,Standardized
7,RandomForestRegressor,1.025374,0.816454,0.713423,0.690787,Full Set,,Unstandardized
8,GradientBoostingRegressor,1.024913,0.813874,0.73458,0.25537,Small Set,,Unstandardized
9,GradientBoostingRegressor,1.766036,0.813861,0.447139,0.123674,Full set,VarThres,Standardized


### Algorithm Selection:
##### Bagging Regressor, Random Forest Regressor, and Gradient Boosting Regressor on the Full Dataset with Variance Threshold Feature Reduction and Standardized Target Variables. 

##### These three algorithm-data pairings achieved the highest R-squared and adjusted R-squared values while maintaining the lowest root mean square error (RMSE). Additionally, the computation time for each algorithm was reasonable.


In [181]:
selected_models = top20.loc[[1, 4, 9]]
selected_models

Unnamed: 0,Model,Adjusted R-Squared,R-Squared,RMSE,Time Taken,Data Set,Feature Reduction,Target Variable
1,BaggingRegressor,1.715556,0.826127,0.432155,0.046716,Full set,VarThres,Standardized
4,RandomForestRegressor,1.738724,0.820497,0.439096,0.299595,Full set,VarThres,Standardized
9,GradientBoostingRegressor,1.766036,0.813861,0.447139,0.123674,Full set,VarThres,Standardized


## Part 2: Repeat Data Preprocessing from Python Notebook 2

In [182]:
#Import data sets
full_set = pd.read_csv("Part4_IC50_cleansed.csv")


# Convert the standard_value column to a numeric type
full_set['standard_value'] = pd.to_numeric(full_set['standard_value'], errors='coerce')

# Convert values from nanomolar to molar 
full_set['standard_value'] = full_set['standard_value']*10**-9 

# Take the negative log of the molar amount
full_set['standard_value']= -np.log10(full_set['standard_value']) 


# Select the required columns for smi file for padel processing
df_smiles_full_set = full_set[['canonical_smiles', 'molecule_chembl_id']]


# Save to a .smi file (PaDEL format)
df_smiles_full_set.to_csv('molecules_full.smi', sep='\t', index=False, header=False)


# Create fingerprints (features)
padeldescriptor(
    mol_dir='molecules_full.smi',          
    d_file='fingerprints_full_output.csv', 
    fingerprints=True,                # binary data: where 1 represents the presence of some characteristic, and 0 means lacking that characteristic
    retainorder=True,                 # you need this so we can tie labels/targets back to the correct row
)

# Import fingerprints / engineered features back in as dataframe
fingerprints_full = pd.read_csv('fingerprints_full_output.csv')


# Create a dataframe which includes targets, features and molecular identification
fingerprints_with_targets_full_df = pd.concat([fingerprints_full, full_set['standard_value']], axis =1)


# Export data for sharing and transparency
fingerprints_with_targets_full_df.to_csv("fingerprints_with_targets_full.csv")


# Split full molecule data set into fingerprints (features) and standard values (targets)
X_full_set = fingerprints_with_targets_full_df.drop(['Name', 'standard_value'], axis =1)
y_full_set = fingerprints_with_targets_full_df['standard_value']


## Part 3: Apply Data Preprocessing Specific to Chosen Models

### Apply Bernoulli Variance Formula for Binary Variance Threshold Selection

In [183]:
selector = VarianceThreshold(threshold=0.8*(1-0.8)) 
X_full_set_reduced = selector.fit_transform(X_full_set)

### Perform Train-Test Split

In [184]:
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full_set_reduced, y_full_set, test_size = 0.2, random_state=42)

print(f"Data set dimentions are : {X_train_full.shape}")

Data set dimentions are : (430, 133)


### Convert Pandas DataFrames to NumPy Arrays for Efficient Data Handling

In [185]:
X_train_full = np.array(X_train_full, dtype=np.float64)
X_test_full = np.array(X_test_full, dtype=np.float64)
y_train_full = np.array(y_train_full, dtype=np.float64)
y_test_full = np.array(y_test_full, dtype=np.float64)


### Apply Standard Scaler to Target Values

In [186]:
scaler = StandardScaler()
y_train_full = scaler.fit_transform(y_train_full.reshape(-1, 1)).flatten()
y_test_full= scaler.transform(y_test_full.reshape(-1,1)).flatten()

## Part 4: Optimizing Algorithms Through Hyperparameter Tuning

### Optimize Bagging Regressor

#### Retrieve Default Parameters for Grid Search Optimization Baseline

In [187]:
bagging_reg = BaggingRegressor(random_state= 42)
print(bagging_reg.get_params())

{'bootstrap': True, 'bootstrap_features': False, 'estimator': None, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 10, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}


#### Check RMSE and R-Squared Values on Default Parameters

In [188]:
bagging_reg.fit(X_train_full, y_train_full)
y_pred = bagging_reg.predict(X_test_full)

rmse_default = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_default = r2_score(y_test_full, y_pred)
print(f"Test RMSE: {rmse_default:.4f}")
print(f"Test R-Squared: {r2_default:.4f}")

Test RMSE: 0.4322
Test R-Squared: 0.8261


Note: According to scikit-learn documentation, if 'estimator' is set to None, the default is a decision tree with default decision tree parameters. 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html

#### First Pass Grid Search for Bagging Regressor (864 combinations)

In [189]:
# Define parameter grid options
param_grid = {
    'bootstrap': [True, False],
    'bootstrap_features': [True, False], 
    'max_features': [0.6, 0.8, 1.0], 
    'max_samples': [0.6, 0.8, 1.0], 
    'n_estimators': [10, 50, 100], 
    'oob_score': [True, False], 
    'warm_start': [False],  
    'estimator': [
        DecisionTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1),  # Tree with default parameters
        DecisionTreeRegressor(max_depth=3, min_samples_split=2, min_samples_leaf=1),
        DecisionTreeRegressor(max_depth=5, min_samples_split=4, min_samples_leaf=2),
        DecisionTreeRegressor(max_depth=10, min_samples_split=5, min_samples_leaf=4),
    ],
}

# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(bagging_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
bagging_reg = BaggingRegressor(**grid_search.best_params_, random_state=42)
bagging_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = bagging_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
1080 fits failed out of a total of 4320.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1080 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/danigeiger/opt/anaconda3/envs/Module_2_670/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danigeiger/opt/anaconda3/envs/Module_2_6

#### Best Parameters and Performance from First Grid Pass

In [190]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'bootstrap': True, 'bootstrap_features': True, 'estimator': DecisionTreeRegressor(), 'max_features': 0.8, 'max_samples': 1.0, 'n_estimators': 50, 'oob_score': True, 'warm_start': False}
Test RMSE: 0.4333
Test R-Squared: 0.8252


#### Check Differences in Metrics from the Default Settings

In [191]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: -0.001119
R-Squared Difference: -0.000902


#### Second Pass Grid Search for Bagging Regressor (30 Combinations)

In [192]:
# Define parameter grid options
param_grid = {
    'bootstrap': [True],
    'bootstrap_features': [True],
    'max_features': [0.75, 0.80, 0.85], 
    'max_samples': [0.95, 1.0], 
    'n_estimators': [30, 40, 50, 60, 70],
    'estimator': [DecisionTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1)],
    'oob_score': [True],
    'warm_start': [False]
}

# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(bagging_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
bagging_reg = BaggingRegressor(**grid_search.best_params_, random_state=42)
bagging_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = bagging_reg.predict(X_test_full)

# calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)


#### Best Parameters and Performance from Second Grid Pass

In [193]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'bootstrap': True, 'bootstrap_features': True, 'estimator': DecisionTreeRegressor(), 'max_features': 0.85, 'max_samples': 1.0, 'n_estimators': 30, 'oob_score': True, 'warm_start': False}
Test RMSE: 0.4439
Test R-Squared: 0.8165


#### Check Differences in Metrics from the Default Settings

In [194]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: -0.011763
R-Squared Difference: -0.009594


#### Third Pass Grid Search for Bagging Regressor (30 Combinations)

In [195]:
# Define parameter grid options
param_grid = {
    'bootstrap': [True],
    'bootstrap_features': [True],
    'max_features': [0.78, 0.80, 0.82], 
    'max_samples': [0.95, 1.0], 
    'n_estimators': [45, 47, 50, 53, 55],
    'estimator': [DecisionTreeRegressor(max_depth=None, min_samples_split=2, min_samples_leaf=1)],
    'oob_score': [True],
    'warm_start': [False]
}

# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(bagging_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
bagging_reg = BaggingRegressor(**grid_search.best_params_, random_state=42)
bagging_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = bagging_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

#### Best Parameters and Performance from Third Grid Pass

In [196]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'bootstrap': True, 'bootstrap_features': True, 'estimator': DecisionTreeRegressor(), 'max_features': 0.8, 'max_samples': 1.0, 'n_estimators': 53, 'oob_score': True, 'warm_start': False}
Test RMSE: 0.4326
Test R-Squared: 0.8258


#### Check Differences in Metrics from the Default Settings

In [197]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: -0.000445
R-Squared Difference: -0.000359


#### Compare Validation RMSE to Test Set RMSE for Model Fit Assessment 


Note: Grid Search returns negative mean squared error as its scoring metric.

In [198]:
print(f"The average validation RMSE is {np.sqrt(abs(grid_search.best_score_)):.4f} units and the test RMSE is {rmse_grid:.4f} units." )

The average validation RMSE is 0.4643 units and the test RMSE is 0.4326 units.


#### Despite a slight decrease of 0.000445 in the R-squared value, cross-validation results indicate strong model performance. The small difference between the test RMSE (0.4326) and the average validation RMSE (0.4643) suggests that the model generalizes well.

#### We will use the final Grid Search parameters for our Bagging Regressor model. Although the trees have no maximum depth, the minimal difference between validation and test RMSE confirms that there is no significant overfitting.

#### Fit Bagging Regressor Model

In [199]:
bagging_reg = BaggingRegressor(**grid_search.best_params_, random_state=42)
bagging_reg.fit(X_train_full, y_train_full)

#### Store Performance Metrics in a DataFrame for Comparison of All Models at the End) 

In [200]:
model_name = "Bagging Regressor"
results1 = pd.DataFrame([[model_name, round(rmse_grid,3), round(r2_grid,3)]], columns=["Model", "RMSE", "R-Squared"])
results1

Unnamed: 0,Model,RMSE,R-Squared
0,Bagging Regressor,0.433,0.826


### Optimize Random Forest Regressor

#### Retrieve Default Parameters for Grid Search Optimization Baseline

In [201]:
rf_reg = RandomForestRegressor(random_state= 42)
print(rf_reg.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}


#### Check RMSE and R-Squared Values on Default Parameters

In [202]:
rf_reg.fit(X_train_full, y_train_full)
y_pred = rf_reg.predict(X_test_full)

rmse_default = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_default = r2_score(y_test_full, y_pred)
print(f"Test RMSE: {rmse_default:.4f}")
print(f"Test R-Squared: {r2_default:.4f}")

Test RMSE: 0.4391
Test R-Squared: 0.8205


#### First pass Grid Search for Random Forest Regressor (576 Combinations)

In [203]:
# Define parameter grid options
param_grid = {
    'bootstrap': [True, False], 
    'ccp_alpha': [0.0], 
    'criterion': ['squared_error'],
    'max_depth': [None, 10, 20],
    'max_features': [0.8, 1.0], 
    'max_leaf_nodes': [None],
    'max_samples': [None],
    'min_impurity_decrease': [0.0, 0.01],
    'min_samples_leaf': [1, 3, 5],
    'min_samples_split': [2, 5],
    'min_weight_fraction_leaf': [0.0],
    'monotonic_cst': [None], 
    'n_estimators': [50, 100],
    'oob_score': [False, True],
    'verbose': [0],
    'warm_start': [False]
}


# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(rf_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
rf_reg = RandomForestRegressor(**grid_search.best_params_, random_state=42)
rf_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = rf_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

720 fits failed out of a total of 2880.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
720 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/danigeiger/opt/anaconda3/envs/Module_2_670/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danigeiger/opt/anaconda3/envs/Module_2_670/lib/python3.9/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/danigeiger/opt/anaconda3/envs/Module_2_670/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 448, in fit
    raise ValueError("Out of bag estimation only available if bootst

#### Best Parameters and Performance from First Grid Pass

In [204]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 0.8, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'oob_score': False, 'verbose': 0, 'warm_start': False}
Test RMSE: 0.4380
Test R-Squared: 0.8214


#### Check Differences in Metrics from the Default Settings

In [205]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: 0.001142
R-Squared Difference: 0.000932


#### Second Pass Grid Search for Random Forest Regressor (1,728 Combinations)

In [206]:
# Define parameter grid options
param_grid = {
    'bootstrap': [True], 
    'ccp_alpha': [0.0], 
    'criterion': ['squared_error'],
    'max_depth': [None, 30, 50],
    'max_features': [0.7, 0.8, 0.9, 1.0], 
    'max_leaf_nodes': [None],
    'max_samples': [0.7, 0.8, 0.9, 1.0],
    'min_impurity_decrease': [0.0],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 5],
    'min_weight_fraction_leaf': [0.0],
    'monotonic_cst': [None], 
    'n_estimators': [100, 200, 300],
    'oob_score': [False, True],
    'verbose': [0],
    'warm_start': [False]
}


# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(rf_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
rf_reg = RandomForestRegressor(**grid_search.best_params_, random_state=42)
rf_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = rf_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

#### Best Parameters and Performance from Second Grid Pass

In [207]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 0.7, 'max_leaf_nodes': None, 'max_samples': 1.0, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 200, 'oob_score': False, 'verbose': 0, 'warm_start': False}
Test RMSE: 0.4393
Test R-Squared: 0.8203


#### Check Differences in Metrics from the Default Settings

In [208]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: -0.000229
R-Squared Difference: -0.000187


#### Third Pass Grid Search for Random Forest Regressor (240 Combinations)

In [209]:
# Define parameter grid options
param_grid = {
    'bootstrap': [True], 
    'ccp_alpha': [0.0], 
    'criterion': ['squared_error'],
    'max_depth': [None],
    'max_features': [0.65, 0.7, 0.75, 0.8], 
    'max_leaf_nodes': [None],
    'max_samples': [0.95, 1.0],
    'min_impurity_decrease': [0.0],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [5],
    'min_weight_fraction_leaf': [0.0],
    'monotonic_cst': [None], 
    'n_estimators': [150, 175, 200, 225, 250],
    'oob_score': [False, True],
    'verbose': [0],
    'warm_start': [False]
}


# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(rf_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
rf_reg = RandomForestRegressor(**grid_search.best_params_, random_state=42)
rf_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = rf_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

#### Best Parameters and Performance from Third Grid Pass

In [210]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 0.65, 'max_leaf_nodes': None, 'max_samples': 1.0, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 5, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 175, 'oob_score': False, 'verbose': 0, 'warm_start': False}
Test RMSE: 0.4454
Test R-Squared: 0.8153


#### Check Differences in Metrics from the Default Settings

In [211]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: -0.006341
R-Squared Difference: -0.005222


#### Compare Validation RMSE to Test Set RMSE for Model Fit Assessment 

In [212]:
print(f"The average validation RMSE is {np.sqrt(abs(grid_search.best_score_)):.4f} units and the test RMSE is {rmse_grid:.4f} units." )

The average validation RMSE is 0.4683 units and the test RMSE is 0.4454 units.


#### Despite a slight decrease of 0.005222 in the R-squared value, cross-validation results indicate strong model performance. The small difference between the test RMSE (0.4454) and the average validation RMSE (0.4683) suggests that the model generalizes well.

#### We will use the final Grid Search parameters for our Random Forest Regressor model. Although the model has several parameters that risk overfitting (ie. no max depth to the trees, no pruning at all, and no min_impurity_decrease), the minimal difference between validation and test RMSE confirms that there is no significant overfitting.

#### Fit Random Forest Regressor Model

In [213]:
# fit random forest regressor using best parameters
rf_reg = RandomForestRegressor(**grid_search.best_params_, random_state=42)
rf_reg.fit(X_train_full, y_train_full)

#### Store Performance Metrics in a DataFrame for Comparison of All Models at the End) 

In [214]:
model_name = "Random Forest Regressor"
results2 = pd.DataFrame([[model_name, round(rmse_grid,3), round(r2_grid,3)]], columns=["Model", "RMSE", "R-Squared"])
results2

Unnamed: 0,Model,RMSE,R-Squared
0,Random Forest Regressor,0.445,0.815


### Optimize Gradient Boosting Regressor

#### Retrieve Default Parameters for Grid Search Optimization Baseline

In [215]:
gb_reg = GradientBoostingRegressor(random_state= 42)
print(gb_reg.get_params())

{'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'squared_error', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_iter_no_change': None, 'random_state': 42, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}


#### Check RMSE and R-Squared Values on Default Parameters

In [216]:
gb_reg.fit(X_train_full, y_train_full)
y_pred = gb_reg.predict(X_test_full)

rmse_default = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_default = r2_score(y_test_full, y_pred)
print(f"Test RMSE: {rmse_default:.4f}")
print(f"Test R-Squared: {r2_default:.4f}")

Test RMSE: 0.4471
Test R-Squared: 0.8139


#### First Pass Grid Search for Gradient Boosting Regressor (5,184 Combinations)

In [217]:
param_grid = {
    'alpha': [0.9], 
    'ccp_alpha': [0.0],  
    'criterion': ['friedman_mse'],  
    'init': [None],  
    'learning_rate': [0.01, 0.05, 0.1, 0.2],  
    'loss': ['squared_error'],  
    'max_depth': [3, 5, 7],  
    'max_features': ['sqrt', 'log2', None],  
    'max_leaf_nodes': [None],  
    'min_impurity_decrease': [0.0],  
    'min_samples_leaf': [1, 3, 5],  
    'min_samples_split': [2, 5, 7, 10],  
    'min_weight_fraction_leaf': [0.0],  
    'n_estimators': [50, 100, 200, 250],  
    'n_iter_no_change': [None],  
    'subsample': [0.8, 0.9, 1.0], 
    'tol': [0.0001],  
    'validation_fraction': [0.1],  
    'verbose': [0],  
    'warm_start': [False]
}

# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(gb_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
gb_reg = GradientBoostingRegressor(**grid_search.best_params_, random_state=42)
gb_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = gb_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

#### Best Parameters and Performance from First Grid Pass

In [218]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.2, 'loss': 'squared_error', 'max_depth': 3, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 5, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 200, 'n_iter_no_change': None, 'subsample': 0.8, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
Test RMSE: 0.4461
Test R-Squared: 0.8147


#### Check Differences in Metrics from the Default Settings

In [219]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: 0.000989
R-Squared Difference: 0.000823


#### Second Pass Grid Search for Gradient Boosting Regressor (2,160 Combinations)

In [220]:
param_grid = {
    'alpha': [0.9], 
    'ccp_alpha': [0.0],  
    'criterion': ['friedman_mse'],  
    'init': [None],  
    'learning_rate': [0.1, 0.15, 0.2, 0.3],  
    'loss': ['squared_error'],  
    'max_depth': [2, 3, 4, 5], 
    'max_features': ['sqrt'],  
    'max_leaf_nodes': [None],  
    'min_impurity_decrease': [0.0],  
    'min_samples_leaf': [4, 5, 6],  
    'min_samples_split': [2, 3, 4],  
    'min_weight_fraction_leaf': [0.0],  
    'n_estimators': [150, 175, 200, 225, 250],  
    'n_iter_no_change': [None],  
    'subsample': [0.75, 0.8, 0.85], 
    'tol': [0.0001],  
    'validation_fraction': [0.1],  
    'verbose': [0],  
    'warm_start': [False]
}

# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(gb_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
gb_reg = GradientBoostingRegressor(**grid_search.best_params_, random_state=42)
gb_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = gb_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

#### Best Parameters and Performance from Second Grid Pass

In [221]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.2, 'loss': 'squared_error', 'max_depth': 3, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 5, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 200, 'n_iter_no_change': None, 'subsample': 0.8, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
Test RMSE: 0.4461
Test R-Squared: 0.8147


#### Check Differences in Metrics from the Default Settings

In [222]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: 0.000989
R-Squared Difference: 0.000823


#### Third Pass Grid Search for Gradient Boosting Regressor (360 Combinations)

In [223]:
param_grid = {
    'alpha': [0.9], 
    'ccp_alpha': [0.0, 0.0001, 0.001, 0.01],  
    'criterion': ['friedman_mse'],  
    'init': [None],  
    'learning_rate': [0.15, 0.18, 0.2, 0.22, 0.25],  
    'loss': ['squared_error'],  
    'max_depth': [3], 
    'max_features': ['sqrt', None],  
    'max_leaf_nodes': [None],  
    'min_impurity_decrease': [0.0],  
    'min_samples_leaf': [5],  
    'min_samples_split': [2],  
    'min_weight_fraction_leaf': [0.0],  
    'n_estimators': [190, 200, 210],  
    'n_iter_no_change': [None],  
    'subsample': [0.79, 0.8, 0.81], 
    'tol': [0.0001],  
    'validation_fraction': [0.1],  
    'verbose': [0],  
    'warm_start': [False]
}

# Perform Grid Search with 5-fold cross-validatoin
grid_search = GridSearchCV(gb_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=0)

# Fit Grid Search to the training data
grid_search.fit(X_train_full, y_train_full)

# Apply best parameters to Bagging Regressor
gb_reg = GradientBoostingRegressor(**grid_search.best_params_, random_state=42)
gb_reg.fit(X_train_full, y_train_full)

# Generate predictions from test features
y_pred = gb_reg.predict(X_test_full)

# Calculate model evaluation metrics
rmse_grid = np.sqrt(mean_squared_error(y_test_full, y_pred))
r2_grid = r2_score(y_test_full, y_pred)

#### Best Parameters and Performance from Third Grid Pass

In [224]:
print("Best Parameters:", grid_search.best_params_)
print(f"Test RMSE: {rmse_grid:.4f}")
print(f"Test R-Squared: {r2_grid:.4f}")

Best Parameters: {'alpha': 0.9, 'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.2, 'loss': 'squared_error', 'max_depth': 3, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 5, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 210, 'n_iter_no_change': None, 'subsample': 0.8, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}
Test RMSE: 0.4475
Test R-Squared: 0.8135


#### Check Differences in Metrics from the Default Settings

In [225]:
diff_rmse = rmse_default - rmse_grid
diff_r2 =  r2_grid - r2_default

print(f"RMSE Difference: {diff_rmse:.6f}")
print(f"R-Squared Difference: {diff_r2:.6f}")

#Note: A positive value indicates that Grid Search has improved model performance.

RMSE Difference: -0.000388
R-Squared Difference: -0.000323


#### Compare Validation RMSE to Test Set RMSE for Model Fit Assessment 


In [226]:
print(f"The average validation RMSE is {np.sqrt(abs(grid_search.best_score_)):.4f} units and the test RMSE is {rmse_grid:.4f} units." )

The average validation RMSE is 0.4433 units and the test RMSE is 0.4475 units.


#### Despite a slight decrease of 0.000323 in the R-squared value, cross-validation results indicate strong model performance. The small difference between the test RMSE (0.4475) and the average validation RMSE (0.4433) suggests that the model generalizes well.

#### We will use the final Grid Search parameters for our Gradient Boosting Regressor model. Although some aspects of this model might be prone to overfitting (ie. no prunning - ccp_alpha = 0.0 and min_samples_split = 2), the minimal difference between validation and test RMSE confirms that there is no significant overfitting.

#### Fit Gradient Boosting Regressor Model

In [227]:
# fit bagging regressor using best parameters
gb_reg = GradientBoostingRegressor(**grid_search.best_params_, random_state=42)
gb_reg.fit(X_train_full, y_train_full)


#### Store Performance Metrics in a DataFrame for Comparison of All Models at the End) 

In [228]:
model_name = "Gradient Boosting Regressor"
results3 = pd.DataFrame([[model_name, round(rmse_grid,3), round(r2_grid,3)]], columns=["Model", "RMSE", "R-Squared"])
results3

Unnamed: 0,Model,RMSE,R-Squared
0,Gradient Boosting Regressor,0.448,0.814


## Part 5: Choosing a Single Model for our UserInterface

In [229]:
final_models = pd.concat([results1, results2, results3], axis=0)
final_models

Unnamed: 0,Model,RMSE,R-Squared
0,Bagging Regressor,0.433,0.826
0,Random Forest Regressor,0.445,0.815
0,Gradient Boosting Regressor,0.448,0.814



#### Model Selection

While the Bagging Regressor achieved the lowest RMSE and highest R-squared, its performance was very similar to that of the Random Forest Regressor and Gradient Boosting Regressor. To simplify the selection process, I first eliminated the Gradient Boosting Regressor, as it has a higher tendency to overfit, whereas this model needs to generalize well to novel molecules.
To ensure strong generalization without overfitting, I prioritized an approach that could identify potential therapeutic agents beyond the historical structural patterns of gepants.

There are many shared hyperparameters between the Bagging Regressor and Random Forest Regressor models including:

- Use bootstrap sampling to create multiple training subsets
- No pruning (ccp_alpha=0)
- Use squared error as its splitting metric
- No maximum tree depth 
- No maximum number of leaf nodes 
- No minimum impurity decrease 
- No weight fraction constraints 
- No out-of-bag validation
- No warm-start 


Given these similarities, the main distinction lies in feature selection and tree splitting criteria. The Bagging Regressor uses 100% of the features in every tree and allows splitting with as little as 2 samples per node. While this led to strong test performance, it is concerning that the model may be memorizing patterns in the training data, making it more prone to overfitting.

The Random Forest Regressor, on the other hand, randomly selects only 65% of the features per tree and requires at least five samples to split a node. This introduces randomness and prevents over-reliance on any single feature, making the model more likely to generalize well to unseen data. This averaging process allows the model to capture meaningful patterns while avoiding overfitting to specific molecular features.



### Export All Models as a .joblib file

In [230]:
joblib.dump(bagging_reg, 'bagging_reg.joblib') # saves your model variable, named "model"
joblib.dump(rf_reg, 'rf_reg.joblib') 
joblib.dump(gb_reg, 'gb_reg.joblib') 

['gb_reg.joblib']