<a href="https://colab.research.google.com/github/bnsreenu/python_for_microscopists/blob/master/317_HyperParameter_Optimization_using_Genetic_algo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://youtu.be/51sdLTNP1O8

 **Hyperparameter optimization using Metaheuristic algorithms
<br> such as the Genetic Algorithm <p>**

In this example, we will use the same dataset (steel alloy strength) from the previous tutorial to fit and tune Random Forest Regressor. <br>
The dataset can be downloaded from here: https://www.kaggle.com/datasets/fuarresvij/steel-test-data

<p>
The data set contains the elemental composition of different alloys and their respective yield and tensile strengths. A machine learning model can be trained on this data, allowing us to predict the strength of an alloy based on its chemical composition.


In [None]:
#Read the csv file and capture data into a pandas dataframe
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/ColabNotebooks/data/steel_strength.csv")

#Understand the data

In [None]:
df.head()

Unnamed: 0,formula,c,mn,si,cr,ni,mo,v,n,nb,co,w,al,ti,yield strength,tensile strength,elongation
0,Fe0.620C0.000953Mn0.000521Si0.00102Cr0.000110N...,0.02,0.05,0.05,0.01,19.7,2.95,0.01,0.0,0.01,15.0,0.0,0.15,1.55,2411.5,2473.5,7.0
1,Fe0.623C0.00854Mn0.000104Si0.000203Cr0.147Ni0....,0.18,0.01,0.01,13.44,0.01,3.01,0.46,0.04,0.01,19.46,2.35,0.04,0.0,1123.1,1929.2,8.0
2,Fe0.625Mn0.000102Si0.000200Cr0.0936Ni0.129Mo0....,0.0,0.01,0.01,8.67,13.45,0.82,0.01,0.0,0.01,13.9,0.0,0.39,0.57,1736.3,1871.8,
3,Fe0.634C0.000478Mn0.000523Si0.00102Cr0.000111N...,0.01,0.05,0.05,0.01,17.7,3.95,0.01,0.0,0.01,15.0,0.0,0.13,1.47,2487.3,2514.9,9.0
4,Fe0.636C0.000474Mn0.000518Si0.00101Cr0.000109N...,0.01,0.05,0.05,0.01,19.4,1.45,0.01,0.0,0.01,14.9,0.0,0.13,1.55,2249.6,2315.0,8.5


In [None]:
#Check if there is any null data. There are a few missing from the elongation column but we will not use it for our exercise.
df.isna().sum()

formula             0
c                   0
mn                  0
si                  0
cr                  0
ni                  0
mo                  0
v                   0
n                   0
nb                  0
co                  0
w                   0
al                  0
ti                  0
yield strength      0
tensile strength    0
elongation          9
dtype: int64

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
c,312.0,0.096442,0.109008,0.0,0.01,0.03,0.1825,0.43
mn,312.0,0.14625,0.397102,0.01,0.01,0.01,0.08,3.0
si,312.0,0.221218,0.580796,0.01,0.01,0.01,0.11,4.75
cr,312.0,8.04383,5.426169,0.01,3.1,9.05,12.52,17.5
ni,312.0,8.184006,6.337055,0.01,0.96,8.5,12.1175,21.0
mo,312.0,2.76609,1.832908,0.02,1.5,2.21,4.09,9.67
v,312.0,0.18375,0.452462,0.0,0.01,0.01,0.1275,4.32
n,312.0,0.005545,0.018331,0.0,0.0,0.0,0.0,0.15
nb,312.0,0.035449,0.161537,0.0,0.01,0.01,0.01,2.5
co,312.0,7.008782,6.254431,0.01,0.01,7.085,13.48,20.1


Assign all chemical composition columns to X

In [None]:
X = df.drop(columns=["formula", "elongation", "tensile strength", "yield strength"])

In [None]:
X.head()

Unnamed: 0,c,mn,si,cr,ni,mo,v,n,nb,co,w,al,ti
0,0.02,0.05,0.05,0.01,19.7,2.95,0.01,0.0,0.01,15.0,0.0,0.15,1.55
1,0.18,0.01,0.01,13.44,0.01,3.01,0.46,0.04,0.01,19.46,2.35,0.04,0.0
2,0.0,0.01,0.01,8.67,13.45,0.82,0.01,0.0,0.01,13.9,0.0,0.39,0.57
3,0.01,0.05,0.05,0.01,17.7,3.95,0.01,0.0,0.01,15.0,0.0,0.13,1.47
4,0.01,0.05,0.05,0.01,19.4,1.45,0.01,0.0,0.01,14.9,0.0,0.13,1.55


Assign the yield strength column to y

In [None]:
y = df['yield strength']

Split data into train and test sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,
test_size=0.2, random_state = 42)

Import the ML algorithm for regression - here, we will import the Random Forest Regressor from scikit-learn. <br>
Please note that Colab comes with most of the required libraries. You need to make sure you install all the required libraries in case you are running the code locally.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
#Instantiate the model by defining the appropriate hyperparameters
model = RandomForestRegressor(n_estimators=10, max_depth=10, max_features='log2', min_samples_split=10, min_samples_leaf=4)

Are these the right hyperparameters? Well, this is the whole point of this exercise so we will soon find out.

In [None]:
#Fit the model to our training data
model.fit(X_train, y_train)

Test the model on our test data and check the RMSE values. We will later compare RMSE from this model to our optimized model.

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

rmse_original_model = np.sqrt(mean_squared_error(y_test, y_pred))

print("RMSE: ", rmse_original_model)

RMSE:  136.77210992979678


#Hyperparameter optimization using Genetic Algorithm. <br>
There are many python libraries for this task but we will use the TPOT library for this tutorial. <br>
https://pypi.org/project/TPOT/
<br>
Please note that there are many approaches for hyperparameter tuning including bayesian and grid search approaches. They all offer different approaches but no clear winner.

In [None]:
!pip install TPOT

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting TPOT
  Downloading TPOT-0.11.7-py3-none-any.whl (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting deap>=1.2
  Downloading deap-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.9/139.9 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py) ... [?25l[?25hdone
  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11954 sha256=af010248ac6cd808e1e26b4f0d3736258e

In [None]:
from tpot import TPOTRegressor

In [None]:
#Define the grid of hyper parameters that we'd like to search
params_grid = {'n_estimators': [20, 60, 100, 150],
               'max_features': ['auto', 'sqrt','log2'],
               'max_depth': [10, 50, 100, 200],
               'min_samples_split': [2, 4, 8, 10],
               'min_samples_leaf': [1, 2, 4, 6]}

Define the TPOT regressor object that performs the hyperparameter search using the Genetic ALgorithm approach.

In [None]:
tpot_regressor = TPOTRegressor(generations= 10, population_size= 50, offspring_size= 12,
                                 verbosity= 2, early_stop= 12,
                                 config_dict={'sklearn.ensemble.RandomForestRegressor': params_grid},
                                 cv = 3, scoring = 'neg_mean_squared_error')


Fit the TPOT regressor to our training data. This task may take a while based on the size of your data set.

In [None]:
tpot_regressor.fit(X_train,y_train)

Optimization Progress:   0%|          | 0/170 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -18818.887198181696

Generation 2 - Current best internal CV score: -18818.887198181696

Generation 3 - Current best internal CV score: -18818.887198181696

Generation 4 - Current best internal CV score: -18818.887198181696

Generation 5 - Current best internal CV score: -18818.887198181696

Generation 6 - Current best internal CV score: -18818.887198181696

Generation 7 - Current best internal CV score: -18818.887198181696

Generation 8 - Current best internal CV score: -18818.887198181696

Generation 9 - Current best internal CV score: -18818.887198181696

Generation 10 - Current best internal CV score: -18818.887198181696

Best pipeline: RandomForestRegressor(RandomForestRegressor(input_matrix, max_depth=50, max_features=log2, min_samples_leaf=2, min_samples_split=4, n_estimators=150), max_depth=50, max_features=log2, min_samples_leaf=1, min_samples_split=4, n_estimators=60)


The hyperparameter search result can be exported as a ready to use python file with the best model parameters already defined.

In [None]:
print(tpot_regressor.export())

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: -18818.887198181696
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=RandomForestRegressor(max_depth=50, max_features="log2", min_samples_leaf=2, min_samples_split=4, n_estimators=150)),
    RandomForestRegressor(max_depth=50, max_features="log2", min_samples_leaf=1, min_samples_split=4, n_estimators=60)
)

exported_pipeline.fit(t

Let us test the performance of our new model with tuned hyperparameters

In [None]:
tpot_pred = tpot_regressor.predict(X_test)



Calculate and print RMSE for the optimized model (and for the original model).

In [None]:
rmse_optimized_model = np.sqrt(mean_squared_error(y_test, tpot_pred))

print("RMSE using the original model: ", rmse_original_model)
print("RMSE using the optimized model: ", rmse_optimized_model)

RMSE using the original model:  136.77210992979678
RMSE using the optimized model:  104.1389767400783


The optimized model performs better!!!

# Classification example <br>
using the digits data set.

In [None]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)



In [None]:
print(X_train.shape, y_train.shape)

(1347, 64) (1347,)


In [None]:
#Let us take the first 500 data points for fast experimentation.
X_train = X_train[:500]
y_train = y_train[:500]

In [None]:
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.962

Generation 2 - Current best internal CV score: 0.966

Generation 3 - Current best internal CV score: 0.968

Generation 4 - Current best internal CV score: 0.968

Generation 5 - Current best internal CV score: 0.974

Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=3, p=2, weights=distance)
0.98


