In [1]:
!conda activate surya

# Gradient Boosting Regressor

First necessary libraries are imported to run the main program.
Pandas is used for data frame creation and handling.
Numpy is used for numercial analysis multidimentional array handling.
Both Sci-kit learn and Tensorflow module provides various submodules to preprocess the data, create and train the machine learning model and to test it's performance.
Matplotlib is used for plotting of graphs.

In [2]:
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate

The observational dataset is imported from the csv file and saved as a pandas data frame in variable grd.


In [3]:
grd = pd.read_csv("../data/graphene_data_final.csv")

The 1st four columns Graphene_percentage, feed, RPM and DOC were taken as input and MRR as 2st output and `Ra` as 2nd output.

In [4]:
X, Y = grd[['Graphene_percentage', 'FEED', 'RPM', 'DOC']], grd['MRR_gm_per_sec']
Y2 = grd['Ra']

Now the input and output data are splited to form respective test and train data sets.
X_train is the input and Y_train is the output data set used to train the model to predict `MRR`.
X_test and Y_test are input and output data sets respectively those are used to test the performance of the `MRR` predictor model.
X_train2 is the input and Y2_train is the output data set used to train the model to predict `Ra`.
X_test2 and Y2_test are input and output data sets respectively those are used to test the performance of the `Ra` predictor model.

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=39)
X_train2,X_test2,Y2_train, Y2_test = train_test_split(X, Y2, test_size=0.2, random_state=23)

Here a basic gradient boosting model is created, trained and tested. `GradientBoostingRegressor()` function is used to initiate a gradient boosting model object.
`fit` function is used to train the model using the train data sets given within them and them `score` function is used to test the performance of the model.

The following program predicts the material removal rate.

In [6]:
gbr = GradientBoostingRegressor(random_state=0)
gbr.fit(X_train, Y_train)
gbr.score(X_test,Y_test)

0.8825338498834934

THe following program predicts surface roughness.

In [7]:
gbr2 = GradientBoostingRegressor(random_state=51)
gbr2.fit(X_train2, Y2_train)
gbr2.score(X_test2,Y2_test)

0.3862120620566041

The above models are produced without any parameter tunning.
But we can improve the model by training and testing the models with various sets of multiple parameters and can get the best performing model to predict our output.
That's why now parameter grid is created which contains various possible values for parameters to find the best set of parameters.

In [8]:
param_grid = { 
    'n_estimators': [10,20,30,50,100],
    'max_depth' : [1,2,3,4,6,8],
    'min_samples_leaf' : [1,2,3,4],
    'min_samples_split' : [2,3,4]
}

Grid search algorithm streamlines the process of finding the best sets of parameters for a given model and provided when parameter grid and a basic model is given to it.

In [9]:
gbr_cv = GradientBoostingRegressor(random_state=7)

GridsearchCV function creates a gridsearch object that takes another model object, parameter grid, number of cross validations to be done as input.
Upon fitting the gridsearch model object with train data, it trains and validates itself to find the best set of parameters from the given paramter grid. It is worth mentioning that grid search cv used cross validation algorithm to split the train data into a particular number of parts and uses one part to validate the model and other parts to train the model. At last it is able to find the highest scoring model and its parameters.

In [10]:
CV_gbr = GridSearchCV(estimator=gbr_cv, param_grid=param_grid, cv= 3)
CV_gbr.fit(X_train, Y_train)

GridSearchCV(cv=3, estimator=GradientBoostingRegressor(random_state=7),
             param_grid={'max_depth': [1, 2, 3, 4, 6, 8],
                         'min_samples_leaf': [1, 2, 3, 4],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [10, 20, 30, 50, 100]})

Here it can be seen that grid search model acts a model itself since it inherits the basic model.

In [11]:
CV_gbr.best_params_

{'max_depth': 3,
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 30}

In [12]:
CV_gbr.score(X_test,Y_test)

0.9083613133714951

This model is used to predict surface roughness(Ra).

In [13]:
gbr_cv2 = GradientBoostingRegressor(random_state=19)

In [14]:
CV_gbr2 = GridSearchCV(estimator=gbr_cv2, param_grid=param_grid, cv= 3)
CV_gbr2.fit(X_train2, Y2_train)


GridSearchCV(cv=3, estimator=GradientBoostingRegressor(random_state=19),
             param_grid={'max_depth': [1, 2, 3, 4, 6, 8],
                         'min_samples_leaf': [1, 2, 3, 4],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [10, 20, 30, 50, 100]})

In [15]:
print(CV_gbr2.score(X_test2,Y2_test))

0.3524667469551527


The gridsearch model can be seen performing better than the original basic model. The improvement may seem miniscule because the train data set is small and the basic model itself applies certain algorithms to find the best parameters. 

In [16]:
CV_gbr2.best_params_

{'max_depth': 4,
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'n_estimators': 20}

It can seen that the test score is pretty low.Therefore, it can be concluded that the gridsearch algorithm using the gradient boosting model is not able to find the actual relation between the input variables and surface roughness.

Now the best models are saved for further usage.

In [17]:
import pickle
with open('../trained_models/gradient_boosting_MRR.pkl','wb') as f:
    pickle.dump(CV_gbr,f)
with open('../trained_models/gradient_boosting_RA.pkl','wb') as f:
    pickle.dump(gbr2,f)