## OPTIMIZING ML MODELS

**This notebook is designed to teach you how you can optimize your ML models once you have built, trained and tested the models.**

Optimization is the last step of a machine learning model process before results can be presented to the user.

#### So what are we going to optimize?

We are going to optimize ***Model Hyperparameters.***
A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

There are many strategies to tune modle hyperparameters. As part of this workshop we will discuss one technique -*** Grid Search***

## What dataset are we using for this workshop?

We will use the Pima Indian diabetes dataset. The dataset corresponds to a classification problem on which you need to make predictions on the basis of whether a person is to suffer diabetes given the 8 features in the dataset. You can find the complete description of the dataset [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [1]:
# Import all required libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Reading and displaying the head of the data
data = pd.read_csv("http://bit.ly/opt-data")
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Some basic data cleaning to remove all the missing/zero values

In [3]:
# Mark zero values as missing or NaN
data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = \
                data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)

# Count the number of NaN values in each column
print(data.isnull().sum())

# Fill missing values with mean column values
data.fillna(data.mean(), inplace=True)

# Count the number of NaN values in each column
print(data.isnull().sum())

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


#### See the results after data cleaning? -> We now have no missing values

### Now lets quickly train a model with random hyperparameter values

In [11]:
# Split dataset into inputs and outputs
values = data.values
X = values[:,0:8]
y = values[:,8]

# Initiate the LR model with random hyperparameters
lr = LogisticRegression(dual=False, max_iter=110)

# We will optimize these parameters using Grid Search

In [12]:
# Pass data to train the LR Model
lr.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(max_iter=110)

In [13]:
# Let's check the accuracy of the model
lr.score(X,y)

0.7721354166666666

### Now lets build the model using hyperparameter optimization

In [16]:
from sklearn.model_selection import GridSearchCV

# Defining the grid parameter values
dual=[True,False]
max_iter=[100,200,300,400,500]
param_grid = dict(dual=dual,max_iter=max_iter)

In [17]:
lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv = 3, n_jobs=-1)

grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.773438 using {'dual': False, 'max_iter': 200}


#### You can play around with more parameters to optimize models better.

You also got to know about what role hyperparameter optimization plays in building efficient machine learning models.