In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

This tutorial was done using [DataCamp's example](https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models)

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data. They are often used in processes to help estimate model parameters. They are often specified by the practitioner.

Model hyperparameters are often referred to as model parameters which can make things confusing. A good rule of thumb to overcome this confusion is as follows: “If you have to specify a model parameter manually, then it is probably a model hyperparameter. ” Some examples of model hyperparameters include:

- The learning rate for training a neural network.
- The C and sigma hyperparameters for support vector machines.
- The k in k-nearest neighbors.

Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. Random search differs from a grid search. In that you longer provide a discrete set of values to explore for each hyperparameter; rather, you provide a statistical distribution for each hyperparameter from which values may be randomly sampled.

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

<IPython.core.display.Javascript object>

In [3]:
df = pd.read_csv("diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


<IPython.core.display.Javascript object>

So you can 8 different features labeled into the outcomes of 1 and 0 where 1 stands for the observation has diabetes, and 0 denotes the observation does not have diabetes. 

In [4]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


<IPython.core.display.Javascript object>

In [5]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

<IPython.core.display.Javascript object>

In [6]:
values = df.values

<IPython.core.display.Javascript object>

In [7]:
X = values[:, 0:8]
y = values[:, 8]

<IPython.core.display.Javascript object>

In [8]:
lr = LogisticRegression(penalty="l1", dual=False, max_iter=110, solver="liblinear")

<IPython.core.display.Javascript object>

- penalty : Used to specify the norm used in the penalization (regularization).
- dual : Dual or primal formulation. The dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features.

- max_iter : Maximum number of iterations taken to converge.

In [9]:
lr.fit(X, y)

LogisticRegression(max_iter=110, penalty='l1', solver='liblinear')

<IPython.core.display.Javascript object>

In [10]:
lr.score(X, y)

0.7799479166666666

<IPython.core.display.Javascript object>

In [11]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

<IPython.core.display.Javascript object>

In [12]:
# building k-fold cross validator

kfold = KFold(n_splits=3, random_state=7)



<IPython.core.display.Javascript object>

In [13]:
result = cross_val_score(lr, X, y, cv=kfold, scoring="accuracy")
print(result.mean())

0.76953125


<IPython.core.display.Javascript object>

In [14]:
from sklearn.model_selection import GridSearchCV

<IPython.core.display.Javascript object>

In [15]:
dual = [True, False]
max_iter = [100, 110, 120, 130, 140]
param_grid = dict(dual=dual, max_iter=max_iter)

<IPython.core.display.Javascript object>

In [16]:
import time

lr = LogisticRegression(penalty="l2")
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv=3, n_jobs=-1)

start_time = time.time()
grid_result = grid.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<IPython.core.display.Javascript object>

In [17]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.772135 using {'dual': False, 'max_iter': 100}


<IPython.core.display.Javascript object>

In [18]:
print("Execution time:" + str((time.time() - start_time)) + "ms")

Execution time:2.455432176589966ms


<IPython.core.display.Javascript object>

Defining a larger grid of hyperparameter and applying gridsearch.

In [19]:
dual = [True, False]
max_iter = [100, 110, 120, 130, 140]
C = [1.0, 1.5, 2.0, 2.5]
param_grid = dict(dual=dual, max_iter=max_iter, C=C)

<IPython.core.display.Javascript object>

In [20]:
lr = LogisticRegression(penalty="l2")
grid = GridSearchCV(estimator=lr, param_grid=param_grid, cv=3, n_jobs=-1)

start_time = time.time()
grid_result = grid.fit(X, y)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.773438 using {'C': 2.5, 'dual': False, 'max_iter': 100}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<IPython.core.display.Javascript object>

In [21]:
print("Execution time:" + str((time.time() - start_time)) + "ms")

Execution time:0.5774538516998291ms


<IPython.core.display.Javascript object>

There is an increase in accuracy score with the new hyper-parameters.

In [22]:
from sklearn.model_selection import RandomizedSearchCV

<IPython.core.display.Javascript object>

In [23]:
random = RandomizedSearchCV(
    estimator=lr, param_distributions=param_grid, cv=3, n_jobs=-1
)

start_time = time.time()
random_result = random.fit(X, y)

print("Best: %f using %s" % (random_result.best_score_, random_result.best_params_))

Best: 0.773438 using {'max_iter': 100, 'dual': False, 'C': 2.5}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<IPython.core.display.Javascript object>

In [24]:
print("Execution time:" + str((time.time() - start_time)) + "ms")

Execution time:0.2044544219970703ms


<IPython.core.display.Javascript object>

This has similar accuracy score but with much quicker execution time.