# 6.4. Fine-tuning machine learning models via grid search

## Comment
This section is about grid search. Grid search is a basic, but powerful hyperparameter tuning method which evalutes all the given hyperparameters. The drawback is it requires a lot of computational power because it performs an exaustive search.
* from sklearn.model_selection import GridSearchCV

For details, refer to [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

An alternative method is Randomized Search which randomly selects a combination of the provided hyperparameters within the contrainst.
* from sklearn.model_selection import RandomizedSearchCV

For details, refer to [sklearn.model_selection.RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).


## Sources:
New part
* [Fine-tuning machine learning models via grid search](https://render.githubusercontent.com/view/ipynb?commit=1b01e733d15a1808ebdb0e07e46dbb9cb1634323&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f72617362742f707974686f6e2d6d616368696e652d6c6561726e696e672d626f6f6b2d326e642d65646974696f6e2f316230316537333364313561313830386562646230653037653436646262396362313633343332332f636f64652f636830362f636830362e6970796e62&nwo=rasbt%2Fpython-machine-learning-book-2nd-edition&path=code%2Fch06%2Fch06.ipynb&repository_id=81413897&repository_type=Repository#Fine-tuning-machine-learning-models-via-grid-search)

Related parts
* [Debugging algorithms with learning and validation curves](https://render.githubusercontent.com/view/ipynb?commit=1b01e733d15a1808ebdb0e07e46dbb9cb1634323&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f72617362742f707974686f6e2d6d616368696e652d6c6561726e696e672d626f6f6b2d326e642d65646974696f6e2f316230316537333364313561313830386562646230653037653436646262396362313633343332332f636f64652f636830362f636830362e6970796e62&nwo=rasbt%2Fpython-machine-learning-book-2nd-edition&path=code%2Fch06%2Fch06.ipynb&repository_id=81413897&repository_type=Repository#Debugging-algorithms-with-learning-and-validation-curves)
* [Streamlining workflows with pipelines](https://render.githubusercontent.com/view/ipynb?commit=1b01e733d15a1808ebdb0e07e46dbb9cb1634323&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f72617362742f707974686f6e2d6d616368696e652d6c6561726e696e672d626f6f6b2d326e642d65646974696f6e2f316230316537333364313561313830386562646230653037653436646262396362313633343332332f636f64652f636830362f636830362e6970796e62&nwo=rasbt%2Fpython-machine-learning-book-2nd-edition&path=code%2Fch06%2Fch06.ipynb&repository_id=81413897&repository_type=Repository#Streamlining-workflows-with-pipelines)
* [Using k-fold cross-validation to assess model performance](https://render.githubusercontent.com/view/ipynb?commit=1b01e733d15a1808ebdb0e07e46dbb9cb1634323&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f72617362742f707974686f6e2d6d616368696e652d6c6561726e696e672d626f6f6b2d326e642d65646974696f6e2f316230316537333364313561313830386562646230653037653436646262396362313633343332332f636f64652f636830362f636830362e6970796e62&nwo=rasbt%2Fpython-machine-learning-book-2nd-edition&path=code%2Fch06%2Fch06.ipynb&repository_id=81413897&repository_type=Repository#Using-k-fold-cross-validation-to-assess-model-performance)

## Summary
### Part 1: Tuning hyperparameters via grid search
There are two types of parameters related to a model:
* model's weights
* hyperparameters to control model, e.g. C of LogisticRegression, depth of Decision Tree

Grid search is an exhaustive search method for the given hyperparameters

## Part 2: Algorithm selection with nested cross-validation
* Nested Cross-Validation Method
* Example of 5-2 Cross validation: 


## Code

## Part 1: Tuning hyperparameters via grid search
### Load the Dataset

In [7]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Fetch the dataset
url2breast_cancer_wisconsin_dataset='https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
url2dataset = url2breast_cancer_wisconsin_dataset
df = pd.read_csv( url2dataset, header=None )

# Get the actual data & label from the dataset
X = df.loc[:, 2:].values  # actual data
y = df.loc[:, 1].values   # label

# Encode B to 0 and M to 1
le = LabelEncoder()
y  = le.fit_transform(y)

# Split the dataset to train & test data
test_over_train_ratio = 0.2
random_seed = 1
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=test_over_train_ratio, stratify=y, random_state=random_seed)

### Prepare the Model

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC  # Support Vector Machine, Support Vector Classifier

pipe_svc = make_pipeline( StandardScaler(), SVC(random_state=1) )

### Prepare the Grid Search

In [9]:
from sklearn.model_selection import GridSearchCV

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]  # I added 10000.0
param_grid = [{'svc__C': param_range, 'svc__kernel': ['linear']},
              {'svc__C': param_range, 'svc__kernel': ['rbf'], 'svc__gamma': param_range}
             ]
gs = GridSearchCV( estimator=pipe_svc,
                   param_grid=param_grid,
                   scoring='accuracy',
                   cv=10,
                   n_jobs=-1
                 )

### Train the Model

In [11]:
gs = gs.fit( X_train, y_train )

### Display the Performance

In [12]:
print( gs.best_score_, gs.best_params_)

0.9846153846153847 {'svc__C': 100.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}


### Test the Best Model's Performance

In [13]:
# Fetch the hyperparameters for the best model
clf = gs.best_estimator_

# Train the best model
clf.fit( X_train, y_train )
test_accuracy = clf.score( X_test, y_test )
print( 'Test Accuracy %.3f' % test_accuracy )

Test Accuracy 0.974


## Part 2: Algorithm selection with nested cross-validation
Let's compare accuracy of two algorithms: SVM and Decision Tree.

### SVC (Support Vector Classifier)

In [16]:
from sklearn.model_selection import cross_val_score
import numpy as np

gs = GridSearchCV( estimator=pipe_svc,
                   param_grid=param_grid,
                   scoring='accuracy',
                   cv=2  # Cross-validation k-fold k=2
                 )
accuracy_scores = cross_val_score( gs,
                                   X_train, y_train,
                                   scoring='accuracy',
                                   cv=5)
mean_accuracy = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)
print( 'Cross-Validation Accuracy %.3f +/- %.3f' % (mean_accuracy, std_accuracy) )

Cross-Validation Accuracy 0.974 +/- 0.015


### Decision Tree

In [18]:
from sklearn.tree import DecisionTreeClassifier

param_grid = [ {'max_depth':[1,2,3,4,5,6,7,None]} ]
gs = GridSearchCV( estimator=DecisionTreeClassifier(random_state=0),
                   param_grid=param_grid,
                   scoring='accuracy',
                 cv=2
                 )
accuracy_scores = cross_val_score( gs, X_train, y_train, scoring='accuracy', cv=5 )
mean_accuracy = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)
print( 'Cross-Validation Accuracy %.3f +/- %.3f' % (mean_accuracy, std_accuracy) )

Cross-Validation Accuracy 0.934 +/- 0.016


### SVC > Decision Tree by 0.04
The accuracies of SVM and Decision Tree are:
* mean: 0.974 and 0.934
* standard deviation: 0.015 and 0.016

So SVM is a better choice than Decision Tree.

(EOF)