# The story so far

![](../images/ml_workflow.png)

## Evaluation Metrics:

* Machine learnig algorithms work on feedbacks.
* Evaluation metrics are used to decide how they performed on a given dataset with given hyper-parameters.
* Evalutaion metrics should always be tied with the *business objectives*.

## Example

For example,

* We are trying to detect credit card fraud.
* Occurance rate of fraud is 4 in 1000.
* Let's say our model predicted as following 

| |Farud|Not Fraud|
|---|---|---|
|Predicted Fraud|2|1|
|Predicted not Fraud|1|996|

* If we calculate the accuracy of the model, it will 99.8%. However, that is not the accurate representation of the performance of our model.

## Choosing a Metric

Let us try to design an evaluation metric that is more appropriate for our problem.

* It is really crucial to identify the fraud case correctly, hence every fraud case identified as *not fraud* should be heavily penalised.
* Predicting a genuin case as fraud is although not acceptable, does not have that severe impact on the businss


* Hence, let's assign penalty of 5 for every fraud classified as not fraud
* and penalty of 1 for every genuin case identified as fraud.

* Hence, our new metric could be 

> metric = (5 * false negative + 1 * false positive) / 6

* Let's look some of the metrics that are used in real life machine learning problems.

## Classification Metrics

1. Classification Accuracy.
* Logarithmic Loss.
* Area Under ROC Curve.
* Confusion Matrix.
* Classification Report.

## Classification Accuracy

Classification accuracy is **the number of correct predictions made as a ratio of all predictions made.**

* The most common evaluation metric for classification problems
* Suitable when 
    * There are an equal number of observations in each class 
    * That all predictions and prediction errors are equally important, 

which is often not the case.

Below is an example of calculating classification accuracy.

* sklearn returns a ratio. 
* This can be converted into a percentage by multiplying the value by 100.

Let's see how to use accuracy with sklearn.

`sklearn`'s metric module provides us easy api's to use these evaluation metrics.

`DEMO 2.1`

In [None]:
from sklearn.metrics import accuracy_score

y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)

## Example with a Dataframe

Let's see a dataframe example.

`DEMO 2.2`

In [None]:
# Logistic Regression
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

# load the iris datasets
dataset = datasets.load_iris()

# fit a logistic regression model to the data
model = LogisticRegression()
model.fit(dataset.data, dataset.target)
print(model)

# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)

# summarize the fit of the model
print(metrics.accuracy_score(expected, predicted))

## With Cross-validation

Quite often, we use cross-validation techniques, and use the evaluation metrics to evaluate each round's performance. 

`DEMO 2.3`

In [None]:
# Cross Validation Classification Accuracy
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()

scoring = 'accuracy'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Accuracy: {:.3f} ({:.3f})".format(results.mean(), results.std()))

## Logarithmic Loss

Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class.

> Log Loss = $- \frac{1}{N} \sum_{i=1}^N [y_{i} \log \, p_{i} + (1 - y_{i}) \log \, (1 - p_{i})].$

where 

* N is the number of samples or instances, 
* M is the number of possible labels, 
* $ y_{ij}$ is a binary indicator of whether or not label j is the correct classification for instance i,
* $ p_{ij}$ is the model probability of assigning label j to instance i.


* The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm.
* Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.

An example of calculating logloss for Logistic regression predictions on the Pima Indians onset of diabetes dataset.

* logloss nearer to 0 is better, with 0 representing a perfect logloss. 

`DEMO 2.4`

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()

scoring = 'neg_log_loss'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Logloss: {:.3f} {:.3f}".format(results.mean(), results.std()))

## F-1 score

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/7d63c1f5c659f95b5dfe5893213cc8ea7f8bea0a)

Where

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/26106935459abe7c266f7b1ebfa2a824b334c807)



![](https://wikimedia.org/api/rest_v1/media/math/render/svg/4c233366865312bc99c832d1475e152c5074891b)

* tp = true positive
- tn = true negative
- fp = false positive
- fn = false negative

![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/440px-Precisionrecall.svg.png)
source: https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/440px-Precisionrecall.svg.png

`DEMO 2.5`

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()

scoring = 'f1'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("f1 score: {:.3f} {:.3f}".format(results.mean(), results.std()))

## Area Under ROC Curve

* Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.


* The AUC represents a model’s ability to discriminate between positive and negative classes. 
* An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random. 


![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ROC_curves.svg/709px-ROC_curves.svg.png)
[source](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ROC_curves.svg/709px-ROC_curves.svg.png)

* ROC can be broken down into sensitivity and specificity. 
* A binary classification problem is really a trade-off between sensitivity and specificity.


* Sensitivity is the true positive rate also called the recall. 
* It is the number instances from the positive (first) class that actually predicted correctly.


* Specificity is also called the true negative rate. 
* Is the number of instances from the negative class (second) class that were actually predicted correctly.

`DEMO 2.6`

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()

scoring = 'roc_auc'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("AUC: {:.3f} ({:.3f})".format(results.mean(), results.std()))

## Confusion Matrix

* The confusion matrix is a handy presentation of the accuracy of a model with two or more classes.

![](http://www.dataschool.io/content/images/2015/01/confusion_matrix_simple2.png)
[source](http://www.dataschool.io/content/images/2015/01/confusion_matrix_simple2.png)

* **true positives** (TP): These are predicted yes and actually yes
* **true negatives** (TN): We predicted no, and actually no
* **false positives** (FP): We predicted yes, but actually no. (Also known as a "Type I error.")
* **false negatives** (FN): We predicted no, but yes. (Also known as a "Type II error.")



* The table presents predictions on the x-axis and accuracy outcomes on the y-axis. 
* The cells of the table are the number of predictions made by a machine learning algorithm.

True Negative and True Positive predictions fall on the diagonal line of the matrix

`DEMO 2.7`

In [None]:
from sklearn.metrics import confusion_matrix

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)

predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
matrix

## Classification Report

* Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

`DEMO 2.8`

In [None]:
from sklearn.metrics import classification_report

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)

report = classification_report(Y_test, predicted)
print(report)

# Regression Metrics

## Overview

In this section will review 3 of the most common metrics for evaluating predictions on regression machine learning problems:

1. Mean Absolute Error.
2. Mean Squared Error.
3. $R^2$.

## Mean Absolute Error

* The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. 
* It gives an idea of how wrong the predictions were.

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/3ef87b78a9af65e308cf4aa9acf6f203efbdeded)

* The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).

A value of 0 indicates no error or perfect predictions.

`DEMO 2.9`

In [None]:
from sklearn.linear_model import LinearRegression

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()

scoring = 'neg_mean_absolute_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: {:.3f} ({:.3f})".format(results.mean(), results.std()))

## Mean Squared Error

* The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of error.

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/67b9ac7353c6a2710e35180238efe54faf4d9c15)

* Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. 
* This is called the Root Mean Squared Error (or RMSE).

* This metric is inverted so that the results are increasing. 
* Remember to take the absolute value before taking the square root if you are interested in calculating the RMSE.

`DEMO 2.10`

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: {:.3f} ({:.3f})".format(results.mean(), results.std()))

## $R^2$ Metric

* The $R^2$ (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. 
* In statistical literature, this measure is called the coefficient of determination.

This is a value between 0 and 1 for no-fit and perfect fit respectively.

`DEMO 2.11`

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()

scoring = 'r2'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R sq: {:.3f} ({:.3f})".format(results.mean(), results.std()))

---

# Sampling for Machine Learning

## Overview

* Validation is a technique which involves reserving a particular sample of a data set on which you do not train the model.
* Later, you test the model on this sample before finalizing the model (model validation)

Methods for model validation:

1. Holdout sets
2. Cross-validation
    - k-fold validation
3. hold-one-out

## Why we validate

* Preventing information leakage
* Measuring the model's ability to generalise

## Validation through Holdout Set

* keep a portion of train data separate for validation.

`sklearn`'s `cross_validation` provides `train_test_split` api. 

## Example

In the following example, the nearest-neighbor classifier is about 90% accurate on this hold-out set.

The hold-out set is similar to unknown data, because the model has not "seen" it before.

`DEMO 2.12`

In [None]:
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

model = KNeighborsClassifier(n_neighbors=1)

# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.5)

# fit the model on one set of data
model.fit(X1, y1)

# evaluate the model on the second set of data
y2_model = model.predict(X2)
accuracy_score(y2, y2_model)

## Model validation via cross-validation

* Disadvantage of using a holdout set for model validation is that we have lost a portion of our data to the model training.

* This is not optimal, and can cause problems – especially if the initial set of training data is small.

* One way to address this is to use *cross-validation*; do a sequence of fits where each subset of the data is used both as a training set and as a validation set.

![](https://github.com/jakevdp/PythonDataScienceHandbook/raw/475499f1464bcdf96e618c922a8e6c92b190ee9a/notebooks/figures/05.03-2-fold-CV.png)
[source](https://github.com/jakevdp/PythonDataScienceHandbook/raw/475499f1464bcdf96e618c922a8e6c92b190ee9a/notebooks/figures/05.03-2-fold-CV.png)

Here we do two validation trials, alternately using each half of the data as a holdout set.
Using the split data from before, we could implement it like this:

`DEMO 2.13`

In [None]:
y2_model = model.fit(X1, y1).predict(X2)
y1_model = model.fit(X2, y2).predict(X1)
accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)

## Notes

* Accuracy scores could be combined (by, say, taking the mean) to get a better measure of the global model performance.
* This particular form of cross-validation is a *two-fold cross-validation*

## Model validation through k-fold validation

We could expand on this idea to use even more trials, and more folds in the data
![](https://github.com/jakevdp/PythonDataScienceHandbook/raw/475499f1464bcdf96e618c922a8e6c92b190ee9a/notebooks/figures/05.03-5-fold-CV.png)
[source](https://github.com/jakevdp/PythonDataScienceHandbook/raw/475499f1464bcdf96e618c922a8e6c92b190ee9a/notebooks/figures/05.03-5-fold-CV.png)

* Here we split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data.
* We can use Scikit-Learn's ``cross_val_score`` convenience routine to do it succinctly:

`DEMO 2.14`

In [None]:
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
scores.mean()

## Model validation using Leave-one-out validation (optional)

* In this approach, we reserve only one data-point of the available data set. 
* And, train model on the rest of data set. This process iterates for each data point.
* We make use of all data points, hence low bias.
* This approach leads to higher variation in testing model effectiveness because we test against one data point. 
* So, our estimation gets highly influenced by the data point. 
* If the data point turns out to be an outlier, it can lead to higher variation.

`DEMO 2.15`

In [None]:
from sklearn.cross_validation import LeaveOneOut
scores = cross_val_score(model, X, y, cv=LeaveOneOut(len(X)))
scores

## Notes 

* Because we have 150 samples, the leave one out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction.
* Taking the mean of these gives an estimate of the error rate:

`DEMO 2.16`

In [None]:
scores.mean()

---

# Training Models

1. Get Data
2. Choose a class of model
3. Choose model hyperparameters
4. Arrange data into a features matrix and target vector
5. Fit the model to your data
6. Predict labels for unknown data
7. Assess Model

              - Collecting Data
              - Exploratory Data Visualization
              - Data pre-processing
              - Feature Extraction
              - Sampling
              - Model Training and Tuning
              - Model Assessment
              - Model Deployment
              - Importance of having a Pipeline

# Hyperparameter Tuning

## Parameters vs Hyperparameters

* Hyperparameters are parameters whose values are set prior to the commencement of the learning process.
* By contrast, the value of other parameters is derived via training.

## What is Hyperparameter Tuning

Hyperparameter optimization or model selection is the problem of choosing a set of optimal hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set.

## What is Hyperparameter Tuning about

So, to summarize. Hyperparameters:

* Define higher level concepts about the model such as complexity, or capacity to learn.
* Cannot be learned directly from the data in the standard model training process and need to be predefined.
* Can be decided by setting different values, training different models, and choosing the values that test better.

## Some examples of hyperparameters

* Number of leaves or depth of a tree
* Number of latent factors in a matrix factorization
* Learning rate (in many models)
* Number of hidden layers in a deep neural network
* Number of clusters in a k-means clustering

## Overview of Methods

* Grid Search
* Random Search
* Bayesian Search (we'll not cover this one)


## Grid Search

* Machine learning models are parameterized so that their behavior can be tuned for a given problem. 
* Models can have many parameters and finding the best combination of parameters can be treated as a search problem.

* Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

The following code evaluates different alpha values for the Ridge Regression algorithm on the diabetes dataset.

`DEMO 2.17`

In [None]:
# Grid Search for Algorithm Tuning
import numpy as np
from sklearn import datasets
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV

# load the diabetes datasets
dataset = datasets.load_diabetes()

# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(dataset.data, dataset.target)

print(grid)

# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

## Random Search

* Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. 
* A model is constructed and evaluated for each combination of parameters chosen. 
 
The following code evaluates different alpha random values between 0 and 1 for the Ridge Regression algorithm on the diabetes dataset. 

`DEMO 2.18`

In [None]:
# Randomized Search for Algorithm Tuning
import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn import datasets
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV

# load the diabetes datasets
dataset = datasets.load_diabetes()

# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}

# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(dataset.data, dataset.target)
print(rsearch)

# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

## Grid Search vs Random Search: Which one is better?

* http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
* https://medium.com/rants-on-machine-learning/smarter-parameter-sweeps-or-why-grid-search-is-plain-stupid-c17d97a0e881#.z27w2p3sr
* http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html
* Hyperparameter Tuning - http://blog.sigopt.com/post/144221180573/evaluating-hyperparameter-optimization-strategies

It has been found that random search performs better than grid search.

![](https://cdn-images-1.medium.com/max/936/1*ZTlQm_WRcrNqL-nLnx6GJA.png)
source: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

`DEMO 2.19`

In [None]:
import numpy as np

from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

# Utility function to report best scores
def report(grid_scores, n_top=3):
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")
        
# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 100
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.grid_scores_)

In [None]:
# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)

## Summary

* The randomized search and the grid search explore exactly the same space of parameters.
* The result in parameter settings is quite similar, while the run time for randomized search can be drastically lower.
* The performance is slightly worse for the randomized search, though this is most likely a noise effect and would not carry over to a held-out test set.
* Note that in practice, one would not search over this many different parameters simultaneously using grid search, but pick only the ones deemed most important.

# Parameter Estimation with grid search and cross-validation

* development set comprises only half of the available labeled data
* the performance of the selected hyper-parameters and trained model is then measured on a dedicated evaluation set that was not used during the model selection step

`DEMO 2.20`

In [None]:
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
                       scoring='%s_weighted' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()