# How “good” is your model, and how can you make it better? 


What distinguishes “true artists” from “one-hit wonders” in machine learning is an understanding of how a model performs with respect to different data. This hands-on tutorial will show you how to use scikit-learn’s model evaluation functions to evaluate different models in terms of accuracy and generalisability, and search for optimal parameter configurations.

## 1. Load the required libraries

In [None]:
import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.cross_validation as cv

# Extra plotting functionality
import visplots

from sklearn import preprocessing, metrics 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy.stats.distributions import randint
from multilayer_perceptron import multilayer_perceptron

%matplotlib inline

## 2. Exploring and pre-processing data

The dataset we will be using throughout this workshop is an adapted version of the wine quality case study, available from the UCI Machine Learning repository at https://archive.ics.uci.edu/ml/datasets/Wine+Quality. The first thing you will need to do in order to work with the wine dataset is to read the contents from the provided wine.csv data file using the `read_csv` command:

In [None]:
wine   = pd.read_csv("data/wine.csv")
header = wine.columns.values

At this point, you should try to explore the first few rows of the imported wine DataFrame using the "`head`" function from the `pandas` package (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html):

In [None]:
### Write your code here ###

# Solution #
wine.head() 

In order to feed the data into our classification models, the imported wine DataFrame needs to be converted into a `numpy` array. Subsequently, we need to split our initial dataset into the data matrix X (independent variable) and the associated class vector y (dependent or target variable). 

In [None]:
# Convert to numpy array
npArray = np.array(wine)

X = npArray[:,:-1]
y = npArray[:,-1].astype(int)

It is always a good practice to check the dimensionality of the imported data prior to constructing any Machine Learning models. <br/> Try printing the size of the input matrix X and class vector y using the "`shape`" command: 

In [None]:
print "X dimensions:", ### Write your code here ###
print "y dimensions:", ### Write your code here ###

# Solution #
# print "X dimensions:", X.shape 
# print "y dimensions:", y.shape 

Based on the class vector y, the wine samples are classified into two distinct categories of high quality (class 1) and low quality (class 0).
<br/><br/>An important thing to understand before applying any classification algorithms is how the output labels are distributed. Are they evenly distributed? Imbalances in distribution of labels can often lead to poor classification results for the minority class even if the classification results for the majority class are very good. For the purposes of this workshop, the ratio in-between the two classes has been kept constant.

In [None]:
yFreq = scipy.stats.itemfreq(y)
print yFreq

It is usually advisable to scale your data prior to fitting a classification model. The main advantage of scaling is to avoid attributes of greater numeric ranges dominating those in smaller numeric ranges. For the purposes of this case study, we are applying auto-scaling on the whole X dataset. (Auto-scaling: mean-centering is initially applied per column, followed by scaling where the centered columns are divided by their standard deviation). 

Use as a reference the `sklearn` preprocessing documentation page in order to scale your data (http://scikit-learn.org/stable/modules/preprocessing.html)  

In [None]:
### Write your code here ###

# Solution #
X = preprocessing.StandardScaler().fit_transform(X) 

You can visualise the relationship between two variables (features) using a simple scatter plot. This step can give you a good first indication of the model and the complexity (linear vs. non-linear) of the algorithm you may need to investigate. At this stage, let’s plot the first two variables against each other:

In [None]:
f0 = 0 
f1 = 1

plt.scatter(X[y==0, f0], X[y==0, f1], color = 'b', edgecolors='black', label='Low Quality')
plt.scatter(X[y==1, f0], X[y==1, f1], color = 'r', edgecolors='black', label='High Quality')
plt.xlabel(header[f0])
plt.ylabel(header[f1])
plt.legend()
plt.show()

You can change the values of *f0* and *f1* to values of your own choice in order to investigate the relationship between different features. 

## 3. Training and testing a classifier

Training and testing a classification model on the same dataset is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data (poor generalisation). To use different datasets for training and testing, we need to split the wine dataset into two disjoint sets: train and test (**Holdout method**). <br/> 

In [None]:
XTrain, XTest, yTrain, yTest = cv.train_test_split(X, y, test_size= 0.3, random_state=1)

XTrain and yTrain are the two arrays you use to train your model. XTest and yTest are the two arrays that you use to evaluate your model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also specify the proportion of data you want to use for training and testing (in this case, 30% is used for testing).

You can check the sizes of the different training and test sets by using the shape attribute:

In [None]:
print "XTrain dimensions:", XTrain.shape
print "yTrain dimensions:", yTrain.shape
print "XTest dimensions:",  XTest.shape
print "yTest dimensions:",  yTest.shape

## 4. KNN

To build KNN models using scikit-learn, you will be using the `KNeighborsClassifier` function, which allows you to set the value of K using the `n_neighbors` parameter. The optimal choice of the value K is highly data-dependent: in general a larger K suppresses the effects of noise, but makes the classification boundaries less distinct. <br/>

### 4.1 Uniform weights

We are going to start by trying two predefined random values of K and compare their performance. For every classification model built with sklearn, we will mainly come across four main steps: 1) Building the classification model using default, pre-defined or optimised parameters, 2) Training, 3) Testing, and <br/> 4) Reporting and evaluating the performance metrics.

In [None]:
# Build the classifier 
knn3 = KNeighborsClassifier(n_neighbors=3)

# Train (fit) the model
knn3.fit(XTrain, yTrain)

# Test (predict)
yPredK3 = knn3.predict(XTest)

# Report the performance metrics
print metrics.classification_report(yTest, yPredK3)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredK3), 2)

We can visualise the classification boundary created by the KNN classifier using the built-in function `visplots.knnDecisionPlot`. For easier visualisation, only the test samples are depicted in the plot. Remember though that the decision boundary has been built using the _training_ data! <br/> 

In [None]:
visplots.knnDecisionPlot(XTrain, yTrain, XTest, yTest, n_neighbors= 3, weights="uniform")

Let us try a larger number of K, for instance K = 99 (or an *odd* number of your own choice). Can you generate the KNN model and print the metrics for a larger K using as guidance the previous example? 

In [None]:
############################################  
# Write your code here 
# 1. Build the KNN classifier for larger K
# 2. Train (fit) the model
# 3. Test (predict)
# 4. Report the performance metrics
############################################


###  Solution ### 
knn99 = KNeighborsClassifier(n_neighbors=99)
knn99.fit(XTrain, yTrain)
yPredK99 = knn99.predict(XTest)

print metrics.classification_report(yTest, yPredK99)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredK99), 2)

Try to visualise the boundaries as before using the K neighbours of your choice and the `knnDecisionPlot` command from  `visplots`. What do you observe? 

In [None]:
### Write your code here ### 

###  Solution ### 
visplots.knnDecisionPlot(XTrain, yTrain, XTest, yTest, n_neighbors= 99, weights="uniform")

** Answer: <BR/> For smaller values of K the decision boundaries present many "creases". In this case the models may suffer from instances of overfitting. For larger values of K, we can see that the decision boundaries are less distinct and tend towards linearity. In these cases the boundaries may be too simple and unable to learn thus leading to cases of underfitting. **

### 4.2 Distance weights

Under some circumstances, it is better to give more importance ("weight" in computing terms) to nearer neighbours. When weights = "distance", weights are assigned to the training data points in a way that is proportional to the inverse of the distance from the query point. In other words, nearer neighbours contribute more to the fit. <br/>

What if we use weights based on distance? Does it improve the overall performance?

In [None]:
# Build the classifier with two parameters
knnW3 = KNeighborsClassifier(n_neighbors=3, weights='distance')

# Train (fit) the model
knnW3.fit(XTrain, yTrain)

# Test (predict)
predictedW3 = knnW3.predict(XTest)

# Report the performance metrics
print metrics.classification_report(yTest, predictedW3)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, predictedW3), 2)

### 4.3 Tuning KNN 

The sklearn library provides the grid search function `GridSearchCV` (http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html), which allows us to search for the optimum
combination of parameters by evaluating models trained with a particular algorithm with all provided parameter combinations. Further details and examples on grid search with sklearn can be found at http://scikit-learn.org/stable/modules/grid_search.html <Br/>

You can use the `GridSearchCV` function to search for a parametisation of the KNN algorithm that gives a more optimal model:

In [None]:
# Define the parameters to be optimised and their values/ranges
n_neighbors = np.arange(1, 51, 2)  
weights     = ['uniform','distance']

# Construct a dictionary of hyperparameters
parameters = [{'n_neighbors': n_neighbors, 'weights': weights}]

# Apply a grid search with 10-fold cross-validation using the dictionary of parameters
grid = GridSearchCV(KNeighborsClassifier(), parameters, cv=10)
grid.fit(XTrain, yTrain)

# Print the optimal parameters
print "Best parameters: n_neighbors=", grid.best_params_['n_neighbors'], "and weight=", grid.best_params_['weights']

<br/> Let us graphically represent the results using a heatmap:

In [None]:
# grid_scores_ contains parameter settings and scores
scores = [x[1] for x in grid.grid_scores_]
scores = np.array(scores).reshape(len(n_neighbors), len(weights))
scores = np.transpose(scores)

# Make a heatmap with the performance
plt.figure(figsize=(12, 6))
plt.imshow(scores, interpolation='nearest', origin='higher', cmap=plt.cm.get_cmap('jet_r'))
plt.xticks(np.arange(len(n_neighbors)), n_neighbors)
plt.yticks(np.arange(len(weights)), weights)
plt.xlabel('Number of K nearest neighbors')
plt.ylabel('Weights')

# Add the colorbar
cbar = plt.colorbar()
cbar.set_label('Classification Accuracy', rotation=270, labelpad=20)

plt.show()

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (XTest). <Br/>
So, we are testing our independent XTest dataset using the optimised model:

In [None]:
# Build the classifier using the optimal parameters detected by grid search 
knn = KNeighborsClassifier(n_neighbors=grid.best_params_['n_neighbors'], weights = grid.best_params_['weights'])

# Train (fit) the model
knn.fit(XTrain, yTrain)

# Test (predict)
yPredKnn = knn.predict(XTest)

# Report the performance metrics
print metrics.classification_report(yTest, yPredKnn)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredKnn), 2)

#### Randomized search on hyperparameters. 
Unlike `GridSearchCV`, `RandomizedSearchCV` does not exhaustively try all the parameter settings. Instead, it samples a fixed number of parameter settings from the specified distributions. The number of parameter settings that are tried is given by `n_iter`. If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. You should use continuous distributions for continuous parameters. Further details can be found at http://scikit-learn.org/stable/modules/grid_search.html

In [None]:
param_dist = {'n_neighbors': randint(1,200)}
random_search = RandomizedSearchCV(KNeighborsClassifier(), param_distributions=param_dist, n_iter=20)
random_search.fit(XTrain, yTrain)

print "Best parameters: n_neighbors=", random_search.best_params_['n_neighbors']

neig = [score_tuple[0]['n_neighbors'] for score_tuple in random_search.grid_scores_]
res = [score_tuple[1] for score_tuple in random_search.grid_scores_]
plt.scatter(neig, res)
plt.xlabel('Number of K nearest neighbors')
plt.ylabel('Classification Accuracy')
plt.xlim(0,200)
plt.show()

## 5. Final Exercise (or different title?)

At this point you need to choose ... blah blah blah .... 

### 5.1 Random Forests

The random forests model aggregates a group of decision trees into an ensemble. ......

One of the most important tuning parameters in building a random forest is the number of trees to construct.

In [None]:
#############################################################   
# Write your code here 
# 1. Build the RF classifier using the default parameters
# 2. Train (fit) the model
# 3. Test (predict)
# 4. Report the performance metrics
#############################################################

## Solution ## 
clf = RandomForestClassifier()
clf.fit(XTrain, yTrain)
predRF = clf.predict(XTest)

print metrics.classification_report(yTest, predRF)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, predRF),2)

In [None]:
# Define the parameters to be optimised and their values/ranges

parameters = [{"n_estimators": [250, 500, 1000]}]
sample_leaf_options = [1,5,10,50,100,200,500]

###### WHAT ELSE DO WE ADD HERE?!?!?!!? ###### 


### 5.2 Support Vector Machines (SVMs)

SVMs attempt to build a decision boundary that accurately separates the samples of different classes by *maximizing* the margin between them.

#### Linear SVMs

The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low C tolerates training misclassifications and allows softer margins, while for high C the misclassifications become more significant leading to hard-margin SVMs and potentially cases of overfitting. 

In this example, we will use linear SVMs with the default value for C

In [None]:
################################################################### 
# Write your code here 
# 1. Build a linear SVM classifier using the default parameters
# 2. Train (fit) the model
# 3. Test (predict)
# 4. Report the performance metrics
##################################################################

## Solution ## 
linearSVM = SVC(kernel='linear')
linearSVM.fit(XTrain, yTrain)
yPredLinear = linearSVM.predict(XTest)

print metrics.classification_report(yTest, yPredLinear)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredLinear),2)

We can visualise the classification boundary created by the linear SVM using the following function. For easier visualisation, only the test samples have been included in the plot. And remember that the decision boundary has been built using the _training_ data!

In [None]:
visplots.svmDecisionPlot(XTrain, yTrain, XTest, yTest, 'linear')

#### Non-linear SVMs

In addition to C, which is common for all types of SVM, the gamma parameter in the RBF kernel controls the nonlinearity of the SVM bounaries. The larger the gamma, the more nonlinear the boundaries surrounding individual samples. Lower values of gamma lead to broader, more linear boundaries. <br/><br/>  In this example, we will use non-linear SVMs with the default values for C and gamma

In [None]:
#################################################################  
# Write your code here 
# 1. Build the RBF SVM classifier using the default parameters
# 2. Train (fit) the model
# 3. Test (predict)
# 4. Report the performance metrics
################################################################# 

## Solution ## 
rbfSVM = SVC(kernel='rbf', C=1.0, gamma=0.0)
rbfSVM.fit(XTrain, yTrain)
yPredRBF = rbfSVM.predict(XTest)

print metrics.classification_report(yTest, yPredRBF)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, yPredRBF),2)

We can visualise the classification boundary created by the RBF SVM using the following function. Once more, for easier visualisation, only the test samples have been included in the plot. And remember that the decision boundary has been built using the _training_ data!

In [None]:
visplots.svmDecisionPlot(XTrain, yTrain, XTest, yTest, 'rbf')

#### Hyperparameter Tuning for non-linear SVMs

Proper choice of C and gamma is critical for the performance of SVMs. Optimisation (tuning) of the hyperparameters can be achieved by applying a coarse tuning (often followed by a finer-tuning in the "neighborhood" of good parameters)

In [None]:
# Define the parameters to be optimised and their values/ranges
# Range for gamma and Cost hyperparameters
g_range = 2. ** np.arange(-15, 5, step=2)
C_range = 2. ** np.arange(-5, 15, step=2)

# Construct a dictionary of hyperparameters as in task 4.3
parameters = [{'gamma': g_range, 'C': C_range}] # Solution 

# Apply a grid search with 10-fold cross-validation using the dictionary of parameters
grid = GridSearchCV(SVC(), parameters, cv= 10) # Solution 
grid.fit(XTrain, yTrain) # Solution 

# Print the optimal parameters
bestG = np.log2(grid.best_params_['gamma']);
bestC = np.log2(grid.best_params_['C']);

print "The best parameters are: gamma=", bestG, " and Cost=", bestC

Plot the results of the grid search using a heatmap

In [None]:
### Write your code here ### 


###  Solution ### 
# grid_scores_ contains parameter settings and scores
scores = [x[1] for x in grid.grid_scores_]
scores = np.array(scores).reshape(len(C_range), len(g_range))

# Make a heatmap with the performance
plt.figure(figsize=(10, 6))
plt.imshow(scores, interpolation='nearest', origin='higher', cmap=plt.cm.get_cmap('jet_r'))
plt.xticks(np.arange(len(g_range)), np.log2(g_range))
plt.yticks(np.arange(len(C_range)), np.log2(C_range))
plt.xlabel('gamma (log2)')
plt.ylabel('Cost (log2)')

# Add the colorbar

cbar = plt.colorbar()
cbar.set_label('Classification Accuracy', rotation=270, labelpad=20)

plt.show()

Finally, testing with the optimised model (best hyperparameters) for C and gamma:

In [None]:
####################################################################################  
# Write your code here 
# 1. Build the classifier using the optimal parameters detected by grid search 
# 2. Train (fit) the model
# 3. Test (predict)
# 4. Report the performance metrics
####################################################################################  


## Solution ## 
rbfSVM = SVC(kernel='rbf', C=grid.best_params_['C'], gamma=grid.best_params_['gamma'])
rbfSVM.fit(XTrain, yTrain)
predictions = rbfSVM.predict(XTest) 

print metrics.classification_report(yTest, predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, predictions),2)

### 5.3 Logistic Regression

Logistic regression predicts the probability that a sample belongs to a class based on the values of the input variables, based on a linear model. In the case of classification, we can use this to then assign the sample to the most likely class.

Building a logistic regression model with default parameters .. blah blah..

In [None]:
#############################################################################
# Write your code here 
# 1. Build the Logistic Regression classifier using the default parameters
# 2. Train (fit) the model
# 3. Test (predict)
# 4. Report the performance metrics
#############################################################################

## Solution ## 
l_regression = LogisticRegression()
l_regression.fit(XTrain, yTrain)
l_prediction = l_regression.predict(XTest)

print metrics.classification_report(yTest, l_prediction)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, l_prediction),2)

We can visualise the classification boundary created by the logistic regression model using the built-in function `visplots.logregDecisionPlot`. <br/> As with the above examples, only the test samples have been included in the plot. Remember that the decision boundary has been built using the _training_ data!

In [None]:
### Write your code here ### 

## Solution ## 
visplots.logregDecisionPlot(XTrain, yTrain, XTest, yTest)

#### Tuning Logistic Regression

Two hyperparameters that are often tuned for logistic regression models are the norm used in penalisation (`penalty`), which can be either `l1` or `l2` (default `l2`) and the inverse of regularisation strength, `C` (default `1.0`). 

In [None]:
# Define the parameters to be optimised and their values/ranges
# Range for pen and C hyperparameters
pen = ['l1','l2']
C_range = 2. ** np.arange(-5, 15, step=2)

# Construct a dictionary of hyperparameters as in task 4.3
parameters = [{'C': C_range, 'penalty': pen}]


# Apply a grid search with 10-fold cross-validation using the dictionary of parameters
grid = GridSearchCV(LogisticRegression(), parameters, cv= 10)
grid.fit(XTrain, yTrain)


# Print the optimal parameters
print "The best parameters are: cost=", grid.best_params_['C'], " and penalty=", grid.best_params_['penalty']

Plot the results of the grid search with a heatmap.

In [None]:
# grid_scores_ contains parameter settings and scores
scores = [x[1] for x in grid.grid_scores_]
scores = np.array(scores).reshape(len(pen), len(C_range))
scores = np.transpose(scores)

# Make a heatmap with the performance
plt.figure(figsize=(12, 6))
plt.imshow(scores, interpolation='nearest', origin='higher', cmap=plt.cm.get_cmap('jet_r'))
plt.xticks(np.arange(len(pen)), pen)
plt.yticks(np.arange(len(C_range)), C_range)
plt.xlabel('penalisation norm')
plt.ylabel('inv regularisation strength')

cbar = plt.colorbar()
cbar.set_label('Classification Accuracy', rotation=270, labelpad=20)

plt.show()

Now try these out to see how the performance metrics are affected.

In [None]:
l_regression = LogisticRegression(C=grid.best_params_['C'], penalty=grid.best_params_['penalty'])
l_regression.fit(XTrain, yTrain)
l_prediction = l_regression.predict(XTest)

print metrics.classification_report(yTest, l_prediction)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, l_prediction),2)

For more details on cross-validating and tuning logistic regression models, see: <br/>
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
and <br/>
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

### 5.4 Neural Networks

A neural network is a set of connected input-output units. During training, the connections are assigned different weights. This allows the classification function to take on highly complex "shapes" (equivalent to complicated mathematical expressions that go beyond the linear or polynomial models of logistic regression). This might also mean that the resulting model is difficult to interpret and map to domain knowledge. (NB. even though you might think of the second layer of a neural network as just a logistic regression model, the non-linear transformation in the hidden units gives the input to output mapping a non-linear decision boundary.)


In [None]:
###############################################################################  
# Write your code here 
# 1. Build the Neural Net classifier classifier using the default parameters
# 2. Train (fit) the model
# 3. Test (predict)
# 4. Report the performance metrics
###############################################################################


# Solution #
nnet = multilayer_perceptron.MultilayerPerceptronClassifier(activation='logistic',
                                                            hidden_layer_sizes=2, learning_rate_init=.5)
nnet.fit(XTrain, yTrain)
net_prediction = nnet.predict(XTest)

print metrics.classification_report(yTest, net_prediction)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, net_prediction),2)

We can visualise the classification boundary created by the neural network using the built in visualisation function `nnDecisionPlot`. As with the above examples, only the test samples have been included in the plot. And remember that the decision boundary has been built using the _training_ data!

In [None]:
visplots.nnDecisionPlot(XTrain, yTrain, XTest, yTest, 2, .5)

In [None]:
visplots.nnDecisionPlot(XTrain, yTrain, XTest, yTest, (2,3,6), .5)

#### Tuning Neural Nets 

Neural networks have many hyperparameters, all of which could potentially be tuned, including learning rate, loss function, number of training iterations, number of hidden layers and number of units within each of them, nonlinearity function, and weight initialisation. 

Here's a worked through example which explores the set of parameter configurations with different numbers of hidden layers and units within them (`hidden_layer_sizes`), and learning rates (`learning_rate_init`).

Note the syntax to specify the number of hidden layers and units with them. If a tuple is given, each value in the tuple stands for the number of units in a layer, e.g. the tuple `(2,3,4)` would mean a network with two units in the first layer, three units in the second, and four in the third. If a single value is given, then there is only one hidden layer, and the value stands for the number of units in this layer.

In [None]:
# Define the parameters to be optimised and their values/ranges
# Range for gamma and Cost hyperparameters
layer_size_range = [(3,2),(10,10),(2,2,2),10,5] # different networks shapes
learning_rate_range = np.linspace(.1,1,3)

parameters = [{'hidden_layer_sizes': layer_size_range, 'learning_rate_init': learning_rate_range}]

grid = GridSearchCV(multilayer_perceptron.MultilayerPerceptronClassifier(), parameters, cv= 10)
grid.fit(XTrain, yTrain)

best_size    = grid.best_params_['hidden_layer_sizes']
best_best_lr = grid.best_params_['learning_rate_init']

print "The best parameters are: hidden_layer_sizes=", best_size, " and learning_rate_init=", best_best_lr

Now try these out to see how the performance metrics are affected.

In [None]:
nnet = multilayer_perceptron.MultilayerPerceptronClassifier(hidden_layer_sizes=best_size, learning_rate_init=best_best_lr)
nnet.fit(XTrain, yTrain)
net_prediction = nnet.predict(XTest)

print metrics.classification_report(yTest, net_prediction)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, net_prediction),2)

Plot the results of the grid search using a heatmap.

In [None]:
# grid_scores_ contains parameter settings and scores
scores = [x[1] for x in grid.grid_scores_]
scores = np.array(scores).reshape(len(layer_size_range), len(learning_rate_range))
scores = np.transpose(scores)

# Make a heatmap with the performance
plt.figure(figsize=(12, 6))
plt.imshow(scores, interpolation='nearest', origin='higher', cmap=plt.cm.get_cmap('jet_r'))
plt.xticks(np.arange(len(layer_size_range)), layer_size_range)
plt.yticks(np.arange(len(learning_rate_range)), learning_rate_range)
plt.xlabel('hidden layer topology')
plt.ylabel('learning rate')

cbar = plt.colorbar()
cbar.set_label('Classification Accuracy', rotation=270, labelpad=20)

plt.show()

- *So, what is your best technique and why?*