# Digits Sample Data

For these examples we'll use scikit-learn's built-in digits data.  This is a dataset of 8x8 pixel handwritten digits such as the following:

![digits](digits.png)

The data is in the form of a 64 element array of integers representing grayscale values for each pixel.  Each matrix also has a label with the true value of the number drawn.

In [1]:
from sklearn.datasets import load_digits

# Load all the samples for all digits 0-9
digits = load_digits()

# Assign the matrices to a variable `data`
data = digits.data

# Assign the labels to a variable `target`
target = digits.target

So for example, the first element is a 0 and the array of pixel values is:

In [2]:
print(target[0])
data[0]

0


array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])

# Splitting Data Across Training and Testing Sets

scikit-learn offers an easy way to randomly split the data into training and testing sets:

In [3]:
from sklearn.cross_validation import train_test_split

# Split the data into 75% train, 25% test
data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=.25, random_state=0
)

# [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
[`klearn.naive_bayes.GaussianNB`](scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

The first classifier we'll look at is Gaussian Naive Bayes.  This is a simple model that doesn't take any configuration parameters.

In [4]:
from sklearn.naive_bayes import GaussianNB

# Create a gaussian naive bayes model with no parameters
model = GaussianNB()

# Fit the model to our training data passing the features as the first parameter and the labels as the second
model.fit(data_train, target_train)

# Use the model to predict labels for our training set
pred_train = model.predict(data_train)

# And for the test set
pred_test = model.predict(data_test)

# Evaluating Our Model

There are a few ways we can evaluate the model.  For example:

* Accuracy: Percentage of predictions that are correct
* Precision: The percent of predicted true values that are actually true
* Recall: Percentage of true that were predicted to be true

scikit-learn offers a number of others [metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).  For these examples, I'll just stick with accuracy, but usage is generally similar across the different metrics.

In [5]:
from sklearn.metrics import accuracy_score

# Print the accuracy for the training set
print("Training Accuracy:", accuracy_score(target_train, pred_train))

# Print the accuracy for the test set
print("Testing Accuracy:", accuracy_score(target_test, pred_test))

Training Accuracy: 0.857461024499
Testing Accuracy: 0.833333333333


The gaussian naive bayes classifier gives us about 85% accuracy.  The accuracy for the test set is slightly lower, as we would expect, but not so much lower that we should be concerned about overfitting.

All the supervised classification algorithms in scikit-learn follow the same pattern as above: create the model, fit it to the data, predict labels and evaluate the results.  Because we'll be doing much the same process repeatedly, let's create a function to encapsulate all of those steps:

In [6]:
def run_model(model):
    # Fit the model to our training data passing the features as the first parameter and the labels as the second
    model.fit(data_train, target_train)

    # Use the model to predict labels for our training set
    pred_train = model.predict(data_train)

    # And for the test set
    pred_test = model.predict(data_test)
    
    # Print the accuracy for the training set
    print("Training Accuracy:", accuracy_score(target_train, pred_train))
    
    # Print the accuracy for the test set
    print("Testing Accuracy:", accuracy_score(target_test, pred_test))

# [K-Nearest Neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
[`sklearn.neighbors.KNeighborsClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

K-nearest neighbors has additional parameters that can be provided when creating the model such as the number of neighbors to use or the algorithm to use.

In [7]:
from sklearn.neighbors import KNeighborsClassifier

# Create a model using the 10 nearest neighbors and make the weights uniform
model = KNeighborsClassifier(n_neighbors=10, weights='uniform')

# Train the model, predict labels and evaluate
run_model(model)

Training Accuracy: 0.986636971047
Testing Accuracy: 0.975555555556


That worked pretty well.  But let's see how changing the model parameters affects the results:

In [8]:
# This time, make the distance between neighbors affect the weight
model = KNeighborsClassifier(n_neighbors=10, weights='distance')

# Train the model, predict labels and evaluate
run_model(model)

Training Accuracy: 1.0
Testing Accuracy: 0.98


The result is _every so slightly_ better.  

**Note**: any time you are tweaking the model parameters to improve the test accuracy, you also need to include a [validation](https://en.wikipedia.org/wiki/Test_set#Validation_set) set to prevent overfitting of the test set as well.

# [Support Vector Classification](https://en.wikipedia.org/wiki/Support_vector_machine)
[`sklearn.svm.SVC`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)

First, a linear kernel:

In [9]:
from sklearn.svm import SVC

# Create a SVC model with a linear kernel
model = SVC(kernel='linear')

run_model(model)

Training Accuracy: 1.0
Testing Accuracy: 0.971111111111


And a polynomial model:

In [10]:
# Create a 5th degree polynomial model
model = SVC(kernel='poly', degree=5)
run_model(model)

Training Accuracy: 1.0
Testing Accuracy: 0.982222222222


# [Decision Trees](https://en.wikipedia.org/wiki/Decision_tree_learning)
[`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [11]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
run_model(model)

Training Accuracy: 1.0
Testing Accuracy: 0.851111111111


With the default parameters, there appears to be some overfitting since the training accuracy is perfect but the testing accuracy is ~83%.  We can adjust parameters such as the depth of the tree or the number of features to use to help mitigate some of that:

In [12]:
model = DecisionTreeClassifier(max_depth=8, max_features=8)
run_model(model)

Training Accuracy: 0.881217520416
Testing Accuracy: 0.724444444444


# [Random Forest](https://en.wikipedia.org/wiki/Random_forest)
[`sklearn.ensemble.RandomForestClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

In [13]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
run_model(model)

Training Accuracy: 1.0
Testing Accuracy: 0.928888888889


We can tweak settings such as the number of estimators to use:

In [14]:
model = RandomForestClassifier(n_estimators=40)
run_model(model)

Training Accuracy: 1.0
Testing Accuracy: 0.973333333333


# Further Reading
* [scikit-learn supervised learning documentation](http://scikit-learn.org/stable/supervised_learning.html)
* [Classification Examples](http://scikit-learn.org/stable/auto_examples/#classification)