# Manipulating data with Python and Numpy: The Digits Dataset

Author: [Alexandre Gramfort](http://alexandre.gramfort.net/) (Telecom ParisTech)
with some modifications by Chloé-Agathe Azencott `chloe-agathe.azencott@mines-paristech.fr`.

The goal of this notebook is to start manipulating data with Python and Numpy. We will use scikit-learn only to load the data.

The data you'll be working with today is called the `digit` datasets. It contains digital images of handwritten digits.

## Getting to know the data

### Imports

In [None]:
% pylab inline

# Equivalent to:
# import numpy as np                      
# import matplotlib.pyplot as plt 

### Loading the data
The data is available from scikit-learn (import name `sklearn`).

In [None]:
# Load data
from sklearn.datasets import load_digits 

digits = load_digits()

# Get descriptors and target to predict
X, y = digits.data, digits.target

# Get the shape of the data
print "Number of samples: %d" % X.shape[0]
print "Number of pixels: %d" % X.shape[1]
print "Number of classes: %d" % len(np.unique(y)) # number of unique values in y

In [None]:
# Pick one sample to visualize it
sample_idx = 42

print X[sample_idx, :]

print y[sample_idx]

### Problem 3.1
* What is the type of X? Of its entries?
* What is the type of y? Of its entries?
* Play with different values for `sample_idx`. Can you guess `y[sample_idx`]?

In [None]:
# TODO

### Visualizing samples

Each sample is a scanned image, of size 8x8, containing 64 pixels. They have been flattened out to a vector of size 64, such as `X[sample_idx, :]`. Each entry of that vector is the intensity of the corresponding pixel.

Let us now visualize the original image.

In [None]:
# Reshape the vector X[sample_idx] in a 2D, 8x8 matrix
sample_image = np.reshape(X[sample_idx, :], (8, 8))
print sample_image.shape

In [None]:
# Display the corresponding image
plt.imshow(sample_image)

In [None]:
# Let us improve visualization by using grayscale plotting 
plt.imshow(sample_image, cmap='gray')

# Give the plot of title
plt.title('The digit of index %d is a %d' % (sample_idx, y[sample_idx]))

### Problem 3.2
* Plot only half the pixels in the previous plot (alternatively)
* Remove the border of the pixel, 1-pixel wide
* Plot the histogram of the values of the pixels

### Some statistics
In order to better understand the data, we will compute some basic statistics: the mean and variance, per class, for each digit, and visualize them as images.

In [None]:
# Get all possible classes
classes_list = np.unique(y).astype(int)
print "Classes in our data: ", classes_list

### Problem 3.3
* Compute the mean of all images representing a 0: the pixel of coordinates (i, j) takes the average value of all (i, j) pixels among images of 0.
* In a for loop, do the same for every digit from 0 to 9.
* Repeat, replacing mean by standard deviation.
* Use `plt.subplots` to visualize all of those on the same plot.

In [None]:
# For plotting
fig = plt.figure()

for idx in classes_list:
    ax = fig.add_subplot(3, 4, idx+1) # plot number (idx+1) on a 3x4 grid
    
    ax.imshow(TODO)
        
# TODO

## Classification by nearest centroid.

The goal of this part is to make you implement your own classifier, based on a simple concept. For a new digit, you will return the class of its nearest mean digit.

### Problem 3.4
* Split the data base in two parts: `X_train`, `y_train` and `X_test`, `y_test`.
* For each digit, compute on `X_train` its mean representation. Store those in `centroids_train`, which is a 10x64 array.
* For each image of the test set, compute its nearest centroid. This is the prediction for this image. Store whether this prediction is correct
* What is the overall percentage of correct predictions with this method?