# bcdata Workshop 2017

## Basic machine learning with `scikit-learn`

* Machine learning terminology
* Example: MNIST digits
* PCA with `sklearn`
* Build our own PCA function
* $k$-means clustering using `sklearn`
* $k$NN classification using `sklearn`

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

### Machine learning terminology

* Sample: element in the dataset represented by a numeric vector (a measurement)
* Feature: a value in the vector representing a sample (a co-variate)
* Target: the output label attached to a sample (a label, response)
* Supervised learning: training data is labelled
    * classification: output is discrete
    * regression: output is continuous
* Unsupervised learning: training data is not labelled
    * Clustering
* Train and test data: trianing data is what you build your model with and the test data is what you use to evaluate your model (it's common to split the original into a training set and a separate testing set). 

### Example: Hand-written digits

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
dir(digits)

In [None]:
print(digits.DESCR)

In [None]:
type(digits.data)

In [None]:
digits.data.shape

In [None]:
digits.images.shape

In [None]:
plt.imshow(digits.images[0,...], cmap='binary', interpolation='gaussian')

### Principal component analysis

The idea is: suppose that we have a dataset stored in a matrix $X$ where each row is a sample and normalized (such that the mean value in each column is 0). 

In [None]:
from sklearn.decomposition import PCA

Note: I'm going to choose `n_components = 29` because that is the lowest number that explains $\geq 95 \%$ of the variance

In [None]:
pca = PCA(n_components=29)
pca.fit(digits.data)

In [None]:
pca.components_.shape

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(121)
plt.imshow(pca.components_[0,:].reshape((8,8)), cmap='binary', interpolation='gaussian')
plt.subplot(122)
plt.imshow(pca.components_[1,:].reshape((8,8)), cmap='binary', interpolation='gaussian')

Are they orthogonal to each other? 

In [None]:
print('<PCA1, PCA2> = {}'.format(np.dot(pca.components_[0,:], pca.components_[1,:]).round(3)))