# bcdata Workshop 2017

## Basic machine learning with `scikit-learn`

* Machine learning terminology
* Example: MNIST digits
* PCA with `sklearn`
* Build our own PCA function
* $k$-means clustering using `sklearn`
* $k$NN classification using `sklearn`

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

### Machine learning terminology

* Sample: element in the dataset represented by a numeric vector (a measurement)
* Feature: a value in the vector representing a sample (a co-variate)
* Target: the output label attached to a sample (a label, response)
* Supervised learning: training data is labelled
    * classification: output is discrete
    * regression: output is continuous
* Unsupervised learning: training data is not labelled
    * Clustering
* Train and test data: trianing data is what you build your model with and the test data is what you use to evaluate your model (it's common to split the original into a training set and a separate testing set). 

### Example: Hand-written digits

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
dir(digits)

In [None]:
print(digits.DESCR)

In [None]:
type(digits.data)

In [None]:
digits.data.shape

In [None]:
digits.images.shape

In [None]:
plt.imshow(digits.images[0,...], cmap='binary', interpolation='gaussian')

### Principal component analysis

The idea is: suppose that we have a dataset stored in a matrix $X$ where each row is a sample and normalized (such that the mean value in each column is 0). 

In [None]:
from sklearn.decomposition import PCA

Note: I'm going to choose `n_components = 29` because that is the lowest number that explains $\geq 95 \%$ of the variance

In [None]:
pca = PCA(n_components=29)
pca.fit(digits.data)

In [None]:
pca.components_.shape

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(121)
plt.imshow(pca.components_[0,:].reshape((8,8)), cmap='binary', interpolation='gaussian')
plt.subplot(122)
plt.imshow(pca.components_[1,:].reshape((8,8)), cmap='binary', interpolation='gaussian')

Are they orthogonal to each other? 

In [None]:
print('<PCA1, PCA2> = {}'.format(np.dot(pca.components_[0,:], pca.components_[1,:]).round(3)))

In [None]:
data2d = pca.transform(digits.data)

In [None]:
discrete_colours = matplotlib.cm.get_cmap('viridis', 10)
plt.figure(figsize=(12,6));
plt.scatter(data2d[:,0], data2d[:,1], c=digits.target, cmap=discrete_colours, vmin=-0.5, vmax=9.5);
plt.colorbar(ticks=np.arange(10));
plt.xlabel('Looks like 3 not 4')
plt.ylabel('Looks like 0 not 1');

In [None]:
plt.figure(figsize=(20,8))
for i in range(5):
    plt.subplot(1,5,i+1)
    plt.imshow(pca.components_[i,:].reshape((8,8)), cmap='binary', interpolation='gaussian')
    plt.axis('off')

### Rolling our own PCA function

In [None]:
from scipy.linalg import eig

In [None]:
def ourPCA(data, n_components=2):
    
    # normalize the input data (mean-value of 0 down the columns)
    means = data.mean(axis=0)
    X = data - means
    evals, evecs = eig(X.T @ X)
    indices = np.argsort(evals.real)[::-1][:n_components]
    return evecs[:, indices].T.real
    
    

In [None]:
pc5 = ourPCA(digits.data, 5)

In [None]:
plt.figure(figsize=(20,8))
for j in range(5):
    plt.subplot(1,5,j+1)
    plt.imshow(pc5[j,:].reshape((8,8)), cmap='gray', interpolation='gaussian')

**Exercise:** Let $v, w$ be eigenvectors of $X^T X$ corresponding to eigenvalues $\lambda_1 \geq \lambda_2$ respectively.
In PCA, $v$ is the *first principal component*. To find the second pc, one proceeds by projecting $X$ onto $v^\perp$ to obtain $Y := \mathcal{P}_{v^\perp} (X)$ and then computes the leading eigenvector of $Y^T Y$. Prove that this leading eigenvector is $w$. 

### $k$-means clustering

In [None]:
ones = digits.data[digits.target == 1]
ones.shape

In [None]:
plt.imshow(ones[20,:].reshape((8,8)))

In [None]:
plt.imshow(ones.mean(axis=0).reshape((8,8)))

In [None]:
onesProj = pca.transform(ones)

In [None]:
plt.figure(figsize=(6,6))
plt.scatter(onesProj[:,0], onesProj[:,1])
plt.axis('equal');

In [None]:
index1 = np.argmax(onesProj[:,0])
index2 = np.argmax(onesProj[:,1])
outlier1 = ones[index1, :].reshape((8,8))
outlier2 = ones[index2, :].reshape((8,8))
plt.figure(figsize=(15,8))
plt.subplot(121)
plt.imshow(outlier1)
plt.subplot(122)
plt.imshow(outlier2)

In [None]:
from sklearn.cluster import KMeans

In [None]:
N = 4
clf = KMeans(n_clusters=N, n_jobs=-1)

In [None]:
clf.fit(ones)

In [None]:
clf.cluster_centers_.shape

In [None]:
plt.figure(figsize=(np.minimum(20, 4*N),4))

for i in range(N):
    plt.subplot(1,N,i+1)
    plt.imshow(clf.cluster_centers_[i,:].reshape((8,8)))


### $k$-nearest neighbours

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [None]:
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=.25)

In [None]:
for var in (x_train, x_test, y_train, y_test):
    print(var.shape)

In [None]:
knn = KNeighborsClassifier(n_neighbors=10)

In [None]:
knn.fit(x_train, y_train)

In [None]:
y_pred = knn.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score, adjusted_mutual_info_score, classification_report, confusion_matrix

In [None]:
acc_score = accuracy_score(y_test, y_pred)
I_mutual = adjusted_mutual_info_score(y_test, y_pred)
print('Accuracy score = {};\nAdjusted Mutual Information score = {}'.format(acc_score.round(3), I_mutual.round(3)))

In [None]:
print('Classification report')
print(classification_report(y_test, y_pred))

In [None]:
print('Confusion matrix')
print(confusion_matrix(y_test, y_pred))