## Set Up

We import our code and any frequently used libraries, and set up our data.

In [24]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path) 
    
DATA_PATH = '../data/zipcombo.dat'

In [25]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [26]:
import pandas as pd
import numpy as np
from tqdm import tqdm

In [27]:
from src.kernels import polynomial_kernel
from src.perceptrons import VectorizedOneVsAllKernelPerceptron, OneVsAllKernelPerceptron

<b> Let's initially work with a smaller dataset until we sort out inner and outer efficiency issues (i.e. using one-vs-all and making the individual perceptrons more efficient) <b>

In [28]:
df = pd.read_csv(DATA_PATH, sep=' ', header=None).drop(columns=[257])
df.rename(columns={0: 'label'}, inplace=True)
X = df[list(range(1, 257))].values
y = df['label'].values.astype(np.int)

In [29]:
# we do not currently use subsampling, but we do still need the function for testing purposes

def subsample(df, classes, sample_size=100):
    # sampling
    df_small = pd.DataFrame()
    for clazz in classes:
        df_clazz = df[df['y'] == clazz]
        df_sample = df_clazz.sample(sample_size)
        df_small = df_small.append(df_sample)

    #shuffle
    df_small = df_small.sample(frac=1.)

    X_small = df_small.drop(columns='y').values
    y_small = df_small['y'].values
    
    return X_small, y_small

## Utils

In [30]:
# functions for creating confusion error matrix

def conf_mat(X, y, model):
    cats = 10
    con_mat = np.zeros((cats,cats))
    x_pred = model.predict_all(X)
    for i in range(len(y)):
        con_mat[y[i], x_pred[i]] += 1
    return con_mat

def confusion_error(X, y, model):
    cats = 10
    con_mat = np.zeros((cats,cats))
    x_pred = model.predict_all(X)
    for i in range(len(y)):
        con_mat[y[i], x_pred[i]] += 1

        
    # row normalize
    for j in range(0,cats):
        list_i = list(range(0,cats))
        list_i.remove(j)
        tot = sum(con_mat[j, :]) - con_mat[j,j]
        for col in list_i:
            if con_mat[j, col] != 0:
                con_mat[j, col] = (con_mat[j, col]/tot)*100
        con_mat[j, j] = 0
    return con_mat

## Exercise

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def error_score(y, y_pred):
    return 1 - accuracy_score(y, y_pred)

### 1. Basic Results
We split our data into 80%/20% train and test. We perform 20 runs for $d = 1, ..., 7$, and report the mean test and training errors with their standard deviations.

In [34]:
# define basic run for part 1.1

def basic_run(X_train, X_test, y_train, y_test, kernel, epochs=2, progress=False):    
    #fit model
    mkp = VectorizedOneVsAllKernelPerceptron(X_train, y_train, kernel)
    mkp.train_for_epochs(epochs, progress=progress)
    
    #return errors
    error_train = error_score(y_train, mkp.predict_all(X_train))
    error_test = error_score(y_test, mkp.predict_all(X_test))
    
    return {'err_train': error_train, 'err_test': error_test, 'model': mkp}

In [35]:
# perform basic runs
iterations = 20
ds = list(range(1, 8))
err_train = {d: [] for d in ds}
err_test = {d: [] for d in ds}

for iteration in tqdm(list(range(iterations))):
    # split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)
    
    for d in ds:
        #split data
        results = basic_run(X_train, X_test, y_train, y_test, polynomial_kernel(d), 5)
        err_train[d].append(results['err_train'])
        err_test[d].append(results['err_test'])
    
err_train_mean = {d: np.mean(errs) for d, errs in err_train.items()}
err_test_mean = {d: np.mean(errs) for d, errs in err_test.items()}
err_train_std = {d: np.std(errs) for d, errs in err_train.items()}
err_test_std = {d: np.std(errs) for d, errs in err_test.items()}

100%|██████████| 20/20 [10:57<00:00, 31.34s/it]


In [23]:
# display in dataframe
df_err = pd.DataFrame([err_train_mean, err_test_mean,
                       err_train_std, err_test_std], 
                       index=['train_mean', 'test_mean', 'train_std', 'test_std'], 
                       columns=ds).T
df_err

Unnamed: 0,train_mean,test_mean,train_std,test_std
1,0.929739,0.907231,0.00808,0.010948
2,0.993063,0.961962,0.002525,0.005137
3,0.998118,0.967554,0.000893,0.003703
4,0.999106,0.97172,0.000899,0.004483
5,0.999597,0.971371,0.000195,0.004287
6,0.999718,0.972043,0.000194,0.003896
7,0.999751,0.971102,0.000129,0.002379


___

### 2. Cross-validation

We split our data into 80%/20% train and test. We then use 5-fold cross validation to find our best $d^*$ parameter for $d^* \in \{1, ..., 7\}$. We then retrain our optimal kernelised perceptron on the full training set, and calculate training and test errors over 20 runs. We report the mean test and training errors for this perceptron, as well as its standard deviations.

In [None]:
def make_fold_indices(n, k=5):
    ixs = np.array(range(n))
    np.random.shuffle(ixs)
    folds = np.array_split(ixs, k)
    fold_ixs = np.zeros(n)
    for i in range(k):
        fold_ixs[folds[i]] = i
    return fold_ixs

In [None]:
# generate k folds and perform cross-validation on them, returning error per fold.
def cross_validation_error(X, y, kernel, epochs=2, k=5):
    fold_ixs = make_fold_indices(len(X), k=k)

    cv_errs = []
    for fold_ix in np.unique(fold_ixs):
        X_val = X[fold_ixs == fold_ix]
        y_val = y[fold_ixs == fold_ix]
        X_train = X[fold_ixs != fold_ix]
        y_train = y[fold_ixs != fold_ix]
        
        #fit model
        mkp = VectorizedOneVsAllKernelPerceptron(X_train, y_train, kernel)
        mkp.train_for_epochs(epochs=5)
        
        #record validation fold error
        mkp.train_for_epochs(epochs)
        cv_errs.append(error_score(y_val, mkp.predict_all(X_val)))
        
    return np.mean(cv_errs)

In [None]:
# perform cross-validation runs

iterations = 5
ds = list(range(1, 8))
errs_cv = {}

d_stars = []
errs_test = []
confusion_matrices = []
for iteration in tqdm(list(range(iterations))):
    # split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)
    
    # perform cross validations
    for d in ds:
        errs_cv[d] = cross_validation_error(X_train, y_train, polynomial_kernel(d))
        
    # get best parameter
    d_star = min(errs_cv, key=errs_cv.get)
    d_stars.append(d_star)
    
    # get final error
    results = basic_run(X_train, X_test, y_train, y_test, polynomial_kernel(d_star))
    errs_test.append(results['err_test'])
    
    # compute confusion matrices too (so as to avoid recomputing in Q3)
    confusion_matrices.append(confusion_error(X_test, y_test, results['model']))
    
# compute results   
err_test_mean = np.mean(errs_test)
d_star_mean = np.mean(d_stars)
err_test_std = np.std(errs_test)
d_star_std = np.std(d_stars)

In [None]:
# display in dataframe
df_err = pd.DataFrame([[err_test_mean, err_test_std],
                       [d_star_mean, d_star_std]], 
                       columns=['mean', 'std'], index=['err_test', 'd_star']).T
print("Answer to 2:")
df_err

In [None]:
d_stars

In [None]:
errs_test

err_test

### 3. Confusion Matrix

We compute the confusion matrix for the above perceptron. (It's not clear to me here whether Mark means a confusion matrix for the training set or testing set. I think it makes sense to look at both).

In [None]:
# display in dataframe
confusion_matrix_array = np.array(confusion_matrices)
confus_error_mean = np.mean(confusion_matrix_array, axis=0)
confus_error_std = np.std(confusion_matrix_array, axis=0)

df_cm_means = pd.DataFrame(confus_error_mean)
print(df_cm_means)

df_cm_std = pd.DataFrame(confus_error_std)
print(df_cm_std)

In [None]:
break

### 4. Hardest Predictions

We define "hardest to predict" as: ... 

We then print out the five hardest to predict images, and discuss each. 

___

### 5.

Repeat 1 with a gaussian kernel

In [None]:
width_list = np.arange(0.5, 1.7, 0.2)
df_gaus = basic_res(X, y, 3, width_list, kernel='gaussian')
df_gaus

Repeat 2 with a gaussian kernel

In [None]:
q2_gaussian = cross_val_run(X, y, runs=3, epochs=2, kernel='gaussian')
q2_gaussian

___
___
___

1. Basic Results: Perform 20 runs for d = 1, . . . , 7 each run should randomly split zipcombo into 80%
train and 20% test. Report the mean test and train errors as well as well as standard deviations.
Thus your data table, here, will be 2 × 7 with each “cell” containing a mean±std.
2. Cross-validation: Perform 20 runs : when using the 80% training data split from within to perform
5-fold cross-validation to select the “best” parameter d
∗
then retrain on full 80% training set using
d
∗ and then record the test errors on the remaining 20%. Thus you will find 20 d
∗ and 20 test errors.
Your final result will consist of a mean test error±std and a mean d
∗ with std.
3. Confusion matrix: Perform 20 runs : when using the 80% training data split that further to perform
5-fold cross-validation to select the “best” parameter d
∗
retrain on the 80% training using d
∗ and
produce a confusion matrix. Here the goal is to find “confusions” thus if the true label was “7” and
“2” was predicted then a “mistake” should recorded for “(7,2)”; the final output will be a 10 × 10
matrix where each cell contains a confusion error and its standard deviation. Note the diagonal will
be 0.
4. Within the dataset relative to your experiments there will be five hardest to predict correctly “data
items.” Print out the visualisation of these five digits along with their labels. Is it surprising that
these are hard to predict?
5. Repeat 1 and 2 (d
∗
is now c and {1, . . . , 7} is now S) above with a Gaussian kernel
K(p, q) = e
−ckp−qk
2
,
c the width of the kernel is now a parameter which must be optimised during cross-validation however,
you will also need to perform some initial experiments to a decide a reasonable set S of values to crossvalidate c over.
6. Choose (research) an alternate method to generalise to k-classes then repeat 1 and 2.
7. Choose two more algorithms to compare to the kernel perceptron each of these algorithms will have
a parameter vector θ and you will need to determine a cross-validation set Sθ with this information
repeat 1 and 2 for your new algorithms.