# Linear Classification Project

In [150]:
# Packages
import numpy as np
import pandas as pd

# Optimisation
from scipy.optimize import minimize

# Models
from sklearn.linear_model import LinearRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# To partition data
from sklearn.model_selection import train_test_split

# For validation and parameter runing
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV

# Metrics
from sklearn.metrics import accuracy_score

The goal of this project is to get experience applying the methods from Chapter 4.

Things I would like to try:
- Linear regression of indicator matrix
- Linear discriminant analysis
    - LDA but choosing cut-point to minimise training error (validation?)
    - Comparing QDA with LDA on quadratic terms
    - Regularised LDA (validation)
    - Projection onto canonical variates
    - Reduced-Rank LDA (validation)
- Logistic Regression
    - Hypothesis testing using Z-scores
    - Subset selection using Wald Test, Likelihood Ratio Test, or Rao's Score Test
    - $L^1$-regularised Logistic Regression (validation)
- Separating hyperplanes? Presumably won't actually be possible

Some of these would work better on binary classification problems (such as spam data) and some would work better on classification problems with more than two classes (such as vowel data). I could always do a mixture but that wouldn't allow me to compare performance on test data.

The logistic regression hypothesis tests have only been described for binary classification problems. The disadvantage of the vowel data is that there aren't very many data points. A downside with binary classification is that a lot of these techniques become very similar.

I will move forward with vowel data.

## Data Preparation

This is the same dataset as in the enclosing folder. To get a better split in the data I have recombined it and split it into training, validation, and test sets in the proportion 60-20-20.

In [52]:
# Import data
train = pd.read_csv('vowel-train.csv', header=None)
val = pd.read_csv('vowel-val.csv', header=None)
test = pd.read_csv('vowel-test.csv', header=None)

# Combine the training/validation set into a single dataframe
trainval = pd.concat([train, val]).reset_index(drop = True)

# Take a look at the data
trainval.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,11,-3.725,1.904,-0.737,0.433,-0.369,1.047,0.12,0.425,-0.678,-0.391
1,5,-2.967,2.781,-1.277,0.354,-0.936,1.505,-0.004,-0.418,-0.56,0.725
2,8,-4.175,3.32,-0.446,0.988,-1.48,0.133,0.507,0.605,0.691,-0.462
3,4,-3.194,1.589,-0.774,0.814,-1.087,0.618,0.218,-0.45,-0.003,0.526
4,4,-2.03,1.764,-0.386,-0.249,0.18,0.117,0.096,-0.121,0.067,-0.552


In [53]:
# Display descriptive statistics
trainval.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,792.0,6.039141,3.180175,1.0,3.0,6.0,9.0,11.0
1,792.0,-3.220919,0.866414,-5.211,-3.90375,-3.171,-2.6265,-0.961
2,792.0,1.897836,1.176604,-1.274,1.0705,1.907,2.7855,5.074
3,792.0,-0.529126,0.706045,-2.487,-0.99925,-0.586,-0.10225,1.431
4,792.0,0.530307,0.743921,-1.247,-0.04725,0.4675,1.0985,2.377
5,792.0,-0.306451,0.67609,-2.127,-0.794,-0.312,0.17625,1.831
6,792.0,0.641318,0.59295,-0.836,0.22975,0.5625,1.0225,2.327
7,792.0,-0.005274,0.449826,-1.454,-0.31225,0.033,0.28275,1.286
8,792.0,0.333104,0.578171,-1.293,-0.10575,0.322,0.77325,1.972
9,792.0,-0.308096,0.568027,-1.613,-0.708,-0.3005,0.09125,1.309


In [54]:
# Print values
print('{} training samples'.format(train.shape[0]))
print('{} validation samples'.format(val.shape[0]))
print('{} test samples'.format(test.shape[0]))
print('{} features'.format(train.shape[1] - 1))

# Classes
print('Class labels: ', np.sort(train.iloc[:, 0].unique()))

594 training samples
198 validation samples
198 test samples
10 features
Class labels:  [ 1  2  3  4  5  6  7  8  9 10 11]


It would be more convenient if our class labels were 0,..,10 so I'll just subtract 1 from the outputs.

In [55]:
# Relabel classes by subtracting one
train.iloc[:, 0] = train.iloc[:, 0] - 1
trainval.iloc[:, 0] = trainval.iloc[:, 0] - 1
val.iloc[:, 0] = val.iloc[:, 0] - 1
test.iloc[:, 0] = test.iloc[:, 0] - 1

In [94]:
# Occurances of different classes in training/validation data
class_counts = trainval.iloc[:, 0].value_counts()
class_counts[list(range(K))]

0     72
1     67
2     77
3     71
4     75
5     65
6     69
7     73
8     74
9     75
10    74
Name: 0, dtype: int64

So different classes occur in approximately equal proportion. This means that the base error rate (from classifying to the most common class) is around $1 - \frac{1}{K}$.

In [97]:
print('Base error rate: {:.4f}'.format(1 - 1/11))

Base error rate: 0.9091


Now we prepare the data for use.

In [56]:
# Split into inputs and outputs
y_train = train.iloc[:, 0].to_numpy()
X_train = train.iloc[:, 1:].to_numpy()
N_train = len(y_train)

y_trainval = trainval.iloc[:, 0].to_numpy()
X_trainval = trainval.iloc[:, 1:].to_numpy()
N_trainval = len(y_trainval)

y_val = val.iloc[:, 0].to_numpy()
X_val = val.iloc[:, 1:].to_numpy()
N_val = len(y_val)

y_test = test.iloc[:, 0].to_numpy()
X_test = test.iloc[:, 1:].to_numpy()
N_test = len(y_test)

# Some useful constants
p = X_train.shape[1]
K = len(np.unique(y_train))

# Predefined split to aid validation with sklearn
test_fold = [-1] * N_train + [0] * N_val
ps = PredefinedSplit(test_fold)

## Linear Regression of Indicator Matrix

We begin by simply performing ordinary least squares regression on the indicator matrix.

In [98]:
def gen_indicator_responses(y, K):
    'Turn output vector with entries 0,..,K-1 into indicator reponse matrix'
    N = y.shape[0]

    Y = np.zeros(shape=(N, K))
    Y[range(N), y] = 1

    return Y

In [128]:
# Indicator response matrices
Y_train = gen_indicator_responses(y_train, K)
Y_trainval = gen_indicator_responses(y_trainval, K)
#Y_val = gen_indicator_responses(y_val, K)
#Y_test = gen_indicator_responses(y_test, K)

We use training and validation to get an estimate of prediction error.

In [117]:
# Fit regression model to training and validation data
ols = LinearRegression()
ols.fit(X_train, Y_train)

# Generate classification predictions for training data
y_train_pred_ols = ols.predict(X_train)
y_train_pred_ols = np.argmax(y_train_pred_ols, axis=1)
train_err_ols = 1 - accuracy_score(y_train, y_train_pred_ols)
print('Ordinary least squares training error: {:.3f}'.format(train_err_ols))

# Classification predictions for validation data
y_val_pred_ols = ols.predict(X_val)
y_val_pred_ols = np.argmax(y_val_pred_ols, axis=1)
val_err_ols = 1 - accuracy_score(y_val, y_val_pred_ols)
print('Ordinary least squares validation error: {:.3f}'.format(val_err_ols))

Ordinary least squares training error: 0.540
Ordinary least squares validation error: 0.591


## Discriminant Analysis

We begin with standard linear discriminant analysis

In [122]:
# Fit model
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# Training error
y_train_pred_lda = lda.predict(X_train)
train_err_lda = 1 - accuracy_score(y_train, y_train_pred_lda)
print('Linear discriminant analysis training error: {:.3f}'.format(train_err_lda))

# Validation error
y_val_pred_lda = lda.predict(X_val)
val_err_lda = 1 - accuracy_score(y_val, y_val_pred_lda)
print('Linear discriminant analysis validation error: {:.3f}'.format(val_err_lda))

Linear discriminant analysis training error: 0.315
Linear discriminant analysis validation error: 0.409


HTF suggest setting the cut-point to minimise misclassification error over the training data.

Optimising this isn't working at the moment.

**Tomorrow**:
- Fix this
- Think about when I want to be using just training data and when I want to use training and validation

In [159]:
intercept = np.zeros(shape=K)
y_pred = np.argmax(intercept + X_train @ lda.coef_.T, axis=1)
err = 1 - accuracy_score(y_train, y_pred)
print(err)

0.7861952861952862


In [157]:
def discriminant(intercept, gradient, y):
    intercept = np.append(intercept, 0)
    y_pred = np.argmax(intercept + gradient, axis=1)
    err = 1 - accuracy_score(y, y_pred)
    
    return err

In [184]:
discriminant(x0, gradient, y)

0.40909090909090906

In [185]:
x0 = [lda.intercept_[i] - lda.intercept_[K-1] for i in range(K-1)]
gradient = X_val @ lda.coef_.T
y = y_val

res = minimize(discriminant, x0=x0, args=(gradient, y))

In [186]:
res

      fun: 0.40909090909090906
 hess_inv: array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
      jac: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
  message: 'Optimization terminated successfully.'
     nfev: 12
      nit: 0
     njev: 1
   status: 0
  success: True
        x: array([ -6.0295333 ,   1.35236471,  12.34263333,  16.32726963,
         4.36711261,   9.36933876,  -2.53462038, -26.1909396 ,
       -19.85781028, -26.60995777])

In [183]:
x0

[-6.029533296027886,
 1.3523647123435274,
 12.34263332893547,
 16.327269631518224,
 4.367112606697166,
 9.36933875805241,
 -2.5346203799362574,
 -26.190939600022908,
 -19.85781028463713,
 -26.609957765278388]