# Classification

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial 3

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$
$\newcommand{\Norm}[1]{\lVert#1\rVert}$
$\newcommand{\RR}{\mathbb{R}}$
$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$
$\newcommand{\DD}{\mathscr{D}}$
$\newcommand{\grad}[1]{\operatorname{grad}#1}$
$\DeclareMathOperator*{\argmin}{arg\,min}$

Setting up the environment

In [29]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as opt
from scipy.special import expit # The logistic sigmoid function 

%matplotlib inline

## The data set

We will predict the incidence of diabetes based on various measurements (see [description](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)). Instead of directly using the raw data, we use a normalised version, where the label to be predicted (the incidence of diabetes) is in the first column. Download the data from [mldata.org](http://mldata.org/repository/data/download/csv/diabetes_scale/).

Read in the data using pandas.

In [30]:
names = ['diabetes', 'num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age']
data = pd.read_csv('diabetes_scale.csv', header=None, names=names)
data['diabetes'].replace(-1, 0, inplace=True) # The target variable need be 1 or 0, not 1 or -1
data.head()

Unnamed: 0,diabetes,num preg,plasma,bp,skin fold,insulin,bmi,pedigree,age
0,0,-0.294118,0.487437,0.180328,-0.292929,-1.0,0.00149,-0.53117,-0.033333
1,1,-0.882353,-0.145729,0.081967,-0.414141,-1.0,-0.207153,-0.766866,-0.666667
2,0,-0.058824,0.839196,0.04918,-1.0,-1.0,-0.305514,-0.492741,-0.633333
3,1,-0.882353,-0.105528,0.081967,-0.535354,-0.777778,-0.162444,-0.923997,-1.0
4,0,-1.0,0.376884,-0.344262,-0.292929,-0.602837,0.28465,0.887276,-0.6


## Classification via Logistic Regression

Implement binary classification using logistic regression for a data set with two classes. Make sure you use appropriate [python style](https://www.python.org/dev/peps/pep-0008/) and [docstrings](https://www.python.org/dev/peps/pep-0257/).

Use ```scipy.optimize.fmin_bfgs``` to optimise your cost function. ```fmin_bfgs``` requires the cost function to be optimised, and the gradient of this cost function. Implement these two functions as ```cost``` and ```grad``` by following the equations in the lectures.

Implement the function ```train``` that takes a matrix of examples, and a vector of labels, and returns the maximum likelihood weight vector for logistic regresssion. Also implement a function ```test``` that takes this maximum likelihood weight vector and the a matrix of examples, and returns the predictions. See the section **Putting everything together** below for expected usage.

We add an extra column of ones to represent the constant basis.

In [31]:
data['ones'] = np.ones((data.shape[0], 1)) # Add a column of ones
data.head()

Unnamed: 0,diabetes,num preg,plasma,bp,skin fold,insulin,bmi,pedigree,age,ones
0,0,-0.294118,0.487437,0.180328,-0.292929,-1.0,0.00149,-0.53117,-0.033333,1.0
1,1,-0.882353,-0.145729,0.081967,-0.414141,-1.0,-0.207153,-0.766866,-0.666667,1.0
2,0,-0.058824,0.839196,0.04918,-1.0,-1.0,-0.305514,-0.492741,-0.633333,1.0
3,1,-0.882353,-0.105528,0.081967,-0.535354,-0.777778,-0.162444,-0.923997,-1.0,1.0
4,0,-1.0,0.376884,-0.344262,-0.292929,-0.602837,0.28465,0.887276,-0.6,1.0


In [32]:
def cost(w, X, y):
    """
    Returns the cross-entropy error function.
    
    w -- parameters
    X -- dataset of features where each row corresponds to a single sample
    y -- dataset of labels where each row corresponds to a single sample
    """
    outputs = expit(X.dot(w)) # Vector of outputs (or predictions)
    return -( y.transpose().dot(np.log(outputs)) + (1-y).transpose().dot(np.log(1-outputs)) )

def grad(w, X, y):
    """
    Returns the gradient of the cross-entropy error function.
    """
    outputs = expit(X.dot(w))
    return X.transpose().dot(outputs-y)
    
def train(X, y):
    """
    Returns the vector of parameters which minimizes the cross-entropy error function via the BFGS algorithm.
    """
    initial_values = np.zeros(X.shape[1]) # Error occurs if inital_values is set too high
    return opt.fmin_bfgs(cost, initial_values, fprime=grad, args=(X,y))

def predict(w, X):
    """
    Returns a vector of predictions. 
    """
    return expit(X.dot(w))

## Performance measure

There are many ways to compute the [performance of a binary classifier](http://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers). The key concept is the idea of a confusion matrix or contingency table:

|              |    | Label |    |
|:-------------|:--:|:-----:|:--:|
|              |    |  +1   | -1 |
|**Prediction**| +1 |    TP | FP |
|              | -1 |    FN | TN |

where
* TP - true positive
* FP - false positive
* FN - false negative
* TN - true negative

Implement three functions, the first one which returns the confusion matrix for comparing two lists (one set of predictions, and one set of labels). Then implement two functions that take the confusion matrix as input and returns the **accuracy** and **balanced accuracy** respectively. The [balanced accuracy](http://en.wikipedia.org/wiki/Accuracy_and_precision) is the average accuracy of each class.


In [33]:
def confusion_matrix(predictions, y): 
    """
    Returns the confusion matrix [[tp, fp], [fn, tn]].
    
    predictions -- dataset of predictions (or outputs) from a model
    y -- dataset of labels where each row corresponds to a single sample
    """
    tp, fp, fn, tn = 0, 0, 0, 0
    predictions = predictions.round().values # Converts to numpy.ndarray
    y = y.values
    for prediction, label in zip(predictions, y):
        if prediction == label: 
            if prediction == 1:
                tp += 1
            else:
                tn += 1
        else: 
            if prediction == 1:
                fp += 1
            else: 
                fn += 1
    return np.array([[tp, fp], [fn, tn]])

def accuracy(cm):
    """
    Returns the accuracy, (tp + fp)/(tp + fp + fn + tn).  
    """
    return cm.trace()/cm.sum()

def balanced_accuracy(cm): 
    """
    Returns the balanced accuracy, (tp/p + tn/n)/2.
    """
    return (cm[0,0]/(cm[0,0] + cm[0,1]) +  cm[1,1]/(cm[1,0] + cm[1,1]))/2
    

## Putting everything together

Consider the following code, which trains on all the examples, and predicts on the training set. Discuss the results.

In [34]:
y = data['diabetes']
X = data[['num preg', 'plasma', 'bp', 'skin fold', 'insulin', 'bmi', 'pedigree', 'age', 'ones']]
theta_best = train(X, y)
pred = predict(theta_best, X)
cmatrix = confusion_matrix(pred, y)
[accuracy(cmatrix), balanced_accuracy(cmatrix)]

Optimization terminated successfully.
         Current function value: 361.722693
         Iterations: 18
         Function evaluations: 30
         Gradient evaluations: 30


[0.78255208333333337, 0.76912964680456408]

### Solution description

## Fisher's discriminant

In the lectures, you saw that the Fisher criterion
$$
J(w) = \frac{w^T S_B w}{w^T S_W w}
$$
is maximum for Fisher's linear discriminant.

Define $S_B$ and $S_W$ as in the lectures and prove this result.

### Solution description

### Solution description

## (optional) Effect of regularization parameter

By splitting the data into two halves, train on one half and report performance on the second half. By repeating this experiment for different values of the regularization parameter $\lambda$ we can get a feeling about the variability in the performance of the classifier due to regularization. Plot the values of accuracy and balanced accuracy for at least 3 different choices of $\lambda$. Note that you may have to update your implementation of logistic regression to include the regularisation parameter.


In [35]:
### Solution

def split_data(data):
    """Randomly split data into two equal groups"""
    np.random.seed(1)
    N = len(data)
    idx = np.arange(N)
    np.random.shuffle(idx)
    train_idx = idx[:int(N/2)]
    test_idx = idx[int(N/2):]

    X_train = data.loc[train_idx].drop('diabetes', axis=1)
    t_train = data.loc[train_idx]['diabetes']
    X_test = data.loc[test_idx].drop('diabetes', axis=1)
    t_test = data.loc[test_idx]['diabetes']
    
    return X_train, t_train, X_test, t_test


### Solution description