Dave Brunner
Group G27

In [108]:
import torch
import numpy as np
import matplotlib.pyplot as plt

## Binary Classification

Here, we use a tabular dataset from kaggle (https://www.kaggle.com/sammy123/lower-back-pain-symptoms-dataset) with features on patients physical spine details possibly suited for classifying whether the person is 'abnormal' or 'normal' - possibly suffers back pain or not.   

We here just want to see how the training works with logistic regression (binary case). We set aside a proper handling of the learning experiment by splitting the data into a train and test partition (in general we would even have a validation partition). We focus here on making the system learn something. 

1. Download the dataset from kaggle (see the link in the notebook). Load it into a pandas dataframe (see the code in the notebook). Normalise the data.
2. Complete the code for the implementation of the methods \verb|predict|, \verb|cost|, \verb|gradient_cost|, \verb|accuracy|. As a test, just invoke the method by suitable dummy values.
3. Implement (full) batch GD for minimizing the CE cost (without autograd). Plot cost vs the number of epochs.
4. Implement (full) batch GD for minimizing the CE cost, this time with autograd. show that you obtain consistent results.
5. Tune the learning rate. What is a reasonable learning rate?

### 1. Load Data

In [109]:
import pandas as pd
df = pd.read_csv("./data/Dataset_spine.csv") # possibly modify!
df = df.drop(columns=['Unnamed: 13'])
N  = df.shape[0]
df.head()

Unnamed: 0,Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,Col10,Col11,Col12,Class_att
0,63.027817,22.552586,39.609117,40.475232,98.672917,-0.2544,0.744503,12.5661,14.5386,15.30468,-28.658501,43.5123,Abnormal
1,39.056951,10.060991,25.015378,28.99596,114.405425,4.564259,0.415186,12.8874,17.5323,16.78486,-25.530607,16.1102,Abnormal
2,68.832021,22.218482,50.092194,46.613539,105.985135,-3.530317,0.474889,26.8343,17.4861,16.65897,-29.031888,19.2221,Abnormal
3,69.297008,24.652878,44.311238,44.64413,101.868495,11.211523,0.369345,23.5603,12.7074,11.42447,-30.470246,18.8329,Abnormal
4,49.712859,9.652075,28.317406,40.060784,108.168725,7.918501,0.54336,35.494,15.9546,8.87237,-16.378376,24.9171,Abnormal


#### Normalization and Turning into Torch Tensors

In [110]:
x0 = torch.from_numpy(df.values[:,0:-1].astype(np.float64))
X = (x0-torch.mean(x0, dim=0))/torch.std(x0,dim=0)
Y = torch.tensor(('Abnormal'==df.values[:,-1])).int().reshape(-1,1)
print(X.shape, Y.shape)

torch.Size([310, 12]) torch.Size([310, 1])


### 2. Implement the Model for (Binary) Logistic Regression

Data:  $\,\qquad X = \left(\begin{array}{cccc} 1 & X_{11} & \dots & X_{1n} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & X_{N1} & \dots & X_{Nn}\end{array}\right)\qquad$ and $\qquad Y = \left(\begin{array}{c} Y_{1} \\ \vdots \\ Y_{N} \end{array}\right)$

Model: $\qquad\hat{Y}(X;W) = \sigma\left(X W^\intercal\right) \qquad$ where $\qquad W = \left(\begin{array}{c} W_0 \\ W_1 \\ \vdots \\ W_n \end{array}\right)$

The model outputs the probability of observing in a sample $x$ a '1' (Abnormal).

Cost:  $\,\qquad C(W) = -\frac{1}{N}\sum_j \left(Y_j\log(\hat{Y}_j(X;W)) + (1-Y_j)\log(1-\hat{Y}_j(X;W))\right)$

__Remark:__ Note that the logarithm diverges at arguments approaching 0. Make sure that you don't run into numerical issues.

In [111]:
# compose torch tensors X of shape (N,13) by inserting a column with 1's as first column  
X = torch.cat((torch.ones(N,1),X), dim=1)

In [112]:
# implement methods for predicting the probability of having label 0 or 1 (W with shape (1,13))
def predict(X,W):
    return torch.mm(X,W.T)

def cost(X,Y,W):
    torch.mean((Y-predict(X,W))**2).squeeze().item()

In [113]:
def gradient_cost(X,Y,W):
    return X.T@(X@W.T-Y)/X.shape[0]
        
def accuracy(Y,Yhat):
    print(Y)
    print(Yhat)
    acc = Y.eq(Yhat)/len(Y)
    return acc


Just for testing:

In [114]:
W = torch.randn((1,13), dtype=torch.double)
print(predict(X,W))
print(cost(X,Y,W))
print(gradient_cost(X,Y,W))
print(accuracy(Y,predict(X,W)))

tensor([[ 1.0986e-01],
        [-2.6789e+00],
        [-2.9271e+00],
        [-3.5305e+00],
        [-9.9067e-01],
        [-8.2727e-01],
        [-2.2973e+00],
        [-2.9588e+00],
        [ 2.0283e-01],
        [ 3.4419e+00],
        [ 1.7645e+00],
        [ 1.2574e+00],
        [ 2.4604e+00],
        [ 1.1490e+00],
        [-4.9464e+00],
        [-5.5680e+00],
        [ 7.8127e-01],
        [ 1.9713e+00],
        [ 6.9108e-01],
        [ 1.1572e-01],
        [-2.1994e-01],
        [ 2.8491e+00],
        [ 1.9218e+00],
        [-2.8545e+00],
        [ 1.4460e+00],
        [-3.7548e-02],
        [ 1.4929e+00],
        [-1.5536e+00],
        [-7.8778e-01],
        [ 1.2839e-01],
        [-3.3197e+00],
        [-1.8155e+00],
        [ 3.4785e+00],
        [ 1.7938e+00],
        [ 3.8597e+00],
        [-4.3037e+00],
        [-9.6965e-02],
        [ 1.1805e+00],
        [-1.6781e+00],
        [ 8.8028e-01],
        [ 1.9990e+00],
        [-4.2844e+00],
        [-4.5887e+00],
        [ 2

### 3. Implement Full Batch Gradient Descent

In [116]:
# adjust if needed
n_epochs = 1000
learning_rate = 1.0

## initial parameter
W = torch.randn((1,13), dtype=torch.double)

# track the costs
costs = []
accuracies = []
# costs = [cost(X,Y,W)]
# accuracies = [accuracy(Y,predict(X,W))]

for epoch in range(n_epochs):
    W = W - learning_rate * gradient_cost(X,Y,W)
    costs.append(cost(X,Y,W))
    accuracies.append(accuracy(Y,predict(X,W)))

# some output
accuracies = np.array(accuracies)
# print(accuracies)


print("Training Accuracy (max,end): %f, %f"%(np.max(accuracies), accuracies[-1]))
print("Training Cost (end): %f"%costs[-1].item())
plt.figure(1)
plt.plot(range(n_epochs+1),costs)
plt.figure(2)
plt.plot(range(n_epochs+1),accuracies)

tensor([[1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],
        [1],

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



### 4. Implement Full Batch Gradient Descent with PyTorch's autograd

### 5. Tune Learning Rates

Play with different learning rates: Explore for what learning rates 
- the learning is most efficient
- the learning yet works
- the learning does not work anymore (learning rate too large)

Explain the different scenarios.