# Project 3: Classification
---

This notebook is supposed to be used to provide the solution to the project 3 of the module Introduction to Machine Learning 2019 @ ETHZ.

---


## Environmental Set-Up

We first set the environment and load the later required packages, as well as fix the random seed globally.

In [0]:
import warnings
import pandas as pd
import numpy as np
import seaborn as sn
import sklearn as sl
import datetime
import random
import matplotlib.pyplot as plt
import time
import copy

random_seed = 1993

%matplotlib inline
sn.set_context('notebook')
%config InlineBackend.figure_format = 'retina'
random.seed(random_seed)
warnings.filterwarnings('ignore')

After loading the basic packages, we will now install Pytorch on the virtual machine since we gonna use it to apply neural networks to solve the project as suggested. Pytorch is chosen as it provides a according to the subjective opinion of the author nice interface compared to Tensorflow, but speedwise supposingly outperforms Keras.

In [3]:
!pip3 install pandas==0.24.2
!pip3 install torch==1.0.1

Since the Google Colab platform offers us a GPU, we will make sure to tell pytorch to use it, as it will speed up the training of our neural network significantly. Unfortunately up until now Pytorch does not support the use of TPUs (Google Colab would offer those as well).

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch import optim
import torch.utils.data as data_utils
use_cuda = True

torch.manual_seed(1993)

---

## Load in the data

We now use the Google Colab API to load the data and the sample submission from disk into the temproray cloud storage attached to this PaaS (platform as a service) solution to make it accessible.

In [5]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


---
## Project 3

The following section now solves the project 3 of the Introduction to Machine Learning course 2019.

---

### Formatting the data

Although the data is loaded we format it to have it in the handy pandas data frame format.

In [6]:
# Get train data
train = pd.read_hdf("train.h5", "train")
train.head()

In [7]:
# Get the test data
test = pd.read_hdf("test.h5", "test")
test.head()

We quickly inspect the shape of the data to make sure the data has been correctly loaded and casted into a pandas data frame.

In [8]:
print("train shape: ", np.array(train).shape)
print("test shape: ", np.array(test).shape)

In [9]:
'''
Get sample prediction file format.
Sample predictions will be simply replaced with the ones obtained from the
custom model.
''' 

submission = pd.read_csv('sample.csv', index_col=0, float_precision='high')
submission.head()

That looks very good. We seperate the label from the features for the sake of handiness of our implementations and data handling in the following.

In [0]:
X_train = train.iloc[:, 1:]
y_train = train.iloc[:, 0]

---

### Exploratory Data Analysis

Before starting with trying to model the data, we will have a first look at the data. First we will look at the distribution of the labels in the training data, since knowing if the data is balanced or not heavily influences the choice of algorithms we will consider later on.

In [11]:
n, bins, patches = plt.hist(np.array(y_train), [-0.25, 0.25, 0.75,1.25, 1.75, 
                                                2.25, 2.75, 3.25, 3.75, 4.25],
                            facecolor='b', alpha=0.75, align="mid")


plt.xlabel('Class Label')
plt.ylabel('Rel. Frequency')
plt.title('Histogram of Class Labels in the Training Data')
plt.axis([-0.5, 4.5, 0, 15000])
plt.grid(True)
plt.show()

We see that the number of training samples we have for each class is differs quite significantly for the different classes, for instance we have roughly 3 times as many samples for class 1 as we have for class 0 or 4. We should keep that in mind as it might negatively influence the performance of our classifier especially with respect to the minority classes. If we see a severe such behavior we can cosider undersampling approaches or other techniques to overcome that obstacle. For now however, we will for simplicity proceed as if the class imbalance is no severe issue.

Although it might be not that informative given the relative high number of features, let us quickly inspect the correlation structure of the features.

In [34]:
corr = X_train.corr()
print(X_train.shape)

f, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Heatmap of the correlation structure")
sn.heatmap(
    corr,
    mask=np.zeros_like(corr, dtype=np.bool),
    cmap=sn.diverging_palette(220, 10, as_cmap=True),
    square=True,
    ax=ax)
plt.subplots_adjust(bottom=0.25)
plt.show()

At first sight it seems that we have rather strong correlations but at least no 1-to-1 mappings of the individual features. We will confirm this by looking at the 10 feature-pairs that are the most correlated.

In [0]:
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

In [36]:
print("Top Absolute Correlations")
print(get_top_abs_correlations(X_train, 10))

As indicated by our heat map we see tthat we have no correlation of 1 and hence for now no reason to exclude any features from the beginning.

---

### Initial Experiments: 5 Layer Pytorch NN

Hereinafter, we will set up a basic 5-layer feed-forward neural network using the Pytorch framework to classify the individual data points based on the 120 features. This network is obviously simplistic and not tuned, but serves as a starting point and an easy way to familiarize oneself with the framework.

1. Let us first transform the data in a format that is compatible with the pytorch framework and hence enables us later to fit the network to the data.

In [37]:
# check if GPU is available and set the device accordingly
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')  
torch.cuda.get_device_name(0)

In [0]:
train_tensors = data_utils.TensorDataset(torch.cuda.FloatTensor(np.array(X_train)), torch.cuda.LongTensor(np.array(y_train)))
train_loader = data_utils.DataLoader(train_tensors, batch_size = 200, shuffle = True)

2. Let us formally define the network. We will use a 5-layer structure with ReLu activation in the hidden layers and a softmax activation at the output layer. While the number of hidden units for the input and output layer is defined by the number of features or classes respectively, we will use for our simplistic approach 80 as the number of hidden units for each hidden layer.

In [39]:
class SimpleNet(nn.Module):

    def __init__(self):
      super(SimpleNet, self).__init__()
      self.fc1 = nn.Linear(120, 128)
      self.fc2 = nn.Linear(128, 64)
      self.fc3 = nn.Linear(64,32)
      self.fc4 = nn.Linear(32, 16)
      self.fc5 = nn.Linear(16, 5)
      self.dropout = nn.Dropout(0.6)
      #self.smax = nn.LogSoftmax()

    def forward(self, x):
      x = F.relu(self.fc1(x))
      x = self.dropout(x)
      x = F.relu(self.fc2(x))
      x = self.dropout(x)
      x = F.relu(self.fc3(x))
      x = self.dropout(x)
      x = F.relu(self.fc4(x))
      x = self.dropout(x)
      x = self.fc5(x)
      x = F.log_softmax(x)
      return x


snet = SimpleNet()
snet = snet.cuda()
snet = snet.to(device)
print(snet)

3. We now define the loss function and the optimizer we intend to use.

In [0]:
learning_rate = 0.01
optimizer = optim.SGD(snet.parameters(), lr=learning_rate, momentum=0.9)
criterion = nn.NLLLoss()

4. Finally let us train the model, since we have set up anything.

In [41]:
torch.backends.cudnn.benchmark = True
epochs=20
log_interval=1000

for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
      data, target = Variable(data).to(device), Variable(target).to(device)
      optimizer.zero_grad()
      net_out = snet(data)
      loss = criterion(net_out, target)
      loss.backward()
      optimizer.step()
      if batch_idx % log_interval == 0:
         print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset),
                           100. * batch_idx / len(train_loader), loss.data.item()))

Let us now get the predictions and submit them to get a first idea of our performance.

In [0]:
test_set = torch.from_numpy(np.array(test))
preds = []

for value in test_set:
  # then put it on the GPU, make it float and insert a fake batch dimension
  test_value = Variable(value.cuda())
  test_value = test_value.float()
  test_value = test_value.unsqueeze(0)

  # pass it through the model
  prediction = snet(test_value)

# get the result out and reshape it
  cpu_pred = prediction.cpu()
  result = np.argmax(cpu_pred.data.numpy())
  preds.append(result)
  
preds = np.array(preds)

In [43]:
preds

We see that we only predict the majority classes. That it is not satisfactory.

---

### More Sophisticated Experiments: 4-Layer ANN

We will now perform some more sophisticated trials to get a better performance.
In particular we did not watch the performance of our network with respect to the desired metric, nor got a less biased estimate by monitoring any performance on a validation set.

We will do so in order to get a better idea of what might be issues of our configuration. To this end we generically implement a couple function that will do the job and use the following as an inspiration: https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html#model-training-and-validation-code .

1. We will split our loaded data into a training and validation set. The former will be used for training purposes, while the latter will be used to monitor the performance of our network. We have seen several times that performance estimate we get on the training set will be too optimistic and should thus in favor of the estimate based on the validation set not used for model tuning purposes. The sci-kit learn library provides all we need to realize the split. We then transform the data in such a way that it fits in the pytorch framework.

In [44]:
# Create train val split and create the data loader

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

sc = StandardScaler().fit(X_train)
X_train_standardized = sc.transform(X_train)

data_train, data_val, label_train, label_val = train_test_split(X_train_standardized, 
                                                                y_train, test_size = 0.175, 
                                                                random_state=1993)
print(np.array(data_train).shape)
print(np.array(label_train).shape)
print(np.array(data_val).shape)
print(np.array(label_val).shape)

# Note that from here on we expect GPU to be available, if that is not the case 
# use torch.xxxTensor instead of float.cuda.xxxTensor

train_tensors = data_utils.TensorDataset(
    torch.cuda.FloatTensor(np.array(data_train)), 
    torch.cuda.LongTensor(np.array(label_train)))

train_loader = data_utils.DataLoader(train_tensors, 
                                     batch_size = 256, shuffle = True)

val_tensors = data_utils.TensorDataset(
    torch.cuda.FloatTensor(np.array(data_val)), 
    torch.cuda.LongTensor(np.array(label_val)))

val_loader = data_utils.DataLoader(val_tensors, 
                                   batch_size = 256, shuffle = True)

data_loaders_dict = {'train':train_loader, 'val':val_loader}
data_loaders_dict

2. Now we define the function to train a preset torch.nn model and thereby monitor the performance of it. This is again inspired by the previously referenced official pytorch tutorial. The nice thing about this function is that it will allow us to monitor the model performance on both the training and validation set at each epoch and will return the best found model i.e. the one with the highest validation accuracy.

In [0]:
def train_model(model, dataloaders, criterion, optimizer, num_epochs=100):
    since = time.time()

    val_acc_history = []

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.type(torch.FloatTensor).to(device)
                labels = labels.type(torch.LongTensor).to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                  # Get model outputs and calculate loss
                  outputs = model(inputs)
                  loss = criterion(outputs, labels)
                  _, preds = torch.max(outputs, 1)

                    # backward + optimize only if in training phase
                if phase == 'train':
                  loss.backward()
                  optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)

            print('{} Loss: {:.6f} Acc: {:.6f}'.format(phase, epoch_loss, epoch_acc))

            # deep copy the model if it has the best val accurary
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())
            if phase == 'val':
                val_acc_history.append(epoch_acc)

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, 
                                                        time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, val_acc_history

3. We now set up the model i.e. the network structure. We will use a very basic model that consists only of two hidden layers, but sufficiently many neurons in each of those to resemble different transformation of the input features.

In [85]:
class EasyNet(nn.Module):

    def __init__(self):
      super(EasyNet, self).__init__()
      self.fc1 = nn.Linear(120, 1024)
      self.fc2 = nn.Linear(1024, 256)
      self.fc3 = nn.Linear(256,5)
      self.dropout = nn.Dropout(0.5)

    def forward(self, x):
      x = F.relu(self.fc1(x))
      x = self.dropout(x)
      x = F.relu(self.fc2(x))
      x = self.dropout(x)
      x = self.fc3(x)
      return x


enet = EasyNet()
enet = enet.cuda()
enet = enet.to(device)
print(enet)

4. We now set up the optimizer and the criterion. We thereby use the Adam optimizer as it automatically adapts the learning rate. Since our outputlayer so far consists of five neurons outputting values in $\mathbb{R}$, we will use the CrossEntropyLoss criterion provided by pytorch. Additionally we set the weight_decay to 1e-5 to have a equivalently high weight regularization according to the $l_2$-norm to prevent overfitting. Note that we also used Dropout layers in our network structure for the same purpose.

In [0]:
params_to_update = enet.parameters()

optimizer_ft = optim.Adam(params_to_update, lr=1e-3, weight_decay=1e-5)

criterion = nn.CrossEntropyLoss(size_average=True)

Before we start training our model, let us quickly as a last control mechanism check if the split of the train and validation split was done such that the distribution of the class labels is roughly the same in both.

In [68]:
counts_train = np.unique(label_train, return_counts=True)[1]
counts_val = np.unique(label_val, return_counts=True)[1]
(counts_train, counts_val)

That is the case.

5. Having done all of this we are good to go and can train our model.

In [87]:
num_epochs=200
enet_fit, hist = train_model(enet, data_loaders_dict, 
                             criterion, optimizer_ft, 
                             num_epochs=num_epochs)

6. The validation accuracy looks promosing. However to get an idea of how representative that score is for the performance on the test set we will now predict the values for the test set and create a submission from those predictions.

In [54]:
test_standardized = sc.transform(test)
test_set = torch.from_numpy(np.array(test_standardized))
predictions = []

for value in test_set:
  # then put it on the GPU, make it float and insert a fake batch dimension
  test_value = Variable(value.cuda())
  test_value = test_value.float()
  test_value = test_value.unsqueeze(0)

  # pass it through the model
  outputs = enet_fit(test_value)
  _, preds = torch.max(outputs, 1)

# get the result out and reshape it
  predictions.append(preds.cpu().numpy()[0])

predictions = np.array(predictions)
predictions

Let us quickly check the fitted distribution of the class labels.

In [55]:
unique, counts = np.unique(predictions, return_counts=True)
dict(zip(unique, counts))

Well that looks quite nice, recalling the distribution of the labels of our training and validation data. As we assume that the data for which we predicted the labels comes from the same distribution as the one which generated our training and validation data, this is exactly what we would expect.

Let us now finally create a submission based on those predictions and download it, such that we potentially could hand it in.

---
### Create submission

In [56]:
submission['y'] = predictions
submission.head()

---

## Export data

We finally use the Google Colab API to download our submission data frame in from of an csv, that we can submit to the submission platform.

In [0]:
from google.colab import files

ts = str(datetime.datetime.utcnow())
ts = ts.replace(' ', '_')
Filename = 'ANN_hand_in' #@param {type:"string"}
fname = Filename+ts+'.csv'

with open(fname, 'w') as f:
  submission.to_csv(f, float_format='%.64f', index=True, header=True)

files.download(fname)