# Activity Classifier

We've explored the data, examined the literature, chosen our features, and pre-processed all the data. Now it's time to finally build the classifier!

Import some of the libraries that we will need

In [None]:
import os

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
import scipy.signal
import scipy.stats

import activity_classifier_utils

Load the data

In [None]:
fs = 256
data = activity_classifier_utils.LoadWristPPGDataset()

### Feature Extraction

Train on 10 second long non-overlapping windows

In [None]:
window_length_s = 10
window_shift_s = 10

In [None]:
import activity_classifier_utils

window_length = window_length_s * fs
window_shift = window_shift_s * fs
labels, subjects, features = [], [], []
for subject, activity, df in data:
    for i in range(0, len(df) - window_length, window_shift):
        window = df[i: i + window_length]
        accx = window.accx.values
        accy = window.accy.values
        accz = window.accz.values
        features.append(activity_classifier_utils.Featurize(accx, accy, accz, fs=fs))
        labels.append(activity)
        subjects.append(subject)

In [None]:
labels = np.array(labels)
subjects = np.array(subjects)
features = np.array(features)

## Build a Random Forest Classifier using sklearn

If you've done machine learning in Python before, you've more than likely used `sklearn`. ML for wearable data is no different. Let's use sklearn to train a random forest to classify our data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

### Define hyperparameters

Let's build a forest with 100 trees where each tree has a maximum depth of 4

In [None]:
n_estimators = 100
max_tree_depth = 4

### Build and train the model

In [None]:
clf = RandomForestClassifier(n_estimators=n_estimators,
                             max_depth=max_tree_depth,
                             random_state=42)
clf.fit(features, labels)

## Performance Evaluation

### Confusion Matrix

One way to evaluate the performance of a multi-class classifier is to look at a confusion matrix. The confusion matrix shows how many datapoints were misclassified and what they were misclassified as.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_true = ['bike', 'run', 'run', 'walk']
y_pred = ['run', 'run', 'bike', 'walk']
class_names = ['bike', 'run', 'walk']
cm = confusion_matrix(y_true, y_pred, labels=class_names)
activity_classifier_utils.PlotConfusionMatrix(cm, class_names)

### Leave-One-Subject-Out Cross Validation

You may have seen leave-one-out cross validation. Leave-one-subject-out cross validation is similar.

For many biomedical signal applications you have many datapoints per subject. In this case we have 611 datapoints from only 8 subjects. Because there might be a lot of similarity in how an individual performs a specific activity, leaving some of that person's data in the training set and then testing on it might lead us to believe our model generalizes better than it actually would if it encounters a brand new person who it has never seen in the training set. 

For this reason we do leave-one-subject-out cross validation.  This is why we kept track of which subject each datapoint belonged to in the `subjects` array.

In [None]:
from sklearn.model_selection import LeaveOneGroupOut

In [None]:
class_names = np.array(['bike', 'run', 'walk'])
logo = LeaveOneGroupOut()
cm = np.zeros((3, 3), dtype='int')

In [None]:
for train_ind, test_ind in logo.split(features, labels, subjects):
    # For each cross-validation fold...
    
    # Split up the dataset into a training and test set.
    # The test set has all the data from just one subject
    X_train, y_train = features[train_ind], labels[train_ind]
    X_test, y_test = features[test_ind], labels[test_ind]
    
    # Train the classifier
    clf.fit(X_train, y_train)
    
    # Run the classifier on the test set
    y_pred = clf.predict(X_test)
    
    # Compute the confusion matrix for the test predictions
    c = confusion_matrix(y_test, y_pred, labels=class_names)
    
    # Aggregate this confusion matrix with the ones from previous
    # folds.
    cm += c

### Plot Confusion Matrices

In [None]:
class_names = ['bike', 'run', 'walk']

In [None]:
activity_classifier_utils.PlotConfusionMatrix(cm, class_names,
                                              title='classifier performance', normalize=False)

In [None]:
activity_classifier_utils.PlotConfusionMatrix(cm, class_names, 
                                              title='normalized classifier performance',
                                              normalize=True)

We seem to be really good at classifying `run`. We don't really mistake `run` for either `bike` or `walk` and don't misclassify the other classes as `run` often.

Our biggest mistake seems to be misclassifying `bike` as `walk`. We do that 42% of the time.

### Compute Classification Accuracy

An overall measure of classifier performance is the classification accuracy. This is the percent of time that we make a correct classification. There are other metrics to evaluate classifier performance, and using a single metric can be misleading depending on your dataset. See the further resources section for this lesson to learn more.

We can compute the classification accuracy from the confusion matrix as follows

In [None]:
print(np.sum(np.diag(cm)) / np.sum(np.sum(cm)))

We've build an activity classifier. This is a good first step. Can we do better?