In [56]:
import importer
import numpy as np
import matplotlib.pyplot as plt

# Creating a Classifier

The goal of this notebook will be to create a highly accurate classifier between MTB rides and road rides. Although this differentiation is not recorded explicitly in the dataset a field 'Activity Gear' records which bike I was riding. When I go for an MTB ride I will always use my mountain bike, the 'Roscoe 8'. Similarly, for road rides, I will always use the 'Carrera Zelos' So if the classifier can model which bike I am riding then it can model which type of ride I am doing.

To achieve this goal will require a three stage approach: pre-processing, feature selection, and classification.


## Pre-processing

In the interest of brevity, all pre-processing for this dataset has been done in the importer module inside this repository. This module servers a few purposes:
- Removes all rows where there is a column containing no data.
- Allows partitioning of data
- Converts the 'Activity Date' field into the seconds from the Unix epoch

A limitation of this classifier is that it will only be able to use numerical data. This can immediately remove all attributes that are strings or booleans. We can also remove any attributes which where only recorded for a few activities.

In [59]:
data_set = importer.Data()

# Get all headers that contain numeric data
headers = []
for i in range(len(data_set.headers)):
    if data_set.types[i] != 'string' and data_set.types[i] != 'bool':
        headers.append(data_set.headers[i])

# Remove headers which have not been recorded for many activities
for header in headers:
    if len(data_set.get_data([header])[header]) < 120:
        headers.remove(header)

features = data_set.get_data(headers)
print('This leaves us with the following list of', len(headers), 'headers: ', headers, '\n')
print('This reduces the number of activities to ', len(features[headers[0]]))

This leaves us with the following list of 36 headers:  ['Activity ID', 'Elapsed Time', 'Distance', 'Elapsed Time2', 'Moving Time', 'Distance3', 'Max Speed', 'Average Speed', 'Elevation Gain', 'Elevation Loss', 'Elevation Low', 'Elevation High', 'Max Grade', 'Average Grade', 'Calories', 'Perceived Relative Effort', 'From Upload', 'Weather Observation Time', 'Weather Condition', 'Weather Temperature', 'Apparent Temperature', 'Dewpoint', 'Humidity', 'Weather Pressure', 'Wind Speed', 'Wind Gust', 'Wind Bearing', 'Precipitation Intensity', 'Sunrise Time', 'Sunset Time', 'Moon Phase', 'Precipitation Probability', 'Cloud Cover', 'Weather Visibility', 'UV Index', 'Weather Ozone'] 

This reduces the number of activities to  97


## Feature Selection

Now that we have a preprocessed subset of the dataset we need to perform **dimensionality reduction** to remove any features which will not be as useful to the classifier, this can be done via ** Heuristic Feature Selection**. To do this it is assumed that the features are independent (although in reality this is not the case) and a top-down approach can be taken by removing the worst features.