# Predicting the Type of Physical Activity from Tri-Axial Smartphone Accelerometer Data

In this investigation, I aim to create a classification model that can predict the type of physical activity from an input set of tri-axial smartphone accelerometer data with high accuracy. This model emulates real-world models used by mobile apps like Fitbit and Apple Fitness to automatically record physical activity such as number of steps and number of stairs climbed without user input. The model will give a range between `1` and `4` inclusive as a prediction for a particular subset of accelerometer data, with `1` = standing, `2` = walking, `3` = stairs down, and `4` = stairs up.

## Overall Procedure
1. Import the `train_time_series.csv` and `train_labels.csv` files into dataframes.
2. Extract the predictor variables from the `train_time_series.csv` dataframe and the outcomes from the `train_labels.csv` dataframe.
3. Create a logistic regression classification model with the training data and evaluate the accuracy and speed of this model.
4. Create a random forest classification model with the training data and evaluate the accuracy and speed of this model.
5. Compare the accuracies of the logistic regression classification model and random forest classification model and analyze the accuracy and speed of both models.
6. Create a suite of functions for the more accurate model
7. Report the final time and accuracy of the model.

The first step is to import the `train_time_series.csv` into a `pandas` dataframe. The `train_time_series.csv` file contains the accelerometer data for each of the four activities (standing, walking, stairs down, and stairs up) and the time series data for each of the four activities. These data will become part of the target matrix $X$ for the classification model.

Let us find the first five rows of the `train_time_series` dataframe as well as its shape.

In [110]:
import pandas as pd
import time

start_time = time.perf_counter()  # Start timer before importing data

train_time_series = pd.read_csv('train_time_series.csv', index_col=0)
train_time_series.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z
20586,1565109930787,2019-08-06T16:45:30.787,unknown,-0.006485,-0.93486,-0.069046
20587,1565109930887,2019-08-06T16:45:30.887,unknown,-0.066467,-1.015442,0.089554
20588,1565109930987,2019-08-06T16:45:30.987,unknown,-0.043488,-1.021255,0.178467
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985
20590,1565109931188,2019-08-06T16:45:31.188,unknown,-0.054031,-1.003616,0.12645


In [111]:
print(train_time_series.shape)  # Get the shape of the dataframe

(3744, 6)


From above, the dataframe has the following columns: `timestamp`, `UTC time`, `accuracy`, `x`, `y`, and `z`. We also know it contains $3744$ rows and $6$ columns.

Let us now import the `train_labels.csv` file into a `pandas` dataframe. The `train_labels.csv` file contains the labels for each of the four activities (standing, walking, stairs down, and stairs up) and the time series data for each of the four activities. These data will become the outcomes vector $y$ for the classification model.

Let us find the first five rows of data in the `train_labels` dataframe as well as its shape.

In [112]:
train_labels = pd.read_csv('train_labels.csv', index_col=0)
train_labels.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,label
20589,1565109931087,2019-08-06T16:45:31.087,1
20599,1565109932090,2019-08-06T16:45:32.090,1
20609,1565109933092,2019-08-06T16:45:33.092,1
20619,1565109934094,2019-08-06T16:45:34.094,1
20629,1565109935097,2019-08-06T16:45:35.097,1


In [113]:
print(train_labels.shape)  # Get the shape of the dataframe

(375, 3)


From above, the dataframe has the following columns: `timestamp`, `UTC time`, and `label`.

It is immediately clear that this dataframe is much smaller than the first dataframe: only 375 rows and 6 columns. Upon closer inspection of the first five rows, it can be seen that some timestamps from the `train_time_series.csv` file are missing from the `train_labels.csv` file, such as the timestamp `1565109930787`. This distinction is important because it indicates that there are cases where a label is not available for a particular subset of accelerometer data.

It would be useful to have a single dataframe to contain both the time series and the labels of each timestamp of recorded accelerometer data. We can do this by iterating through the rows of the `train_time_series` dataframe and adding the corresponding label from the `train_labels` dataframe to a new dataframe called `train_data`. If the timestamp is not found in the `train_labels` dataframe, then the label will be set to `NaN`.

In [114]:
from numpy import nan

train_data = train_time_series.copy()  # Create a copy of the train_time_series dataframe as train_data

for index in train_data.index:
    if index in train_labels.index:
        train_data.loc[index, 'label'] = train_labels.loc[index, 'label']
    else:
        train_data.loc[index, 'label'] = nan

train_data.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z,label
20586,1565109930787,2019-08-06T16:45:30.787,unknown,-0.006485,-0.93486,-0.069046,
20587,1565109930887,2019-08-06T16:45:30.887,unknown,-0.066467,-1.015442,0.089554,
20588,1565109930987,2019-08-06T16:45:30.987,unknown,-0.043488,-1.021255,0.178467,
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985,1.0
20590,1565109931188,2019-08-06T16:45:31.188,unknown,-0.054031,-1.003616,0.12645,


We can now drop the NaN rows from the `train_data` dataframe.

In [117]:
train_data.dropna(inplace=True)  # Drop the NaN rows from the dataframe
train_data.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z,label
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985,1.0
20599,1565109932090,2019-08-06T16:45:32.090,unknown,0.013718,-0.852371,-0.00087,1.0
20609,1565109933092,2019-08-06T16:45:33.092,unknown,0.145584,-1.007843,-0.036819,1.0
20619,1565109934094,2019-08-06T16:45:34.094,unknown,-0.09938,-1.209686,0.304489,1.0
20629,1565109935097,2019-08-06T16:45:35.097,unknown,0.082794,-1.001434,-0.025375,1.0


We can now check the shape of the `train_data` dataframe.

In [118]:
print(train_data.shape)  # Get the shape of the dataframe

(375, 7)


As expected, there are $375$ rows (the number of rows in the `train_labels` dataframe) and $7$ columns in the `train_data` dataframe. The

Right now, the `UTC time` values are strings, as can be seen below.

In [115]:
# Create a function to determine if all the values in a Pandas DataFrame column are a specified data type, such as string.
def is_column_values_datatype(column: pd.Series, datatype: type) -> bool:
    """
    Returns `True` if all values in a Pandas DataFrame column is a specified datatype, `False` otherwise.

    :param column: A Pandas DataFrame column.
    :param datatype: The type of data to check for.
    :return: `True` if all values in a Pandas DataFrame column is a string, `False` otherwise.
    """
    return all(isinstance(value, datatype) for value in column)

is_column_values_datatype(train_time_series['UTC time'], str)

True

Therefore, each value in the `UTC time` column will be converted to a `DateTime` object to make it easier to generate a time elapsed column later.

In [116]:
import datetime

train_time_series['UTC time'] = pd.to_datetime(train_time_series['UTC time'])
# Check if the UTC time column values are now all DateTime objects
is_column_values_datatype(train_time_series['UTC time'], datetime.datetime)

True