# Predicting the Type of Physical Activity from Tri-Axial Smartphone Accelerometer Data

In this investigation, I aim to create a classification model that can predict the type of physical activity from an input set of tri-axial smartphone accelerometer data with high accuracy. This model emulates real-world models used by mobile apps like Fitbit and Apple Fitness to automatically record physical activity such as number of steps and number of stairs climbed without user input. The model will give a range between `1` and `4` inclusive as a prediction for a particular subset of accelerometer data, with `1` = standing, `2` = walking, `3` = stairs down, and `4` = stairs up.

## Overall Procedure
1. Import the `train_time_series.csv` and `train_labels.csv` files into dataframes.
2. Extract the predictor variables from the `train_time_series.csv` dataframe and the outcomes from the `train_labels.csv` dataframe.
3. Create a logistic regression classification model with the training data and evaluate the accuracy and speed of this model.
4. Create a random forest classification model with the training data and evaluate the accuracy and speed of this model.
5. Compare the accuracies of the logistic regression classification model and random forest classification model and analyze the accuracy and speed of both models.
6. Create a suite of functions for the more accurate model
7. Report the final time and accuracy of the model.

The first step is to import the `train_time_series.csv` into a `pandas` dataframe. The `train_time_series.csv` file contains the accelerometer data for each of the four activities (standing, walking, stairs down, and stairs up) and the time series data for each of the four activities. These data will become part of the target matrix $X$ for the classification model.

Let us find the first five rows of the `train_time_series` dataframe as well as its shape.

In [321]:
import pandas as pd
import time

start_time = time.perf_counter()  # Start timer before importing data

train_time_series = pd.read_csv('train_time_series.csv', index_col=0)
train_time_series.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z
20586,1565109930787,2019-08-06T16:45:30.787,unknown,-0.006485,-0.93486,-0.069046
20587,1565109930887,2019-08-06T16:45:30.887,unknown,-0.066467,-1.015442,0.089554
20588,1565109930987,2019-08-06T16:45:30.987,unknown,-0.043488,-1.021255,0.178467
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985
20590,1565109931188,2019-08-06T16:45:31.188,unknown,-0.054031,-1.003616,0.12645


In [322]:
print(train_time_series.shape)  # Get the shape of the dataframe

(3744, 6)


From above, the dataframe has the following columns: `timestamp`, `UTC time`, `accuracy`, `x`, `y`, and `z`. We also know it contains $3744$ rows and $6$ columns.

Let us now import the `train_labels.csv` file into a `pandas` dataframe. The `train_labels.csv` file contains the labels for each of the four activities (standing, walking, stairs down, and stairs up) and the time series data for each of the four activities. These data will become the outcomes vector $y$ for the classification model.

Let us find the first five rows of data in the `train_labels` dataframe as well as its shape.

In [323]:
train_labels = pd.read_csv('train_labels.csv', index_col=0)
train_labels.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,label
20589,1565109931087,2019-08-06T16:45:31.087,1
20599,1565109932090,2019-08-06T16:45:32.090,1
20609,1565109933092,2019-08-06T16:45:33.092,1
20619,1565109934094,2019-08-06T16:45:34.094,1
20629,1565109935097,2019-08-06T16:45:35.097,1


In [324]:
print(train_labels.shape)  # Get the shape of the dataframe

(375, 3)


From above, the dataframe has the following columns: `timestamp`, `UTC time`, and `label`.

It is immediately clear that this dataframe is much smaller than the first dataframe: only $375$ rows and $6$ columns. Upon closer inspection of the first five rows, it can be seen that some timestamps from the `train_time_series.csv` file are missing from the `train_labels.csv` file, such as the timestamp `1565109930787`. This distinction is important because it indicates that there are cases where a label is not available for a particular subset of accelerometer data. This behavior is expected because the problem statement does say that labels are given to select accelerometer readings that are separated by 10 seconds.

It would be useful to have a single dataframe to contain both the time series and the labels of each timestamp of recorded accelerometer data. We can do this by iterating through the rows of the `train_time_series` dataframe and adding the corresponding label from the `train_labels` dataframe to a new dataframe called `train_data`. If the timestamp is not found in the `train_labels` dataframe, then the label will be set to `NaN`.

In [325]:
from numpy import nan

train_data = train_time_series.copy()  # Create a copy of the train_time_series dataframe as train_data

for index in train_data.index:
    if index in train_labels.index:
        train_data.loc[index, 'label'] = train_labels.loc[index, 'label']
    else:
        train_data.loc[index, 'label'] = nan

train_data.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z,label
20586,1565109930787,2019-08-06T16:45:30.787,unknown,-0.006485,-0.93486,-0.069046,
20587,1565109930887,2019-08-06T16:45:30.887,unknown,-0.066467,-1.015442,0.089554,
20588,1565109930987,2019-08-06T16:45:30.987,unknown,-0.043488,-1.021255,0.178467,
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985,1.0
20590,1565109931188,2019-08-06T16:45:31.188,unknown,-0.054031,-1.003616,0.12645,


We can now drop the NaN rows from the `train_data` dataframe.

In [326]:
train_data.dropna(inplace=True)  # Drop the NaN rows from the dataframe
train_data = train_data.astype({'label': 'int'})  # Convert the label column to an integer
train_data.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z,label
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985,1
20599,1565109932090,2019-08-06T16:45:32.090,unknown,0.013718,-0.852371,-0.00087,1
20609,1565109933092,2019-08-06T16:45:33.092,unknown,0.145584,-1.007843,-0.036819,1
20619,1565109934094,2019-08-06T16:45:34.094,unknown,-0.09938,-1.209686,0.304489,1
20629,1565109935097,2019-08-06T16:45:35.097,unknown,0.082794,-1.001434,-0.025375,1


We can now check the shape of the `train_data` dataframe.

In [327]:
print(train_data.shape)  # Get the shape of the dataframe

(375, 7)


As expected, there are $375$ rows (the number of rows in the `train_labels` dataframe) and $7$ columns in the `train_data` dataframe.

The `test_time_series.csv` file contains new data with the same headings as the training data variant. We can now import the `test_time_series.csv` file into a `pandas` dataframe.

In [328]:
test_time_series = pd.read_csv('test_time_series.csv', index_col=0)
test_time_series.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z
24330,1565110306139,2019-08-06T16:51:46.139,unknown,0.034286,-1.504456,0.157623
24331,1565110306239,2019-08-06T16:51:46.239,unknown,0.409164,-1.038544,0.030975
24332,1565110306340,2019-08-06T16:51:46.340,unknown,-0.23439,-0.984558,0.124771
24333,1565110306440,2019-08-06T16:51:46.440,unknown,0.251114,-0.787003,0.05481
24334,1565110306540,2019-08-06T16:51:46.540,unknown,0.109924,-0.16951,0.23555


The `test_labels.csv` file is a CSV file with labels corresponding to the rows of the `test_time_series.csv` file. This file contains the actual labels for each test data point. We can now import the `test_labels.csv` file into a `pandas` dataframe. After that, we can merge `test_time_series` and `test_labels` into a single dataframe called `test_data` as before.

In [329]:
test_labels = pd.read_csv('test_labels.csv', index_col=0)
test_labels.head()  # Get the first five rows of data

Unnamed: 0,timestamp,UTC time,label
24339,1565110307041,2019-08-06T16:51:47.041,2
24349,1565110308043,2019-08-06T16:51:48.043,4
24359,1565110309046,2019-08-06T16:51:49.046,2
24369,1565110310048,2019-08-06T16:51:50.048,4
24379,1565110311050,2019-08-06T16:51:51.050,2


In [330]:
# To begin merging, create a copy of the test_time_series dataframe as test_data
test_data = test_labels.copy()

# Assign the label from the test_labels dataframe to the corresponding timestamp in the test_data dataframe,
# inserting NaN for any timestamps that are not found in the test_labels dataframe.
for index in test_data.index:
    if index in test_time_series.index:
        test_data.loc[index, ['x', 'y', 'z']] = test_time_series.loc[index, ['x', 'y', 'z']]
    else:
        test_data.loc[index, 'label'] = nan

# See the first few rows of the dataframe
test_data

Unnamed: 0,timestamp,UTC time,label,x,y,z
24339,1565110307041,2019-08-06T16:51:47.041,2,0.098282,-0.833771,0.118042
24349,1565110308043,2019-08-06T16:51:48.043,4,0.348465,-0.946701,-0.051041
24359,1565110309046,2019-08-06T16:51:49.046,2,0.377335,-0.849243,-0.026474
24369,1565110310048,2019-08-06T16:51:50.048,4,0.110077,-0.520325,0.312714
24379,1565110311050,2019-08-06T16:51:51.050,2,0.283478,-0.892548,-0.085876
...,...,...,...,...,...,...
25539,1565110427366,2019-08-06T16:53:47.366,2,-0.043915,-0.242416,0.068802
25549,1565110428369,2019-08-06T16:53:48.369,2,0.118271,-1.212097,0.357468
25559,1565110429371,2019-08-06T16:53:49.371,2,0.667404,-0.978851,0.171906
25569,1565110430373,2019-08-06T16:53:50.373,2,0.371384,-1.021927,-0.244446


In [331]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier().fit(train_data[['timestamp', 'x', 'y', 'z']], train_data['label'] - 1)

In [332]:
xgb_output_data = test_data.copy()[['timestamp', 'UTC time', 'label']]

xgb_output_data['label'] = xgb_model.predict(test_data[['timestamp', 'x', 'y', 'z']]) + 1

In [333]:
print(list(xgb_output_data['label']))

[2, 4, 2, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


In [334]:
xgb_output_data.to_csv('test_labels.csv')

In [335]:
runtime = time.perf_counter() - start_time
print('Run time:', runtime, "seconds")

Run time: 0.6409250830001838 seconds
