### Introduction
In this final project, we'll attempt to predict the type of physical activity (e.g., walking, climbing stairs) from tri-axial smartphone accelerometer data. 


In [None]:
#Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.layers import Conv2D, MaxPool2D
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import scipy.stats as stats
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
import time

In [None]:
# start=time.process_time() #Run Time

In [None]:
#load the datasets
train_df = pd.read_csv('train_time_series.csv')
train_labels_df = pd.read_csv('train_labels.csv')
test_df = pd.read_csv('test_time_series.csv')
test_labels_df = pd.read_csv('test_labels.csv')

#### Clean the data
Need to include labels with the training data.  Since the train_labels_df only has every 10th datapoint, the labels in the concatenated df's will need to be filled out

In [None]:
# clean the training data
df1 = train_df.copy()
df2 = train_labels_df.copy()
df1 = df1.set_index('Unnamed: 0')
df2 = df2.set_index('Unnamed: 0')
train_df_with_labels = pd.concat([df1,df2['label']],axis=1)
train_df_with_labels = train_df_with_labels.fillna(method='bfill')

# train_df_with_labels.info()


In [None]:
# clean the test data
df1 = test_df.copy()
df2 = test_labels_df.copy()
df1 = df1.set_index('Unnamed: 0')
df2 = df2.set_index('Unnamed: 0')
test_df_with_labels = pd.concat([df1,df2['label']],axis=1)

# test_df_with_labels.info()

In [None]:
# quick glance at the training data
plt.plot(train_df_with_labels['timestamp'],train_df_with_labels['x'], label='x')
plt.plot(train_df_with_labels['timestamp'],train_df_with_labels['y'], label='y')
plt.plot(train_df_with_labels['timestamp'],train_df_with_labels['z'], label='z')
plt.plot(train_df_with_labels['timestamp'],train_df_with_labels['label'])
plt.legend()
plt.show()

The x and z data seeem to be mostly centererd around 0 while the y accelerometer data is slightly offset. There is also  an uneven distribution of the differnt activities.  1 = standing, 2 = walking, 3 = stairs down, 4 = stairs up. I will filter the data by activity to examine it more closely.

In [None]:
standing = train_df_with_labels[train_df_with_labels.label == 1]
walking = train_df_with_labels[train_df_with_labels.label == 2]
downstairs = train_df_with_labels[train_df_with_labels.label == 3]
upstairs = train_df_with_labels[train_df_with_labels.label == 4]

Examine the filtered dfs

In [None]:
#set up multiindex df to display summary stats
activities = ['standing','walking','downstairs','upstairs']
features = ['x','y','z']
index = pd.MultiIndex.from_product([activities, features])

dfs = [standing, walking, downstairs, upstairs]
features_summary_df = pd.DataFrame(index = ['count','mean','stdev', 'min','max'], columns=index)
for i in range(len(dfs)):
    for j in range(len(features)):
        features_summary_df[activities[i],features[j]] = [dfs[i][features[j]].count(),
                                                    dfs[i][features[j]].mean(), 
                                                  dfs[i][features[j]].std(), 
                                                  dfs[i][features[j]].min(), 
                                                  dfs[i][features[j]].max()] 

features_summary_df

In [None]:
# make 3d plots  to display the accellerometer data filtered by activity
fig = plt.figure(figsize=(10,10))

def subplot(df,label,position):
    ax = fig.add_subplot(position,projection='3d')
    ax.scatter(df['x'],df['y'],df['z'],label=label)
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z')
    ax.legend(loc='upper right')
    
subplot(standing,'standing',221)
subplot(walking,'walking',222)
subplot(downstairs,'downstairs',223)
subplot(upstairs,'upstairs',224)

At a glance, the standing data is visibly different to the other activities, but the other activities are not easily distinguished from each other.  The other obvious feature is a very small number of observations for standing.  Ideally I would want to balance the data so we are working with an equal number of observations for each activity, but I think that limiting the size of the dataset that much would hurt more than help

### Testing different models

We covered several regression models during the course, but it is not clear which one might be best suited for this data.

In [None]:
# Define the data I'm going to use for the different models

X=np.array(train_df_with_labels[['x','y','z']])
y=np.array(train_df_with_labels['label'])
test=np.array(test_df_with_labels[['x','y','z']])

#### Logisitc Regression

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.5) # split the data

logmodel=LogisticRegression()
logmodel.fit(X_train,y_train) # fit the model
print(f'Score: {logmodel.score(X_test,y_test)}')

In [None]:
#### Random Forrest

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.5)

forestmodel=RandomForestClassifier()
forestmodel.fit(X_train,y_train)
print(f'Score: {forestmodel.score(X_test,y_test)}')
print(classification_report(y_test,forestmodel.predict(X_test)))

#### Knn

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.5)

knnmodel=KNeighborsClassifier(n_neighbors=5)
knnmodel.fit(X_train,y_train)
print(f'{knnmodel.score(X_test,y_test)}')
print(classification_report(y_test,knnmodel.predict(X_test)))

#### CNN
convolutional neural networks were not covered in the course, but I found an example of a cnn being used to predict outcomes from similar data, so I will compare that to the outcome from the regression models

In [None]:
# while the x,y and z data all seem to be of a similar scale, there are some discrepancies in the mean and variance
# so I will standardize the data before running the CNN

X = train_df_with_labels[['x', 'y', 'z']]
y = train_df_with_labels['label']

scaler = StandardScaler()
X = scaler.fit_transform(X)

scaled_X = pd.DataFrame(data = X, columns = ['x', 'y', 'z'])
scaled_X['label'] = y.values

scaled_X.head()

In [None]:
#divide data into a series of individual timeframes
Fs = 10 # 10 data points per second
frame_size = Fs*4 # 4 second frames
step_size = Fs*2 # 2 second steps

In [None]:
def get_frames(df, frame_size, step_size):

    N_FEATURES = 3

    frames = []
    labels = []
    for i in range(0, len(df) - frame_size, step_size):
        x = train_df_with_labels['x'].values[i: i + frame_size]
        y = train_df_with_labels['y'].values[i: i + frame_size]
        z = train_df_with_labels['z'].values[i: i + frame_size]

        
        # Retrieve the most often used label in this segment
        label = stats.mode(train_df_with_labels['label'].iloc[i: i + frame_size], keepdims=True)[0][0]
        frames.append([x, y, z])
        labels.append(label)


    # Bring the segments into a better shape
    frames = np.asarray(frames).reshape(-1, frame_size, N_FEATURES)
    labels = np.asarray(labels)

    return frames, labels

X,y = get_frames(scaled_X, frame_size, step_size)

X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

In [None]:
print(X_train.shape, X_test.shape)
print(X_train[0].shape, X_test[0].shape)

In [None]:
# Need to reshape the data to fit the model
X_train = X_train.reshape(148, 40, 3, 1)
X_test = X_test.reshape(38, 40, 3, 1)
print(X_train[0].shape, X_test[0].shape)

In [None]:
#Define the CNN model
model = Sequential()
model.add(Conv2D(16, (2, 2), activation = 'relu', input_shape = X_train[0].shape))
model.add(Dropout(0.1))

model.add(Conv2D(32, (2, 2), activation='relu'))
model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(32, activation = 'relu'))
model.add(Dropout(0.5))

model.add(Dense(6, activation='softmax'))

print(model.output_shape)

In [None]:
# compile and fit the model
model.compile(optimizer=Adam(learning_rate = 0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
history = model.fit(X_train, y_train, epochs = 20, validation_data= (X_test, y_test), verbose=1)

In [None]:

def plot_learningCurve(history, epochs):
  # Plot training & validation accuracy values
  epoch_range = range(1, epochs+1)
  plt.plot(epoch_range, history.history['accuracy'])
  plt.plot(epoch_range, history.history['val_accuracy'])
  plt.title('Model accuracy')
  plt.ylabel('Accuracy')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Val'], loc='upper left')
  plt.show()

  # Plot training & validation loss values
  plt.plot(epoch_range, history.history['loss'])
  plt.plot(epoch_range, history.history['val_loss'])
  plt.title('Model loss')
  plt.ylabel('Loss')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Val'], loc='upper left')
  plt.show()

In [None]:
plot_learningCurve(history, 20)

The CNN model does not appear to perform any better with this data than the other models

### Training and fitting the model
Overall, the Knn model gave the best results so I will choose that model to predict the activities of the test data

In [None]:
#Train the knn model
X_train=np.array(train_df_with_labels[['x','y','z']])
y_train=np.array(train_df_with_labels['label'])
# test data
test=np.array(test_df_with_labels[['x','y','z']])

final_model=KNeighborsClassifier(n_neighbors=5)
final_model.fit(X_train,y_train)

final_model.predict(test)

In [None]:
#Predict labels for test data based on trained model
label_predictions=final_model.predict(test)
test_df_with_labels['label']=label_predictions
#save as csv for submission
test_df_with_labels.to_csv('test labels.csv')

### Conclusion
I've tried various approaches but I didn't see any large differences in the outcome between the approaches.  The same CNN model I used here produced much higher accuracies with a different data set (Tesnsor flow analysis of accelerometer data - https://kgptalkie.com/human-activity-recognition-using-accelerometer-data/).  As best as I can tell, the main limiting factor here is the small size of the data set.  Normally, it would be better to use a balanced data set (similar numbers of observations for each activity), but the very small number of observations for standing in this dataset means that balancing the dataset would have made it too small to be reliable. It is possible that tweaking the parameters in the CNN model might have improved the outcome, but I need to spend more time to understand that model better