Human Activity Recognition with Smartphones

https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
from sklearn.utils import shuffle

# Load up the data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train = shuffle(train)
test = shuffle(test)

# let's take a gander
display(train.head())

print train.shape
print test.shape

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
846,0.268721,0.002852,-0.101237,-0.077811,0.066019,-0.323281,-0.155269,0.029815,-0.286556,0.404905,...,-0.189048,0.251453,-0.608322,0.949318,-0.940855,-0.784688,0.245746,0.037306,5,WALKING_UPSTAIRS
5662,0.199756,-0.05445,-0.126101,-0.116552,-0.013445,-0.389103,-0.208698,-0.040712,-0.381358,0.080198,...,-0.527984,0.286606,0.645512,0.918708,-0.813764,-0.765804,0.25961,0.029168,26,WALKING_UPSTAIRS
4638,0.291675,-0.020475,-0.07995,0.001924,-0.026864,-0.123359,0.011103,-0.154024,-0.173451,0.000478,...,0.34876,-0.175459,-0.080655,0.714433,-0.206603,-0.751445,0.266426,0.053042,22,WALKING
3192,0.217206,-0.021366,-0.122693,0.244988,-0.13017,0.060327,0.173675,-0.156793,0.0792,0.497257,...,-0.650273,0.66662,-0.891268,0.954843,-0.206973,-0.784206,0.013188,0.148256,16,WALKING_DOWNSTAIRS
1590,0.27588,-0.024806,-0.107179,0.257385,-0.082598,-0.39574,0.182759,-0.168866,-0.432667,0.585873,...,-0.850583,-0.013511,-0.706398,-0.895567,-0.388917,-0.883673,0.177604,0.004475,7,WALKING_DOWNSTAIRS


(7352, 563)
(2947, 563)


In [4]:
# Seperate subject information
subject_training_data = train['subject']
subject_testing_data = test['subject']

# Seperate labels
training_labels = train['Activity']
testing_labels = test['Activity']

# Drop labels and subject info from data
train = train.drop(['subject', 'Activity'], axis=1)
test = test.drop(['subject', 'Activity'], axis=1)

# Print some information about our data
print "Training data consists of {} instances of data with {} total features".format(train.shape[0], train.shape[1])
print "Training data includes value counts of\n", training_labels.value_counts()
print "Testing data consists of {} instances of data".format(test.shape[0])
print "Testing data includes value counts of\n", testing_labels.value_counts()

Training data consists of 7352 instances of data with 561 total features
Training data includes value counts of
LAYING                1407
STANDING              1374
SITTING               1286
WALKING               1226
WALKING_UPSTAIRS      1073
WALKING_DOWNSTAIRS     986
Name: Activity, dtype: int64
Testing data consists of 2947 instances of data
Testing data includes value counts of
LAYING                537
STANDING              532
WALKING               496
SITTING               491
WALKING_UPSTAIRS      471
WALKING_DOWNSTAIRS    420
Name: Activity, dtype: int64


In [5]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
#from sklearn.manifold import TSNE

#scaler = MinMaxScaler()
#scaled_trainingdata = scaler.fit_transform(train)

# Encode our categorical labels into numerical target labels
le = LabelEncoder()
le = le.fit(["WALKING", "WALKING_UPSTAIRS", "WALKING_DOWNSTAIRS", "SITTING", "STANDING", "LAYING"])
enc_training_labels = le.transform(training_labels)
enc_testing_labels = le.transform(testing_labels)


#tsne = TSNE(init = 'pca')
#tsne_vis = tsne.fit_transform(scaled_trainingdata)
#plt.scatter(tsne_vis[:,0], tsne_vis[:,1], c=encodedlabels)

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier 

#Let's try out some out-of-the-box classifiers and see how they perform
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
xt = ExtraTreesClassifier()
kn = KNeighborsClassifier()

def evaluateclf(clf):
    scores = cross_val_score(clf, train, enc_training_labels)
    avg = scores.mean()
    return "performances: {}, \nAverage: {}".format(scores, avg)

print "Decision Tree {}".format(evaluateclf(dt))

print "Random Forest {}".format(evaluateclf(rf))

print "Extra Trees {}".format(evaluateclf(xt))

print "K Neighbors {}".format(evaluateclf(kn))

Decision Tree performances: [ 0.92903752  0.93923328  0.93872549], 
Average: 0.935665429848
Random Forest performances: [ 0.96411093  0.96859706  0.97589869], 
Average: 0.969535562095
Extra Trees performances: [ 0.96982055  0.96982055  0.97385621], 
Average: 0.971165772816
K Neighbors performances: [ 0.96451876  0.96329527  0.96732026], 
Average: 0.965044763601


In [9]:
from sklearn.model_selection import RandomizedSearchCV

#Extremely Random Trees classifier looks promising, let's fine tune some hyper-parameters and see how much we can improve
parameters = {'n_estimators': np.arange(20,200,20), 'min_samples_split': np.arange(2,10,2)}

randgrid = RandomizedSearchCV(xt, parameters, n_iter = 30, n_jobs = 4, verbose = 3)

randgrid = randgrid.fit(train, enc_training_labels)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done  90 out of  90 | elapsed:  2.5min finished


In [10]:
print randgrid.best_estimator_
print randgrid.best_score_

# We'll train the model and hyper-parameters which produced the best 3-fold cross-validation score
xt = randgrid.best_estimator_
xt.fit_transform(train, enc_training_labels)

# Check the performance of the tuned and trained model on the testing set
print "Testing score for extra random trees is {:.4f}".format(xt.score(test, enc_testing_labels))

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=4, min_weight_fraction_leaf=0.0,
           n_estimators=60, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
0.984902067465




Testing score for extra random trees is 0.9365


In [11]:
from keras.utils.np_utils import to_categorical

# Now let's experiment with a neural network to classify this data and see if we can improve our accuracy even further
# First we need to encode our targets as one-hot label vectors
oh_training_labels = to_categorical(enc_training_labels)
oh_testing_labels = to_categorical(enc_testing_labels)

Using Theano backend.


In [19]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import SGD
from keras.regularizers import l2

# Build a network for this classification task
model = Sequential()
model.add(Dense(90, input_dim = train.shape[1], activation = 'tanh'))
model.add(Dropout(0.5))
model.add(Dense(30, activation = 'tanh'))
model.add(Dropout(0.3))
model.add(Dense(12, activation = 'tanh'))
model.add(Dense(output_dim = 6, activation = 'softmax'))

sgd = SGD(lr = .12, momentum = .1, decay = 1e-4)
model.compile(optimizer = sgd, loss = 'categorical_crossentropy', metrics = ['accuracy'])

model.fit(train.values, oh_training_labels, nb_epoch = 200, batch_size = 35, verbose = 2,
          validation_split = .15, shuffle=True)

Train on 6249 samples, validate on 1103 samples
Epoch 1/200
0s - loss: 0.8396 - acc: 0.6179 - val_loss: 0.9833 - val_acc: 0.5675
Epoch 2/200
0s - loss: 0.5055 - acc: 0.7622 - val_loss: 0.3805 - val_acc: 0.7616
Epoch 3/200
0s - loss: 0.4176 - acc: 0.7969 - val_loss: 0.3023 - val_acc: 0.8241
Epoch 4/200
0s - loss: 0.3663 - acc: 0.8205 - val_loss: 0.3405 - val_acc: 0.8676
Epoch 5/200
0s - loss: 0.3334 - acc: 0.8361 - val_loss: 0.2345 - val_acc: 0.9121
Epoch 6/200
0s - loss: 0.3069 - acc: 0.8648 - val_loss: 0.2241 - val_acc: 0.9148
Epoch 7/200
0s - loss: 0.2933 - acc: 0.8697 - val_loss: 0.1696 - val_acc: 0.9347
Epoch 8/200
0s - loss: 0.2825 - acc: 0.8774 - val_loss: 0.2015 - val_acc: 0.9266
Epoch 9/200
0s - loss: 0.2514 - acc: 0.8944 - val_loss: 0.2551 - val_acc: 0.8749
Epoch 10/200
0s - loss: 0.2571 - acc: 0.8931 - val_loss: 0.2261 - val_acc: 0.8966
Epoch 11/200
0s - loss: 0.2480 - acc: 0.8987 - val_loss: 0.2137 - val_acc: 0.9148
Epoch 12/200
0s - loss: 0.2150 - acc: 0.9147 - val_loss: 0.

<keras.callbacks.History at 0x1708c438>

In [20]:
# Add regularization?

nn_test_score = model.evaluate(test.values, oh_testing_labels, verbose=2)
print "Our Neural network achieves a score of {} on the test set".format(nn_test_score[1])

# Feature Selection and redo neural net?

Our Neural network achieves a score of 0.960637936885 on the test set
