Human Activity Recognition from Smartphone Accelerometer Data
This project demonstrates how to predict the type of physical activity (e.g., walking, climbing stairs) from tri-axial smartphone accelerometer data using supervised machine learning. Smartphone accelerometers are very precise, and different physical activities give rise to unique patterns of acceleration.
The input data used for training in this project consists of two files.
timestamp, UTC time, accuracy, x, y, z
The time series signals are sampled at 10 Hz (0.1 seconds per sample) and contains total 3744 samples and 3 components.
timestampcolumn is the time variable. The last three columns labeled
zcorrespond to measurements of linear acceleration along each of the three orthogonal axes.
The second file, train_labels.csv, contains the activity labels. Different activities have been encoded with integers as follows:
1 = standing,
2 = walking,
3 = stairs down,
4 = stairs up.
The activity labels are sampled at 1 Hz (1 second per sample) and contains 375 samples. Because the accelerometers are sampled at high frequency, the labels in train_labels.csv are only provided for every 10th observation in train_time_series.csv.
Here, the goal is to classify the four physical activities from smartphone accelerometer signals as accurately as possible. My approach to build a machine learning classifier is as follows:
training_labels.csvcontain labels for every 10 observations in
training_time_series.csv. This implies each labeled signal is sampled from 10 signals.
- I combined 3 axes components in training and test time series into a 4th component of combined magnitude by taking a square root of summation of their squares: sqrt(x^2+y^2+z^2)
- From 1 & 2 above, I transformed the
training_time_seriesdataframe of shape (3744, 3) into a numpy array of shape (375, 10, 4)
- From this array, I extracted features for each of the 4 components using frequency transformations e.g. Fast fourier transformation values, Power spectral density values, Autocorrelation
- I split this array of all features into training and validation array in 80:20 ratio. The training set (300 observations) is used to train the classifier. I randomized the training set to avoid classifier getting biased to a particular pattern in time-series
- To address the activity class imbalance, I used oversampling with SMOTE on the training set
- The validation set (75 observations) is used to get a sense of validation accuracy as an indicator of test accuracy