# Final Project Template 

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your final project to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone.
As such, you should have a URL which anyone can use to download your data:

https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

## 3) Import your data
In the space below, import your data.
If your data span multiple files, read them all in.
If applicable, merge or append them as needed.

In [1]:
import pandas as pd
import numpy as np

Function to load file as a numpy array:

In [2]:
# load a single txt file as a numpy array
def load_file(filepath):
    df = pd.read_csv(filepath, header=None, delim_whitespace=True)
    return df.values

Function to load a group of variables into a 3D numpy array:

In [3]:
# load a group of files, such as x y and z data for a given variable
def load_group(filenames, directory=''):
    loaded = []
    for name in filenames:
        data = load_file(directory + name)
        loaded.append(data)
    # stack group so that features are in the 3rd dimension
    loaded = np.dstack(loaded)
    return loaded

Function to load all inertial signals in a group, train or test, as a 3D numpy array:

In [4]:
# load a dataset group, such as train or test
def load_dataset(group, main_data_dir=''):
    filepath = main_data_dir + group + '/Inertial Signals/'
    # get all 9 files in Inertial Signals into a single filename list
    filenames = []
    # total acceleration
    filenames += ['total_acc_x_'+group+'.txt', 'total_acc_y_'+group+'.txt', 'total_acc_z_'+group+'.txt']
    # body acceleration
    filenames += ['body_acc_x_'+group+'.txt', 'body_acc_y_'+group+'.txt', 'body_acc_z_'+group+'.txt']
    # body gyroscope
    filenames += ['body_gyro_x_'+group+'.txt', 'body_gyro_y_'+group+'.txt', 'body_gyro_z_'+group+'.txt']
    # load input data
    X = load_group(filenames, filepath)
    # load class output 
    y = load_file(main_data_dir + group + '/y_'+group+'.txt')
    return X, y

Training data raw inertial signals

In [5]:
# load raw training data
raw_train_X, raw_train_y = load_dataset('train', 'UCI HAR Dataset/')
train_sub_map = load_file("UCI HAR Dataset/train/subject_train.txt")

Test data raw inertial signals

In [6]:
# load raw test data
raw_test_X, raw_test_y = load_dataset('test', 'UCI HAR Dataset/')
test_sub_map = load_file("UCI HAR Dataset/test/subject_test.txt")

Feature engineered training data

In [7]:
# load engineered training data
eng_train_X = load_file("UCI HAR Dataset/train/X_train.txt")
eng_train_y = load_file("UCI HAR Dataset/train/y_train.txt")

Feature engineered test data

In [8]:
# load engineered test data
eng_test_X = load_file("UCI HAR Dataset/test/X_test.txt")
eng_test_y = load_file("UCI HAR Dataset/test/y_test.txt")

## 4) Show me the head of your data.

Only showed the .head() of the engineered data, because the raw data is a 3D numpy array. Below, I converted the train and test X and y data into dataframes and displayed the head

In [9]:
train_X = pd.DataFrame(eng_train_X)
train_y = pd.DataFrame(eng_train_y)
test_X = pd.DataFrame(eng_test_X)
test_y = pd.DataFrame(eng_test_y)

In [10]:
train_X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,551,552,553,554,555,556,557,558,559,560
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.074323,-0.298676,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,0.158075,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,0.414503,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,0.404573,-0.11729,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,0.087753,-0.351471,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892


In [11]:
train_y.head()

Unnamed: 0,0
0,5
1,5
2,5
3,5
4,5


In [12]:
test_X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,551,552,553,554,555,556,557,558,559,560
0,0.257178,-0.023285,-0.014654,-0.938404,-0.920091,-0.667683,-0.952501,-0.925249,-0.674302,-0.894088,...,0.071645,-0.33037,-0.705974,0.006462,0.16292,-0.825886,0.271151,-0.720009,0.276801,-0.057978
1,0.286027,-0.013163,-0.119083,-0.975415,-0.967458,-0.944958,-0.986799,-0.968401,-0.945823,-0.894088,...,-0.401189,-0.121845,-0.594944,-0.083495,0.0175,-0.434375,0.920593,-0.698091,0.281343,-0.083898
2,0.275485,-0.02605,-0.118152,-0.993819,-0.969926,-0.962748,-0.994403,-0.970735,-0.963483,-0.93926,...,0.062891,-0.190422,-0.640736,-0.034956,0.202302,0.064103,0.145068,-0.702771,0.280083,-0.079346
3,0.270298,-0.032614,-0.11752,-0.994743,-0.973268,-0.967091,-0.995274,-0.974471,-0.968897,-0.93861,...,0.116695,-0.344418,-0.736124,-0.017067,0.154438,0.340134,0.296407,-0.698954,0.284114,-0.077108
4,0.274833,-0.027848,-0.129527,-0.993852,-0.967445,-0.978295,-0.994111,-0.965953,-0.977346,-0.93861,...,-0.121711,-0.534685,-0.846595,-0.002223,-0.040046,0.736715,-0.118545,-0.692245,0.290722,-0.073857


In [13]:
test_y.head()

Unnamed: 0,0
0,5
1,5
2,5
3,5
4,5


## 5) Show me the shape of your data

In [14]:
print("Raw train X:", raw_train_X.shape)
print("Raw train y:", raw_train_y.shape)
print("Engineered train X:", eng_train_X.shape)
print("Engineered train y:", eng_train_y.shape)
print("Train subject mapping list:", train_sub_map.shape)
print("Raw test X:", raw_test_X.shape)
print("Raw test y:", raw_test_y.shape)
print("Engineered test X:", eng_test_X.shape)
print("Engineered test y:", eng_test_y.shape)
print("Test subject mapping list:", test_sub_map.shape)

Raw train X: (7352, 128, 9)
Raw train y: (7352, 1)
Engineered train X: (7352, 561)
Engineered train y: (7352, 1)
Train subject mapping list: (7352, 1)
Raw test X: (2947, 128, 9)
Raw test y: (2947, 1)
Engineered test X: (2947, 561)
Engineered test y: (2947, 1)
Test subject mapping list: (2947, 1)


## 6) Show me the proportion of missing observations for each column of your data

The data is missing zero observations

In [15]:
percent_missing1 = train_X.isnull().sum() * 100 / len(train_X)
missing_value_df1 = pd.DataFrame({'column_name': train_X.columns,
                                 'percent_missing': percent_missing1})
missing_value_df1.head()

Unnamed: 0,column_name,percent_missing
0,0,0.0
1,1,0.0
2,2,0.0
3,3,0.0
4,4,0.0


In [16]:
percent_missing2 = train_y.isnull().sum() * 100 / len(train_y)
missing_value_df2 = pd.DataFrame({'column_name': train_y.columns,
                                 'percent_missing': percent_missing2})
missing_value_df2.head()

Unnamed: 0,column_name,percent_missing
0,0,0.0


In [17]:
percent_missing3 = test_X.isnull().sum() * 100 / len(test_X)
missing_value_df3 = pd.DataFrame({'column_name': test_X.columns,
                                 'percent_missing': percent_missing3})
missing_value_df3.head()

Unnamed: 0,column_name,percent_missing
0,0,0.0
1,1,0.0
2,2,0.0
3,3,0.0
4,4,0.0


In [18]:
percent_missing4 = test_y.isnull().sum() * 100 / len(test_y)
missing_value_df4 = pd.DataFrame({'column_name': test_y.columns,
                                 'percent_missing': percent_missing4})
missing_value_df4.head()

Unnamed: 0,column_name,percent_missing
0,0,0.0


## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

The goal of this project is to train a model that can predict the type of activity someone is performing, given a 2.56s window of accelerometer data sampled at 50Hz. The accelerometer measured 9 traces of inertial movements: total acceleration, body acceleration, and body gyro in the x, y, and z directions. I hope to classify a window of these movement traces as 1 of 6 activities: walking, walking upstairs, walking downstairs, sitting, standing, or laying down. I will try to train models on features of the data that I have extracted, and compare that to models that I will train on the engineered features of the data that the authors provided. The authors have already split the data into train and test sets, but I will combine those sets and cross-validate the data accordingly.

## 8) What is your _y_-variable?
For final project, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

The y-variable is the activity label for the window of accelerometer data. Each window consists of movement data taken when as subject was performing exactly one activity (walking, walking upstairs, walking downstairs, sitting, standing, or laying down).

## 9) Sources

* **Dataset:**
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones
* **Data exploration:**
Brownlee, Jason. “How to model human activity from smartphone data.” Machine Learning Mastery, 17 September 2018, https://machinelearningmastery.com/how-to-model-human-activity-from-smartphone-data/.
* **Ideas on how to select features of the data:** 
Martin, et al. “Methods for Real-Time Prediction of the Mode of Travel Using Smartphone-Based GPS and Accelerometer Data.” MDPI, Multidisciplinary Digital Publishing Institute, 8 Sept. 2017, www.mdpi.com/1424-8220/17/9/2058/htm.
* **Class notes and sklearn documentation**