# Exploration of raw data

Gathering the raw data, reading into python, and exploring. 

## Loading and processing data

This notebook is intended to run in the root directory of the project.

In [1]:
%cd ..

/project


In [2]:
import os
from pathlib import Path
from pandas_profiling import ProfileReport
from src.data import *

Make sure the data is downloaded.

In [3]:
load_har_data_from_repo('data/raw')

Data appears to be already obtained.


### Loading the raw data

The raw recorded data contains 9 signals, stored as 2.56s windows (128 points in each row). We load the data for either `train` or `test.

In [4]:
raw_df = load_raw_data('test'); raw_df.shape

(188608, 12)

In [5]:
raw_df.head()

Unnamed: 0,subject_id,time_exp,body_acc_x,body_acc_y,body_acc_z,body_gyro_x,body_gyro_y,body_gyro_z,total_acc_x,total_acc_y,total_acc_z,activity_id
0,2,0.0,0.011653,-0.029399,0.106826,0.437464,0.531349,0.136528,1.041216,-0.269796,0.02378,5
1,2,0.02,0.013109,-0.039729,0.152455,0.468264,0.721069,0.097622,1.041803,-0.280025,0.076293,5
2,2,0.04,0.011269,-0.052406,0.216846,0.498257,0.520328,0.083556,1.039086,-0.292663,0.147475,5
3,2,0.06,0.027831,-0.052106,0.202581,0.479396,0.372625,0.022861,1.054768,-0.292384,0.139906,5
4,2,0.08,0.002318,-0.04547,0.17601,0.389894,0.414541,-0.025939,1.028376,-0.285826,0.119934,5


### Loading the feature data

The data provides a set of standard signal processing features. This is a total of 561 features. There are some duplicate fields, and we append the column index to their names.

In [6]:
features_df = load_feature_data('test'); 
features_df.shape

(2947, 563)

In [7]:
features_df.head()

Unnamed: 0,subject_id,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",activity_id
0,2,0.257178,-0.023285,-0.014654,-0.938404,-0.920091,-0.667683,-0.952501,-0.925249,-0.674302,...,-0.33037,-0.705974,0.006462,0.16292,-0.825886,0.271151,-0.720009,0.276801,-0.057978,5
1,2,0.286027,-0.013163,-0.119083,-0.975415,-0.967458,-0.944958,-0.986799,-0.968401,-0.945823,...,-0.121845,-0.594944,-0.083495,0.0175,-0.434375,0.920593,-0.698091,0.281343,-0.083898,5
2,2,0.275485,-0.02605,-0.118152,-0.993819,-0.969926,-0.962748,-0.994403,-0.970735,-0.963483,...,-0.190422,-0.640736,-0.034956,0.202302,0.064103,0.145068,-0.702771,0.280083,-0.079346,5
3,2,0.270298,-0.032614,-0.11752,-0.994743,-0.973268,-0.967091,-0.995274,-0.974471,-0.968897,...,-0.344418,-0.736124,-0.017067,0.154438,0.340134,0.296407,-0.698954,0.284114,-0.077108,5
4,2,0.274833,-0.027848,-0.129527,-0.993852,-0.967445,-0.978295,-0.994111,-0.965953,-0.977346,...,-0.534685,-0.846595,-0.002223,-0.040046,0.736715,-0.118545,-0.692245,0.290722,-0.073857,5


## Standard EDA report via `pandas-profiling`

For both the raw and feature data, we compile comprehensive EDA reports.

In [8]:
profile_raw = ProfileReport(raw_df, title="Pandas Profiling Report: raw signal data")

In [9]:
if not os.path.exists('reports'): os.mkdir('reports')
profile_raw.to_file("reports/raw-profiling.html")

Summarize dataset:   0%|          | 0/25 [00:00<?, ?it/s]

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

For the feature data we only compute the minimal report since there are many variables.

In [10]:
profile = ProfileReport(features_df, title="Pandas Profiling Report: feature signal data", minimal=True)

In [11]:
if not os.path.exists('reports'): os.mkdir('reports')
profile.to_file("reports/features-profiling.html")

Summarize dataset:   0%|          | 0/571 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]