# Data Wrangling

## Introduction


## Data Landscape

The data provided (csv) has the following features:
+ timestamp (milliseconds with microsecond precision); 
+ channel 0 raw, 
+ channel 1 raw,
+ channel 0 high-passed, 
+ channel 1 high-passed (all in ADC counts); 
+ quaternion x, y, z, and w; 

The quaternion coordinates follow the convention:
The X-axis points West.
The Y-axis points South.
The Z-axis points Up.
+ gyroscope x, y, and z in degrees per second; 
+ accelerometer x, y, and z in meters per second squared; 

The accelerometer and gyroscope coordinates follow the ENU convention:
The X-axis points East.
The Y-axis points North.
The Z-axis points Up.
+ body movement label; 

The body movement coding is: 0 = standing #1, 1 = standing #2, 2 = walking, 3 = walking
fast, 4 = running.
+ repetition number (where there is one "repetition" for a given prompting window)



In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Import raw data 

filepath = './raw/pison_data_interview.csv'
col_names = ['timestamp', 'channel_0', 'channel_1', 'channel_0_hp', 'channel_1_hp', 'q_x', 'q_y', 'q_z', 'q_w', 'g_x', 'g_y', 'g_z', 'a_x', 'a_y', 'a_z', 'body_movement', 'repetition']

df = pd.read_csv(filepath, names=col_names)

df.head()

Unnamed: 0,timestamp,channel_0,channel_1,channel_0_hp,channel_1_hp,q_x,q_y,q_z,q_w,g_x,g_y,g_z,a_x,a_y,a_z,body_movement,repetition
0,1514824.503,12535249,12566283,-11889,17295,0.32196,-0.596619,-0.621826,0.39209,-1.34287,1.063105,0.503576,-0.660156,-10.003906,1.21875,0,1
1,1514827.496,12536264,12559246,-2063,13384,0.32196,-0.596741,-0.621826,0.392029,-1.510729,1.175011,0.0,-0.660156,-10.003906,1.21875,0,1
2,1514830.493,12538584,12565279,2757,16008,0.32196,-0.596741,-0.621765,0.391968,-1.230964,1.175011,-0.279765,-0.660156,-10.003906,1.21875,0,1
3,1514833.5,12546745,12567024,7504,5644,0.32196,-0.596741,-0.621765,0.391968,-1.230964,1.175011,-0.279765,-0.660156,-10.003906,1.21875,0,1
4,1514836.498,12537375,12545467,-3855,-15893,0.32196,-0.596802,-0.621765,0.391907,-1.063105,0.839294,-0.727388,-0.660156,-10.003906,1.21875,0,1


In [3]:
df.info

<bound method DataFrame.info of          timestamp  channel_0  channel_1  channel_0_hp  channel_1_hp  \
0      1514824.503   12535249   12566283        -11889         17295   
1      1514827.496   12536264   12559246         -2063         13384   
2      1514830.493   12538584   12565279          2757         16008   
3      1514833.500   12546745   12567024          7504          5644   
4      1514836.498   12537375   12545467         -3855        -15893   
...            ...        ...        ...           ...           ...   
14976  1618167.393   11942866   12916517        -32480        -78200   
14977  1618170.388   11881160   12934589        -43544        -69885   
14978  1618173.377   11869705   12927326        -25592        -73242   
14979  1618176.442   11898127   12957446         40027        -17502   
14980  1618179.410   11949480   12977401         79369          7782   

            q_x       q_y       q_z       q_w         g_x          g_y  \
0      0.321960 -0.596619 -0.

In [4]:
df.dtypes

timestamp        float64
channel_0          int64
channel_1          int64
channel_0_hp       int64
channel_1_hp       int64
q_x              float64
q_y              float64
q_z              float64
q_w              float64
g_x              float64
g_y              float64
g_z              float64
a_x              float64
a_y              float64
a_z              float64
body_movement      int64
repetition         int64
dtype: object

In [5]:
df.describe()

Unnamed: 0,timestamp,channel_0,channel_1,channel_0_hp,channel_1_hp,q_x,q_y,q_z,q_w,g_x,g_y,g_z,a_x,a_y,a_z,body_movement,repetition
count,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0,14981.0
mean,1566353.0,12537650.0,12512630.0,62.550898,-257.1735,0.263423,0.051271,0.111606,0.444427,6.40726,5.18396,2.237941,0.564972,-10.547218,2.064593,2.003071,1.991456
std,30967.5,415846.5,562667.3,35397.290087,65238.63,0.320534,0.395624,0.588643,0.335159,119.806613,346.98696,110.392775,7.242565,6.138921,6.852529,1.414352,0.815279
min,1514825.0,10402010.0,10212680.0,-422438.0,-1053204.0,-0.897766,-0.777832,-1.0,0.0,-481.75455,-1830.8911,-623.76306,-34.382812,-48.960938,-27.765625,0.0,1.0
25%,1538782.0,12459080.0,12335640.0,-9221.0,-10419.0,0.032471,-0.085632,-0.304993,0.107849,-36.425343,-50.077854,-21.709728,-2.011719,-13.261719,-0.542969,1.0,1.0
50%,1566517.0,12560780.0,12548050.0,411.0,398.0,0.230042,0.072815,0.036194,0.401611,-0.671435,-0.727388,1.678587,-0.363281,-9.816406,1.457031,2.0,2.0
75%,1594422.0,12658770.0,12693200.0,9792.0,11275.0,0.584961,0.329468,0.712097,0.754333,47.727833,50.41357,39.278942,3.582031,-7.398438,5.097656,3.0,3.0
max,1618179.0,14193520.0,15041100.0,640848.0,1726175.0,0.905945,0.738708,1.0,0.999268,427.31238,1828.3173,425.13022,37.671875,7.691406,33.898438,4.0,3.0


In [6]:
# check for rows with NaNs
df_NaNs = df[df.isna().any(axis=1)]
df_NaNs.head()

Unnamed: 0,timestamp,channel_0,channel_1,channel_0_hp,channel_1_hp,q_x,q_y,q_z,q_w,g_x,g_y,g_z,a_x,a_y,a_z,body_movement,repetition


There are no rows with missing values.

In [7]:
# check for duplicated rows 
df_dups = df[df.duplicated()]
df_dups.head()

Unnamed: 0,timestamp,channel_0,channel_1,channel_0_hp,channel_1_hp,q_x,q_y,q_z,q_w,g_x,g_y,g_z,a_x,a_y,a_z,body_movement,repetition


There are no duplicated rows.

## Save Copy to Interim

In [8]:
# save a copy to the interim folder for EDA use - do not modify the raw data!
# has been commented out to prevent saving multiple copies or rewriting saved copy

# clean_copy = df.to_csv('./interim/pison_data.csv')