## Feature engineering, overlapping window

### Loading dataset

In [3]:
import pandas as pd
import numpy as np

folderpath = '../data'

grenade = pd.read_csv(f'{folderpath}/grenade.txt', header=None)
grenade.columns = ['Acc_X', 'Acc_Y', 'Acc_Z', 'Gyr_X', 'Gyr_Y', 'Gyr_Z']
grenade['label'] = 0

grenade.head()

Unnamed: 0,Acc_X,Acc_Y,Acc_Z,Gyr_X,Gyr_Y,Gyr_Z,label
0,-0.06,8.36,4.98,-0.04,-0.02,0.04,0
1,0.11,8.39,5.02,-0.0,-0.01,0.01,0
2,-0.01,8.51,4.86,-0.05,-0.0,0.0,0
3,0.01,8.29,4.97,-0.08,-0.02,0.0,0
4,0.04,8.33,5.06,-0.02,-0.03,0.01,0


## Performing feature engineering

Now we will simply use a window of 50 frames to perform feature engineering. 

In [4]:
"""
Mean along z-axis
(1 feature), STD along x-axis, y-axis and z-axis (3), RMS along x-axis, y-axis and z-axis (3), Correlation between z-axis and y-axis
(1), and MinMax along x-axis, y-axis and z-axis (3)

A Study on Human Activity Recognition Using Accelerometer Data
from Smartphones
Akram Bayat∗
, Marc Pomplun, Duc A. Tran
"""

def engineer_features(frame):
    # takes in 50 frames six features, Acc_XYZ, Gyr_XYZ and
    # adds 18 features to it
    frame['Z_mean'] = frame['Acc_Z'].mean()
    
    frame['X_std'] = frame['Acc_X'].std()
    frame['Y_std'] = frame['Acc_Y'].std()
    frame['Z_std'] = frame['Acc_Z'].std()
    
    frame['X_rms'] = np.sqrt(np.mean(frame['Acc_X']**2))
    frame['Y_rms'] = np.sqrt(np.mean(frame['Acc_Y']**2))
    frame['Z_rms'] = np.sqrt(np.mean(frame['Acc_Z']**2))
    
    frame['ZY_corr'] = frame['Acc_Z'].corr(frame['Acc_Y'])
    
    frame['X_minmax'] = frame['Acc_X'].max() - frame['Acc_X'].min()
    frame['Y_minmax'] = frame['Acc_Y'].max() - frame['Acc_Y'].min()
    frame['Z_minmax'] = frame['Acc_Z'].max() - frame['Acc_Z'].min()
    return frame
    


In [5]:
# applying the function to the grenade dataset

# get sliding window of 25 frames - rehash from hwai_notes notebook
def gen_sequence(df, seq_length, seq_cols):
    data_array = df[seq_cols].values
    num_elements = data_array.shape[0]
    for start, stop in zip(range(0, num_elements-seq_length), range(seq_length, num_elements)):
        yield data_array[start:stop, :]
        
sequence_cols = ['Acc_X', 'Acc_Y', 'Acc_Z', 'Gyr_X', 'Gyr_Y', 'Gyr_Z']

seq_len = 50

seq_gen = (list(gen_sequence(grenade, seq_len, sequence_cols)))
seq_gen = np.stack(list(seq_gen)).astype(np.float32)

for seq in seq_gen:
    seq = pd.DataFrame(seq)
    seq.columns = ['Acc_X', 'Acc_Y', 'Acc_Z', 'Gyr_X', 'Gyr_Y', 'Gyr_Z']
    engineer_features(seq)
    print(seq.head())


   Acc_X  Acc_Y  Acc_Z  Gyr_X  Gyr_Y  Gyr_Z  Z_mean     X_std     Y_std  \
0  -0.06   8.36   4.98  -0.04  -0.02   0.04  5.1096  0.107499  0.198143   
1   0.11   8.39   5.02  -0.00  -0.01   0.01  5.1096  0.107499  0.198143   
2  -0.01   8.51   4.86  -0.05  -0.00   0.00  5.1096  0.107499  0.198143   
3   0.01   8.29   4.97  -0.08  -0.02   0.00  5.1096  0.107499  0.198143   
4   0.04   8.33   5.06  -0.02  -0.03   0.01  5.1096  0.107499  0.198143   

      Z_std   X_rms     Y_rms     Z_rms   ZY_corr  X_minmax  Y_minmax  \
0  0.232598  0.1197  8.288521  5.114786 -0.875161      0.52      0.81   
1  0.232598  0.1197  8.288521  5.114786 -0.875161      0.52      0.81   
2  0.232598  0.1197  8.288521  5.114786 -0.875161      0.52      0.81   
3  0.232598  0.1197  8.288521  5.114786 -0.875161      0.52      0.81   
4  0.232598  0.1197  8.288521  5.114786 -0.875161      0.52      0.81   

   Z_minmax  
0      1.09  
1      1.09  
2      1.09  
3      1.09  
4      1.09  
   Acc_X  Acc_Y  Acc_Z  Gy

# Performance improvements/ overhead

In the output above we see that the engineered features such as standard deviation and rms change slightly over the overlapping sequences.

However, it is important to note that we will only engineer features and package them for prediction when a valid action is detected (refer to action_detect notebook). Thus there will be more valuable information/ correlation for prediction improvements there.

Additionally, as we only add 11 features per 50 * 6, we only have 11 feature overhead for 300 variables which is not bad!

## Further improvements

- test out on trained model and see how it improves accuracy further
- plot confusion matrix when model is done training
- try out dataframe format in queue for main python file