# Feature Engineering

In [1]:
import pandas as pd

# import out own modules
from modules.FeatureBuilder import *
from modules.DataHandler import *
from modules.SignalTransform import *

### Why we don't simply fit intertial signals to the model?

Although they showed relatively different trends between walking and stationary activity groups, we observed that most of the signals overlap within the same group activities. Hence, it's not clear whether they would be helpful for modelling activity detecting problem. Nevertheless, it's important to add that total acceleration signals showed considerable distinction even within the activities of same group. But, still, the total acceleration signals alone isn't enough to capture all the information about the event.

### What we can do?

Most of the signal show similar behaviour (periodic patterns) over a period of time. From the initial investigation we did on inertial signals, we know that signal values change when activity changes at some moment and it shows similar flow during this activity. Especially, we observed some fluctuations or rapid changes at the signal values for some short frames in which transaction from one activity to another activity occurs. It's important to find when these changes happen. Thus, transforming signals between time and frequency domain, we'll try to decompose the frequency of these periodic components and later find possible significant peaks in frequency spectra. For this we'll use these concepts:
    
   1. **Fast Fourier Transform (FFT)** - Used for finding frequency of periodical components
   2. **Power Spectral Density (PSD)** - Finds peaks corrensponds to power distribution at that frequency
   3. **Atocorrelation (aCORR)** - Caculates the serial correlation of a signal with its lagged signal
   
   
   For  implemataion of these mehotds go to *'modules.SignalTransform'*.

### What is next now?

Now we have extra signals to conduct more detailed feature engineering which may convey better information. We'll use these transformation signals addition to initial inertial signals for doing feature extraction. Our total features set we'll be consist of features calculated on this newly created signal and those calculated on normal inertial signals.


### Main features

main_features: those are the features that we'll be calculated on original inertial signals.

In [2]:
featureBuilder = FeatureBuilder(n_peaks=2)
featureBuilder.init_features()

In [3]:
featureBuilder.get_main_features()

['std', 'mean', 'mad', 'max', 'min', 'iqr', 'correlation-1', 'correlation-2']

#### Descrioption of main features

* 'mean' mean vale
* 'std': standart deviation
* 'mad': median absolute deviation
* 'max': larget value in array
* 'min': smallest value in array
* 'iqr': interquartile range
* 'correlation-1': correlation
* 'correlation-2': correlation

We know that we have 3 different signals and each signal is represented on 3-axis. So if the given signal is on the x-axis, then *'correlation-1'* and *'correlation-2'* show correlation between x-y and x-z. It the same for the other axes.


### Domain  features

domain_features: those are the features that we'll be calculated from signal transformations.

In [4]:
featureBuilder.get_domain_features()

['aCORR-peaks-mean',
 'aCORR-peak-value-0',
 'aCORR-peak-value-1',
 'aCORR-peak-domain-0',
 'aCORR-peak-domain-1',
 'PSD-peaks-mean',
 'PSD-peak-value-0',
 'PSD-peak-value-1',
 'PSD-peak-domain-0',
 'PSD-peak-domain-1',
 'FFT-peaks-mean',
 'FFT-peak-value-0',
 'FFT-peak-value-1',
 'FFT-peak-domain-0',
 'FFT-peak-domain-1']

#### Description of domain featues

for each of FFT, PSD, and aCORR we have the same structue below:

* 'peaks-mean': mean of the first n selected peaks-value (not domains)
* 'peak-value':{} : e.g. for the first 2 selected peaks, it will be like {'1':0, '2':0} 
* 'peak-domain': {}: e.g for the first 2 selected peaks, it will be like {'1':0, '2':0}


For each transformation, we'll look at the first n peaks in the signal. We're not only the interested in the amplitude of these peaks happened, but also where/when this peaks happened in the t/f-domains. Because moving where this peak occurs can also be helpful for discrimination period pattern. Thus we'll not only take first n peaks of transformations, but also consider their t/f-domains.

So *'peak-value'* stores the amplitude of the first n peaks, *'peak-domain'* stores the information about at what frequency/time domain this peaks happens. In the above example n_peak=2, therefore, two peak-related features has generated.

## Feature Generation

Now, here comes the fun part, finally! In this part, will carry out feature engineering on test/train internal signals.

#### Init our modules

In [5]:
# init datahandler module
dataHandler = DataHandler()

# init SignalTransform
# N: reading size for given window
# F: sampling rate with Hz
# t: size of fixed-width sliding wondow in seconds
signalTransform = SignalTransform(N=128, F=50, t=2.56)

# init FeatureBuilder
featureBuilder = FeatureBuilder(n_peaks=5)
featureBuilder.init_features()

#### Create feature labels
Create features labels from the list of features that are initialized in FutureBuilder model. We will use them as dataset labels.

In [6]:
# get the name of signal files in the path
signal_names = os.listdir('UCI HAR Dataset/train/Inertial Signals/')

signal_names

['body_acc_x_train.txt',
 'body_acc_y_train.txt',
 'body_acc_z_train.txt',
 'body_gyro_x_train.txt',
 'body_gyro_y_train.txt',
 'body_gyro_z_train.txt',
 'total_acc_x_train.txt',
 'total_acc_y_train.txt',
 'total_acc_z_train.txt']

In [21]:
# list stores feature labels
feature_labels = []

for signal in signal_names:

    main_features = [(signal[:-9]+name).upper() for name in featureBuilder.get_main_features()]
    domain_features = [(signal[:-9]+name).upper() for name in featureBuilder.get_domain_features()]
    
    feature_labels.extend(main_features)
    feature_labels.extend(domain_features)
    
# also add target label
feature_labels.append('ACTIVITY')
    
feature_labels[:10]

['BODY_ACC_X_STD',
 'BODY_ACC_X_MEAN',
 'BODY_ACC_X_MAD',
 'BODY_ACC_X_MAX',
 'BODY_ACC_X_MIN',
 'BODY_ACC_X_IQR',
 'BODY_ACC_X_CORRELATION-1',
 'BODY_ACC_X_CORRELATION-2',
 'BODY_ACC_X_ACORR-PEAKS-MEAN',
 'BODY_ACC_X_ACORR-PEAK-VALUE-0']

In [22]:
# given signal calculate both main and domain features and return values
def get_all_features(signal, corr_signals):
    '''
    input:
        signal: signal array for a axis in which caculation is will be done
        corr_signals: list that contains signal arrays for other two axes,
                      these will be used for calculation correlations.
                      e.g. if signal is x, then corr_signals=[z,y]
                      
    output: return 1D array which contains both main and domain features 
    '''

    # init/reste fature values before start
    featureBuilder.init_features()
    
    # calculate main features
    featureBuilder.calculate_main_features(x_signal, corr_signals)

    # get signals transformaiton
    domain_fft, signal_fft = signalTransform.fft_transform(x_signal)
    domain_psd, signal_psd = signalTransform.psd_transform(x_signal)
    domain_aCorr, signal_aCorr = signalTransform.aCorr_transform(x_signal)

    # calculate domain features on differen signal transformations
    featureBuilder.calculate_domain_features(domain_fft, signal_fft, t_name='FFT')
    featureBuilder.calculate_domain_features(domain_psd, signal_psd, t_name='PSD')
    featureBuilder.calculate_domain_features(domain_aCorr, signal_aCorr, t_name='aCORR')

    # get features and concotanate them
    main_features = featureBuilder.get_main_features(return_values=True)
    domain_features = featureBuilder.get_domain_features(return_values=True)
    
    return np.concatenate((main_features, domain_features))

### Generate features

In [25]:
# accumulator for sotring data

for prefix in ['train', 'test']:
    
    data = []
    
    # load initial data
    X_data = dataHandler.load_files('UCI HAR Dataset/{p}/Inertial Signals/'.format(p=prefix))
    y_data = dataHandler.load_txt('UCI HAR Dataset/{p}/y_{p}.txt'.format(p=prefix)).values

    for row in range(X_data.shape[0]):

        # accumulator for storing features at each row
        features = []

        # iterate over signal types by +3, becase for each signal type we have 3 differnet axes values
        for signal in range(0, X_data.shape[2], 3):

            # fet the signal for each axis
            x_signal = X_data[row][:, signal]
            y_signal = X_data[row][:, signal+1]
            z_signal = X_data[row][:, signal+2]

            # GIVEN SIGNAL CALCULATE BOTH MAIN AND DOMAIN FEATURES 
            # corr_siganls is used to calculate corrlation of theese singals with the given signal
            x_features = get_all_features(x_signal, corr_signals=[y_signal, z_signal])
            y_features = get_all_features(y_signal, corr_signals=[x_signal, z_signal])
            z_features = get_all_features(z_signal, corr_signals=[x_signal, y_signal])

            # complete features for each row
            features.append(np.concatenate((x_features, y_features, z_features)))

        # add new feture row to the data list 
        data.append(np.array(features).flatten())

    # save data, add y_data to data and ACTIVITY to labes as well
    dataframe = pd.DataFrame(np.hstack((data, y_data)), columns=feature_labels)
    dataframe.to_csv('dataset/{}.csv'.format(prefix), index=False, header=True)

#### Engineered data

Feature engineering part is now done. Our data looks like below.

In [26]:
pd.read_csv('dataset/train.csv').head(20)

Unnamed: 0,BODY_ACC_X_STD,BODY_ACC_X_MEAN,BODY_ACC_X_MAD,BODY_ACC_X_MAX,BODY_ACC_X_MIN,BODY_ACC_X_IQR,BODY_ACC_X_CORRELATION-1,BODY_ACC_X_CORRELATION-2,BODY_ACC_X_ACORR-PEAKS-MEAN,BODY_ACC_X_ACORR-PEAK-VALUE-0,...,TOTAL_ACC_Z_FFT-PEAK-VALUE-1,TOTAL_ACC_Z_FFT-PEAK-VALUE-2,TOTAL_ACC_Z_FFT-PEAK-VALUE-3,TOTAL_ACC_Z_FFT-PEAK-VALUE-4,TOTAL_ACC_Z_FFT-PEAK-DOMAIN-0,TOTAL_ACC_Z_FFT-PEAK-DOMAIN-1,TOTAL_ACC_Z_FFT-PEAK-DOMAIN-2,TOTAL_ACC_Z_FFT-PEAK-DOMAIN-3,TOTAL_ACC_Z_FFT-PEAK-DOMAIN-4,ACTIVITY
0,0.002941,0.002269,0.002025,0.01081,-0.004294,0.004812,0.374934,0.433372,0.001125,0.001205,...,0.000781,0.000479,0.000556,0.000576,1.190476,3.174603,3.968254,4.761905,5.952381,5.0
1,0.001981,0.000174,0.00011,0.005251,-0.006706,0.00197,-0.011562,-0.071672,3.8e-05,3.1e-05,...,0.00066,0.000661,0.000582,0.000858,0.793651,1.984127,2.777778,3.968254,5.15873,5.0
2,0.002908,0.000428,0.000627,0.008167,-0.010483,0.003138,-0.121905,-0.179492,9.7e-05,0.000255,...,0.000832,0.000626,0.00079,0.001252,0.793651,1.587302,2.380952,4.365079,5.952381,5.0
3,0.002678,0.000329,0.000269,0.008167,-0.010483,0.003128,-0.301393,-0.360048,0.000181,0.000304,...,0.000749,0.000358,0.000727,0.00089,1.190476,1.984127,3.571429,4.365079,5.952381,5.0
4,0.002015,-0.000195,-0.000144,0.00565,-0.006847,0.002622,-0.152752,-0.188102,5.2e-05,6e-05,...,0.00089,0.00066,0.000693,0.000565,1.190476,2.777778,3.968254,4.761905,5.952381,5.0
5,0.002276,-7.8e-05,-0.000182,0.00565,-0.006847,0.002883,-0.206476,-0.149615,4.5e-05,-6e-05,...,0.001155,0.001041,0.00069,0.000885,1.587302,2.777778,3.571429,5.555556,7.142857,5.0
6,0.002409,0.000387,0.000461,0.006637,-0.005558,0.003317,-0.093299,-0.134364,0.000103,0.000155,...,0.001078,0.001735,0.000898,0.000974,1.190476,2.777778,3.571429,6.746032,9.126984,5.0
7,0.002527,-3e-05,-0.000261,0.006637,-0.00603,0.003928,-0.158534,-0.016404,9.3e-05,0.00012,...,0.000659,0.001169,0.00067,0.000786,1.587302,2.380952,3.571429,5.15873,6.349206,5.0
8,0.002278,-5.8e-05,-0.000261,0.006897,-0.00603,0.002976,-0.161664,-0.03296,0.000111,0.000138,...,0.000448,0.001132,0.000402,0.000287,1.190476,2.777778,5.15873,5.952381,6.746032,5.0
9,0.003095,0.00062,0.000495,0.007276,-0.009268,0.003705,-0.173019,-0.44458,0.000278,0.000231,...,0.001386,0.001759,0.000503,0.00071,1.984127,3.174603,4.761905,6.349206,7.142857,5.0
