# Preprocessing Raw Data

Here will do some simple preprocessing steps on raw data in order to make sure it's ready for the feature engineering phase.

## 1. Balance of target classes

In the first step, we have to make sure if the raw data is balanced. An unbalanced data would sure affect the predictive model and lead to a biased result. Thus, we'll check whether class distribution in each dataset is balanced. If not, then we can carry out necessary steps, such as oversampling or undersampling, after feature engineering.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import our modules here
from modules.DataHandler import *
from modules.utils import *

# HandleData: used for downloading dataset (txt) files and handle the data we get
datahandler = DataHandler()

In [2]:
data=datahandler.load_txt('UCI HAR Dataset/train/subject_train.txt')
class_count = pd.value_counts(data.iloc[:,0])

In [3]:
prefix = ['train', 'test']

for p in prefix:
    print('\n{}-SET\nclass\tnumber\t%-wise'.format(p.upper()))
    
    # load file
    data=datahandler.load_txt('UCI HAR Dataset/{p}/y_{p}.txt'.format(p=p))
    # count the number of occurance for each classes
    class_count = pd.value_counts(data.iloc[:,0])
    
    for c in range(len(class_count)):
        print('%d\t%d\t%.2f' %(class_count.index[c], 
                               class_count.values[c], 
                               100*class_count.values[c]/sum(class_count)))


TRAIN-SET
class	number	%-wise
6	1407	19.14
5	1374	18.69
4	1286	17.49
1	1226	16.68
2	1073	14.59
3	986	13.41

TEST-SET
class	number	%-wise
6	537	18.22
5	532	18.05
1	496	16.83
4	491	16.66
2	471	15.98
3	420	14.25


*It looks like distribution of our clsses is fairly normal. So it's safe to continue.*

## 2. Missing value

Check if there's any missing value on every raw data file. If so, remove/replace it.

In [4]:
# iterate over test and train raw files (datasets)
for prefix in ['train', 'test']:

    print('\n'+prefix.upper())
    parentdir = 'UCI HAR Dataset/{}/Inertial Signals/'.format(prefix)
    # get the name of signal files in the parentdir
    filelist = os.listdir(parentdir)

    # load all the files that exist in the filelist
    for filename in filelist:
        # load data
        data = datahandler.load_txt(parentdir + filename)
        # check if there is any missing value
        if data[data.isna().any(axis=1)].shape[0]==0:
            print(filename[:-4]+ ': No missing value')
        else:
            print(filename[:-4]+ ': Missing value found')


TRAIN
body_acc_x_train: No missing value
body_acc_y_train: No missing value
body_acc_z_train: No missing value
body_gyro_x_train: No missing value
body_gyro_y_train: No missing value
body_gyro_z_train: No missing value
total_acc_x_train: No missing value
total_acc_y_train: No missing value
total_acc_z_train: No missing value

TEST
body_acc_x_test: No missing value
body_acc_y_test: No missing value
body_acc_z_test: No missing value
body_gyro_x_test: No missing value
body_gyro_y_test: No missing value
body_gyro_z_test: No missing value
total_acc_x_test: No missing value
total_acc_y_test: No missing value
total_acc_z_test: No missing value


# 3. Noise Removal

Dataset description suggests that the sensor signals (accelerometer and gyroscope) are already pre-processed by applying noise filters. This information can be by the dataset description. Thus we'll skip this step.