# Data cleaning

## Overview
The data consists of two data sets, **accelerometer** and **gyroscope** data set.
The first step is to merge this two data set into a new, merged data set and to get rid of outliers and missing values.
Therefore the accelerometer data entries have to match corresponding gyroscope data entries.

First, let's look at the accelerometer and gyroscope data attributes (gt is the label):

**accelerometer data attributes**:     Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
**gyroscope data attributes**:         Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt

Both data sets contain the same attribute labels.
The **Index,Arrival_Time,Creation_Time** doesn't have any relevant information content for the classification algorithm because most of them are unique identifiers. These attributes will be ignored in the merge process of the two data sets.
To learn a general model that will handle and classify all devices and users, the attributes **User,Model and Device** will be ignored.
Thus we have 4 Attributes left in both data sets which needs to be merged: **x,y,z,gt**

To be not confused with two x, y and z values the three values from accelerometer will be named **aX, aY and aZ** and from gyroscope **gX, gY and gZ** respectively. For reason of understanding the attribute name **gt** will be named **label**.

The result of the merged data set will be a set of seven attributes: **aX,aY,aZ,gX,gY,gZ,label**.
Because of the vast data file, the file is splitted in a further step into ten subsets which also will allow an easier cross-validation, where nine of the subsets will be used to train and one to test the classifier.

## Data pruner
The Class **DataPruner** was implemented to prune and merge the two data sets.
For this purpose we decided to find the two matching entries by the attributes **Index** and **gt (label)** of a given entry.
The main problem is that you cannot load the two data sets into the main memory because they're too big (2.48GB both CSV files combined).
To solve this problem the **DataPruner** method **dataPruning** uses two CSV reader, one for each data set, which goes through all entries row by row.
The **Index** and **gt (label)** attribute will be used to find a matching row because the problem, as mentioned before, of other appropriate attributes **Arrival_Time,Creation_Time** is the distinct assignability. The difference of the time values in **accelerometer** and **gyroscope** is about two and often the time values don't match exactly. With the use of the **Index** value an exact assignment can be done.
Also the amount of specific labeled data entries differs in the two data sets, and so the last **Index** values of a set of specific labeled data entries will differ.

The implemented strategy is:
Go through all data entries (rows) in both data sets, **accelerometer** and **gyroscope**.

**Do** while rows exist:
    1. Find the next matching Index entries:
        a. Load next parsable and valid rows accelerometer and gyroscope.
           (Valid means all entries x,y,z as float and gt as string can be parsed)
        b. If the Index values match: go to 2.
        c. If Index value are unequal and one of them is zero:
            Skip rows of the other reader till it have also the value zero and go to 2.
        d. else:
            Skip one row of the reader with the lower Index value and go to 1.b.
    2. Matching rows found:
        a. Read in the x,y,z values of the accelerometer as aX,aY,aZ and x,y,z values of gyroscope as gX,gY,gZ and write it with the corresponding label (gt) into the merged data set.
        b. Go to 1.

Data entries with label (gt) **null** will be ignored.

# Convert data into a format useable for machine learning algorithms

## Overview
Through the fact that the merged and pruned data set was still too vast to handle, it was splitted into ten subsets. The subsets should reflect the same proportion of data entries with the same label, we first split the data set by label (Pruned data separator) and merge them randomly with respect to the label proportions (Classificator data generator).

## Pruned data separator
The Class **PrunedDataSeparator** was implemented to split the merged data set into ten subsets.
For each label item one file was created (**label file**) and then the merged data set was parsed row by row.
For each row the label was determined and copied into the according label file.

## Classificator data generator
The Class **ClassificatorDataGenerator** was implemented to create the data set which will be used to train and test the learning algorithm. In the section before the pruned and merged data set was split into ten **label files** in which each file contains exclusively entries with the same label.
To get a random order of labeled data entries in each data subset for cross-validation a random label, which reflects the **label file** from which the row should the data entry be read in, was chosen according to the label amount proportions and written into one of the ten data subsets for cross-validation, also randomly chosen.
Each subset for cross-validation should now reflect the label amount proportion of the original data sets and every subset for cross-validation should have approximately the same amount of data entries.