In [1]:
import os
import sys
os.path.abspath(os.curdir)
os.chdir("..")
ML_FOLDER_PATH = os.path.abspath(os.curdir)
sys.path.append(ML_FOLDER_PATH)
import numpy as np
import src.helpers as hlp
import numpy as np

## Data loading

In the first part of the processing, we divide the train dataset into three different dataset based on the value of the feature 'PRI_jet_num'.
We made this choice based on the fact the this feature takes values in {0, 1, 2, 3} and depending on this value, other features of the sample are undefined (noted -999.0). 
Thus, our way to deal with this is to separate the dataset in 3 (value 2 and 3 are combined together as they have the same number of defined features) and drop for each dataset the features that are not defined.

In [2]:
path = 'data/train.csv'
data_0, data_1, data_2_3 = hlp.load_split_data(path)

In [3]:
print('********** Number of features per dataset (including index and labels) **********')
print('Number of defined features for dataset 0: ', data_0.shape[1])
print('Number of defined features for dataset 1: ', data_1.shape[1])
print('Number of defined features for dataset 2_3: ', data_2_3.shape[1])

********** Number of features per dataset (including index and labels) **********
Number of defined features for dataset 0:  20
Number of defined features for dataset 1:  24
Number of defined features for dataset 2_3:  31


## Data processing
### Dealing with remaining undefined values
Now that all features containing only undefined values are removed from each dataset, we still need to deal with the undefined values that may arrive occasionnaly for each sample.
To deal with these remaining undefined values, we choose to replace them by the mean of the corresponding feature.

In [4]:
data_0 = hlp.nan_to_mean(data_0)
data_1 = hlp.nan_to_mean(data_1)
data_2_3 = hlp.nan_to_mean(data_2_3)

### Polynomial expansion
In order to increase the dimension of our dataset and have a better approximation of the relationship between the dependent and independent variable, we choose to do a polynomial expansion of the dataset.
This means that for each sample, we will add to it $\sum_{n=1}^d \textbf x^d$ where $d$ is the degrees we want to add to our samples.

In [5]:
deg = [2, 3, 4, 5, 6, 7]
data_0 = hlp.poly_expansion(data_0, deg)
data_1 = hlp.poly_expansion(data_1, deg)
data_2_3 = hlp.poly_expansion(data_2_3, deg)

### Arcsinh transformation
To deal with outliers, we choose to apply an arcsinh transformation. We choosed this particular function based on the fact that for positive values, $sinh^{-1}$ follows almost exactly the $log$ function and it has the benefit of being defined for negative values (which are present in our dataset)

In [6]:
data_0[:, 2:] = hlp.arcsinh_transform(data_0[:, 2:])
data_1[:, 2:] = hlp.arcsinh_transform(data_1[:, 2:])
data_2_3[:, 2:] = hlp.arcsinh_transform(data_2_3[:, 2:])

### Standardization

In [7]:
data_0[:, 2:] = hlp.std_data(data_0[:, 2:])
data_1[:, 2:] = hlp.std_data(data_1[:, 2:])
data_2_3[:, 2:] = hlp.std_data(data_2_3[:, 2:])