## Denoising the Data.

- Through EDA, features seem to have noise added in it (Noise ~ [0,0.01])
- General Property of money is to have peaks (Since money is kind of discrete. Peak at 0 and other integers and in the discrete multiple of 0.01) are expected in some sense (even if the variables are transformed)
- Number of unique values in what appears categorical features (through plots) are way higher. Hence the following function is applied to denoise the data 
**(np.floor(x*100))**

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
import joblib
import matplotlib.pyplot as plt
from tqdm import tqdm

In [21]:
X = pd.read_csv('datasets/Modelling_students/train_allx.csv',memory_map=True)
y = pd.read_csv('datasets/Modelling_students/train_y.csv',memory_map=True)
X_val = pd.read_csv('datasets/Modelling_students/val_allx.csv',memory_map=True)

In [22]:
def denoise(df):
    # Columns D_36 and D_44 have texts in them and are of categorical type
    # So these variables are one-hot encoded
    dummies1 = pd.get_dummies(df.D_36, prefix = 'onehot_')
    dummies2 = pd.get_dummies(df.D_44, prefix = 'onehot_')

    df.drop(['D_36', 'D_44'], axis=1, inplace = True)

    
    df[dummies1.keys()] = dummies1.values
    df[dummies2.keys()] = dummies2.values

    for col in tqdm(df.columns):
        if col not in ['ID']:
            df[col] = np.floor(df[col]*100)
    return df

In [23]:
train = denoise(X)

100%|██████████| 196/196 [00:14<00:00, 13.62it/s]


In [24]:
# saved into feather format because they are faster and smaller to store
train.to_feather('datasets/dataset1/train.feather')

In [25]:
test = denoise(X_val)

100%|██████████| 196/196 [00:15<00:00, 12.50it/s]


In [26]:
test.to_feather('datasets/dataset1/test.feather')

In [2]:
train = pd.read_feather('datasets/dataset1/train.feather')

In [3]:
train.head()

Unnamed: 0,ID,B_37,S_24,S_4,S_14,B_25,D_38,B_30,D_138,P_2,...,S_5,onehot__CL,onehot__CO,onehot__CR,onehot__XL,onehot__XM,onehot__XZ,onehot__O,onehot__R,onehot__U
0,3337446730,0.0,0.0,85.0,55.0,100.0,,2.0,51.0,100.0,...,0.0,0.0,100.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
1,7888784125,0.0,0.0,75.0,68.0,0.0,,8.0,55.0,68.0,...,0.0,100.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
2,9871378905,0.0,0.0,76.0,28.0,100.0,,0.0,85.0,92.0,...,0.0,0.0,100.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0
3,8891869609,0.0,0.0,36.0,0.0,53.0,,0.0,52.0,70.0,...,0.0,0.0,0.0,100.0,0.0,0.0,0.0,100.0,0.0,0.0
4,2006443827,0.0,0.0,75.0,42.0,100.0,,0.0,82.0,100.0,...,0.0,0.0,0.0,0.0,0.0,100.0,0.0,100.0,0.0,0.0
