# Data preprocessing

With least squares we can find the optimal w to minimize the loss (mean square error) in our train data of a linear regression model. We have accomplished a 75% of accuracy in test data this way. Since least squares finds the optimal w, we already have the "best" model for our data, if we want to improve our model, we have to "improve" our data by cleaning/handling it.

In this notebook we try to find a proper way to preprocess our data in order to achieve higher accuracy in linear regression models.

In [1]:
import numpy  as np
import pandas as pd

In [2]:
TRAIN_DATA_PATH = '../data/train.csv'
TEST_DATA_PATH  = '../data/test.csv'
OUTPUT_PATH     = '../data/output.csv'

In [3]:
df_train = pd.read_csv(TRAIN_DATA_PATH, index_col = 0)
df_test  = pd.read_csv(TEST_DATA_PATH,  index_col = 0)

In [4]:
# List all variables and types
df_train.dtypes

Prediction                      object
DER_mass_MMC                   float64
DER_mass_transverse_met_lep    float64
DER_mass_vis                   float64
DER_pt_h                       float64
DER_deltaeta_jet_jet           float64
DER_mass_jet_jet               float64
DER_prodeta_jet_jet            float64
DER_deltar_tau_lep             float64
DER_pt_tot                     float64
DER_sum_pt                     float64
DER_pt_ratio_lep_tau           float64
DER_met_phi_centrality         float64
DER_lep_eta_centrality         float64
PRI_tau_pt                     float64
PRI_tau_eta                    float64
PRI_tau_phi                    float64
PRI_lep_pt                     float64
PRI_lep_eta                    float64
PRI_lep_phi                    float64
PRI_met                        float64
PRI_met_phi                    float64
PRI_met_sumet                  float64
PRI_jet_num                      int64
PRI_jet_leading_pt             float64
PRI_jet_leading_eta      

In [5]:
# Brief overview of our train data
df_train.describe()

Unnamed: 0,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,...,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt
count,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,...,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0,250000.0
mean,-49.023079,49.239819,81.181982,57.895962,-708.420675,-601.237051,-709.356603,2.3731,18.917332,158.432217,...,-0.010119,209.797178,0.979176,-348.329567,-399.254314,-399.259788,-692.381204,-709.121609,-709.118631,73.064591
std,406.345647,35.344886,40.828691,63.655682,454.480565,657.972302,453.019877,0.782911,22.273494,115.706115,...,1.812223,126.499506,0.977426,532.962789,489.338286,489.333883,479.875496,453.384624,453.389017,98.015662
min,-999.0,0.0,6.329,0.0,-999.0,-999.0,-999.0,0.208,0.0,46.104,...,-3.142,13.678,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
25%,78.10075,19.241,59.38875,14.06875,-999.0,-999.0,-999.0,1.81,2.841,77.55,...,-1.575,123.0175,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
50%,105.012,46.524,73.752,38.4675,-999.0,-999.0,-999.0,2.4915,12.3155,120.6645,...,-0.024,179.739,1.0,38.96,-1.872,-2.093,-999.0,-999.0,-999.0,40.5125
75%,130.60625,73.598,92.259,79.169,0.49,83.446,-4.593,2.961,27.591,200.47825,...,1.561,263.37925,2.0,75.349,0.433,0.503,33.703,-2.457,-2.275,109.93375
max,1192.026,690.075,1349.351,2834.999,8.503,4974.979,16.69,5.684,2834.999,1852.462,...,3.142,2003.976,3.0,1120.573,4.499,3.141,721.456,4.5,3.142,1633.433


As we are trying to improve as much as possible our **linear regression** model, outliers can affect significantly the precission we can obtain. If we don't remove them our model would try to fit the outliers and this could cause a deviation in the other and "more correct" training data. Other (equally important) issue we have to solve are the huge amount of missing values (those with -999) in certain variables. As seen in above table some of the variables have more than the half of the data with missing values. We have to decide if these variables can be dropped or if they contribute to the model accuarcy. In other words, are they providing noise or information? Is there any way we can manipulate these missing values to minimize the noise and maximize the information?

Therefore the first issues we have to solve are:
* **Outliers (OL)**
* **Missing values (MV)**

As an important final remark we'll add that even though the decissions about how to handle these issues are going to be taken by analyzing the effects caused in the training data model we have to always keep an eye in the test data and be sure that it's also having the same properties.

In [6]:
# Brief overview of our test data
df_test.describe()

Unnamed: 0,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,...,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt
count,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,...,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0,568238.0
mean,-48.729241,49.258387,81.122338,57.829094,-707.4418,-599.731058,-708.384205,2.374211,18.99262,158.668286,...,-0.007981,209.957809,0.980251,-348.946261,-399.886426,-399.899229,-691.293904,-708.143299,-708.146201,73.267629
std,406.018702,35.393465,40.474035,63.30445,454.931763,659.054554,453.464437,0.779978,21.76045,116.258246,...,1.812916,126.95606,0.979394,533.156405,489.468578,489.458204,480.450337,453.837535,453.832741,98.470522
min,-999.0,0.0,6.81,0.0,-999.0,-999.0,-999.0,0.237,0.0,46.103,...,-3.142,13.847,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
25%,78.191,19.33,59.425,14.20225,-999.0,-999.0,-999.0,1.815,2.838,77.463,...,-1.574,122.97225,0.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
50%,105.08,46.467,73.74,38.472,-999.0,-999.0,-999.0,2.492,12.413,120.666,...,-0.016,179.94,1.0,38.968,-1.862,-2.11,-999.0,-999.0,-999.0,40.504
75%,130.7755,73.63,92.16275,79.256,0.503,84.3055,-4.532,2.962,27.651,201.073,...,1.559,264.02475,2.0,75.52,0.431,0.483,33.838,-2.427,-2.26,110.5665
max,1949.261,968.669,1264.965,1337.187,8.724,4794.827,17.65,5.751,759.363,2079.162,...,3.142,2190.275,3.0,1163.439,4.5,3.142,817.801,4.5,3.142,1860.175


Given the information of the following table we can be calm since (at least in general terms) train and test data seems to be very similarly distributed.

In [7]:
# Difference of descriptive statistics in train and test data
(df_train.describe() - df_test.describe()).drop(['count'], axis=0)

Unnamed: 0,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,...,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt
mean,-0.293839,-0.018568,0.059644,0.066868,-0.978876,-1.505992,-0.972398,-0.001111,-0.075288,-0.236069,...,-0.002138,-0.160631,-0.001075,0.616694,0.632112,0.639441,-1.0873,-0.97831,-0.97243,-0.203037
std,0.326944,-0.048579,0.354655,0.351232,-0.451198,-1.082251,-0.444561,0.002933,0.513043,-0.55213,...,-0.000693,-0.456555,-0.001968,-0.193616,-0.130292,-0.124321,-0.574841,-0.452911,-0.443723,-0.45486
min,0.0,0.0,-0.481,0.0,0.0,0.0,0.0,-0.029,0.0,0.001,...,0.0,-0.169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.09025,-0.089,-0.03625,-0.1335,0.0,0.0,0.0,-0.005,0.003,0.087,...,-0.001,0.04525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.068,0.057,0.012,-0.0045,0.0,0.0,0.0,-0.0005,-0.0975,-0.0015,...,-0.008,-0.201,0.0,-0.008,-0.01,0.017,0.0,0.0,0.0,0.0085
75%,-0.16925,-0.032,0.09625,-0.087,-0.013,-0.8595,-0.061,-0.001,-0.06,-0.59475,...,0.002,-0.6455,0.0,-0.171,0.002,0.02,-0.135,-0.03,-0.015,-0.63275
max,-757.235,-278.594,84.386,1497.812,-0.221,180.152,-0.96,-0.067,2075.636,-226.7,...,0.0,-186.299,0.0,-42.866,-0.001,-0.001,-96.345,0.0,0.0,-226.742


## 1.- Missing values

It's very important to remark that the **order** (what to handle first, MV or OL?) **is important**. We have decided here that first we are trying to handle the MV problem because if not, we could confuse outliers with data with several missing values (since MV entries are -999, far from the non missing values range in most cases).

#### 1.1 Removing MV

Dead the dog, the rabies is gone. We can dismiss every feature with missing values entries. This option might not be a good choice since it seems that we are losing plenty of information. However, it could be the case that these features to be providing more noise to our model than information. It's free to try so we have no excuse for not doing it.

In [8]:
# Features including missing values
features_with_MV = ['DER_mass_MMC','DER_deltaeta_jet_jet','DER_lep_eta_centrality','DER_mass_jet_jet','DER_prodeta_jet_jet','PRI_jet_leading_pt','PRI_jet_leading_eta','PRI_jet_leading_phi','PRI_jet_subleading_pt','PRI_jet_subleading_eta','PRI_jet_subleading_phi']

In [9]:
# Dismiss features with missing values
df_train_without_MV_columns = df_train.drop(features_with_MV, axis=1)
df_test_without_MV_columns  = df_test.drop( features_with_MV, axis=1)

#df_train_without_MV_columns.describe()
#df_test_without_MV_columns.describe()  # <- Important to check!

With the previous approach we are losing the information given by the deleted features in the case where they have a valid measurement. We should analize the following: 

* Was this data dropped correlated with the predicted variable? For example, could be the case of these features not being able to get a valid measurement when a Boson appears, we would be losing crucial information!
* How affects the suppression of this features to the data distribution? 

We can also try to analyze the data when dropping points with MV, however, with this approach we would be training our model without noise and therefore we can't predict how it will behave whith test data entries with missing values. However, if we are lucky we can get an incredible accuracy with this model for data without MV and for sure we can use it in our final model.

In [10]:
# Dismiss every point with a missing value
df_train_without_MV_points  = df_train[(df_train.DER_mass_MMC != -999) & (df_train.DER_deltaeta_jet_jet!=-999) & (df_train.DER_lep_eta_centrality!=-999) & (df_train.DER_mass_jet_jet!=-999) & (df_train.DER_prodeta_jet_jet!=-999) & (df_train.PRI_jet_leading_pt!=-999) & (df_train.PRI_jet_leading_eta!=-999) & (df_train.PRI_jet_leading_phi!=-999) & (df_train.PRI_jet_subleading_pt!=-999) & (df_train.PRI_jet_subleading_eta!=-999) & (df_train.PRI_jet_subleading_phi!=-999)]
df_test_without_MV_points  = df_test[(df_test.DER_mass_MMC != -999)  & (df_test.DER_deltaeta_jet_jet!=-999) & (df_test.DER_lep_eta_centrality!=-999) & (df_test.DER_mass_jet_jet!=-999) & (df_test.DER_prodeta_jet_jet!=-999) & (df_test.PRI_jet_leading_pt!=-999) & (df_test.PRI_jet_leading_eta!=-999) & (df_test.PRI_jet_leading_phi!=-999) & (df_test.PRI_jet_subleading_pt!=-999) & (df_test.PRI_jet_subleading_eta!=-999) & (df_test.PRI_jet_subleading_phi!=-999)]

#df_train_without_MV_points.describe()
#df_test_without_MV_points.describe()  # <- Important to check!

Removing points with MV of course is affecting drastically the data distribution in features with MV but, how is it affecting the other features? We can get a general idea again by analyzing the difference in the metrics of our data distribution.

In [11]:
# Show difference in descriptive statistics for original and without MV points distributions
#(df_train.describe() - df_train_without_MV_points.describe()).drop(features_with_MV, axis=1).drop(['count'], axis=0)
#(df_test.describe()  - df_test_without_MV_points.describe( )).drop(features_with_MV, axis=1).drop(['count'], axis=0)

#### 1.2 Replacing MV

Non valid measurements take the value -999 (totally arbitrary). This happens both in features which can take only positive values (therefore of course MV entries creates outliers in this dimension). 

For this purpose it totally make sense trying to replace these MV measurements for one neutral and more aseptic value so their impact in the decissionmaking is smaller. We can try to replace it by the average (or the mode in discrete and short range variables) of the feature valid measurements and, again since it's also free, we can try to replace it by a deviated but more neutral value (f.e. 0 for positive features). 

In [12]:
# features_with_MV = ['DER_mass_MMC','DER_deltaeta_jet_jet','DER_lep_eta_centrality','DER_mass_jet_jet','DER_prodeta_jet_jet','PRI_jet_leading_pt','PRI_jet_leading_eta','PRI_jet_leading_phi','PRI_jet_subleading_pt','PRI_jet_subleading_eta','PRI_jet_subleading_phi']

# Features taking also negative values:
# 'DER_prodeta_jet_jet'
# 'PRI_jet_leading_eta'
# 'PRI_jet_leading_phi'
# 'PRI_jet_subleading_eta'
# 'PRI_jet_subleading_phi'

In [13]:
# Replace MV with 0 (we have to copy df_train and df_test before!)
df_train_MV_replaced_0 = df_train
df_train_MV_replaced_0 = df_train_MV_replaced_0[features_with_MV].replace(-999,0)
df_test_MV_replaced_0  = df_test
df_test_MV_replaced_0  = df_test_MV_replaced_0[ features_with_MV].replace(-999,0)

# df_train_MV_replaced_0.describe()
# df_test_MV_replaced_0.describe()

In [14]:
# Replace MV with avg in train data
DER_mass_MMC_nonMV_avg           = df_train[df_train.DER_mass_MMC           != -999]['DER_mass_MMC'          ].mean()
DER_deltaeta_jet_jet_nonMV_avg   = df_train[df_train.DER_deltaeta_jet_jet   != -999]['DER_deltaeta_jet_jet'  ].mean()
DER_lep_eta_centrality_nonMV_avg = df_train[df_train.DER_lep_eta_centrality != -999]['DER_lep_eta_centrality'].mean()
DER_mass_jet_jet_nonMV_avg       = df_train[df_train.DER_mass_jet_jet       != -999]['DER_mass_jet_jet'      ].mean()
DER_prodeta_jet_jet_nonMV_avg    = df_train[df_train.DER_prodeta_jet_jet    != -999]['DER_prodeta_jet_jet'   ].mean()
PRI_jet_leading_pt_nonMV_avg     = df_train[df_train.PRI_jet_leading_pt     != -999]['PRI_jet_leading_pt'    ].mean()
PRI_jet_leading_eta_nonMV_avg    = df_train[df_train.PRI_jet_leading_eta    != -999]['PRI_jet_leading_eta'   ].mean()
PRI_jet_leading_phi_nonMV_avg    = df_train[df_train.PRI_jet_leading_phi    != -999]['PRI_jet_leading_phi'   ].mean()
PRI_jet_subleading_pt_nonMV_avg  = df_train[df_train.PRI_jet_subleading_pt  != -999]['PRI_jet_subleading_pt' ].mean()
PRI_jet_subleading_eta_nonMV_avg = df_train[df_train.PRI_jet_subleading_eta != -999]['PRI_jet_subleading_eta'].mean() 
PRI_jet_subleading_phi_nonMV_avg = df_train[df_train.PRI_jet_subleading_phi != -999]['PRI_jet_subleading_phi'].mean()

df_train_MV_replaced_avg = df_train.copy()         # IMPORTANT! .copy()
df_train_MV_replaced_avg['DER_mass_MMC']           = df_train_MV_replaced_avg['DER_mass_MMC'          ].replace(-999,DER_mass_MMC_nonMV_avg)
df_train_MV_replaced_avg['DER_deltaeta_jet_jet']   = df_train_MV_replaced_avg['DER_deltaeta_jet_jet'  ].replace(-999,DER_deltaeta_jet_jet_nonMV_avg)
df_train_MV_replaced_avg['DER_lep_eta_centrality'] = df_train_MV_replaced_avg['DER_lep_eta_centrality'].replace(-999,DER_lep_eta_centrality_nonMV_avg)
df_train_MV_replaced_avg['DER_mass_jet_jet']       = df_train_MV_replaced_avg['DER_mass_jet_jet'      ].replace(-999,DER_mass_jet_jet_nonMV_avg)
df_train_MV_replaced_avg['DER_prodeta_jet_jet']    = df_train_MV_replaced_avg['DER_prodeta_jet_jet'   ].replace(-999,DER_prodeta_jet_jet_nonMV_avg)
df_train_MV_replaced_avg['PRI_jet_leading_pt']     = df_train_MV_replaced_avg['PRI_jet_leading_pt'    ].replace(-999,PRI_jet_leading_pt_nonMV_avg)
df_train_MV_replaced_avg['PRI_jet_leading_eta']    = df_train_MV_replaced_avg['PRI_jet_leading_eta'   ].replace(-999,PRI_jet_leading_eta_nonMV_avg)
df_train_MV_replaced_avg['PRI_jet_leading_phi']    = df_train_MV_replaced_avg['PRI_jet_leading_phi'   ].replace(-999,PRI_jet_leading_phi_nonMV_avg)
df_train_MV_replaced_avg['PRI_jet_subleading_pt']  = df_train_MV_replaced_avg['PRI_jet_subleading_pt' ].replace(-999,PRI_jet_subleading_pt_nonMV_avg)
df_train_MV_replaced_avg['PRI_jet_subleading_eta'] = df_train_MV_replaced_avg['PRI_jet_subleading_eta'].replace(-999,PRI_jet_subleading_eta_nonMV_avg)
df_train_MV_replaced_avg['PRI_jet_subleading_phi'] = df_train_MV_replaced_avg['PRI_jet_subleading_phi'].replace(-999,PRI_jet_subleading_phi_nonMV_avg)

In [15]:
# This function generalizes the behaviour of the cell above
# WARNING! Modifies the input parameter df
def replace_MV_by_average(df, features, MV_value):
    for feature in features:
        df[feature] = df[feature].replace(MV_value, (df[df[feature] != MV_value][feature].mean()))

In [16]:
# Replace MV with avg in test data (1, using newly function)
df_test_MV_replaced_avg = df_test.copy()
replace_MV_by_average(df_test_MV_replaced_avg, features_with_MV, -999)

In [17]:
# Replace MV with avg in test data (2, ad hoc)
DER_mass_MMC_nonMV_avg           = df_test[df_test.DER_mass_MMC           != -999]['DER_mass_MMC'          ].mean()
DER_deltaeta_jet_jet_nonMV_avg   = df_test[df_test.DER_deltaeta_jet_jet   != -999]['DER_deltaeta_jet_jet'  ].mean()
DER_lep_eta_centrality_nonMV_avg = df_test[df_test.DER_lep_eta_centrality != -999]['DER_lep_eta_centrality'].mean()
DER_mass_jet_jet_nonMV_avg       = df_test[df_test.DER_mass_jet_jet       != -999]['DER_mass_jet_jet'      ].mean()
DER_prodeta_jet_jet_nonMV_avg    = df_test[df_test.DER_prodeta_jet_jet    != -999]['DER_prodeta_jet_jet'   ].mean()
PRI_jet_leading_pt_nonMV_avg     = df_test[df_test.PRI_jet_leading_pt     != -999]['PRI_jet_leading_pt'    ].mean()
PRI_jet_leading_eta_nonMV_avg    = df_test[df_test.PRI_jet_leading_eta    != -999]['PRI_jet_leading_eta'   ].mean()
PRI_jet_leading_phi_nonMV_avg    = df_test[df_test.PRI_jet_leading_phi    != -999]['PRI_jet_leading_phi'   ].mean()
PRI_jet_subleading_pt_nonMV_avg  = df_test[df_test.PRI_jet_subleading_pt  != -999]['PRI_jet_subleading_pt' ].mean()
PRI_jet_subleading_eta_nonMV_avg = df_test[df_test.PRI_jet_subleading_eta != -999]['PRI_jet_subleading_eta'].mean() 
PRI_jet_subleading_phi_nonMV_avg = df_test[df_test.PRI_jet_subleading_phi != -999]['PRI_jet_subleading_phi'].mean()

df_test_MV_replaced_avg1 = df_test.copy()          # IMPORTANT! .copy()
df_test_MV_replaced_avg1['DER_mass_MMC']           = df_test_MV_replaced_avg1['DER_mass_MMC'          ].replace(-999,DER_mass_MMC_nonMV_avg)
df_test_MV_replaced_avg1['DER_deltaeta_jet_jet']   = df_test_MV_replaced_avg1['DER_deltaeta_jet_jet'  ].replace(-999,DER_deltaeta_jet_jet_nonMV_avg)
df_test_MV_replaced_avg1['DER_lep_eta_centrality'] = df_test_MV_replaced_avg1['DER_lep_eta_centrality'].replace(-999,DER_lep_eta_centrality_nonMV_avg)
df_test_MV_replaced_avg1['DER_mass_jet_jet']       = df_test_MV_replaced_avg1['DER_mass_jet_jet'      ].replace(-999,DER_mass_jet_jet_nonMV_avg)
df_test_MV_replaced_avg1['DER_prodeta_jet_jet']    = df_test_MV_replaced_avg1['DER_prodeta_jet_jet'   ].replace(-999,DER_prodeta_jet_jet_nonMV_avg)
df_test_MV_replaced_avg1['PRI_jet_leading_pt']     = df_test_MV_replaced_avg1['PRI_jet_leading_pt'    ].replace(-999,PRI_jet_leading_pt_nonMV_avg)
df_test_MV_replaced_avg1['PRI_jet_leading_eta']    = df_test_MV_replaced_avg1['PRI_jet_leading_eta'   ].replace(-999,PRI_jet_leading_eta_nonMV_avg)
df_test_MV_replaced_avg1['PRI_jet_leading_phi']    = df_test_MV_replaced_avg1['PRI_jet_leading_phi'   ].replace(-999,PRI_jet_leading_phi_nonMV_avg)
df_test_MV_replaced_avg1['PRI_jet_subleading_pt']  = df_test_MV_replaced_avg1['PRI_jet_subleading_pt' ].replace(-999,PRI_jet_subleading_pt_nonMV_avg)
df_test_MV_replaced_avg1['PRI_jet_subleading_eta'] = df_test_MV_replaced_avg1['PRI_jet_subleading_eta'].replace(-999,PRI_jet_subleading_eta_nonMV_avg)
df_test_MV_replaced_avg1['PRI_jet_subleading_phi'] = df_test_MV_replaced_avg1['PRI_jet_subleading_phi'].replace(-999,PRI_jet_subleading_phi_nonMV_avg)

In [18]:
# df_train_MV_replaced_avg.describe()
# df_test_MV_replaced_avg.describe()

# To check function 'replace_MV_by_average' correctness
# df_test_MV_replaced_avg.describe() - df_test_MV_replaced_avg1.describe()

#### 1.3 Combination of both

We have features with a lot of MV and other with just a few of them. It makes sense then trying to keep features with only few MV and remove those which are suppose to add more noise than information. We are saying then that removing all features with at least one MV is a really drastic approach and loses a lot of information but also replacing MV by the average of valid measurements in features with more than 70% of MV is whiten noise.

In this section we are trying to obtain a more intelligent approach for data preprocessing based on these ideas.

In [19]:
# Get for every feature with MV their frequency in train and test data
train_MV_freqs = []
test_MV_freqs  = []
N_train = df_train.shape[0]
N_tests = df_test.shape[ 0]

for feature in features_with_MV:
    train_MV_freqs.append((df_train[df_train[feature]==-999]).shape[0]/N_train)
    test_MV_freqs.append( (df_test[ df_test[ feature]==-999]).shape[0]/N_tests)

features_with_MV_freq = pd.DataFrame({'MV_freq_train':train_MV_freqs, 'MV_freq_test':test_MV_freqs},features_with_MV)
features_with_MV_freq

Unnamed: 0,MV_freq_train,MV_freq_test
DER_mass_MMC,0.152456,0.152204
DER_deltaeta_jet_jet,0.709828,0.708851
DER_lep_eta_centrality,0.709828,0.708851
DER_mass_jet_jet,0.709828,0.708851
DER_prodeta_jet_jet,0.709828,0.708851
PRI_jet_leading_pt,0.399652,0.400286
PRI_jet_leading_eta,0.399652,0.400286
PRI_jet_leading_phi,0.399652,0.400286
PRI_jet_subleading_pt,0.709828,0.708851
PRI_jet_subleading_eta,0.709828,0.708851


Seeing the frequencies of the missing values in the different features we can decide which is more reasonable to keep. We have to be aware however that only by knowing the frequence of the MV we can't take a totally well-funded decission. There could be underlying relations between the desired prediction and one/some of these features. However, a priori it's reasonable to suppose that the huge amount of noise in these features is penalizing our model.

In [20]:
features_with_high_MV_freq       = ['DER_deltaeta_jet_jet','DER_lep_eta_centrality','DER_mass_jet_jet','DER_prodeta_jet_jet','PRI_jet_subleading_pt','PRI_jet_subleading_eta','PRI_jet_subleading_phi']
features_with_reasonable_MV_freq = ['DER_mass_MMC','PRI_jet_leading_pt','PRI_jet_leading_eta','PRI_jet_leading_phi']

In [21]:
df_train_MV_cleaned = df_train.copy()
df_train_MV_cleaned = df_train_MV_cleaned.drop(features_with_high_MV_freq, axis=1)
replace_MV_by_average(df_train_MV_cleaned, features_with_reasonable_MV_freq, -999)

df_test_MV_cleaned = df_test.copy()
df_test_MV_cleaned = df_test_MV_cleaned.drop(features_with_high_MV_freq, axis=1)
replace_MV_by_average(df_test_MV_cleaned, features_with_reasonable_MV_freq, -999)

In [22]:
# df_train_MV_cleaned.describe()
# df_test_MV_cleaned.describe()

## 2.- Outliers

As we said, linear regression model is determined by w and w is the coefficients vector which minimizes the loss, this is, the average square error in every point. Having outliers makes us find an w which tries to suit these points also (and since we are using **square** error, giving it quite importance) even by reducing the precission in more correctly measured/common data points. 

In this section we'll try to remove outliers in our train dataset so we can find an w which fits better our model.

In [23]:
df_train_MV_cleaned.describe().drop(['count'])

Unnamed: 0,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,DER_pt_ratio_lep_tau,DER_met_phi_centrality,PRI_tau_pt,...,PRI_lep_eta,PRI_lep_phi,PRI_met,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_all_pt
mean,121.858528,49.239819,81.181982,57.895962,2.3731,18.917332,158.432217,1.437609,-0.128305,38.707419,...,-0.019507,0.043543,41.717235,-0.010119,209.797178,0.979176,84.822105,-0.003275,-0.012393,73.064591
std,52.749898,35.344886,40.828691,63.655682,0.782911,22.273494,115.706115,0.844743,1.193585,22.412081,...,1.264982,1.816611,32.894693,1.812223,126.499506,0.977426,47.002359,1.382702,1.405048,98.015662
min,9.044,0.0,6.329,0.0,0.208,0.0,46.104,0.047,-1.414,20.0,...,-2.505,-3.142,0.109,-3.142,13.678,0.0,30.0,-4.499,-3.142,0.0
25%,95.665,19.241,59.38875,14.06875,1.81,2.841,77.55,0.883,-1.371,24.59175,...,-1.014,-1.522,21.398,-1.575,123.0175,0.0,57.439,-0.433,-0.556,0.0
50%,119.958,46.524,73.752,38.4675,2.4915,12.3155,120.6645,1.28,-0.356,31.804,...,-0.045,0.086,34.802,-0.024,179.739,1.0,84.822105,-0.003275,-0.012393,40.5125
75%,130.60625,73.598,92.259,79.169,2.961,27.591,200.47825,1.777,1.225,45.017,...,0.959,1.618,51.895,1.561,263.37925,2.0,84.822105,0.433,0.503,109.93375
max,1192.026,690.075,1349.351,2834.999,5.684,2834.999,1852.462,19.773,1.414,764.408,...,2.503,3.142,2842.617,3.142,2003.976,3.0,1120.573,4.499,3.141,1633.433


A first analysis can be done by checking the difference in each feature between minimun and 25% element, maximum and 75% element. This difference should be analized in relation to the standard deviation value.

F.e. check the variable DER_pt_tot, the difference between the maximum value and the valui in the percentil 75% is ~= 2800, more than 100 times the standard deviation.

#### 2.1- Remove outliers

In [24]:
from scipy import stats

In [25]:
z_train          = np.abs(stats.zscore(df_train_MV_cleaned.drop(['Prediction'], axis=1)))
df_train_cleaned = df_train_MV_cleaned[(z_train < 4).all(axis=1)]
df_train_cleaned.describe()

# In case we want to analyze outliers in test dataset
#z_test           = np.abs(stats.zscore(df_test_MV_cleaned.drop(['Prediction'], axis=1)))
#df_test_cleaned  = df_test_MV_cleaned[(z_test < 3).all(axis=1)]
#df_test_cleaned.describe()

Unnamed: 0,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltar_tau_lep,DER_pt_tot,DER_sum_pt,DER_pt_ratio_lep_tau,DER_met_phi_centrality,PRI_tau_pt,...,PRI_lep_eta,PRI_lep_phi,PRI_met,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_all_pt
count,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,...,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0,237376.0
mean,117.834739,48.431308,78.210731,51.498414,2.387252,17.55051,144.549141,1.399114,-0.154108,37.064706,...,-0.020969,0.044199,38.825513,-0.011071,196.65044,0.9287,79.78236,-0.002967,-0.013894,62.862284
std,38.209407,32.67829,30.792343,50.925703,0.738511,18.026591,89.311476,0.705681,1.194827,17.280008,...,1.265165,1.815857,25.49306,1.812351,103.666211,0.951625,33.739601,1.3813,1.386539,78.120344
min,9.044,0.0,6.329,0.0,0.208,0.0,46.104,0.203,-1.414,20.0,...,-2.505,-3.142,0.109,-3.142,13.678,0.0,30.0,-4.499,-3.142,0.0
25%,95.244,19.43975,59.37975,11.83925,1.857,2.76,76.063,0.89,-1.374,24.513,...,-1.018,-1.521,20.95,-1.577,120.67475,0.0,56.322,-0.38,-0.488,0.0
50%,119.1065,46.5985,73.3735,36.46,2.505,11.096,115.712,1.276,-0.4725,31.481,...,-0.047,0.088,33.966,-0.025,174.808,1.0,84.822105,-0.003275,-0.012393,37.9545
75%,129.46625,73.148,91.037,73.32,2.958,27.01225,187.099,1.753,1.22,44.021,...,0.959,1.617,50.12,1.559,251.34325,2.0,84.822105,0.377,0.421,100.12925
max,332.838,190.471,244.387,312.113,5.331,108.004,621.165,4.816,1.414,128.319,...,2.503,3.142,173.242,3.142,712.239,3.0,272.812,4.499,3.141,464.95


A better approach we could consider is removing outliers in our data (in a more restricted way, removing less data) and after "cleaning" outliers in each variable by replacing them in by a deviated but less extreme value.

## 3.- Increase regresion order

Linear regression involving only given features has given us a maximum of a 75% of accuracy. In this section we create artificial features from the original data so we can create a more complex model which is able to fit better our data.

In [26]:
df_train_extended = df_train_cleaned.copy()
df_test_extended  = df_test_MV_cleaned.copy()

features = list(df_train_extended.columns)[1:]

In [27]:
# This function should generalize the behaviour of the cells below but it needs to be fixed!
#
#for i in range(len(features)):
#    for j in range(i,len(features)):
#        df_train_extended = df_train_extended.assign(aux=lambda x: (x[features[i]]*x[features[j]]))

In [28]:
df_train_extended = df_train_extended.assign(aux_1_1  =lambda x: (x[features[1]]**2))
df_train_extended = df_train_extended.assign(aux_2_2  =lambda x: (x[features[2]]**2))
df_train_extended = df_train_extended.assign(aux_3_3  =lambda x: (x[features[3]]**2))
df_train_extended = df_train_extended.assign(aux_4_4  =lambda x: (x[features[4]]**2))
df_train_extended = df_train_extended.assign(aux_5_5  =lambda x: (x[features[5]]**2))
df_train_extended = df_train_extended.assign(aux_6_6  =lambda x: (x[features[6]]**2))
df_train_extended = df_train_extended.assign(aux_7_7  =lambda x: (x[features[7]]**2))
df_train_extended = df_train_extended.assign(aux_8_8  =lambda x: (x[features[8]]**2))
df_train_extended = df_train_extended.assign(aux_9_9  =lambda x: (x[features[9]]**2))
df_train_extended = df_train_extended.assign(aux_10_10=lambda x: (x[features[10]]**2))
df_train_extended = df_train_extended.assign(aux_11_11=lambda x: (x[features[11]]**2))
df_train_extended = df_train_extended.assign(aux_12_12=lambda x: (x[features[12]]**2))
df_train_extended = df_train_extended.assign(aux_13_13=lambda x: (x[features[13]]**2))
df_train_extended = df_train_extended.assign(aux_14_14=lambda x: (x[features[14]]**2))
df_train_extended = df_train_extended.assign(aux_15_15=lambda x: (x[features[15]]**2))
df_train_extended = df_train_extended.assign(aux_16_16=lambda x: (x[features[16]]**2))
df_train_extended = df_train_extended.assign(aux_17_17=lambda x: (x[features[17]]**2))
df_train_extended = df_train_extended.assign(aux_18_18=lambda x: (x[features[18]]**2))
df_train_extended = df_train_extended.assign(aux_19_19=lambda x: (x[features[19]]**2))
df_train_extended = df_train_extended.assign(aux_20_20=lambda x: (x[features[20]]**2))
df_train_extended = df_train_extended.assign(aux_21_21=lambda x: (x[features[21]]**2))
df_train_extended = df_train_extended.assign(aux_22_22=lambda x: (x[features[22]]**2))

In [29]:
df_test_extended = df_test_extended.assign(aux_1_1  =lambda x: (x[features[1]]**2))
df_test_extended = df_test_extended.assign(aux_2_2  =lambda x: (x[features[2]]**2))
df_test_extended = df_test_extended.assign(aux_3_3  =lambda x: (x[features[3]]**2))
df_test_extended = df_test_extended.assign(aux_4_4  =lambda x: (x[features[4]]**2))
df_test_extended = df_test_extended.assign(aux_5_5  =lambda x: (x[features[5]]**2))
df_test_extended = df_test_extended.assign(aux_6_6  =lambda x: (x[features[6]]**2))
df_test_extended = df_test_extended.assign(aux_7_7  =lambda x: (x[features[7]]**2))
df_test_extended = df_test_extended.assign(aux_8_8  =lambda x: (x[features[8]]**2))
df_test_extended = df_test_extended.assign(aux_9_9  =lambda x: (x[features[9]]**2))
df_test_extended = df_test_extended.assign(aux_10_10=lambda x: (x[features[10]]**2))
df_test_extended = df_test_extended.assign(aux_11_11=lambda x: (x[features[11]]**2))
df_test_extended = df_test_extended.assign(aux_12_12=lambda x: (x[features[12]]**2))
df_test_extended = df_test_extended.assign(aux_13_13=lambda x: (x[features[13]]**2))
df_test_extended = df_test_extended.assign(aux_14_14=lambda x: (x[features[14]]**2))
df_test_extended = df_test_extended.assign(aux_15_15=lambda x: (x[features[15]]**2))
df_test_extended = df_test_extended.assign(aux_16_16=lambda x: (x[features[16]]**2))
df_test_extended = df_test_extended.assign(aux_17_17=lambda x: (x[features[17]]**2))
df_test_extended = df_test_extended.assign(aux_18_18=lambda x: (x[features[18]]**2))
df_test_extended = df_test_extended.assign(aux_19_19=lambda x: (x[features[19]]**2))
df_test_extended = df_test_extended.assign(aux_20_20=lambda x: (x[features[20]]**2))
df_test_extended = df_test_extended.assign(aux_21_21=lambda x: (x[features[21]]**2))
df_test_extended = df_test_extended.assign(aux_22_22=lambda x: (x[features[22]]**2))

## 4.- Predictor

In [35]:
from proj1_helpers import *
from implementations import *

#### 3.1- Train, generate model

In [36]:
df = df_train_extended

y  = np.array([1 if x=='s' else -1 for x in df['Prediction'].values]) # Labels 's' and 'b' to 1 and -1
tx = df.drop(['Prediction'], axis=1).values

In [37]:
w, loss = least_squares(y, tx)
loss

0.31734953607527555

#### 3.2- Test, generate predictions

In [38]:
df = df_test_extended

tx_test = df.drop(['Prediction'], axis=1).values
ids_test= df.index

In [39]:
y_pred  = predict_labels(w, tx_test)
create_csv_submission(ids_test, y_pred, OUTPUT_PATH)