# Task 1: Data preprocessing
#### Data Analytics

### Introduction

The dataset named *task1_data* (*task1_data.csv*) has around 130K samples and 65 features.

The main objective of the task is to preprocess the train data, designing a complete preprocessing scheme, and test it on test data. 

You must take into account that this is not a toy dataset, and its size could be relevant.

The function "automatic_scoring" provides a way for comparing different schemes using a classifier, by means of 10-Fold CV and using AUC as metric. You will need to put the right seed as requested. Notice that the function just needs inputs (X) and target (y) arrays as input. NOTE: AUC takes values in [0, 1], being the higher the better.

If you try anytime several options it is important to show the results of those discarded trials, because what is not visible cannot be evaluated.

The function "automatic_testing" trains the model on the train data and applies it to the test data. Do not change the classification algorithm, its parameters and the scoring choice. Those are fixed and their optimization (model selection) is out of the scope of this task.

The deliverable of this task is this Jupyter Notebook containing the code, plus some short answers in markdown cells if required. All the cells in the notebook should be run. You should also upload the downloaded html and pdf formats (*File > Export > Files or selection to HTML... (.html)*)

NOTE: Keep in mind that some functions accept both Pandas dataframes and Numpy arrays, but some others only one of them. Nevertheless, we should know how to pass from one to the other and viceversa.

NOTE: Keep in mind that some functions will take some time to run. You can continue working on other cells during the run to avoid wasting time waiting.

NOTE: If you work in pairs, please add the name of your partner in brackets besides yours in the *Name & Surnames* field.


### Exercises:

* (i) Split the data into 4 parts, i.e. train inputs and target (xtr, ytr) and test inputs and target (xte and yte), in such a way that the proportion of the classes is kept constant in train and test parts. The size of the training set must be double of the size of the test data. ***The random seed must be kept during all the task in any possible place***. [5%] <br><br><br>

* (ii) Checking for missing values and outliers. If any, considering (a), (b) and (c) below, treat the data however you consider better, arguing your decisions. [20%] <br>
    - (a) Is there any missing value? If so, regarding the characteristics of the data, decide what to do arguing your answer. Modify your data according to your answer if necessary.  <br>
    - (b) Is there any collective outlier? If so, regarding the characteristics of the data, decide what to do arguing your answer.  Modify your data according to your answer if necessary. <br>
    - (c) From now on, this is your basic data. Therefore, it is save to overwrite the names of the data parts.
<br><br>

* (iii) The feature selection algorithm *SelectPercentile* (sklearn.feature_selection.SelectPercentile) uses different scores (f_classif, mutual_info_classif, chi2, f_regression, etc) in order to select the most relevant features. In the Scikit-Learn documentation
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile)
you have the function info and an example of use of *chi2* score. Use the feature selection method SelectPercentile with the *mutual_info_classif* score, and *percentile* parameter 20. [30%] <br>
    - (a) Which is the compression ratio you obtained? (Note: Compression ratio is the proportion of the original variables kept after the selection). <br>
    - (b) Compare the performance (with *automatic_scoring*) with and without feature selection with the right scheme and function. Is selecting those variables a good idea? Argue your response. <br>
    - (c) Regarding the answer to (b), get your current data in order to continue preprocessing.
<br><br>

* (iv) Apply principal component analysis to your data for compression, capturing at least 95% of the cumulative variance. How many extracted variables do you have? Which reduction percentage would you get if you apply it? Compare the performance with the one of your current non-compressed data. Would you use the pca compression here? Act consequently with your answer, and keep the data overwriting the names. [15%] <br> 
<br>

* (v) Check the balance of your current dataset. Which is its imbalance ratio? Discuss if it makes sense to apply imbalanced data treatments or not, considering the size of the data and the performance you have obtained for the data you currently have after (iv). Act in consequence, with total freedom on the sampling method to use if decide you need any.

__Hint__:  We can understand the imbalance ratio both as the number of times the majority class is bigger than the minority class, or the proportion of the samples that are from minority class. If imbalance ratio is 49 to 1, then it is equivalent to having 2% of minority class samples. [20%] <br>
<br>

* (vi) Once you are here, you have final preprocessed data using the definitive preprocessing scheme you have reasonably chosen. Check now the performance using the test data (with *automatic_testing*). Comment on the result you have obtained for test data compared to the one for train data in (v). [10%]

#### Auxiliar functions

In [107]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


seed = 44837 # Your student number, without letters and left zeros if there is any


def automatic_scoring(x, y):
    average_score = cross_val_score(estimator=RandomForestClassifier(n_estimators=100, random_state=seed), X=x, y=y, cv=5, scoring='roc_auc').mean()
    return average_score

In [108]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score


def automatic_testing(xtr, ytr, xte, yte):
    auc_score = roc_auc_score(yte, RandomForestClassifier(n_estimators=100, random_state=seed).fit(xtr, ytr).predict_proba(xte)[:,1])
    return auc_score

### Solution:

In [109]:
# (i)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import pandas as pd

# Read data
init_data = pd.read_csv('task1_data.csv')
init_data.describe()

# Split x and y
x = init_data.values[:,:-1]
y = init_data.values[:,-1]

# 2/3 train, 1/3 test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=1/3, random_state=seed)

# Convert to data frame
columnas = ['X1','X2','X3','X4','X5','X6','X7','X8','X9','X10','X11','X12','X13','X14','X15','X16','X17','X18','X19','X20',
            'X21','X22','X23','X24','X25','X26','X27','X28','X29','X30','X31','X32','X33','X34','X35','X36','X37','X38','X39','X40',
            'X41','X42','X43','X44','X45','X46','X47','X48','X49','X50','X51','X52','X53','X54','X55','X56','X57','X58','X59','X60',
            'X61','X62','X63','X64','X65']

x_train_df = pd.DataFrame(x_train, columns=columnas)
y_train_df = pd.DataFrame(y_train, columns=['Y'])
x_test_df = pd.DataFrame(x_test, columns=columnas)
y_test_df = pd.DataFrame(y_test, columns=['Y'])

y_test_df.describe()

Unnamed: 0,Y
count,43632.0
mean,0.009305
std,0.096014
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [110]:
from sklearn.impute import SimpleImputer
# (ii)
# a) There are many columns with only 1 NaN maximum, I will replace it with the mean because it is difficult that 1 row could alter the result.
x_train_df.describe()
x_train_df.isna().sum().sum()

imp = SimpleImputer(strategy='mean')
x_train_df_nan = pd.DataFrame(data=imp.fit_transform(x_train_df),columns=columnas)
y_train_df_nan = y_train_df
x_train_df_nan.describe()


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X56,X57,X58,X59,X60,X61,X62,X63,X64,X65
count,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,...,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0,87263.0
mean,61.176479,26.493558,0.187255,1.814486,18.264829,0.00079,0.208647,1.112092,-73.837701,768.048879,...,0.338574,2.937087,1823.106158,0.02896,0.540669,0.319887,-64.810811,472.46519,0.475946,0.260109
std,19.032381,4.529705,1.275754,32.549068,81.31391,1.028372,1.481605,21.4321,27.254139,525.730088,...,17.965777,60.90677,1698.770121,1.181238,1.667215,9.728382,36.899961,406.227381,1.068878,0.187573
min,2.68,12.0,-3.77,-125.5,-1082.0,-6.12,-2.86,-83.0,-1082.0,-716.3,...,-133.0,-319.0,-668.0,-7.78,-10.0,-63.0,-322.0,-509.2,-10.0,-0.55
25%,47.89,23.57,-0.57,-17.5,-13.0,-0.65,-0.6,-9.5,-86.0,389.5,...,-10.0,-24.0,863.4,-0.71,-0.54,-5.0,-82.0,174.9,-0.16,0.14
50%,62.4,25.78,0.1,1.0,11.5,0.04,0.03,0.5,-69.5,644.2,...,0.0,0.0,1436.7,0.08,0.4,0.0,-55.0,378.0,0.56,0.26
75%,75.34,28.57,0.82,19.5,40.0,0.7,0.77,10.5,-56.5,1019.7,...,10.0,27.0,2335.05,0.82,1.46,5.0,-38.0,669.0,1.21,0.39
max,100.0,100.0,50.38,1059.5,3355.0,5.99,72.28,973.5,-23.0,4423.9,...,171.0,2502.0,64129.4,5.57,18.85,146.0,0.0,4197.9,6.6,1.0


In [111]:
# b) Using the EllipticEnvelope class to detect the positions of the outliers and delete them

from sklearn.covariance import EllipticEnvelope
elip_env = EllipticEnvelope().fit(x_train_df_nan)
detection = elip_env.predict(x_train_df_nan)
outlier_positions_mah = [x for x in range(x_train_df_nan.shape[0]) if detection[x] == -1]
if detection is []:
    print("There aren't outliers.")
else:
    print('Outlier positions: ', len(outlier_positions_mah))


Outlier positions:  8727
Outlier positions:  8727


In [112]:
# c) New datasets x_train_free_data, and y_train_free_data

x_train_free_data = x_train_df_nan
y_train_free_data = y_train_df_nan

x_train_free_data.drop(x_train_free_data.index[outlier_positions_mah], inplace=True)
y_train_free_data.drop(y_train_free_data.index[outlier_positions_mah], inplace=True)

x_train_free_data.describe() # 78536
y_train_free_data.describe() # 78536

Unnamed: 0,Y
count,78536.0
mean,0.004826
std,0.069301
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


Unnamed: 0,Y
count,78536.0
mean,0.004826
std,0.069301
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [None]:
# (iii)
from sklearn.feature_selection import SelectPercentile, mutual_info_classif

x_train_free_data_ndarray = x_train_free_data.to_numpy()
y_train_free_data_ndarray = y_train_free_data.to_numpy()

per = SelectPercentile(mutual_info_classif, percentile=20)
x_train_new = per.fit_transform(x_train_free_data_ndarray, y_train_free_data_ndarray)
x_train_new.shape
y_train_new = y_train_free_data_ndarray

In [101]:
# a) Compression ratio
ratio = x_train_new.shape[1]/x_train_free_data_ndarray.shape[1]
print('Number of columns before: ', x_train_free_data_ndarray.shape[1]) # 65
print('Number of columns after: ', x_train_new.shape[1]) # 13
print(ratio)
print('Compression ratio: ', (x_train_new.shape[1]/x_train_free_data_ndarray.shape[1])*100, '%') # 20% compression

Number of columns before:  65
Number of columns after:  13
0.2
Compression ratio:  20.0 %


In [102]:
# b) Performance comparation using automatic_scoring method

y_train_free_data_ndarray_flatten = y_train_free_data_ndarray.ravel()
print('Before compression: ', automatic_scoring(x_train_free_data_ndarray, y_train_free_data_ndarray_flatten)) # 0.9516467981849747

Before compression:  0.95280259820858


In [103]:
y_train_new = y_train_free_data_ndarray
y_train_new_flatten = y_train_new.ravel()
print('After compression: ', automatic_scoring(x_train_new, y_train_new_flatten)) # 0.93101975560227

# Better without compression, but it is not a huge difference, I think it is better with fewer variables (here from 65 to 13).

After compression:  0.9241574332891569


In [104]:
# (iv) Applying Principal Component Analysis

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

pca.fit(x_train_free_data_ndarray)
X_reduced_raw = pca.transform(x_train_free_data_ndarray)
raw_pca_data = pd.DataFrame(data=X_reduced_raw)

print("There have been selected " + str(X_reduced_raw.shape[1]) + " principal components.") # 6 PCA

print('After PCA: ',automatic_scoring(X_reduced_raw,y_train_new_flatten)) # 0.7146784271941337

# Better without PCA

There have been selected 6 principal components.
After PCA:  0.6875731402686054


In [105]:
# (v) Balance

from collections import Counter  # For checking the imbalance ratio

RANDOM_STATE = seed

count = Counter(y_train_new_flatten)
print('Zeros: ', count[0.0], '\nOnes: ', count[1.0]) # 78157, 379
print('Minority percentage: ', (count[1.0]/(count[0.0]+count[1.0]))*100, '%') # 0.4825812366303351
print('Majority percentage: ', (count[0.0]/(count[1.0]+count[0.0]))*100, '%') # 99.51741876336968
# There is a big imbalance

Zeros:  78157 
Ones:  379
Minority percentage:  0.4825812366303351 %
Majority percentage:  99.51741876336968 %


In [None]:
# (vi) Performance check

x_test_nan = pd.DataFrame(data=imp.transform(x_test_df),columns=columnas)
x_test_nan.describe()

# We had some problems with the size of the data, so we didn't remove the outliers nor the percentile to reduce the data of the test data set

'''
test_detection = elip_env.predict(x_test_nan)
outlier_positions_mah_test = [x for x in range(x_test_nan.shape[0]) if test_detection[x] == -1]
if test_detection is []:
    print("No hay outliers.")
else:
    print('Posiciones outliers: ', len(outlier_positions_mah_test))
'''

x_test_free_data = x_test_nan
y_test_free_data = y_test_df

#x_test_free_data.describe()
#y_test_free_data.describe()

#x_test_free_data.drop(x_test_free_data.index[outlier_positions_mah_test], inplace=True, index = 0)
#y_test_free_data.drop(y_test_free_data.index[outlier_positions_mah_test], inplace=True, index = 0)

x_test_free_data_ndarray = x_test_free_data.to_numpy()
y_test_free_data_ndarray = y_test_free_data.to_numpy()

print(automatic_testing(x_train_free_data,y_train_free_data_ndarray_flatten,x_test_free_data_ndarray,y_test_free_data_ndarray))
# 0.9736808591526858