# Task 2: Model selection
#### Data Analytics

### Introduction
The dataset named *task2_data* (*task2_data.csv*) has 515 samples and 8 features.

The main objective of the task is to **preprocess** the train data in the way it is indicated, perform a guided **train/validation model selection** step, and **test the final winner model on test data**.

If you try anytime several options it is important to show the results of those discarded trials, because what is not visible cannot be evaluated. If any cells are not meant to be executed but you leave them to show your trials, then comment the code so a full run is possible.

The deliverable of this task is this Jupyter Notebook containing the code, plus some short answers in markdown cells if required. All the cells in the notebook should be run. You should also upload the downloaded html and pdf formats (*File > Export > Files or selection to HTML... (.html)*)

NOTE: Keep in mind that some functions accept both Pandas dataframes and Numpy arrays, but some others only one of them. Nevertheless, we should know how to pass from one to the other and viceversa.

NOTE: If you work in pairs, please add the name of your partner in brackets besides yours in the *Name & Surnames* field.

In [41]:
seed = 26

## Problem:

Consider the dataset *task2_data.csv*, which is a regression dataset that we will transform into a binary classification problem, just binarizing the target.


* (a) We split the target into *high values* and *low values*. Binarize the target so that it is possible to determine whether a sample is high (*target* $>$ 4.0) or not (*target* $\leq$ 4.0), overwritting your dataframe. [5%]


* (b)  Split the data into train, validation and test sets, so that the test and validation sets have the same size, being one third of the train set size. Make sure that the proportion af the classes is the same in all parts. [5%]


**Preprocessing**
We will perform three preprocessing steps:
* (c.1) Scale the features to a [0, 1] range. [5%]
* (c.2) Perform a *recursive feature elimination* (RFE) in order to reduce the dimensionality of our data in at least 20%. You are free to choose the *estimator* you consider as the right one (argue your choice), and you **do not need** to use cross validation (CV). [10%]
* (c.3) Check the imbalance degree of your data. If the imbalance is higher than 4 to 1, then reduce it to 2 to 1 using ADASYN algorithm. [5%]


**Model selection**
We will perform a 2-step model selection strategy. First, we will select a promising family of models just comparing their default ones. Then, once we have a preferred family, we will find the best parameters we can in order to get a winner final model.
* (d.1) Consider the default *support vector machines* (SVM), *random forests* (RF), and *multilinear perceptron* (MLP) algorithms. Using *f1_macro* as score and 4-fold CV, determine which is the most promising family of algorithms. [15%]

In this task, the most relevant parameters for each family are:
- **SVM**:  *C* (unlimited options), *kernel* (5 options, but we ignore *linear* and *precomputed*), and *gamma* (2 options).
- **RF**: *n_estimators* (unlimited options), *min_samples_split* (unlimited options if using float), *max_features* (3 options if we ignore integers and floats, considering that *auto* and *sqrt* are the same).
- **MLP**: *hidden_layer_sizes* (unlimited in number of layers and neurons per layer), *solver* (use *'sgd'*), *learning_rate* (3 options), and *learning_rate_init* (unlimited options).


* (d.2.1) For the most promising family found in (d.1), taking into account the info above, we will consider all tunable parameters (remember that in *MLP*, *solver* parameter must be *'sgd'* so it is not tunable). For each of them, consider at least two options in such a way that the total number of possible models is at least 20. Once you made your choices, exactly how many possible models could you have? [10%]

* (d.2.2) Use a train/validation strategy to check all models. The best of all will be the final winning model. Which are the best parameters? And the best score? [30%]


**Model validation**
We will obtain the test score for the winning model and comment about the results achieved.
* (e.1) Taking into account the final best parameters obtained in (d), train the final model with the right data. [5%]

* (e.2) Calculate the final test score. [5%]

* (e.3) Comparing this test score with the train/validation score obtained in (d.2.2) for that model, would you say that the winner model overfits? [5%]


__Solution:__

(a)

In [42]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


data = pd.read_csv('task2_data.csv')

# Change V6 column data type from int64 to float64
data['V6'] = data['V6'].astype('float64')

# (a) We split the target into high values and low values. Binarize the target so that it is possible to determine whether a sample is high (target>4.0) or not
# (target <= 4.0), overwritting your dataframe.

x = data.iloc[:,:-1]
y = data.iloc[:,-1]

# Binarize y
for i in range(len(y)):
    if y[i] > 4.0:
        y[i] = 1
    else:
        y[i] = 0


In [43]:
x.dtypes

V1    float64
V2    float64
V3    float64
V4    float64
V5    float64
V6    float64
V7    float64
V8    float64
dtype: object

In [44]:
# We check for imbalance rate (only for information purposes)
imb_rate = sum(data.iloc[:, -1])/len(data.iloc[:, -1])
print('Imbalance rate: ',imb_rate*100,'%')

Imbalance rate:  4.466019417475728 %


(b)

In [45]:
# (b) Split the data into train, validation and test sets, so that the test and validation sets have the same size, being one third of the train set size. Make sure that
# the proportion af the classes is the same in all parts.

xtr, xtv, ytr, ytv = train_test_split(x, y, test_size=1/3, random_state=seed, stratify=data.iloc[:,-1])

xval, xtest, yval, ytest = train_test_split(xtv, ytv, test_size=1/2, random_state=seed, stratify=ytv)

print(len(xtr))
print(len(xval))
print(len(xtest))

print('Suma xval xtest: ', len(xval)+len(xtest))
print('(xval + xtest) * 2 = ', len(xtv)*2)

343
86
86
Suma xval xtest:  172
(xval + xtest) * 2 =  344


In [46]:
xtest.describe()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8
count,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0
mean,90.037209,110.068605,526.576744,8.563953,18.298837,45.116279,4.09186,0.02093
std,5.668241,70.369304,268.976847,4.546179,5.627485,18.70855,1.9563,0.156518
min,50.4,3.0,7.9,0.4,4.2,18.0,0.9,0.0
25%,89.325,47.7,353.775,6.3,14.825,29.0,2.7,0.0
50%,91.0,112.95,668.0,7.5,18.9,40.5,3.6,0.0
75%,92.575,139.075,709.9,10.275,21.525,58.75,5.275,0.0
max,96.1,290.0,855.3,22.7,33.1,99.0,9.4,1.4


(c)

In [47]:
from sklearn import preprocessing
#(c.1) Scale the features to a [0, 1] range.

# Fit the min max scaler with the train dataset and transform it, and then only transform the test and validation datasets

xtr_values = xtr.values # numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(xtr_values)
xtr = pd.DataFrame(x_scaled, columns=xtr.columns)

xval_values = xval.values # numpy array
x_scaled = min_max_scaler.transform(xval_values)
xval = pd.DataFrame(x_scaled, columns=xval.columns)

xtest_values = xtest.values # numpy array
x_scaled = min_max_scaler.transform(xtest_values)
xtest = pd.DataFrame(x_scaled, columns=xtest.columns)


In [48]:
xtest

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8
0,0.911215,0.399792,0.772098,0.531532,0.552632,0.231707,0.600000,0.00000
1,0.857477,0.129110,0.084132,0.261261,0.427632,0.146341,0.300000,0.00000
2,0.904206,0.295258,0.834931,0.409910,0.523026,0.475610,0.300000,0.00000
3,0.890187,0.442021,0.936694,0.319820,0.605263,0.268293,0.155556,0.00000
4,0.892523,0.619245,0.707017,0.324324,0.562500,0.560976,0.500000,0.00000
...,...,...,...,...,...,...,...,...
81,0.871495,0.465559,0.955272,0.288288,0.351974,0.292683,0.255556,0.00000
82,0.878505,0.569401,0.872204,0.301802,0.621711,0.682927,0.800000,0.21875
83,0.922897,0.453098,0.809490,0.396396,0.565789,0.048780,0.100000,0.00000
84,0.913551,0.296296,0.808307,0.301802,0.595395,0.365854,0.300000,0.00000


In [49]:
# Outliers in xtrain
from sklearn.covariance import EllipticEnvelope
elip_env = EllipticEnvelope(random_state=seed).fit(xtr)

detection = elip_env.predict(xtr)
outlier_positions_mah = [x for x in range(xtr.shape[0]) if detection[x] == -1]
print("Outliers: " + str(len(outlier_positions_mah)))
print("From 0 (majority) class: " + str(sum(ytr.iloc[outlier_positions_mah] == 0)))
print("From 1 (minority) class: " + str(sum(ytr.iloc[outlier_positions_mah] == 1)))
outlier_positions_mah_major = [x for x in range(xtr.shape[0]) if (detection[x] == -1 and ytr.iloc[x] == 0)]
print(len(outlier_positions_mah_major) == sum(ytr.iloc[outlier_positions_mah] == 0))


Outliers: 35
From 0 (majority) class: 34
From 1 (minority) class: 1
True


In [50]:
# Outliers in xtest
detection_test = elip_env.predict(xtest)
outlier_positions_mah_test = [x for x in range(xtest.shape[0]) if detection_test[x] == -1]
# Total amount of outliers in train
print("Outliers: " + str(len(outlier_positions_mah_test)))
# Those from majority class (0.0)
print("From 0 (majority) class: " + str(sum(ytest.iloc[outlier_positions_mah_test] == 0)))
# and minority class (1.0)
print("From 1 (minority) class: " + str(sum(ytest.iloc[outlier_positions_mah_test] == 1)))
# Positions from majority class train outliers
outlier_positions_mah_major_test = [x for x in range(xtest.shape[0]) if (detection_test[x] == -1 and ytest.iloc[x] == 0)]
# Check
print(len(outlier_positions_mah_major_test) == sum(ytest.iloc[outlier_positions_mah_test] == 0))


Outliers: 13
From 0 (majority) class: 13
From 1 (minority) class: 0
True


In [51]:
# Outliers in xval
detection_val = elip_env.predict(xval)
outlier_positions_mah_val = [x for x in range(xval.shape[0]) if detection_val[x] == -1]
# Total amount of outliers in train
print("Outliers: " + str(len(outlier_positions_mah_val)))
# Those from majority class (0.0)
print("From 0 (majority) class: " + str(sum(yval.iloc[outlier_positions_mah_val] == 0)))
# and minority class (1.0)
print("From 1 (minority) class: " + str(sum(yval.iloc[outlier_positions_mah_val] == 1)))
# Positions from majority class train outliers
outlier_positions_mah_major_val = [x for x in range(xval.shape[0]) if (detection_val[x] == -1 and yval.iloc[x] == 0)]
# Check
print(len(outlier_positions_mah_major_val) == sum(yval.iloc[outlier_positions_mah_val] == 0))

Outliers: 11
From 0 (majority) class: 10
From 1 (minority) class: 1
True


In [52]:
# Majority class outliers deletion in train
xtr.drop(xtr.index[outlier_positions_mah_major], inplace=True)
ytr.drop(ytr.index[outlier_positions_mah_major], inplace=True)
# Majority class outliers deletion in validation
xval.drop(xval.index[outlier_positions_mah_major_val], inplace=True)
yval.drop(yval.index[outlier_positions_mah_major_val], inplace=True)
# Majority class outliers deletion test (only for informative purposes at the end)
# Note*: we are performing the preprocessing tasks on test at the same time for simplicity.
# Nevertheless, we will not use the test set for prediction until last section.
xtest.drop(xtest.index[outlier_positions_mah_major_test], inplace=True)
ytest.drop(ytest.index[outlier_positions_mah_major_test], inplace=True)
# Check after deleting the majority outliers
print([xtr.shape, ytr.shape, xtest.shape, ytest.shape, xval.shape, yval.shape])

[(309, 8), (309,), (73, 8), (73,), (76, 8), (76,)]


In [53]:
#(c.2) Perform a recursive feature elimination (RFE) in order to reduce the dimensionality of our data in at least 20%.
# You are free to choose the estimator you consider as the right one (argue your choice), and you do not need to use cross validation (CV)

from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

# I am going to use SVR estimator, because the size of the data is smaller than 10.000, for larger datasets it is used LinearSVR or SGDRegressor

selectorTrain = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=6, step=1)

xtr_np = xtr.to_numpy()
ytr_np = ytr.to_numpy()
xtest_np = xtest.to_numpy()
xval_np = xval.to_numpy()

selectorTrain.fit(xtr_np, ytr_np)

xtr_sel_features = selectorTrain.transform(xtr_np)
xtest_sel_features = selectorTrain.transform(xtest_np)
xval_sel_features = selectorTrain.transform(xval_np)

xtr = pd.DataFrame(xtr_sel_features, columns=['V1','V2','V3','V4','V5','V6'])
xtest = pd.DataFrame(xtest_sel_features, columns=['V1','V2','V3','V4','V5','V6'])
xval = pd.DataFrame(xval_sel_features, columns=['V1','V2','V3','V4','V5','V6'])


In [54]:
#(c.3) Check the imbalance degree of your data. If the imbalance is higher than 4 to 1, then reduce it to 2 to 1 using ADASYN algorithm. [5%]

from collections import Counter
from imblearn.over_sampling import ADASYN

# Imbalance rate before resample
imb_rate = sum(ytr)/len(ytr)
print('Imbalance rate train: ',imb_rate*100,'%')

print('Resampled dataset shape %s' % Counter(ytr))

ada_train = ADASYN(random_state=seed, sampling_strategy=1/2, n_neighbors=3)
xtr, ytr = ada_train.fit_resample(xtr, ytr)
print()

# Imbalance rate after resample
imb_rate = sum(ytr)/len(ytr)
print('Imbalance rate train: ',imb_rate*100,'%')

print('Resampled dataset shape %s' % Counter(ytr))


Imbalance rate train:  4.854368932038835 %
Resampled dataset shape Counter({0.0: 294, 1.0: 15})

Imbalance rate train:  33.63431151241535 %
Resampled dataset shape Counter({0.0: 294, 1.0: 149})


(d)

In [55]:
# (d.1) Consider the default support vector machines (SVM), random forests (RF), and multilinear perceptron (MLP) algorithms. Using f1_macro as score and 4-fold CV,
# determine which is the most promising family of algorithms. [15%]
from sklearn.metrics import f1_score


# Random Forest
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

average_score = cross_val_score(estimator=RandomForestClassifier(), X=xtr, y=ytr, cv=4, scoring='f1_macro').mean()
print(average_score)


0.8516516952005198


In [56]:
# SVM
from sklearn.metrics import f1_score
from sklearn import svm

average_score = cross_val_score(estimator=svm.SVC(), X=xtr, y=ytr, cv=4, scoring='f1_macro').mean()
print(average_score)


0.7733366160759356


In [None]:
# MLP
from sklearn.neural_network import MLPClassifier

average_score = cross_val_score(estimator=MLPClassifier(), X=xtr, y=ytr, cv=4, scoring='f1_macro').mean()
print(average_score)


In [58]:
# In this task, the most relevant parameters for each family are:
# -SVM: C (unlimited options), kernel (5 options, but we ignore linear and precomputed), and gamma (2 options).
# -RF: n_estimators (unlimited options), min_samples_split (unlimited options if using float), max_features (3 options if we ignore integers and floats, considering that auto and sqrt are the same).
# -MLP: hidden_layer_sizes (unlimited in number of layers and neurons per layer), solver (use ‘sgd’), learning_rate (3 options), and learning_rate_init (unlimited options).

# (d.2.1) For the most promising family found in (d.1), taking into account the info above, we will consider all tunable parameters (remember that in MLP, solver parameter
# must be ‘sgd’ so it is not tunable). For each of them, consider at least two options in such a way that the total number of possible models is at least 20. Once you made
# your choices, exactly how many possible models could you have? [10%]

# (d.2.2) Use a train/validation strategy to check all models. The best of all will be the final winning model. Which are the best parameters? And the best score? [30%]

from sklearn.ensemble import RandomForestClassifier

# The best estimator is Random Forest

n_estimators_try = [20, 60, 100]
min_samples_split_try = [2, 3, 4]
max_features_try = [2, 4, 6]

fitted_estimators = []
f1_scores = []

iter = 0
params = []

# First we fit the estimators with all the combination of parameters

for i in n_estimators_try:
    for j in min_samples_split_try:
        for k in max_features_try:
            rnd_clf = RandomForestClassifier(n_estimators=i, min_samples_split=j, max_features=k)
            fitted_estimators.append(rnd_clf.fit(xtr, ytr))
            params.append((i,j,k))

# Now we get the f1 scores for all the fitted estimators with the validation dataset

for i in fitted_estimators:
    print('Iteration ', iter)

    predicted = i.predict(xval)
    expected = yval

    f1_partial = f1_score(expected, predicted, average='macro')
    print(f1_partial)
    f1_scores.append(f1_partial)

    iter += 1
    print()

# Lastly, we search the best f1 score obtained from the various fitted estimators with each combination of parameters

max = 0
index_max = 0
for i in range(len(f1_scores)):
    if f1_scores[i] > max:
        index_max = i
        max = f1_scores[i]

print('The estimator with highest cv is in the index ', index_max, ', with score of ', f1_scores[index_max])
print('The best params are: \n\tn_estimators = ',params[index_max][0], '\n\tmin_samples_split = ',params[index_max][1],'\n\tmax_features = ',params[index_max][2])



Iteration  0
0.4722222222222222

Iteration  1
0.5866355866355867

Iteration  2
0.4722222222222222

Iteration  3
0.5866355866355867

Iteration  4
0.5866355866355867

Iteration  5
0.571830985915493

Iteration  6
0.5866355866355867

Iteration  7
0.571830985915493

Iteration  8
0.47586206896551725

Iteration  9
0.4722222222222222

Iteration  10
0.4722222222222222

Iteration  11
0.47586206896551725

Iteration  12
0.5866355866355867

Iteration  13
0.5866355866355867

Iteration  14
0.7047397047397047

Iteration  15
0.5866355866355867

Iteration  16
0.6041666666666667

Iteration  17
0.7047397047397047

Iteration  18
0.5866355866355867

Iteration  19
0.4722222222222222

Iteration  20
0.6041666666666667

Iteration  21
0.5866355866355867

Iteration  22
0.5866355866355867

Iteration  23
0.6041666666666667

Iteration  24
0.5866355866355867

Iteration  25
0.5866355866355867

Iteration  26
0.5866355866355867

The estimator with highest cv is in the index  14 , with score of  0.7047397047397047
The be

(e)

In [59]:
from sklearn.metrics import roc_auc_score

# (e.1) Taking into account the final best parameters obtained in (d), train the final model with the right data. [5%]

# First we concatenate the train and validation datasets
sumax = [xtr, xval]
xtrain_val = pd.concat(sumax)

sumay = [ytr, yval]
ytrain_val = pd.concat(sumay)

# Then we create the estimator with the best parameters obtained, and then fit
rnd_clf = RandomForestClassifier(n_estimators=params[index_max][0], min_samples_split=params[index_max][1], max_features=params[index_max][2])
rnd_clf.fit(xtrain_val, ytrain_val)

# (e.2) Calculate the final test score. [5%]

predicted = rnd_clf.predict(xtest)
expected = ytest

f1_partial = f1_score(expected, predicted, average='macro')
print('f1 macro score: ' + str(f1_partial))


#(e.3) Comparing this test score with the train/validation score obtained in (d.2.2) for that model, would you say that the winner model overfits? [5%]

# We can anticipate overfitting due to the concatenation of train and validation datasets, because although we fixed the train imbalace,
# we can't fix the validation dataset, so the concatenation of both will imbalance the result

# The results obtained, indicate overfitting because the roc curve is 0.48, under 0.8, which would be a good result.
auc = roc_auc_score(expected, predicted)
print('The area under the ROC curve obtained is: ' + str(auc))


f1 macro score: 0.47857142857142854
The area under the ROC curve obtained is: 0.4855072463768116
