### Feature Selection

On this part we are going to try to reduce the complexity and the correlation of our dataset.

In [10]:
import pandas as pd
import seaborn as sns
import numpy as np
import sklearn

In [11]:
data_train = pd.read_csv("transformed_train.csv")
data_test = pd.read_csv("transformed_test.csv")

In [12]:
data_train

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Sex_male,Pclass_1.0,Pclass_2.0,Pclass_3.0,Embarked_C,Embarked_Q,Embarked_S
0,0.0,0.566474,0.000,0.000000,0.055628,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.283740,0.000,0.000000,0.025374,1.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.0,0.396833,0.000,0.000000,0.015469,1.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.0,0.321438,0.125,0.000000,0.015330,1.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0.0,0.070118,0.500,0.333333,0.061045,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
707,1.0,0.258608,0.000,0.000000,0.014932,0.0,0.0,0.0,1.0,0.0,0.0,1.0
708,0.0,0.484795,0.000,0.000000,0.060508,1.0,1.0,0.0,0.0,0.0,0.0,1.0
709,0.0,0.509927,0.250,0.000000,0.027538,1.0,0.0,0.0,1.0,0.0,0.0,1.0
710,1.0,0.170646,0.125,0.333333,0.234224,0.0,1.0,0.0,0.0,0.0,0.0,1.0


On this point we have two alternative ways of doing feature selection.

* Selecting among the columns we have
* Dimensionality reduction techniques

On this notebook I am going to follow the first aproach.

For this, we are going to employ the minimun redundancy maximun relevance (with the mutual information criterion as relevance and the correlation as redundancy) method.

In [13]:
from sklearn.feature_selection import mutual_info_regression

# Precomputing the relevances, and the redundancy
relevance = mutual_info_regression(data_train.drop("Survived", axis=1), data_train.Survived)
redundancy = np.abs(np.cov(data_train.drop("Survived", axis=1)))

m = len(relevance)

# List of features selected with the first already selected
features = [np.argmax(relevance)]
MRmr_score = [np.max(relevance)]

candidates = np.arange(m)
selected = np.array([])

# adding the first feature
f = np.argmax(relevance)
selected = np.hstack([candidates[f]])
candidates = np.delete(candidates, f)



for i in range(m-1):
    rel = relevance[candidates]
    
    #Average of the covariation coefficient of the current candidates and the selected features
    red = np.sum(redundancy[np.ix_(selected, candidates)], axis=0)
    
    mrmr = rel-red
    
    # Select the next feature
    f = np.argmax(mrmr)
    
    #Update the candidates and selected features
    features.append(candidates[f])
    MRmr_score.append(mrmr[f])
    selected = np.hstack([selected, candidates[f]])
    candidates = np.delete(candidates, f)


In [14]:
MRmr_features = pd.Series(MRmr_score, index=data_train.drop("Survived", axis=1).columns[features]) 
MRmr_features

Sex_male      0.162580
Pclass_2.0    0.043688
Pclass_3.0   -0.058944
Fare         -0.112925
Embarked_S   -0.154490
Pclass_1.0   -0.293595
Parch        -0.435539
Embarked_Q   -0.571901
SibSp        -0.689790
Age          -0.893840
Embarked_C   -1.021141
dtype: float64

We can appreciate that this more of less confirms what we have seen on the exploratory part. The most critical variable is the sex. Then the age. Then the class.

#### Comments
Given that we do not have that many features, it is not that critical to remove data features. We are going to remove them if that increases the performance. But for now they keep the same and use these results as reference.

Also we must comment that there exist a lot of techniques which do feature selection. Most of the time they do a simmilar ranking. However there is not clear evidance that one performs better than others.

We are going to define a function which mesures the score of the model given a set of features for training.

As model we are going to use a Random forest classifier since it is one of the most robust models we can use. 
In addition we are going to perform a gridsearch in order to tune the models.

The score will be the best accuracy gotten from the gridsearch with the given set of features

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

def score_model(data_train, features):
    model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=0)
    gs = GridSearchCV(model, param_grid={'max_depth': [1,2,3,4,5,6,7,8,9,10]}, scoring='roc_auc', cv=5)
    gs.fit(data_train[features], data_train["Survived"])
    return gs.best_score_

We are going to perform a recursive feature elimination.
In other words:
* We are going to score our model trained with all the features. 
* After that, we are going to score all the models that can be trained with "total features" - 1 features. 
* Then we will pick the features of the one which maximizes the score. 

And we will repeat the two last steps until we can not improve our score.

In [16]:
# inverse recursive feature elimination
def inverse_rfe(data_train):
    features = data_train.columns.values
    features = np.delete(features, np.where(features == "Survived"))
    best_score = score_model(data_train, features)
    best_features = features

    updated = True
    while len(features) > 1 and updated:
        updated = False
        for feature in features:
            features_ = features.copy()
            features_ = np.delete(features_, np.where(features_ == feature))
            score = score_model(data_train, features_)
            if score > best_score:
                best_score = score
                best_features = features_
                updated = True
        if updated:
            print(f"Current best score is {best_score} \n Current selected features are {best_features}")
            features = best_features

    return features

best_features = inverse_rfe(data_train)


Current best score is 0.8628692407789291 
 Current selected features are ['Age' 'SibSp' 'Fare' 'Sex_male' 'Pclass_1.0' 'Pclass_2.0' 'Pclass_3.0'
 'Embarked_C' 'Embarked_Q' 'Embarked_S']
Current best score is 0.8653449733002414 
 Current selected features are ['Age' 'SibSp' 'Fare' 'Sex_male' 'Pclass_1.0' 'Pclass_3.0' 'Embarked_C'
 'Embarked_Q' 'Embarked_S']
Current best score is 0.8654001457226494 
 Current selected features are ['Age' 'SibSp' 'Sex_male' 'Pclass_1.0' 'Pclass_3.0' 'Embarked_C'
 'Embarked_Q' 'Embarked_S']
Current best score is 0.8670145201574077 
 Current selected features are ['Age' 'SibSp' 'Sex_male' 'Pclass_1.0' 'Pclass_3.0' 'Embarked_C'
 'Embarked_Q']


### Conclusions
At the end the features we are going tune are: 'Age' 'SibSp' 'Sex_male' 'Pclass_1.0' 'Pclass_3.0' 'Embarked_C'
 'Embarked_Q'


This gives us an accuracy of more or less 86.7 percent

We can see that the Recursive feature elimination (RFE) and the Minimun redundancy maximun relevance do not quite concide. However, the features they remove are related to the ones which are not. For instance, Embarked_C should be removed according MRmr and Embarked_S should be removed according to RFE. Something simmilar happens in the case of Pclass.

Thinking about this. It makes all the sense of the world. From the feature transforming part, when we encoded our categotical variables (embarked and pclass, in the case of sex we just keep one column) we got 3 binary columns. The thing is that we do not need three binary columns. We can represent the same information using 2 columns. if it is not pclass1 (0) nor pclass3 (0), then it is pclass2.


This is the end of the feature selection part. 
On the next notebook, we are going to do a more intense gridsearch. And we are going to create our pipeline. That way we can automatize all the steps we have done through these notebooks.

Let's store our results:

In [17]:
data_train[best_features].to_csv("transformed_train2.csv")
data_test[best_features].to_csv("transformed_test2.csv")