### Model Selection

After doing feature selection we got our final training data. 

On this part we are going to try fitting a machine learning model. 
We are going to consider a decision tree and a random forest.

First of all we are going to tune up a decision tree for simplicity. And we are going to train a random forest for performance. 
It may be the case that the random forest could not improve the performance of the decision tree. Thus we are doing this.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
train_df = pd.read_csv('transformed_train2.csv')
test_df = pd.read_csv('transformed_test2.csv')

Until now we have not used the test data. We are going to use it now for evaluating our models. For tunning we are going to use cross validation with 5 folds (to control overfitting) since we do not have that many samples. If we had more, it would be really costly computationally.

We are going to use sklearn's gridsearch with cross validation.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [4]:
def grid_search(data, model, param_grid, verbose=0):
    grid = GridSearchCV(model, param_grid, cv=5, n_jobs=4, verbose=verbose)
    grid.fit(data.drop(['Survived'], axis=1), data['Survived'])
    
    print("Best parameters: ", grid.best_params_)
    print("Best score: ", grid.best_score_)
    return grid

#### Decision tree

In the case of the decision tree, [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) let's us play with:

* criterion{“gini”, “entropy”}: The function to measure the quality of a split.
* splitter{"best", "random"}: Chose a feature and get the best possible split or get the best split of a random set of splits (less probable to overfit).
* max_depth{int}: Controls the maximun depth
* min_samples_split{int}: Minimun namber of samples needed to do an split.
* min_samples_leaf{int}: Minimun number of samples needed to be a leaf.
* min_weight_fraction_leaf{float}: In this case (assuming all the data has the same weight) it is the proportion of samples needed in order to be a leaf.
* random_state{int}: Sets a random seed.
* max_leaf_nodes{int}: Sets an upperbond to the size of the tree in leafs
* min_impurity_decrease{float}: There will be an split only if the decrease of inpurity is greater than this threshold
* ccp_alpha{float}: Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen

Trying to perform a grid seach with all these many parameters is insane. 

From [towardsdatascience](https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680) we found out that:
* max_depth, max_leaf_nodes, min_samples_split, and min_samples_leaf are all stopping criteria
* min_weight_fraction_leaf and min_impurity_decrease are pruning methods.

We will consider only max_depth and min_impurity decrease.


At the end we want to tune up:
* criterion
* splitter
* max_depth
* min_impurity_decrease
* ccp_alpha


In [5]:
# First grid
parm_grid = {
    'max_depth': [None, 5, 10, 20, 50, 100],
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'min_impurity_decrease': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'ccp_alpha': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}

tree = DecisionTreeClassifier()
grid = grid_search(train_df, tree, parm_grid)

Best parameters:  {'ccp_alpha': 0.0, 'criterion': 'gini', 'max_depth': 5, 'min_impurity_decrease': 0.0, 'splitter': 'best'}
Best score:  0.825844577957254


In [6]:
# Second grid
parm_grid = {
    'max_depth': range(1, 10),
    'ccp_alpha': np.arange(0.0, 0.1, 0.01),
    'min_impurity_decrease': np.arange(0.0, 1.0, 0.01),
}
grid = grid_search(train_df, tree, parm_grid)


Best parameters:  {'ccp_alpha': 0.0, 'max_depth': 5, 'min_impurity_decrease': 0.0}
Best score:  0.825844577957254


We got our best tree

In [7]:
best_tree = grid.best_estimator_
best_tree

DecisionTreeClassifier(max_depth=5)

#### Random Forest

In this case we have the same parameters that in the decision tree. But in addition we have:
* n_estimators{int}: The number of trees we want

There are a bit more, but we are not going to take them into account.

In [8]:
# First grid
parm_grid = {
    'n_estimators': [10, 20, 30, 40, 50],
    'max_depth': [None, 5, 10, 20, 50, 100],
    'criterion': ['gini', 'entropy'],
    'min_impurity_decrease': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5],
    'ccp_alpha': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5],
}

forest = RandomForestClassifier()
grid = grid_search(train_df, forest, parm_grid)

Best parameters:  {'ccp_alpha': 0.0, 'criterion': 'gini', 'max_depth': 5, 'min_impurity_decrease': 0.0, 'n_estimators': 30}
Best score:  0.8300403821530582


In [11]:
# Second grid
parm_grid = {
    'n_estimators': range(25, 35),
    'max_depth': range(2, 10),
    'criterion': ['entropy'],
    'min_impurity_decrease': np.arange(0.0, .1, 0.05),
    'ccp_alpha': np.arange(0.0, .1, 0.05),
}

forest = RandomForestClassifier()
grid = grid_search(train_df, forest, parm_grid)


Best parameters:  {'ccp_alpha': 0.0, 'criterion': 'entropy', 'max_depth': 4, 'min_impurity_decrease': 0.0, 'n_estimators': 32}
Best score:  0.8328572835615089


Finally we got our best random forest

In [13]:
best_forest = grid.best_estimator_

#### Comparision

Now let's compare the performance of both models on the actual test set.

In [14]:
print("Tree accuracy:", best_tree.score(test_df.drop(['Survived'], axis=1), test_df['Survived']))
print("RF accuracy", best_forest.score(test_df.drop(['Survived'], axis=1), test_df['Survived']))

Tree accuracy: 0.8100558659217877
RF accuracy 0.8212290502793296


At the end we are not that happy with the results. This accuracy is far less than we spected. I will probably try to use a neural net. But for now let's do de pipeline and submite my predictions to kaggle.

### Pipeline

Once we got our best model, we are going to create a pipeline in which will have as input the raw original dataset and as output the prediction. In order to do this we are going to use sklearn's pipeline. In order to include all our data transformations in the pipeline we need to create some classes which are compatible with sklern's pipeline environment.

In order to make this classes compatible we need to create the fit and the transform methods.

We are going to create a class which implements data inputation

In [15]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import KNNImputer, SimpleImputer

class DataFrameImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.catImputer = SimpleImputer(strategy='most_frequent')
        self.numImputer = KNNImputer(n_neighbors=5)
    
    def fit(self, X, y=None):

        # Converting to np.nan all the missing values
        X.applymap(lambda x: x if x else np.nan)

        # Creating a list with the numerical features
        self.num = [X.dtypes.index[i] for i, dtype in enumerate(X.dtypes) if dtype != object]


        # Creating a list with all the categorical features
        self.cat = [X.dtypes.index[i] for i, dtype in enumerate(X.dtypes) if dtype == object]

        self.catImputer.fit(X[self.cat])
        self.numImputer.fit(X[self.num])
        return self

    def transform(self, X):
        X[self.cat] = self.catImputer.transform(X[self.cat])
        X[self.num] = self.numImputer.transform(X[self.num])
        return X

Also we are going to create a class which can select a subset of features of the dataset.

In [16]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

In addition we need a helper class which allows us to merge the data sets of the Encoders.

In [27]:
from sklearn.preprocessing import OneHotEncoder

class DataFrameEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, categories = ["Sex", "Pclass", "Embarked"]):
        self.encoder = OneHotEncoder(drop="if_binary",sparse=False)
        self.categories = categories

    def fit(self, X, y=None):
        self.encoder.fit(X[self.categories])
        return self

    def transform(self, X):
        
        encoded = pd.DataFrame(
            self.encoder.transform(X[self.categories]),
            columns=self.encoder.get_feature_names_out()
            )

        encoded.index = X.index
        return X.join(encoded)


In [47]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

selector_1 = DataFrameSelector(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'])
imputer = DataFrameImputer()
encoder = DataFrameEncoder()
selector_2 = DataFrameSelector(['Age', 'SibSp', 'Sex_male', 'Pclass_1.0', 'Pclass_3.0','Embarked_C', 'Embarked_Q'])
scaler = MinMaxScaler()

pipeline = Pipeline([("selector_1", selector_1), ("imputer", imputer), ("encoder", encoder), ("selector_2", selector_2), ("scaler", scaler), ("forest", best_forest)])

In [48]:
data = pd.read_csv("train.csv")
pipeline.fit(data.drop(['Survived'], axis=1), data['Survived'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[self.cat] = self.catImputer.transform(X[self.cat])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[self.num] = self.numImputer.transform(X[self.num])


Pipeline(steps=[('selector_1',
                 DataFrameSelector(attribute_names=['Pclass', 'Sex', 'Age',
                                                    'SibSp', 'Parch', 'Fare',
                                                    'Embarked'])),
                ('imputer', DataFrameImputer()),
                ('encoder', DataFrameEncoder()),
                ('selector_2',
                 DataFrameSelector(attribute_names=['Age', 'SibSp', 'Sex_male',
                                                    'Pclass_1.0', 'Pclass_3.0',
                                                    'Embarked_C',
                                                    'Embarked_Q'])),
                ('scaler', MinMaxScaler()),
                ('forest',
                 RandomForestClassifier(criterion='entropy', max_depth=4,
                                        n_estimators=32))])

### Prediction
Finally we end up on this step. We are going to predict the test set of kaggle and we are going to store them along with the ids in prediction.csv

In [59]:
test = pd.read_csv("test.csv")

prediction = pipeline.predict(test)
ids = test['PassengerId']

prediction_df = pd.DataFrame(data=zip(ids, prediction), columns=['PassengerId', 'Survived'], index=None)
prediction_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[self.cat] = self.catImputer.transform(X[self.cat])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[self.num] = self.numImputer.transform(X[self.num])


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [60]:
prediction_df.to_csv('prediction.csv', index=False)

### Results

We got an accuracy of 0.77990 which positionates ourselves on the 3254 place at the moment of doing this notebook. It is a quite frustrating result being honest.
We will try to employ a neural network to try to beat this result.