# Select Best Model

Using data from EDA to check the model that fits the best.

## Import libraries

In [1]:
# Linear algebra
import numpy as np 

# Data processing
import pandas as pd 

# Regex
import re

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 8, 6
plt.style.use('tableau-colorblind10')

# Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Machine learning
from sklearn.model_selection import train_test_split

# Logistic Regression
from sklearn.linear_model import LogisticRegression
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
# k-nearest neighbors
from sklearn.neighbors import KNeighborsClassifier
# Linear Support Vector Machine
from sklearn.svm import SVC
from sklearn import svm
# Random Forest
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

## Load data

In [3]:
df = pd.read_csv('output/data_label.csv', index_col = 0)

We reorder the dataset, placing the target column ('survived) at the end.

In [4]:
df = df[["pclass","deck", "fare_range","sex","title","age_range", "fam_size","embarked","survived"]]
df.head()

Unnamed: 0,pclass,deck,fare_range,sex,title,age_range,fam_size,embarked,survived
0,1,1,4,1,0,2,1,0,1
1,1,2,4,0,3,0,4,0,1
2,1,2,4,1,0,0,4,0,0
3,1,2,4,0,1,2,4,0,0
4,1,2,4,1,2,2,4,0,0


In [5]:
print(f'The dataset has {df.shape[0]} examples and {df.shape[1]-1} features + the target variable (survived)')

The dataset has 1309 examples and 8 features + the target variable (survived)


### Save csv

In [6]:
df.to_csv('output/data_model.csv')

## Models

A **classification model** is a **supervised learning model** that attempts to draw some conclusion from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes. Outcomes are labels that can be applied to a dataset.

We create a function to compare different classification algorithms:

- **Logistic regression**: the simpler one, estimate discrete values (binary 0/1) based on a given set of indipendent variables.


- **Gaussian Naive Bayes**: assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. 


- **K nearest neighbor** : it stores all available cases to classify the new cases by a majority vote of its k neighbors.


- **Linear Support Vector Machine**: we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate.


- **Random Forest**:  is a type of ensemble learning method that is used for classification, regression and other tasks that can be performed with the help of the decision trees.


### Split dataset

We split the prepared dataset into train and test sets.

    - Train = 70% data
    - Test = 30% data 
    
Stratified K-Folds cross-validator randomly split the training data into K subsets called folds (k=5 default) and returns stratified folds. The folds are made by preserving the percentage of samples for each class.

### Metrics

The metric we use to compare the performance of the models is: **accuracy** (the ratio of the true predicted values to the total predicted values).

In [9]:
def run_multiple_models(df, target, split):
    
    labels = ['Not Survived', 'Survived']
    
    X = df.iloc[:,:-1].values
    y = df[target]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = split)
    
    dfs = []
    models = [
        ('LR_model', LogisticRegression()),
        ('GNB_model', GaussianNB()),      
        ('KNN_model', KNeighborsClassifier(n_neighbors = 5)),
        ('SVC_model', svm.SVC()), 
        ('RFC_model', RandomForestClassifier(n_estimators=100))]
    
    results = []
    
    names = []
    score = {}
    
    target_names = labels

    for name, model in models:
        
        # Cross validation
        cv = cross_val_score(model, X_train, y_train, cv = StratifiedKFold(), scoring = 'accuracy')
        # Fit
        clf = model.fit(X_train, y_train)
        # Predict
        y_pred = clf.predict(X_test)
        # Score
        acc = clf.score(X_train, y_train)
        
        print(f'---- {name} ----')
        print(f'Accuracy: {round(acc * 100, 2)}%')
        print()
        print(classification_report(y_test, y_pred, target_names = target_names))
        
        results.append(cv)
        names.append(name)
        score[name] = round(acc*100,2)
        
        
        final_df = pd.DataFrame(cv)
        final_df['model'] = name
        dfs.append(final_df)

        final = pd.concat(dfs, ignore_index = True)
    
    print()
    max_key = max(score, key = lambda key: score[key])
    print(f'The model with best performance is: {max_key}')
    
    return final

In [10]:
res = run_multiple_models(df, 'survived', 0.3)

---- LR_model ----
Accuracy: 80.46%

              precision    recall  f1-score   support

Not Survived       0.82      0.88      0.85       249
    Survived       0.76      0.66      0.71       144

    accuracy                           0.80       393
   macro avg       0.79      0.77      0.78       393
weighted avg       0.80      0.80      0.80       393

---- GNB_model ----
Accuracy: 74.24%

              precision    recall  f1-score   support

Not Survived       0.79      0.78      0.79       249
    Survived       0.63      0.65      0.64       144

    accuracy                           0.73       393
   macro avg       0.71      0.71      0.71       393
weighted avg       0.73      0.73      0.73       393

---- KNN_model ----
Accuracy: 83.52%

              precision    recall  f1-score   support

Not Survived       0.84      0.86      0.85       249
    Survived       0.74      0.71      0.73       144

    accuracy                           0.80       393
   macro avg   

## Conclusions

The random model classifier is the best model to solve our supevised learning problem. We will create the model in the next notebook: `2_Random_Forest`

### Reference

- https://data-flair.training/blogs/machine-learning-classification-algorithms/