______________________
# Summary
- **Test different models for their base score and select best model for further optimization**

*Note : To understand Preprocessing, EDA and Feature Engeering done in this notebook : [EDA & FE](https://www.kaggle.com/abhinavnayak/eda-4-insights-fe-2-new-features)*

<a id='content-table'></a>
## Table of Contents
1. [Fill Missing values, EDA and FE](#tag1)
2. [Define all the models to test](#tag2)   
3. [Using 5-fold cross validation to get accuracy scores](#tag3)      
4. [Print accuracy values](#tag4)
5. [Use best model to predict on test data](#tag5)
6. [Submit your prediction](#tag6)

In [1]:
import numpy as np 
import pandas as pd 

train = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/test.csv')
submission = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2021/sample_submission.csv')

<a id='tag1'></a>
## 1) [Fill Missing values, EDA and FE](#content-table)

- All missing values filled
- Created 3 new features : 'Cabin_type', 'Ticket_type', 'total_member'
- Dropped 6 features : 'Cabin', 'Ticket', 'SibSp', 'Parch', 'Name', 'PassengerId'
- One hot encoding 

In [2]:
# Combine train and test
test['Survived'] = -1
all_data = pd.concat([train, test])

# Fill 'Age', 'Embarked'
for col in ['Age']:
    all_data[col] = all_data[col].fillna(all_data[col].mean())
    
for col in ['Embarked']:
    all_data[col] = all_data[col].fillna('X') 

# Fill 'Fare'
all_data['Fare'] = all_data.groupby('Pclass')['Fare'].transform(lambda x: x.fillna(x.mean()))
    
# New feature 'Cabin_filled', drop 'Cabin'
all_data['Cabin_type'] = all_data['Cabin'].fillna('X').map(lambda x: x[0].split()[0])    
all_data.drop('Cabin', axis = 1, inplace = True)

# New feature 'Ticket_type', drop 'Ticket'
import re

def fn1(x):
    if isinstance(x, str):
        if len(re.findall("^\d+$", x))>0:
            return 'type1'
        if len(re.findall("^(A\.|A/S|A/5|A/4|AQ/4|AQ/3|A4)", x))>0:
            return 'type2'
        if len(re.findall("^(C|CA|CA\.|C\.A\.)", x))>0:
            return 'type3'
        if len(re.findall("^(SC|S\.C\.|SC/PARIS|S\.C\./PARIS|SC/Paris|SC/AH|S\.C\./A\.4)", x))>0:
            return 'type4'
        if len(re.findall("^(PC|PP|P\.P|P/PP)", x))>0:
            return 'type5'
        if len(re.findall("^(W\.C\.|W./C\.|W/C)", x))>0:
            return 'type6'
        if len(re.findall("^(SOTON/O\.Q|SOTON/OQ|STON/O|STON/O2|SOTON/O2)", x))>0:
            return 'type7'
        if len(re.findall("^(WE/P|W\.E\.P)", x))>0:
            return 'type8'
        if len(re.findall("^(F\.C|F\.C\.C|Fa)", x))>0:
            return 'type9'
        if len(re.findall("^(LP)", x))>0:
            return 'type10'
        if len(re.findall("^(S\.O\.C|S\.P|S\.O|P\.P|SO/C)", x))>0:
            return 'type11'
        if len(re.findall("^(S\.W\./PP|SW/PP)", x))>0:
            return 'type12'
        
        else:
            return x
    else:
        return 'type1'
    
all_data['Ticket_type'] = all_data['Ticket'].apply(lambda x: fn1(x))
all_data.drop('Ticket', axis = 1, inplace = True)

# New feature 'total_members', drop 'SibSp' and 'Parch'
all_data['total_members'] = all_data['SibSp'] + all_data['Parch'] + 1
all_data.drop(['SibSp', 'Parch'], axis = 1, inplace = True)

# drop 'Name' and 'PassengerId'
all_data.drop(['Name', 'PassengerId'], axis = 1, inplace = True)

# Check missing values
print("% Missing values in each column :")
print(all_data.isnull().sum()/len(train)*100)

# Get the categorical columns
cols = [col for col in all_data.columns if all_data[col].dtype == 'object']

# One-hot-encode
all_data = pd.get_dummies(all_data, drop_first = True)

# all_data --> train, test
n_train = len(train)
train = all_data.iloc[:n_train].copy()   # This will create copy of the df. Done to avoid future warnings
test = all_data.iloc[n_train:].copy()

# Remove 'Survived' column from test data
test.drop('Survived', axis = 1,inplace = True)

% Missing values in each column :
Survived         0.0
Pclass           0.0
Sex              0.0
Age              0.0
Fare             0.0
Embarked         0.0
Cabin_type       0.0
Ticket_type      0.0
total_members    0.0
dtype: float64


<a id='tag2'></a>
## 2) [Define all the models to test](#content-table)

- Import libraries
- Create a list of models

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier


models = [
          LogisticRegression(solver = 'liblinear', random_state = 42), 
          RidgeClassifier(random_state = 42),
          SGDClassifier(random_state = 42),
          DecisionTreeClassifier(random_state = 42),
          RandomForestClassifier(random_state = 42),
          ExtraTreesClassifier(random_state = 42),
          LGBMClassifier(verbose = -1),
          XGBClassifier(verbosity = 0, use_label_encoder = False),
          CatBoostClassifier(verbose = 0)
]

model_names = []           #store the names of all the models.

for model in models:
    model_names.append(type(model).__name__)
    print(f"{type(model).__name__} :- {model}")

LogisticRegression :- LogisticRegression(random_state=42, solver='liblinear')
RidgeClassifier :- RidgeClassifier(random_state=42)
SGDClassifier :- SGDClassifier(random_state=42)
DecisionTreeClassifier :- DecisionTreeClassifier(random_state=42)
RandomForestClassifier :- RandomForestClassifier(random_state=42)
ExtraTreesClassifier :- ExtraTreesClassifier(random_state=42)
LGBMClassifier :- LGBMClassifier(verbose=-1)
XGBClassifier :- XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              random_state=None, reg_alpha=None, reg_lambda=None,
              scale_pos_weight=None, subsample=None, tree_met

<a id='tag3'></a>
## 3. [Using 5-fold cross validation to get accuracy scores](#content-table)
- Use stratified k fold to train and get accuracy score
- Save all the models in each fold to use later for prediction of test data

In [4]:
# Get the feature and target data
X = train.drop('Survived', axis = 1)
y = train['Survived'].copy()

In [5]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
import joblib
import time

accuracy = []  
time_taken = []

for model in models:
    
    print(f"\nTraining {type(model).__name__} : ")
       
    skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

    acc = 0
    
    start = time.time()

    for fold, (train_idx, valid_idx) in enumerate(skf.split(X, y)):
        X_train, X_valid = X.iloc[train_idx, :], X.iloc[valid_idx, :]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]
        model.fit(X_train.values, y_train)
        y_pred = model.predict(X_valid.values)
        acc += accuracy_score(y_valid, y_pred)
        
        print(f"Fold : {fold}, Accuracy : {accuracy_score(y_valid, y_pred)}")
        joblib.dump(model, f"base_{type(model).__name__}_{fold}")        # save all the models
    
    end = time.time()
    
    time_taken.append((end-start)/5)  
    accuracy.append(acc/5*100)                # get the average accuracy of 5 folds
    
    print(f"Average accuracy : {acc/5: 0.4f},  Average time taken per fold: {(end-start)/5: 0.3f} s")


Training LogisticRegression : 
Fold : 0, Accuracy : 0.77665
Fold : 1, Accuracy : 0.77905
Fold : 2, Accuracy : 0.77595
Fold : 3, Accuracy : 0.77245
Fold : 4, Accuracy : 0.77195
Average accuracy :  0.7752,  Average time taken per fold:  0.716 s

Training RidgeClassifier : 
Fold : 0, Accuracy : 0.775
Fold : 1, Accuracy : 0.7772
Fold : 2, Accuracy : 0.7742
Fold : 3, Accuracy : 0.77275
Fold : 4, Accuracy : 0.771
Average accuracy :  0.7740,  Average time taken per fold:  0.157 s

Training SGDClassifier : 
Fold : 0, Accuracy : 0.69115
Fold : 1, Accuracy : 0.70845
Fold : 2, Accuracy : 0.6806
Fold : 3, Accuracy : 0.69755
Fold : 4, Accuracy : 0.76455
Average accuracy :  0.7085,  Average time taken per fold:  3.105 s

Training DecisionTreeClassifier : 
Fold : 0, Accuracy : 0.6897
Fold : 1, Accuracy : 0.68965
Fold : 2, Accuracy : 0.69395
Fold : 3, Accuracy : 0.69175
Fold : 4, Accuracy : 0.69155
Average accuracy :  0.6913,  Average time taken per fold:  0.582 s

Training RandomForestClassifier : 


<a id='tag4'></a>
## 4) [Print accuracy values](#content-table)
- Print accuracy values in descending order

In [6]:
df = pd.DataFrame()
df['Model'] = model_names
df['Accuracy'] = accuracy
df['Time per fold'] = time_taken
_ = df.sort_values('Accuracy', ascending = False).reset_index(drop = True)
_.index+=1
_

Unnamed: 0,Model,Accuracy,Time per fold
1,LGBMClassifier,77.99,0.636119
2,CatBoostClassifier,77.917,18.582003
3,XGBClassifier,77.768,5.043561
4,LogisticRegression,77.521,0.715711
5,RidgeClassifier,77.403,0.157305
6,RandomForestClassifier,74.065,13.15104
7,ExtraTreesClassifier,72.119,13.560322
8,SGDClassifier,70.846,3.104784
9,DecisionTreeClassifier,69.132,0.58215


<a id='tag5'></a>
## 5) [Use best model to predict on test data](#content-table)
- Load the best model 
- Predict on test data

In [7]:
model_name = model_names[np.argmax(accuracy)]
print(f"Predicting using {model_name}...")

preds = []

for fold in range(5):
    model = joblib.load(f"base_{model_name}_{fold}")
    preds.append(model.predict_proba(test.values)[:, 1])

preds = np.mean(preds, axis = 0)    # Average probabilities of 5 models
y_pred = np.where(preds>0.5, 1, 0)  # Predictions

print("Predictions saved.")

Predicting using LGBMClassifier...
Predictions saved.


<a id='tag6'></a>
## 5) [Submit your prediction](#content-table)

In [8]:
submission.loc[:, 'Survived'] = y_pred
submission.to_csv('submission_base_models.csv', index = False)
pd.read_csv("submission_base_models.csv")

Unnamed: 0,PassengerId,Survived
0,100000,0
1,100001,1
2,100002,1
3,100003,0
4,100004,1
...,...,...
99995,199995,1
99996,199996,0
99997,199997,0
99998,199998,1


______________