## MLBootcamp21 Week 4 Assignment: Gridsearch and CrossValidation
- This week we will focus on improving our existing models using these two techniques
- They are good approaches to improving your model's performance apart from feature engineering

## Q1. Use GridSearchCV and also explore other GridSearch techniques provided by scikit-learn to tune your hyperparameters and obtain the best model

- Try out Decision Trees, Random Forest, GradientBoost, AdaBoost and also XGBoost 
- Run the model on the features you have generated until now without tuning the parameters first to check the result on basic parameters
- Then apply GridSearchCV or other GridSearch techniques provided by scikit-learn to tune the hyperparameters and get results on the best model

## Keep the "random_state" number as 42 or anynumber of your choice and report that number for me to be able to reproduce the same results

- Report your performance on the test set after making the submission on kaggle. 

- ****Do not use some random existing notebook on Kaggle to get the best results as you will not learn anything that way and we will be able to easily know if that has been done. Do whatever you can****

In [None]:
# Write your code from this cell
# It need not be in a single cell

# IMPORTS
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.metrics import classification_report
from xgboost import XGBClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [None]:
# READ FILE AND WEEK 3 FEATURES
df = pd.read_csv("./train.csv")
df[['female','male']] = pd.get_dummies(df['Sex'])
df[["C","Q","S"]] = pd.get_dummies(df["Embarked"])
df.fillna(method="ffill",inplace=True)
drop_features = ["Sex",'Ticket','Name','Cabin',"Embarked"]
df.drop(drop_features,inplace=True,axis=1)

In [None]:
# TRAIN TEST SPLIT
x_train, x_test, y_train, y_test = train_test_split(df.loc[:,'Pclass':],df.Survived,\
                                                          test_size=0.2,random_state=42)

In [None]:
# DECISION TREE MODEL
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(x_train,y_train)
dt_model_predictions = dt_model.predict(x_test)
print(classification_report(y_test,dt_model_predictions))

              precision    recall  f1-score   support

           0       0.82      0.82      0.82       105
           1       0.74      0.74      0.74        74

    accuracy                           0.79       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



In [None]:
# DECISION TREE MODEL WITH GRIDSEARCHCV
parameter_grid = {
    'criterion': ['gini','entropy'],
    'splitter': ['best','random'],
    'max_leaf_nodes': [2,4,5,7,10],
    'max_features': ['auto','log2','sqrt']
}
dt_model_gridsearched = GridSearchCV(cv=5,estimator=DecisionTreeClassifier(random_state=42),
                                     param_grid = parameter_grid)
dt_model_gridsearched.fit(x_train,y_train)
dt_model_gridsearched.best_params_
print(classification_report(y_test,dt_model_gridsearched.predict(x_test)))

              precision    recall  f1-score   support

           0       0.80      0.88      0.84       105
           1       0.80      0.69      0.74        74

    accuracy                           0.80       179
   macro avg       0.80      0.78      0.79       179
weighted avg       0.80      0.80      0.80       179



In [None]:
# RANDOM FOREST MODEL
rf_model = RandomForestClassifier(random_state=2)
rf_model.fit(x_train,y_train)
rf_model_predictions = rf_model.predict(x_test)
print(classification_report(y_test,rf_model_predictions))

              precision    recall  f1-score   support

           0       0.83      0.87      0.85       105
           1       0.80      0.76      0.78        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179



In [None]:
# RANDOM FOREST MODEL WITH GRIDSEARCHCV
parameter_grid = {
    'bootstrap': [True, False],
    'criterion': ['gini','entropy'],
    'n_estimators': [10,20,50,100,500],
    'max_leaf_nodes': [2,4,5,7,10],
    'max_features': ['auto','log2','sqrt']
}
rf_model_gridsearched = GridSearchCV(cv=5,estimator=RandomForestClassifier(random_state=42),
                                     param_grid = parameter_grid)
rf_model_gridsearched.fit(x_train,y_train)
rf_model_gridsearched.best_params_
print(classification_report(y_test,rf_model_gridsearched.predict(x_test)))

              precision    recall  f1-score   support

           0       0.80      0.88      0.84       105
           1       0.80      0.69      0.74        74

    accuracy                           0.80       179
   macro avg       0.80      0.78      0.79       179
weighted avg       0.80      0.80      0.80       179



In [None]:
# GRADIENT BOOSTING MODEL
gb_model = GradientBoostingClassifier(learning_rate=0.5,random_state=42)
gb_model.fit(x_train,y_train)
gb_model_predictions = gb_model.predict(x_test)
print(classification_report(y_test,gb_model_predictions))

              precision    recall  f1-score   support

           0       0.83      0.82      0.83       105
           1       0.75      0.77      0.76        74

    accuracy                           0.80       179
   macro avg       0.79      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179



In [None]:
# GRADIENT BOOSTING MODEL WITH GRIDSEARCHCV
parameter_grid = {
    'loss': ['deviance','exponential'],
    'learning_rate': [0.1,0.3,0.5,0.7],
    'n_estimators': [10,20,50,100],
    'max_leaf_nodes': [2,3,5,7,10]
}
gb_model_gridsearched = GridSearchCV(cv=5,estimator=GradientBoostingClassifier(random_state=42),
                                     param_grid = parameter_grid)
gb_model_gridsearched.fit(x_train,y_train)
gb_model_gridsearched.best_params_
print(classification_report(y_test,gb_model_gridsearched.predict(x_test)))

              precision    recall  f1-score   support

           0       0.79      0.88      0.83       105
           1       0.79      0.68      0.73        74

    accuracy                           0.79       179
   macro avg       0.79      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



In [None]:
# ADA BOOST MODEL
ab_model = AdaBoostClassifier(random_state=42)
ab_model.fit(x_train,y_train)
ab_model_predictions = ab_model.predict(x_test)
print(classification_report(y_test,ab_model_predictions))

              precision    recall  f1-score   support

           0       0.82      0.79      0.81       105
           1       0.72      0.76      0.74        74

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



In [None]:
# ADA BOOST MODEL WITH GRIDSEARCHCV
parameter_grid = {
    'algorithm': ['SAMME','SAMME.R'],
    'n_estimators': [5,10,20,50,100],
    'learning_rate': [0.5,1.0,1.5]
}
ab_model_gridsearched = GridSearchCV(cv=5,estimator=AdaBoostClassifier(random_state=42),
                                     param_grid = parameter_grid)
ab_model_gridsearched.fit(x_train,y_train)
ab_model_gridsearched.best_params_
print(classification_report(y_test,ab_model_gridsearched.predict(x_test)))

              precision    recall  f1-score   support

           0       0.83      0.84      0.83       105
           1       0.77      0.76      0.76        74

    accuracy                           0.80       179
   macro avg       0.80      0.80      0.80       179
weighted avg       0.80      0.80      0.80       179



In [None]:
# XGBOOST MODEL
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(x_train,y_train)
xgb_model_predictions = xgb_model.predict(x_test)
print(classification_report(y_test,xgb_model_predictions))

              precision    recall  f1-score   support

           0       0.80      0.89      0.84       105
           1       0.81      0.69      0.74        74

    accuracy                           0.80       179
   macro avg       0.81      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179



In [None]:
# XGBOOST MODEL WITH GRIDSEARCHCV
parameter_grid = {
    'max_depth': [2,3,4,5,6],
    'learning_rate': [0.1,0.2,0.3],
    'n_estimators': [10,20,50,100],
    'min_child_weight': [1,2,5,10],
    'max_delta_step': [0,1,2]
}
xgb_model_gridsearched = GridSearchCV(cv=5,estimator=XGBClassifier(random_state=42),
                                      param_grid=parameter_grid)
xgb_model_gridsearched.fit(x_train,y_train)
xgb_model_gridsearched.best_params_
print(classification_report(y_test,xgb_model_gridsearched.predict(x_test)))

              precision    recall  f1-score   support

           0       0.81      0.90      0.86       105
           1       0.84      0.70      0.76        74

    accuracy                           0.82       179
   macro avg       0.83      0.80      0.81       179
weighted avg       0.82      0.82      0.82       179



In [None]:
# RESULT FROM KAGGLE
best_submission = 0.78468   #from rf model

## Q2. There might be times when you would only like to do K-Fold Cross Validation and not run the time consuming GridSearch everytime. That is what you will be doing in this question. 

- Read the documention for K-Fold crossvalidation provided by scikit-learn here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

- Try to understand and run the code provided by scikit-learn. Basically this function provided by scikit-learn will split your dataset into the "K" folds and will give you the indices of the "K" folds

- Then use the K-Fold crossvalidation technique to generate the "K" folds and report the accuracy obtained for each of the "K" folds

- You should get "K" different accuracy values and then finally take the average of all the "K" accuracies which would be your final model performance

- This exercise will help you really understand K-Fold CrossValidation

In [None]:
## Write your code from here
kf = KFold(n_splits=10, random_state=42, shuffle=True)
clf = RandomForestClassifier(bootstrap=False,criterion='entropy',max_features='auto',max_leaf_nodes=10,n_estimators=500,random_state=42)  #using the best params

X = df.loc[:,'Pclass':]
Y = df.loc[:,'Survived']
scoring = 'accuracy'
results = cross_val_score(clf, X.values, Y.values, cv=kf, n_jobs=1, scoring=scoring)
results

array([0.82222222, 0.78651685, 0.84269663, 0.76404494, 0.88764045,
       0.87640449, 0.79775281, 0.7752809 , 0.76404494, 0.88764045])

In [None]:
print(results.mean())

0.8204244694132334


In [None]:
## Thats it for this week. There are only two questions but they are a bit time consuming so enjoy! I hope you reach atleast 85-90% on Kaggle before tapping into Deep Learning