## MLBootcamp21 Week 4 Assignment: Gridsearch and CrossValidation
- This week we will focus on improving our existing models using these two techniques
- They are good approaches to improving your model's performance apart from feature engineering

## Q1. Use GridSearchCV and also explore other GridSearch techniques provided by scikit-learn to tune your hyperparameters and obtain the best model

- Try out Decision Trees, Random Forest, GradientBoost, AdaBoost and also XGBoost 
- Run the model on the features you have generated until now without tuning the parameters first to check the result on basic parameters
- Then apply GridSearchCV or other GridSearch techniques provided by scikit-learn to tune the hyperparameters and get results on the best model

## Keep the "random_state" number as 42 or anynumber of your choice and report that number for me to be able to reproduce the same results

- Report your performance on the test set after making the submission on kaggle. 

- ****Do not use some random existing notebook on Kaggle to get the best results as you will not learn anything that way and we will be able to easily know if that has been done. Do whatever you can****

In [None]:
# Write your code from this cell
# It need not be in a single cell

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

df_train = pd.read_csv("./train.csv")

df_train['Embarked'].fillna(df_train['Embarked'].mode()[0], inplace = True)
df_train['Fare'].fillna(df_train['Fare'].median(), inplace = True)
df_train['Age'].fillna(df_train['Age'].median(),inplace=True)

df_train[['female','male']] = pd.get_dummies(df_train['Sex'])
df_train[["C","Q","S"]] = pd.get_dummies(df_train["Embarked"])
df_train.fillna(method="ffill",inplace=True)
drop_features = ["Sex",'Ticket','Name','Cabin',"Embarked",'female']
df_train.drop(drop_features,inplace=True,axis=1)

x_train, x_test, y_train, y_test = train_test_split(df_train.loc[:,'Pclass':],df_train.Survived,random_state=42,\
                                                    test_size=0.2)
#Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(x_train,y_train)
dt_model_predictions = dt_model.predict(x_test)
print("Decision Tree Orginal")
print(classification_report(y_test,dt_model_predictions))

#Decision Tree GridSearch CV
parameter_grid = {
    'criterion' : ['gini','entropy'],
    'max_leaf_nodes' : [2,4,5,7,10,15,17,20],
    'max_features' : ['auto','log2','sqrt'],
}

dt_model_gridsearched = GridSearchCV(cv=5,estimator=DecisionTreeClassifier(random_state=42), param_grid = parameter_grid)
dt_model_gridsearched.fit(x_train,y_train)
print(dt_model_gridsearched.best_params_)
print("Decision Tree GridSearched")
print(classification_report(y_test,dt_model_gridsearched.predict(x_test)))

#Random Forest
rf_model = RandomForestClassifier(random_state=42,bootstrap=False,criterion='entropy',max_features='log2',max_leaf_nodes=10,n_estimators=100)
rf_model.fit(x_train,y_train)
rf_model_predictions = rf_model.predict(x_test)
print("Random Forest Orginal")
print(classification_report(y_test,rf_model_predictions))

#Random Forest GridSearch CV
parameter_grid = {
    'bootstrap' : [True,False],
    'criterion' : ['gini','entropy'],
    'n_estimators' : [10,20,50,100],
    'max_leaf_nodes' : [2,4,5,7,10],
    'max_features' : ['auto','log2','sqrt']
}

rf_model_gridsearched = GridSearchCV(cv=5,estimator=RandomForestClassifier(random_state=42), param_grid = parameter_grid)
rf_model_gridsearched.fit(x_train,y_train)
print(rf_model_gridsearched.best_params_)
print("Random Forest GridSearched")
print(classification_report(y_test,rf_model_gridsearched.predict(x_test)))

#Gradient Boost
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(x_train,y_train)
gb_model_predictions = gb_model.predict(x_test)
print("Gradient Boost Original")
print(classification_report(y_test,gb_model_predictions))

#Gradient Boost GridSearch CV
parameter_grid = {
    'loss' : ['deviance','exponential'],
    'learning_rate' : [0.1,0.3,0.5,0.7,0.9],
    'n_estimators' : [10,20,50,100],
    'max_leaf_nodes' : [2,4,5,7,10],
}

gb_model_gridsearched = GridSearchCV(cv=5,estimator=GradientBoostingClassifier(random_state=42), param_grid = parameter_grid)
gb_model_gridsearched.fit(x_train,y_train)
print(gb_model_gridsearched.best_params_)
print("Gradient Boost GridSearched")
print(classification_report(y_test,gb_model_gridsearched.predict(x_test)))

#AdaBoost
ab_model = AdaBoostClassifier(random_state=42)
ab_model.fit(x_train,y_train)
ab_model_predictions = ab_model.predict(x_test)
print("AdaBoost Original")
print(classification_report(y_test,ab_model_predictions))

#AdaBoost GridSearch CV
parameter_grid = {
    'learning_rate' : [0.1,0.3,0.5,0.7,0.9],
    'n_estimators' : [10,20,50,100],
    'algorithm' : ['SAMME','SAMME.R']
}

ab_model_gridsearched = GridSearchCV(cv=5,estimator=AdaBoostClassifier(random_state=42), param_grid = parameter_grid)
ab_model_gridsearched.fit(x_train,y_train)
print(ab_model_gridsearched.best_params_)
print("AdaBoost GridSearched")
print(classification_report(y_test,ab_model_gridsearched.predict(x_test)))

#XGBoost
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(x_train,y_train)
xgb_model_predictions = xgb_model.predict(x_test)
print("XGBoost Original")
print(classification_report(y_test,xgb_model_predictions))

#XGBoost GridSearch CV
parameter_grid = {
    'learning_rate' : [0.1,0.3,0.5,0.7,0.9],
    'n_estimators' : [10,20,50,100],
    'booster' : ['gbtree','gblinear','dart'],
    'max_depth' : [1,3,5,7]
}

xgb_model_gridsearched = GridSearchCV(cv=5,estimator=XGBClassifier(random_state=42), param_grid = parameter_grid)
xgb_model_gridsearched.fit(x_train,y_train)
print(xgb_model_gridsearched.best_params_)
print("XGBoost GridSearched")
print(classification_report(y_test,xgb_model_gridsearched.predict(x_test)))

#Test Submission
df_test = pd.read_csv("./test.csv")

df_test['Embarked'].fillna(df_test['Embarked'].mode()[0], inplace = True)
df_test['Fare'].fillna(df_test['Fare'].median(), inplace = True)
df_test['Age'].fillna(df_test['Age'].median(),inplace=True)

df_test[['female','male']] = pd.get_dummies(df_test['Sex'])
df_test[["C","Q","S"]] = pd.get_dummies(df_test["Embarked"])
df_test.fillna(method="ffill",inplace=True)
drop_features = ["Sex",'Ticket','Name','Cabin',"Embarked",'female']
df_test.drop(drop_features,inplace=True,axis=1)

#Gradient Boosting got the best results
predictions_for_submission = gb_model_gridsearched.predict(df_test.loc[:,'Pclass':])
df_submission = df_test[['PassengerId']].copy()
df_submission['Survived'] = predictions_for_submission
df_submission.to_csv("submission_v1.csv")

#Kaggle Submission
results = 0.79425


Decision Tree Orginal
              precision    recall  f1-score   support

           0       0.84      0.78      0.81       105
           1       0.72      0.78      0.75        74

    accuracy                           0.78       179
   macro avg       0.78      0.78      0.78       179
weighted avg       0.79      0.78      0.78       179

Random Forest Orginal
              precision    recall  f1-score   support

           0       0.79      0.91      0.85       105
           1       0.84      0.66      0.74        74

    accuracy                           0.81       179
   macro avg       0.82      0.79      0.80       179
weighted avg       0.81      0.81      0.81       179

Gradient Boost Original
              precision    recall  f1-score   support

           0       0.82      0.90      0.85       105
           1       0.83      0.72      0.77        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weight

## Q2. There might be times when you would only like to do K-Fold Cross Validation and not run the time consuming GridSearch everytime. That is what you will be doing in this question. 

- Read the documention for K-Fold crossvalidation provided by scikit-learn here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

- Try to understand and run the code provided by scikit-learn. Basically this function provided by scikit-learn will split your dataset into the "K" folds and will give you the indices of the "K" folds

- Then use the K-Fold crossvalidation technique to generate the "K" folds and report the accuracy obtained for each of the "K" folds

- You should get "K" different accuracy values and then finally take the average of all the "K" accuracies which would be your final model performance

- This exercise will help you really understand K-Fold CrossValidation

In [None]:
## Write your code from here
kf = KFold(n_splits=10,shuffle=True,random_state=42)
clf = DecisionTreeClassifier(random_state=42,criterion='gini',max_features='auto',max_leaf_nodes=20)
scores = cross_val_score(clf, x_train, y_train, cv=5)
print(scores.mean())

clf = RandomForestClassifier(random_state=42,bootstrap=True,criterion='entropy',max_features='auto',max_leaf_nodes=10,n_estimators=10)
scores = cross_val_score(clf, x_train, y_train, cv=5)
print(scores.mean())

clf = GradientBoostingClassifier(random_state=42,learning_rate=0.5,loss='exponential',max_leaf_nodes=7,n_estimators=20)
scores = cross_val_score(clf, x_train, y_train, cv=5)
print(scores.mean())

clf = AdaBoostClassifier(random_state=42,algorithm='SAMME.R',learning_rate=0.9,n_estimators=100)
scores = cross_val_score(clf, x_train, y_train, cv=5)
print(scores.mean())

clf = XGBClassifier(random_state=42,booster='gbtree',learning_rate=0.3,n_estimators=20)
scores = cross_val_score(clf, x_train, y_train, cv=5)
print(scores.mean())

0.8118290160543682
0.832847434255885
0.8328474342558849
0.8103713188220232
0.8328474342558849


In [None]:
## Thats it for this week. There are only two questions but they are a bit time consuming so enjoy! I hope you reach atleast 85-90% on Kaggle before tapping into Deep Learning