### In this Notebook, the gridsearches for the different types of models are shown.

For this specific Project it was chosen to optimize the models with respect to Precision, as it is important to not say a company they will most likely succeed with their Campaign and then won't. Therefore we aim at reducing the false positives while still having a sufficent amount of campaigns classified as 'most_likely' successfull. Nevertheless, assuming the accuracy is easier interpretable by a potential stakeholder, the models accuracies and accuracy scores are saved for comparison. 

Not all cells in this notebook need to be executed, as the best models resulting from the grid searches are saved in the models folder. Executing the grid searcher, especially the grid search for the KNN algorithm takes many hours. 

In [None]:
import pandas as pd 
import numpy as np 
import pickle

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

from utils.evaluation import *                  #contains functions used for model evaluation
from utils.Preprocessing_funct import *         #contains functions used for pre-processing
RSEED = 42


#### Load the differently scaled train- and test-sets

In [None]:
#Load X_train, X_test, y_train, y_test 

y_train = pd.read_csv('data/ytrain.csv')
y_test = pd.read_csv('data/y_test.csv')

X_train = pd.read_csv('data/unscaled_Xtrain.csv')
X_test = pd.read_csv('data/unscaled_Xtest.csv')

X_train_stdscaled = pd.read_csv('data/stdscaled_Xtrain.csv')
X_test_stdscaled = pd.read_csv('data/stdscaled_Xtest.csv')

X_train_mmscaled = pd.read_csv('data/mmscaled_Xtrain.csv')
X_test_mmscaled = pd.read_csv('data/mmscaled_Xtest.csv')


### Models

Various models, including Logistic Regression, k-nearest neighbours, random forest and xgboost were optimized by performing a grid search. The accuracy scores of the respective best models were saved in the 'accuracies.txt' file and the confusion matrices in the images folder.

In [None]:
accuracies = {}

#### Logistic Regression - Gridsearch 

In [None]:
# ###############################################################################################
# ###### Don't execute this cell if not intending to repeat grid search #########################
# ###############################################################################################

# # Logistic Regression
# logreg_clf = LogisticRegression(max_iter=100000)

# # define parameter grid for grid search
# # params_log_reg_1st_try = {"penalty": ["l1","l2"],
# #                "solver": ["saga", "liblinear"],
# #                "C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]           
# # }

# # as params maxed out C, try larger Cs

# params_log_reg = {"penalty": ["l1","l2"],
#                "solver": ["saga", "liblinear"],
#                "C": [100.0, 1000.0, 10000.0]           
# }

# # Instantiate gridsearch
# gs_log_reg = GridSearchCV(logreg_clf, param_grid=params_log_reg, cv = 5,n_jobs = -1, scoring='precision')

# # fit gridsearch object to data
# gs_log_reg.fit(X_train_stdscaled, np.ravel(y_train))

In [None]:
# # Best score
# print('Best score:', round(gs_log_reg.best_score_, 3))
# # Best parameters
# print('Best parameters:', gs_log_reg.best_params_)
# print('----------'*8)

# #save best model

# # best_lr = gs_log_reg.best_estimator_
# # pickle.dump(best_lr, open('./models/best_lr_model', 'wb'))

#### Results Logistic Regression Gridsearch
**Best score:** 0.672 \
**Best parameters:** {'C': 1000.0, 'penalty': 'l1', 'solver': 'liblinear'}


#### Load the best Logistic Regression model

In [None]:
best_lr = pickle.load(open('./models/best_lr_model', 'rb'))     # load best model from the model folder

In [None]:
# Predict on best model
y_pred_best_lr = best_lr.predict(X_test_stdscaled)

# Plot classification report and confusion matrix
vis_results(y_test, y_pred_best_lr, 'LogReg', savefig=True)

# Accuracy score
accuracies['LogReg'] = accuracy_score(y_test, y_pred_best_lr)


#### KNN - Gridsearch

In [None]:
# ###############################################################################################
# ###### Don't execute this cell if not intending to repeat grid search #########################
# ###############################################################################################

# # Train model
# knn_clf = KNeighborsClassifier()

# params_KNN = {"n_neighbors" : [3,5,7], #this actually defines the model you use
#               "weights" : ["uniform", "distance"],
#               "p" : [1, 2, 3],
#               "algorithm": ["ball_tree", "brute"]
#              }

# # Instantiate gridsearch and define the metric to optimize 
# gs_KNN = GridSearchCV(knn_clf, param_grid=params_KNN, cv = 5, n_jobs = -1, scoring='precision', verbose=5) 


# # fit gridsearch object to data
# gs_KNN.fit(X_train_mmscaled, np.ravel(y_train))

In [None]:
# # Best score
# print('Best score:', round(gs_KNN.best_score_, 3))
# # Best parameters
# print('Best parameters:', gs_KNN.best_params_)
# print('----------'*8)

# # save best model

# #best_knn = gs_KNN.best_estimator_
# #pickle.dump(best_knn, open('./models/best_knn_model', 'wb'))  

#### Results KNN Gridsearch 

**Best score:** 0.678 \
**Best parameters:** {'algorithm': 'ball_tree', 'n_neighbors': 5, 'p': 2, 'weights': 'distance'}

#### Load the best KNN model

In [None]:
best_knn = pickle.load(open('./models/best_knn_model', 'rb'))

In [None]:
# Predict on best model
y_pred_best_knn = best_knn.predict(X_test_mmscaled)

# Plot classification report and confusion matrix
vis_results(y_test, y_pred_best_knn, 'KNN', savefig=True)

# Accuracy score
accuracies['KNN'] = accuracy_score(y_test, y_pred_best_knn)

#### Random Forest - Gridsearch 

In [None]:
# ###############################################################################################
# ###### Don't execute this cell if not intending to repeat grid search #########################
# ###############################################################################################


# # Train model
# rf_clf = RandomForestClassifier()

# params_rf = {"n_estimators": range(70,110,10),
#              "criterion": ['gini', 'entropy', 'log_loss'],
#              "min_samples_leaf": range(7,15,2),
#              "max_features": ['sqrt', 'log2'],
#              "class_weight": ['balanced','balanced_subsample'],
#              "max_samples": [0.6, 0.7, 0.8]
#              }

# # Instantiate gridsearch and define the metric to optimize 
# gs_rf = GridSearchCV(rf_clf, param_grid=params_rf, cv = 5, n_jobs = -1, scoring='precision')


# # fit gridsearch object to data
# gs_rf.fit(X_train, np.ravel(y_train))

In [None]:
# # Best score
# print('Best score:', round(gs_rf.best_score_, 3))
# # Best parameters
# print('Best parameters:', gs_rf.best_params_)
# print('----------'*8)

# #save best model

# # best_rf = gs_rf.best_estimator_
# # pickle.dump(best_rf, open('./models/best_rf_model', 'wb'))

#### Results Random Forest Gridsearch

**Best score:** 0.722 \
**Best parameters:** {'class_weight': 'balanced', 'criterion': 'log_loss', 'max_features': 'sqrt', 'max_samples': 0.8, 'min_samples_leaf': 11, 'n_estimators': 100}

#### Load the best Random Forest model

In [None]:
best_rf = pickle.load(open('./models/best_rf_model', 'rb'))

In [None]:
# Predict on best model
y_pred_best_rf = best_rf.predict(X_test)

# Plot classification report and confusion matrix
vis_results(y_test, y_pred_best_rf, 'RandomForest', savefig=True)

# Accuracy score
accuracies['RandomForest'] = accuracy_score(y_test, y_pred_best_rf)

### XGBoost - Gridsearch

In [None]:
# ###############################################################################################
# ###### Don't execute this cell if not intending to repeat grid search #########################
# ###############################################################################################

# # Train model
# xgb_clf = XGBClassifier()

# params_xgb = {"n_estimators": [100, 200, 300, 400, 500],
#              "learning_rate": [0.01, 0.1, 1],
#              "subsample": [0.25, 0.5, 0.75],
#              "max_depth": [7, 13, 23],
#              "colsample_bytree": [0.3, 0.5, 0.85, 1]
#              }

# # Tested second set of params because maximum of max_depth and subsamples and minimum of n_estimators was found as best parameters,
# # however no improvement was seen.
# # params_xgb2 = {"n_estimators": [50, 100, 200],
#             #  "learning_rate": [0.01, 0.1],
#             #  "subsample": [0.5, 0.75, 1],
#             #  "max_depth": [25, 30, 50],
#             #  "colsample_bytree": [0.85]
#             #  }


# # Instantiate gridsearch and define the metric to optimize 
# gs_xgb = GridSearchCV(xgb_clf, param_grid=params_xgb, cv = 5, n_jobs = -1, scoring='precision', verbose=5)


# # fit gridsearch object to data
# gs_xgb.fit(X_train, np.ravel(y_train))

In [None]:
# # Best score
# print('Best score:', round(gs_xgb.best_score_, 3))
# # Best parameters
# print('Best parameters:', gs_xgb.best_params_)
# print('----------'*8)

# # save best model

# # best_xgb = gs_xgb.best_estimator_
# # pickle.dump(best_xgb, open('./models/best_xgb_model', 'wb'))

##### Results XGBoost Gridsearch 

**Best score:** 0.705 \
**Best parameters:** {'colsample_bytree': 0.85, 'learning_rate': 0.1, 'max_depth': 23, 'n_estimators': 100, 'subsample': 0.75}

#### Load the best XGBoost model

In [None]:
best_xgb = pickle.load(open('./models/best_xgb_model', 'rb'))

In [None]:
# Predict on best model
y_pred_best_xgb = best_xgb.predict(X_test)

# Plot classification report and confusion matrix
vis_results(y_test, y_pred_best_xgb, 'XGBoost', savefig=True)

# Accuracy score
accuracies['XGBoost'] = accuracy_score(y_test, y_pred_best_xgb)

In [None]:
# save the accuracies of the different models

sorted_accuracies = dict(sorted(accuracies.items(), key=lambda x:x[1]))
# pickle.dump(sorted_accuracies, open('accuracies.txt', 'wb')) 
