# 1. GridSearch

By Alberto Valdés 

**Mail 1:** anvaldes@uc.cl 

**Mail 2:** alberto.valdes.gonzalez.96@gmail.com

In this notebook we will talk about GridSearch which is a useful tool to determine the do hyperparameters tuning and determine the best hyperparameters.

### Hyperparameter Tuning

When training machine learning models, each data set and each model requires a different set of hyperparameters, which are a type of variable. The only way to determine them is by running multiple experiments, in which a set of hyperparameters is chosen and run through the model. This is called hyperparameter tuning. Basically you are training your model sequentially with different sets of hyperparameters. This process can be manual, or you can choose one of several automated methods of hyperparameter tuning.

We choose the hyperparemeters values that optimize the performance in the **validation set**.

### We import all the libraries

In [1]:
!pip install -q xgboost

In [2]:
import time
import warnings
import numpy as np
import pandas as pd
from sklearn import metrics
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from matplotlib.pyplot import figure
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.exceptions import DataConversionWarning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier


from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer

warnings.simplefilter("ignore")
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

In [3]:
def compute_auc(y, y_pred):

    fpr, tpr, thresholds = metrics.roc_curve(y, y_pred, pos_label=1)

    return metrics.auc(fpr, tpr)

In [4]:
start = time.time()

### i. Load the dataset

This dataset is about default in credit cards.

In [5]:
df = pd.read_csv('creditcard.csv')

In [6]:
X_cols = [f'V{i}' for i in range(1, 28 + 1)]
X_cols = X_cols + ['Amount']

y_col = ['Class']

In [7]:
X = df[X_cols].copy()
y = df[y_col].copy()

In [8]:
X.isna().sum()

V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
dtype: int64

In [9]:
y.isna().sum()

Class    0
dtype: int64

In [10]:
round(y.value_counts(normalize = True)*100, 2)

Class
0        99.83
1         0.17
dtype: float64

In [11]:
y.value_counts()

Class
0        284315
1           492
dtype: int64

**Note:** This is a imbalanced problem.

### ii. Prepare the data

When we use the GridSearch method, this incorporate K-Folds which made unnecesary the creation of a validation set.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 10)

In [13]:
mean_X = X.mean()
std_X = X.std()

In [14]:
X_train = (X_train - mean_X)/std_X
X_test = (X_test - mean_X)/std_X

### iii. Hyperparameter Tuning

In [15]:
parameters = {
    "n_estimators":[10, 20, 30, 40],
    'max_depth': [1, 2, 3, 4]
    }

In [16]:
scoring = {'accuracy': make_scorer(accuracy_score), 'precision': make_scorer(precision_score), 
           'recall':make_scorer(recall_score)}

In [17]:
clf = GridSearchCV(XGBClassifier(), parameters, scoring = scoring, refit = False, cv = 5, n_jobs = 4)

In [18]:
clf = clf.fit(X_train, y_train)

In [19]:
report = pd.DataFrame.from_dict(clf.cv_results_)

In [20]:
n_est_opt = report.iloc[report['mean_test_recall'].idxmax()]['param_n_estimators']
max_depth_opt = report.iloc[report['mean_test_recall'].idxmax()]['param_max_depth']

In [21]:
print('N estimators:', n_est_opt)
print('Max Depth:', max_depth_opt)

N estimators: 10
Max Depth: 4


### iv. Definitive Train

In [22]:
clf = XGBClassifier(n_estimators = n_est_opt, max_depth = max_depth_opt, random_state = 10)

In [23]:
clf = clf.fit(X_train, y_train)

In [24]:
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

In [25]:
recall_train = recall_score(y_train, y_train_pred)
recall_test = recall_score(y_test, y_test_pred)

In [26]:
print('Recall Train:', round(recall_train*100, 2))
print('Recall Test:', round(recall_test*100, 2))

Recall Train: 82.34
Recall Test: 78.72


### v. Time of execution

In [27]:
end = time.time()

In [28]:
delta = (end - start)

hours = int(delta/3600)
mins = int((delta - hours*3600)/60)
segs = int(delta - hours*3600 - mins*60)
print(f'Execute this notebook take us {hours} hours, {mins} minutes and {segs} seconds.')

Execute this notebook take us 0 hours, 1 minutes and 50 seconds.
