# GridSearchCV 

Here I define a function that performs a parameter tuning using `GridSearchCV` and returns the `best_params` and the `best_estimator` of the grid search.

I will use the data from the Kaggle's Titanic competition.

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Importing the data
data = pd.read_csv('../Kaggle competitions/Titanic/input/train.csv')
data.pop('Cabin') # Dropping 'Cabin' column as too many entries are missing
data.pop('Name') # Dropping the 'Name' column as it requires preprocessing
data.pop('Ticket') # Dropping the 'Ticket' column as it requires preprocessing
data.pop('PassengerId') # Dropping the 'PassengerId' column as it is equal to row.index+1 
data['Age'] = data['Age'].fillna(data['Age'].mean()) # Filling missing Age values with mean
data['Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()) # Filling missing 'Embarked' values with mode (most frequent value)
data['Pclass'] = data['Pclass'].apply(str) # Converting the entries of the cathegorical column 'Pclass' into strings
data = pd.get_dummies(data, prefix_sep='_') # Encoding cathegorical data
y = data['Survived'] # Target array
X = data.drop(['Survived'], axis=1) # Features matrix

# Splitting the data into train and test subsets
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

Here I define the `parameter_tuning` function:

In [30]:
from sklearn.model_selection import GridSearchCV

def parameter_tuning(model, Xtrain, ytrain, param_grid, cv=5, n_jobs=1):
    """This function returns the best_parameter and the best estimator of the grid search
    Note: the param_grid must be defined accordingly to the model tested"""
    
    gs = GridSearchCV(model, param_grid, cv=cv, verbose=1, n_jobs=n_jobs)
    gs.fit(Xtrain, ytrain)
    
    return gs.best_params_, gs.best_estimator_

As an example I will use the `KNeighborsClassifier` estimator:

In [39]:
from sklearn.neighbors import KNeighborsClassifier

# Number of neighbors to use by default for kneighbors queries.
n_neighbors = [3, 5, 11, 19, 21]
# Weight function used in prediction. 
weights = ['uniform', 'distance']
# The distance metric to use for the tree.
metrics = ['euclidean', 'minkowski', 'manhattan']
# Algorithm used to copute the nearest neighbor
algorithms = ['ball_tree', 'kd_tree', 'brute']


param_grid = {'n_neighbors': n_neighbors,
              'weights': weights,
              'metric': metrics,
              'algorithm': algorithms}

best_params, KNN_tuned = parameter_tuning(KNeighborsClassifier(), Xtrain, ytrain, param_grid, 5, -1)

Fitting 5 folds for each of 90 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:    1.6s finished


In [40]:
y_model = KNN_tuned.fit(Xtrain, ytrain).predict(Xtest)
best_params

{'algorithm': 'brute',
 'metric': 'manhattan',
 'n_neighbors': 19,
 'weights': 'distance'}

In [41]:
from sklearn.metrics import classification_report 
print(classification_report(ytest, y_model))

              precision    recall  f1-score   support

           0       0.81      0.77      0.79       140
           1       0.64      0.70      0.67        83

    accuracy                           0.74       223
   macro avg       0.73      0.74      0.73       223
weighted avg       0.75      0.74      0.75       223

