## GridSearchCV

`GridSearchCV` is a method in the scikit-learn library, which is a popular machine learning library in Python. It's used for hyperparameter optimization, which involves searching for the best set of hyperparameters for a machine learning model.

### Let's import some packages

We begin by importing necessary packages and modules. The `KNeighborsRegressor` model is imported from the `sklearn.neighbors` module.

In [1]:
# Let's import some packages

from dataidea.packages import * # imports np, pd, plt, etc
from dataidea.datasets import loadDataset
from sklearn.neighbors import KNeighborsRegressor


### Let's import necessary components from sklearn

We import essential components from `sklearn`, including `Pipeline`, `ColumnTransformer`, `StandardScaler`, and `OneHotEncoder`. These components are pivotal for preprocessing data and building machine learning pipelines.

In [7]:
# lets import the Pipeline from sklearn

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

### Loading the dataset

We load the dataset named 'boston' using the `loadDataset` function, which is inbuilt in the dataidea package. The loaded dataset is stored in the variable `data`.

In [3]:
# loading the data set

data = loadDataset('boston')

In [4]:
# looking at the top part

data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


### Selecting features (X) and target variable (y)

We separate the features (X) from the target variable (y). Features are stored in `X`, excluding the target variable 'MEDV', which is stored in `y`.

In [5]:
# Selecting our X set and y

X = data.drop('MEDV', axis=1)
y = data.MEDV

### Defining numeric and categorical columns

We define lists of column names representing numeric and categorical features in the dataset. We identified these columns as the best features from the previous section of this week

In [6]:
numeric_cols = ['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']
categorical_cols = ['CHAS', 'RAD']

### Preprocessing steps

We define transformers for preprocessing numeric and categorical features. `StandardScaler` is used for standardizing numeric features, while `OneHotEncoder` is used for one-hot encoding categorical features. These transformers are applied to respective feature types using `ColumnTransformer` as we learned in the previous section.

In [47]:
# Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine preprocessing steps
column_transformer = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_cols),
        ('categorical', categorical_transformer, categorical_cols)
    ])


### Defining the pipeline

We construct a machine learning pipeline using `Pipeline`. The pipeline consists of preprocessing steps (defined in `column_transformer`) and a `KNeighborsRegressor` model with 10 neighbors.

In [None]:
# Pipeline
pipe = Pipeline([
    ('column_transformer', column_transformer),
    ('model', KNeighborsRegressor(n_neighbors=10))
])

# print(pipe)


### Fitting the pipeline

As we learned, the Pipeline has the `fit`, `score` and `predict` methods which we use to fit on the dataset (`X`, `y`) and evaluate the model's performance using the `score()` method, finally making predictions.

In [36]:
# Fit the pipeline
pipe.fit(X, y)

# Score the pipeline
pipe_score = pipe.score(X, y)

# Predict using the pipeline
pipe_predicted_y = pipe.predict(X)

print('Pipe Score:', pipe_score)

Pipe Score: 0.818140222027107



### Hyperparameter tuning using GridSearchCV

We perform hyperparameter tuning using `GridSearchCV`. The pipeline (`pipe`) serves as the base estimator, and we define a grid of hyperparameters to search through, focusing on the number of neighbors for the KNN model.

In [37]:
from sklearn.model_selection import GridSearchCV

In [41]:
model = GridSearchCV(
    estimator=pipe,
    param_grid={
        'model__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    },
    cv=3
    )


### Fitting the model for hyperparameter tuning

We fit the `GridSearchCV` model on the dataset to find the optimal hyperparameters. This involves preprocessing the data and training the model multiple times using cross-validation.

In [42]:
model.fit(X, y)


### Extracting and displaying cross-validation results

We extract the results of cross-validation performed during hyperparameter tuning and present them in a tabular format using a DataFrame.

In [48]:
cv_results = pd.DataFrame(model.cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005611,0.001182,0.005552,0.002497,1,{'model__n_neighbors': 1},0.347172,0.56178,0.295295,0.401415,0.115356,10
1,0.008659,0.00054,0.007201,0.002532,2,{'model__n_neighbors': 2},0.404829,0.612498,0.27669,0.431339,0.138369,9
2,0.007307,0.001737,0.007313,0.004104,3,{'model__n_neighbors': 3},0.466325,0.590333,0.243375,0.433345,0.143552,8
3,0.004802,0.000561,0.003383,0.000331,4,{'model__n_neighbors': 4},0.569672,0.619854,0.246539,0.478688,0.165428,4
4,0.004051,0.00015,0.003302,0.000101,5,{'model__n_neighbors': 5},0.6139,0.600994,0.23032,0.481738,0.177857,2
5,0.004078,0.000268,0.003345,0.00029,6,{'model__n_neighbors': 6},0.620587,0.607083,0.225238,0.484302,0.183269,1
6,0.004307,0.000472,0.003678,0.000273,7,{'model__n_neighbors': 7},0.639693,0.583685,0.218612,0.480663,0.186704,3
7,0.005612,0.001025,0.004131,0.000574,8,{'model__n_neighbors': 8},0.636143,0.567841,0.209472,0.471152,0.187125,5
8,0.004556,9.9e-05,0.003832,0.000258,9,{'model__n_neighbors': 9},0.649335,0.542624,0.197917,0.463292,0.192639,6
9,0.006855,0.001309,0.004405,0.000803,10,{'model__n_neighbors': 10},0.65337,0.535112,0.191986,0.460156,0.195674,7



### Scoring the final model

We score the final model after hyperparameter tuning to evaluate its performance on the entire dataset. This provides an indication of how well the model generalizes to unseen data.

In [46]:
model.score(X, y)

0.8661624926868122