# GridSearchCV & Hyperparameter Tuning

GridSearchCV in Scikit-Learn is a vital tool for hyperparameter tuning, performing an exhaustive search over specified parameter values for an estimator.
It helps data scientists efficiently find the best hyperparameters, saving time and resources while optimizing model performance, making it an essential tool in the machine learning pipeline.
Hyperparameters are parameters of an algoritghm defined by the user that control the model training process.

## Parameters of GridSearchCV

<b>Estimator</b>: this is the model or pipeline to be optimized. Examples are GaussianNB(), LogisticRegression() etc.

<b>param_grid</b>: A dictionary or list of dictionaries with parameter names (as strings) as keys and lists of parameter settings to try as values. Using param_grid, you can specify the hyperparameters for various models to find the optimal combination.

Examples of various models hyperparameters for the param_grid parameter.

<b> Logistic Regression</b>: When tuning a logistic regression model, GridSearchCV can search through different values of C, penalty, and solver to find the best parameters.


In [2]:
parameters = {'C': [0.01, 0.1, 1],
              'penalty': ['l2'],
              'solver': ['lbfgs']}

* C: Inverse of regularization strength; smaller values specify stronger regularization. Regularization is a technique used to prevent a model from overfitting.
* penalty: Specifies the norm of the penalty; 'l2' is ridge regression.
* solver: Algorithm to use in the optimization problem.

<b>Support Vector Machine</b>: For SVM, GridSearchCV can explore different kernels, C values, and gamma settings to optimize the model.


In [4]:
import numpy as np
parameters = {'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
              'C': np.logspace(-3, 3, 5),
              'gamma': np.logspace(-3, 3, 5)}


* kernel: Specifies the kernel type to be used in the algorithm.
* C: Regularization parameter.
* gamma: Kernel coefficient.

<b>Decision Tree Classifier</b>: In the case of a decision tree, GridSearchCV can test various criteria, splitters, depths, and other parameters to find the best configuration.


In [6]:
parameters = {'criterion': ['gini', 'entropy'],
              'splitter': ['best', 'random'],
              'max_depth': [2*n for n in range(1, 10)],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4],
              'min_samples_split': [2, 5, 10]}

* criterion: The function to measure the quality of a split.
* splitter: The strategy used to choose the split at each node.
* max_depth: The maximum depth of the tree.
* max_features: The number of features to consider when looking for the best split.
* min_samples_leaf: The minimum number of samples required to be at a leaf node.
* min_samples_split: The minimum number of samples required to split an internal node.

K-Nearest Neighbors: For KNN, GridSearchCV can try different numbers of neighbors, algorithms, and power parameters to determine the best model.


In [9]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1, 2]}

* n_neighbors: Number of neighbors to use.
* algorithm: Algorithm used to compute the nearest neighbors.
* p: Power parameter for the Minkowski metric.

* <b>scoring</b>: A single string or callable to evaluate the predictions on the test set. Common options include accuracy, f1, roc_auc, etc. If none, the estimator's default scorer is used.


* <b>n_jobs</b>: The number of jobs to run in parallel. -1 means using all processors.

* <b>pre_dispatch</b>: Controls the number of jobs that get dispatched during parallel execution. It can be an integer or expressions like 2n_jobs, 3n_jobs, etc., to limit the number of jobs dispatched at once.

* <b>refit</b>: If True, refits the best estimator with the entire dataset. The best estimator is stored in the best_estimator_ attribute. Default is True.

* <b>cv</b>: Determines the cross-validation splitting strategy. It can be an integer to specify the number of folds, a cross-validation generator, or an iterable. Default is 5-fold cross-validation.


* <b>verbose</b>: Controls the verbosity level. Higher values indicate more messages. verbose=0 is silent, verbose=1 shows some messages, and verbose=2 shows more messages.

* <b>return_train_score</b>: If False, the cv_results_ attribute will not include training scores. Default is False.


* <b>error_score</b>: Value to assign to the score if an error occurs in estimator fitting. np.nan is the default, but it can be set to a specific value.

## Applications and Advantages of GridSearchCV


* Model Selection: GridSearchCV enables the comparison of multiple models and facilitates the selection of the best-performing one for a given data set.

* Hyperparameter Tuning: It automates the process of finding the optimal hyperparameters, which can significantly improve the performance of machine learning models.

* Pipeline Optimization: GridSearchCV can be applied to complex pipelines involving multiple preprocessing steps and models to optimize the entire workflow.

* Cross-Validation: It incorporates cross-validation in the parameter search process, ensuring that the model's performance is robust and not overfitted to a particular train-test split.

* Exhaustive Search: GridSearchCV performs an exhaustive search over the specified parameter grid, ensuring that the best combination of parameters is found.

* Parallel Execution: With the n_jobs parameter, it can leverage multiple processors to speed up the search process.

* Automatic Refit: By setting refit=True, GridSearchCV automatically refits the model with the best parameters on the entire data set, making it ready for use.

* Detailed Output: The cv_results_ attribute provides detailed information about the performance of each parameter combination, including training and validation scores, which helps in understanding the model's behavior.