# 5. Hyperparameters and Model Validation

The first two pieces of this—the choice of model and choice of hyperparameters—are perhaps the most important part of using these tools and techniques effectively.
In order to make an informed choice, we need a way to *validate* that our model and our hyperparameters are a good fit to the data.
While this may sound simple, there are some pitfalls that you must avoid to do this effectively.

## Thinking about Model Validation

In principle, model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the prediction to the known value.

The following sections first show a naive approach to model validation and why it
fails, before exploring the use of holdout sets and cross-validation for more robust
model evaluation.

### Model validation the wrong way

Let's demonstrate the naive approach to validation using the Iris data.
We will start by loading the data:

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

Next we choose a model and hyperparameters. Here we'll use a *k*-neighbors classifier with ``n_neighbors=1``.
This is a very simple and intuitive model that says "the label of an unknown point is the same as the label of its closest training point:"

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)

Then we train the model, and use it to predict labels for data we already know and compute the fraction of correctly labeled points:

In [None]:
model.fit(X, y)
y_model = model.predict(X)

from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)

We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model!
But is this truly measuring the expected accuracy? Have we really come upon a model that we expect to be correct 100% of the time?

As you may have gathered, the answer is no.
In fact, this approach contains a fundamental flaw: *it trains and evaluates the model on the same data*.
Furthermore, the nearest neighbor model is an *instance-based* estimator that simply stores the training data, and predicts labels by comparing new data to these stored points: except in contrived cases, it will get 100% accuracy *every time!*

### Model validation via k-fold cross-validation


*K-Fold Cross-validation*; that is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set.
Visually, it might look something like this:

![](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/05.03-2-fold-CV.png?raw=1)



The results of dividing the dataset into training and testing sets may be random. In order to make the evaluation of the model more objective and accurate, cross-validation can be performed. The main idea is to divide the data set into k parts (called k-fold cross-validation) (usually k=5 or 10), take one of the folds as the test set, and the remaining data as the training set. In this way, k groups of samples can be obtained, k scores are calculated on the training set, and the average of these k scores is calculated.

k-fold cross-validation can be implemented through the above-mentioned `train_test_split()` function combined with loop statements. However, sklearn provides a more convenient method: the `cross_val_score()` function.

```python
cross_val_score(estimator, x, y=None, cv=None, n_jobs=1)
```

- estimator: estimation method object (classifier)
- x: data features (Features)
- y: data labels (Labels)
- cv: several-fold cross-validation
- n_jobs: number of cpus working at the same time (-1 means all)

Usually only need to set estimator, x , y, cv.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# load data
iris=load_iris()
x=iris.data
y=iris.target

# kNN using 5 neighbours
knn=KNeighborsClassifier(n_neighbors=5)

# 5-fold cross validation
scores=cross_val_score(knn, x, y, cv=5, scoring='accuracy')

print(scores)  # 5 values, one for each iteration
print(scores.mean())

Note that when we print the `scores`, we have 5 values: one for each iteration of the 5-fold cross-validation algorithm, which uses a different fold for validation and the remaining 4 for training varying these in each iteration.

**IMPORTANT**: Cross-validation is not used to improve the accuracy of the model, but to find the appropriate model parameters to prevent overfitting when the data set is small.

Some hyperparameters in the machine learning model need to be adjusted manually. The common method is grid search, that is, to traverse the hyperparameters according to a certain step size within a reasonable range, and observe the effect of the model under each parameter value (usually crossover Verified scoring results). Especially when there are multiple hyperparameters that need to be debugged, the "grid" is more intuitive: traverse the hyperparameters and observe the model effect under each parameter combination.

Grid search can be implemented through the above-mentioned `cross_val_score()` function combined with loop statements. Take the hyperparameter k in knn as an example:

In [None]:
import matplotlib.pyplot as plt

k_range=range(1,31)  # possible values for k in the kNN algorithm
k_scores=[]
for k in k_range:
    knn=KNeighborsClassifier(n_neighbors=k)
    scores=cross_val_score(knn, x, y, cv=5, scoring='accuracy')  # for  classification
    #loss=-cross_val_score(knn, x, y, cv=5, scoring='mean_squared_error') # for regression
    k_scores.append(scores.mean())

#plot
plt.plot(k_range, k_scores)
plt.xlabel('value of k for knn')
plt.ylabel('cross-validated accuracy')
plt.show()

# Best k for accuracy
max_acc = max(k_scores)
print("Best values for k:")
print([i+1 for i, j in enumerate(k_scores) if j == max_acc])  # index

# Metrics

With the fitted model, we can now compute the predictions of the model on the test dataset. These predictions are used to compute the final metrics, such as confustion matrix (which is plotted with the [`ConfusionMatrixDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html)), [`accuracy`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), etc.

Let's first split the data (so we can have some test set) and train the modekl.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

knn_best = KNeighborsClassifier(n_neighbors=5)   # one of the best k value
knn_best.fit(X_train,y_train)

y_pred = knn_best.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()

A `SKLearn` method called [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) is also a commonly used method for outputting model evaluation reports.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))