<img src="data/images/div/lecture-notebook-header.png" />

# Classification & Regression: Cross Validation & Hyperparameter Tuning

Cross-validation is a widely used technique in machine learning for evaluating the performance of a predictive model and estimating its generalization ability. It helps assess how well a model will perform on unseen data. Most classifiers or regressors feature a set of hyperparameters (e.g., the k in KNN) that can significantly affect the results. To find the best parameter settings, we have to train and evaluate for different parameter values.

However, this evaluation of finding the best parameter values cannot be done using the test set. The test set has to be unseen using the very end for the final evaluation (once the hyperparameters have been fixed). Using the test set to tune the hyperparameters means that the test set has affected the training process.

The process of cross-validation involves splitting the available dataset into multiple subsets or "folds." One of the folds is used as the validation set, while the remaining folds are used for training the model. This process is repeated multiple times, with each fold serving as the validation set in a different iteration.

Here's a step-by-step explanation of a common cross-validation procedure called "k-fold cross-validation":

* The dataset is divided into k subsets or folds of approximately equal size.
* The model is trained k times, each time using k-1 folds as the training data and one fold as the validation data.
* The performance of the model is evaluated on each validation set, typically by calculating a performance metric such as accuracy, precision, recall, or F1 score.
* The performance scores obtained from each fold are averaged to get an overall performance estimate of the model.

The value of k can vary, but common choices include 5-fold or 10-fold cross-validation. Generally, a larger value of k leads to a more robust performance estimation but increases the computational cost. Once the model is validated using cross-validation and its performance is satisfactory, it can be trained on the entire dataset (without the need for validation sets) and used for making predictions on new, unseen data.

## Setting up the Notebook

### Specify How Plots Get Rendered

In [None]:
%matplotlib inline

### Import Required Packages

In [None]:
import numpy as np
import pandas as pd

from tqdm import tqdm

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline

from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV

import warnings
warnings.filterwarnings('ignore')

---

## Logistic Regression with Cross-Validation (Single-Parameter Tuning)

Let's first consider only a single hyperparameter for which we want to find the best value. This helps to focus on understanding the idea behind Cross-Validations. Throughout this notebook, we use the "Vessels Details" dataset and the task is to predict the `Type` of a vessel based on (some of) its features (e.g., `Length`, `Width`, etc.).

### Load Data from File

In [None]:
df = pd.read_csv('data/datasets/vessels/vessel-details.csv')

# Shuffling is ofte an good idea; the data might be sorted in some way
df = df.sample(frac=1).reset_index(drop=True)

# Show the first 5 columns
df.head()

#### Data Selection

To skip any more sophisticated data preprocessing steps, we consider only the convenient features -- that is, we consider only a subset of numerical features for our model. This particularly means that we do not have to consider any encoding strategies for categorical features. To keep it even simpler, we also remove all rows containing any missing value.

In [None]:
# Keep only the numerical attributes to keep it simple here + Type as our class label
df = df[['Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage', 'Type']]

# Remove all rows with any NaN values; again, just to keep it simple
df = df.dropna()

df.head()

### Convert Class Labels

Most classification algorithms assume that the class labels of the range 0..C, where C is the number of classes. Using `pandas`, this conversion is easy to do. After the conversion, all rows with the class labels, say, "Oil Tanker" will have the same numerical (integer) class label of the range 0..C. For our dataset, the number of classes is `C=15`.

In [None]:
df['Type'] = pd.factorize(df['Type'])[0]

df.head()

### Generate Training & Test Data

As usual, we convert the dataframe into numpy arrays for further processing, including splitting the dataset into training and test data.

In [None]:
# Convert data to numpy arrays
X = df[['Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage']].to_numpy()
y = df[['Type']].to_numpy().squeeze()

# Split dataset in to training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))

#### Normalize Data via Standardization

Since we want to consider different polynomial degrees, it is strongly recommended – and almost required – to normalize/standardize the data. As the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) implementation also applies regularization by default, we do normalize the data via standardization.

In [None]:
# We fit the scaler based on the training data only
scaler = StandardScaler().fit(X_train)

# Of course, we need to convert both training and test data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Train and Test Logistic Classifier Using Cross-Validation

#### Semi-Manually K-Fold Validation

We first utilize `scikit-learn`'s  [`KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) to split the training data into $k$ folds (here, $k=10$). The [`KFold.split()`] method generates the folds and allows to loop over all combinations of training and validation folds. Each combination contains $k-1$ training folds and 1 validation fold. For each combination we can retrain and validate the classifier.

In [None]:
# Initialize the best f1-score and respective k value
p_best, f1_best = None, 0.0

# Loop over a range of values for setting p
for p in tqdm(range(1, 10)):

    kf = KFold(n_splits=5, shuffle=True, random_state=0)
    f1_scores = []
    
    # Transform data w.r.t to degree of polynomial p
    poly = PolynomialFeatures(p)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)

    for train_index, val_index in kf.split(X_train_poly):
        
        # Create the next combination of training and validation folds
        X_trn, X_val = X_train_poly[train_index], X_train_poly[val_index]
        y_trn, y_val = y_train[train_index], y_train[val_index]
    
        # Train the classifier for the current training folds
        classifier = LogisticRegression(fit_intercept=False, max_iter=1000).fit(X_trn, y_trn)
        
        # Predict the labels for the validation fold
        y_pred = classifier.predict(X_val)

        # Calculate the f1-score for the validation fold
        f1_scores.append(f1_score(y_val, y_pred, average='micro'))
        
    # Calculate f1-score for all fold combination as the mean over all scores
    f1_fold_mean = np.mean(f1_scores)
    
    # Keep track of the best f1-score and the respective k value
    if f1_fold_mean > f1_best:
        p_best, f1_best = p, f1_fold_mean
        
        
print('The best average f1-score was {:.3f} for p={}'.format(f1_best, p_best))

### Automatic Cross-Validation

`scikit-learn` provides the even more convenient method [`cross_val_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) that does the generation of folds and splitting them into training folds and validation folds, as well as the training of a classifier for all folds.

In [None]:
# Initialize the best f1-score and respective p value
p_best, f1_best = None, 0.0


# Loop over a range of values for setting p
for p in tqdm(range(1, 9)):
    
    # Transform data w.r.t to degree of polynomial p
    poly = PolynomialFeatures(p)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)    
    
    # Specfify type of classifier
    classifier = LogisticRegression(fit_intercept=False, max_iter=1000)
    
    # perform cross validation (here with 5 folds)
    # f1_scores is an array containg the 5 f1-scores
    f1_scores = cross_val_score(classifier, X_train_poly, y_train, cv=5, scoring='f1_micro')
    
    # Calculate the f1-score for the current k value as the mean over all 10 f1-scores
    f1_fold_mean = np.mean(f1_scores)
    
    # Keep track of the best f1-score and the respective k value
    if f1_fold_mean > f1_best:
        p_best, f1_best = p, f1_fold_mean
  

print('The best average f1-score was {:.3f} for a p={}'.format(f1_best, p_best))

**Side note:** The 2 previous code cells might yield different results for the best f1-score and the corresponding best value for $p$. This is because the splitting into folds has a random component.

### Final Evaluation on Test Data

Now that we have identified the best value for $k$, we can perform the final evaluation using the test data. We can now also use the fill training data, and don't need to split it into any folds.

In [None]:
# Transform data w.r.t to degree of polynomial p
poly = PolynomialFeatures(p_best)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)  

classifier = LogisticRegression(fit_intercept=False, max_iter=1000).fit(X_train_poly, y_train)

y_pred = classifier.predict(X_test_poly)

f1_final = f1_score(y_test, y_pred, average='micro')

print('The final f1-score of the Logistic Regression classifier (p={}) is: {:.3f}'.format(p_best, f1_final))
        

This final score is the one to report when quantifying the quality of the classifier.

---

## Logistic Regression with Cross-Validation (Multi-Parameter Tuning)

So far, we use Cross-Validation to find the best value for a single hyperparameter (here: the degree `p` of the polynomial). In practice, however, there might be many possible parameters we can and need to consider. This most commonly includes any hyperparameters of a classification model. For example, if you check the docs of [`sklearn.linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) you can see hyperparameters such as

* `penalty: {'l1', 'l2', 'elasticnet', None}, default='l2'`
* `C: float, default=1.0` (inverse of regularization strength; must be a positive float)
* `max_iter: int, default=100` (maximum number of iterations taken for the solvers to converge)

and others.

In principle, we could check all possible parameter combinations we want to consider the same way we did above by using nested loops to generate all combinations. For example, the example below shows how we can generate various parameter combinations using nested loops, with one loop for each of the hyperparameters we have just mentioned above.

```
for p in range(1, 9):
	for penalty in ['l1', 'l2', 'elasticnet']:    
    	for C in [0.1, 1, 10]:   	 
        	for max_iter in [100, 1000, 2000]:       	 
            	...
            	# Fit and evaluate model with current parameter set
            	...
```

While this would work fine, it can be quite tedious and error prone to write such code. In practice, it is therefore much more convenient to use off-the-shelf auxiliary methods to simplify hyperparameter tuning. In the following, we go through some basic examples to illustrate this.

### Create Model Pipeline

Creating a classifier typically involves many steps such as preprocessing (e.g., standardization, generating polynomial features) and selecting and training a model. By using a [`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) we can wrap all involved steps into a single entity. The code cell below shows an example where we create a pipeline that includes

* `StandardScaler()` for standardizing the data
* `PolynomialFeatures()` for the generation of polynomial features
* `LogisticRegression` for training/fitting a Logistic Regression model

Note how we give each component a name (e.g., `stdscaler`). The reason for this will be clearer in the next step.

In [None]:
pipe = Pipeline(steps=[('stdscaler', StandardScaler()),
                       ('polyfeatures', PolynomialFeatures()),
                       ('logreg', LogisticRegression(fit_intercept=False))
                      ]) 

### Perform Cross-Validations

Another very useful auxiliary class is [`sklearn.model_selection.GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) which perform Cross-Validation "under the hood". `GridSearchCV` also inspects as input the information about possible values for specified hyperparameters; see the variable `param_grid` in the code cell above.

This dictionary allows us to specify the hyperparameter values. Hyperparameters are identified by name, where name is a combination of the component name in the pipeline and the parameter name of the component. For example, `logreg__penalty` refers to the `penalty` input parameter for the `LogisticRegression()` class we named `logreg` in our pipeline.

In the code cell below, we use this approach to mimic the tuning we did previously, Note that only `polyfeatures__degree` refers to a set of different values while the remaining parameters `logreg__C`, `logreg__penalty`, and `logreg__max_iter` only refer to the same respective values we used above. With this, we have everything to perform Cross-Validation for our pipeline.

In [None]:
%%time 

# Define considered values for each considered hyperparameter
param_grid = {'polyfeatures__degree': range(1, 10),
              'logreg__C': [1],
              'logreg__penalty': ['l2'],
              'logreg__max_iter': [1000]
             }

# Perfom 5-fold Cross-Validation for each possible parameter combination
grid_search_cv = GridSearchCV(pipe, param_grid=param_grid, cv=5, verbose=3).fit(X_train, y_train)

The code cell above fits 45 models, since we have $9\cdot 1 \cdot 1 \cdot 1 = 9$ possible parameter combinations and use $5$-fold Cross-Validation; hence, $9 \cdot 5 = 45$.

Lastly, we can simply inspect which combination of values for all hyperparameter values showed the best performance. Since we only varied the degree of the polynomial, the result should match the one we got before, i.e., that `p=5` gives us the best results for our data and task here.

In [None]:
print(grid_search_cv.best_params_)

Of course, ideally we want to consider many more parameter combinations by also varying the values of other hyperparameters. The code cell below shows an example. Since consider $3\cdot 3\cdot 3\cdot 3 = 81$ possible parameter combinations in this combination, and still perform $5$-fold Cross Validation, we now need to fit $81 \cdot 5 = 405$ models. While this is all done "under the hood" -- and as such does not require writing more code -- it naturally does greatly affect the overall runtime.

As such, the code cell below will take some time to complete. Note that here we set `verbose=1` to avoid printing a line for each model that is fitted. However, if you want to see those lines to better observe the progress, you can change it to `verbose=3` like in the previous example.

In [None]:
%%time
param_grid = {'polyfeatures__degree': [4, 5, 6],
              'logreg__C': [0.1, 1, 10],
              'logreg__penalty': ['l1', 'l2', 'elasticnet'],
              'logreg__max_iter': [100, 1000, 2000]
             }

grid_search_cv = GridSearchCV(pipe, param_grid=param_grid, cv=5, verbose=1)

grid_search_cv.fit(X_train, y_train)

And again, we can check out the best values for all hyperparameters:

In [None]:
print(grid_search_cv.best_params_)

---

## Summary

Cross-validation plays a crucial role in machine learning as it serves two primary purposes: model evaluation and hyperparameter tuning. By evaluating the model's performance using cross-validation, we can estimate its ability to generalize to unseen data and compare it against other models or algorithms. This evaluation helps in selecting the best-performing model for deployment.

One of the key challenges in machine learning is overfitting, where a model learns to perform well on the training data but fails to generalize to new data. Cross-validation helps mitigate this issue by providing a more reliable estimate of the model's performance. By repeatedly training and evaluating the model on different subsets of the data, cross-validation helps identify models that generalize well.

Another challenge in machine learning is the selection of optimal hyperparameters. Hyperparameters are parameters that are not learned from the data but are set manually by the user, such as the learning rate in neural networks or the depth of a decision tree. Cross-validation can be used to tune these hyperparameters by evaluating the model's performance with different combinations of hyperparameter values. This helps in finding the best set of hyperparameters that optimize the model's performance.

However, cross-validation also presents certain challenges. One such challenge is the potential for data leakage. Data leakage occurs when information from the validation set inadvertently influences the model during training, leading to overly optimistic performance estimates. Care must be taken to ensure that the validation set remains independent and untouched during the training process.

Another challenge is computational cost. Cross-validation involves training and evaluating the model multiple times, which can be time-consuming and computationally expensive, especially for large datasets or complex models. This challenge can be mitigated by using techniques like stratified sampling or parallel computing to improve efficiency.

In summary, cross-validation is a vital technique in machine learning for model evaluation and hyperparameter tuning. It helps estimate a model's generalization performance, identify overfitting, and select optimal hyperparameters. While challenges like data leakage and computational cost exist, proper implementation and careful consideration of these challenges can ensure the effectiveness and reliability of cross-validation in improving model performance.