# Lab 4: Model evaluation

This lab contains various methods for evaluating the model created in the previous lab. So far, no automatic process has been performed to determine the optimal parameters. In this lab, you will learn how to perform a grid search to find good parameter settings and perform k-fold to ensure the model's generalizability.

### Contents
- 4-1. Model evaluation methods in scikit-learn
  - K-fold
  - Grid search
  - Nested k-fold
  
  


- 4-2. Manual implementation
  - K-fold

## 4-1. Model evaluation methods in scikit-learn

The first validation method we will learn is the **k-fold cross-validation**. This method is quite simple but is most widely used in practice. It divides the dataset in a (k-1):1 ratio and uses the right part ($\frac{1}{k}$ of the dataset) as a **validation set** while training the model on the other $\frac{k-1}{k}$ part. We change the validation set k times and run this validation k times with different parts of the dataset to generalize the validation performance. 


Split the data set in a (k-1):1 ratio, use the right part as the validation set, and train the model on the other part. 

#### Load the libraries

Basic libraries used throughout this lab session. The random seed is set to ensure the same results as the instructor's ones.

In [None]:
import pandas as pd
import numpy as np
RANDOM_SEED = 12345

#### Load the dataset

In this lab, we will use the same data as we used in the previous lab: **Connectionist Bench** from UCI Machine Learning Repository, which can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data). This dataset has two classes: ***Mines***, ***Rocks***, with 60 attributes representing each data entity. More information can be found <a href="https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)">here</a>. If you have downloaded the whole package of our labs, you will not have any problem of loading the file.

The first thing you must do is to load data and check it is correctly loaded. We will use pandas to load and manipulate it. Since there is no proper **head** for the table, you need to choose not to use the first row as a set of column names. You can refer to the previous lab to check whether the dataset is correctly loaded.

In [None]:
data = pd.read_csv("sonar.all-data", header=None)

We will continue to use scikit-learn, in which we manage labels and data attributes separately. Let's separate the data labels from the dataset.

- Divide the dataset into two parts: attributes (`X`) and labels (`y`).

In [None]:
X = data.drop(60, axis=1)
y = data.iloc[:, -1]

The next task was to split the dataset into training and test sets in the previous lab. However, we will no longer have the test set. Instead, we will split our dataset into training and validation sets. Here, we use the validation set for further generalization of our model. However, we want to use the validation set for the model creation process to determine optimal parameters (e.g., together with grid search). We may need to split our dataset into three parts, including the test set, to get final performance measures.

Since we are not trying optimization at this stage, we will divide our dataset into two parts using the **k-fold cross-validation** method.

#### K-fold

Scikit-learn provides two types of k-fold methods: k-fold and stratified k-fold. As the name suggests, stratified k-fold preserves the proportions of labels when separating datasets. We will try both and see which produces a better model for our dataset. 

First, let's try a normal **k-fold** method. The first step is the same as other scikit-learn functions: import the class from the library package and create an instance. You can find k-fold in the *model_selection* package.

In [None]:
from sklearn.model_selection import KFold

Next, we will initialize our instance as we did before for classifiers. Here, we need to specify the number of splits (`n_splits`). Let's set it to five.

- Initialize an KFold instance.

In [None]:
kf = KFold(n_splits=5)

Now we can put our dataset into the **split** method of our instance. It will automatically divide our dataset with a 4:1 ratio five times following the original order of the dataset. This method only returns the indices, so we need to use those indices to get the actual data points. If we want to shuffle the datasets, we must predefine the option when creating the instance.

- Use the split method to iterate different indices for each fold and print the dataset using the indices.

In [None]:
for train_index, test_index in kf.split(X):
    # CHECK THE ORDERS OF THE SPLIT TRAINING SET AND THE TEST SET
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

Next, let's try the **stratified k-fold** method. You can also find it in the model_selection package. We will use the same number of splits (five).

In [None]:
from sklearn.model_selection import StratifiedKFold

- Initialize a StratifiedKFold instance.

In [None]:
skf = StratifiedKFold(n_splits=5)

In this case, we also need to give the split method our original **y** value so that the algorithm knows the label distribution and keeps it in the divided dataset.

- Use the split method to iterate different indices for each fold and print the dataset using the indices.

In [None]:
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

You can use this index to get a partitioned validation set and a partitioned training set, but that is too much work because there are many manual implementations. Scikit-learn provides another option to automate the cross-validation process called **cross_val_score**. This method lists all performance scores from k iterations (i.e., cross-validation scores).

In [None]:
from sklearn.model_selection import cross_val_score

To use this method, we need to go through the same model creation process we learned last time. Let's make a basic SVC classifier with the RBF kernel.

In [None]:
from sklearn.svm import SVC

- Initialize a SVC instance.

In [None]:
clf = SVC()

This function uses **StratifiedKFold** inside, so you do not need to worry about the class distribution. If you want to use **KFold** instead of **StratifiedKFold**, you may create a **KFold** instance and put it as a parameter into the function.

- Return a list of cross validation scores by using `cross_val_score`.

In [None]:
# NORMAL CASE: StratifiedKFold is applied
scores = cross_val_score(clf, X, y, cv=5) 

- Return a list of cross validation scores by using cross_val_score with a customized KFold saved as `kf`.

In [None]:
# SPECIAL CASE: Normal KFold is applied
kf = KFold(n_splits=5)
scores2 = cross_val_score(clf, X, y, cv=kf)

In [None]:
scores

In [None]:
scores2

In [None]:
np.mean(scores), np.mean(scores2)

The default score is **accuracy**, but you can also display other scores, such as precision, recall, and the F1 score. Let's display the F1 score instead of accuracy. 

- Return a list of cross validation scores by using cross_val_score with a customized scoring option (f1_macro)

In [None]:
scores3 = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')

In [None]:
np.mean(scores3)

#### Grid search

In the last exercise, we tried to increase the test accuracy by adding different parameter values. However, this is practically impossible because you cannot always wait for the model to finish training, and you cannot manually put in numerous parameter combinations. In this situation, **grid search** can be used to find the optimal parameter given a specific range of parameters. 

In [None]:
from sklearn.model_selection import GridSearchCV

It receives sets of parameters as a dictionary list (a list having dictionaries as its entities). Inside each dictionary, we specify the possible combination of parameters.

- Define a parameter grid running two grid search rounds, where one contains C and kernel and the other contains C, gamma, and kernel as options.

In [None]:
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

First, as always, we need to create an instance with all our parameters.

- Define a grid search instance.

In [None]:
search = GridSearchCV(clf, param_grid, cv=5)

Next, we can directly fit this instance with our dataset. Since it will run cross-validation inside, we do not need to put any other split dataset, but the entire dataset.

- Fit the search to our dataset (`X`, `y`).

In [None]:
search.fit(X, y)

Now, our first grid search is done! You can find out the best score and the best estimator.

- Return the best estimator by using the attribute best_estimator_.

In [None]:
search.best_estimator_

- Return the best score by using the attribute best_score_.

In [None]:
search.best_score_

#### Nested k-fold

Nested k-fold is used to estimate optimal parameters, but we do not have enough data entities in our dataset to separate it into three parts (training, validation, and test). This method first runs k-fold to run grid-search and runs another k-fold to test the performance measure. Therefore, it must shuffle the dataset before running each k-fold since its strategy is to estimate parameters and test using a different portion of the same dataset.

Here we are going to use a default SVC classifier again!

- Initiate a SVC instance with the RBF kernel.

In [None]:
clf = SVC(kernel="rbf")

The basic idea of nested k-fold is to use one cross-validation to **create** and the other cross-validation to **evaluate** the models and pick the best one. The second cross-validation works like a test set.

We eventually need a loop, but let's learn about a basic structure first.

First, we need to set two different k-fold cross-validation instances.

- Initiate two KFold instances, with the same option.

In [None]:
model_cv = KFold(n_splits=4, shuffle=True, random_state=RANDOM_SEED) # inner k-fold
eval_cv = KFold(n_splits=4, shuffle=True, random_state=RANDOM_SEED+1) # outer k-fold

Next, we also need to set one grid-search instance with the first k-fold instance.

- Initiate one grid search instance with `model_cv` as an option for cross validation.

In [None]:
search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=model_cv)

However, this best score is not useful since it evaluates the same portion of the training dataset. Therefore, now we need to use our second cross-validation instance to get a more reasonable cross-validation score.

- Initiate one grid search instance with `eval_cv` as an option for cross validation.

In [None]:
np.mean(cross_val_score(search, X=X, y=y, cv=eval_cv))

## 4-2. Manual implementation

Here, we are going to implement k-fold. It is a straightforward algorithm having only three steps: 1) divide the data into k folds, 2) choose one of the chunks as one set and all the other chunks as another set, 3) repeat 1-2 k times.

We will also try to make the same structure with the one in the scikit-learn library so that we can quickly test and compare!

In [None]:
class KFold_Manual():
    def __init__(self, n_splits=5, shuffle=False, random_state=RANDOM_SEED):
        return
        
    def split(self, X):
        return

The answer is as follows:

In [None]:
class KFold_Manual():
    def __init__(self, n_splits=5, shuffle=False, random_state=RANDOM_SEED):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = RANDOM_SEED

    def split(self, X):
        #extract the indices
        indices = X.index.values
        #shuffle
        if self.shuffle == True:
            indices = np.random.shuffle(indices, random_state = self.random_state)
        
        #split
        split_indices = np.array_split(indices, self.n_splits)
        
        #index manipulation
        results = []

        for i in range(self.n_splits):
            splits = [np.zeros(0), np.zeros(0)]

            for idx, val in enumerate(split_indices):
                if idx != i:
                    splits[0] = np.concatenate((splits[0], val))
                else:
                    splits[1] = np.concatenate((splits[1], val))
                
            results.append(splits)

        return results

Now, let's copy and paste the code above and run it here!

In [None]:
kf = KFold_Manual()

In [None]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# END OF LAB 4