# Lab 4: Model evaluation

### Practice notebook

Hello! This lab has an objective to teach various ways of evaluating a model we already learned how to build in Lab 3. In the last lab, we did not perform any automatic process to determine the optimal parameter, but just put any value and saw whether the model showed a better perfor|mance. In this lab, you will learn how to perform a grid search for parameter settings. After that, in the following assignment, you will try various ways of evaluating a model. This lab contains the extra tasks to practice **after** completing the labs with videos.

- 4-3. Run more algorithms using scikit-learn
  - Hold-out
  - Repeated hold-out

- 4-4. Implement manually
  - Grid-search

## 4-3.  Run more algorithms using scikit-learn

Now it's your turn. You will try two more methods related to splitting the datasets. After splitting, you will try to run a simple SVM method using the split datasets. Throughout this section, you will use the same training and test dataset we prepared above in the lab session! If you want to create it again, you can run the code below.

In [7]:
import pandas as pd
import numpy as np
RANDOM_SEED = 12345

In [4]:
data = pd.read_csv("sonar.all-data", header=None)
X = data.drop(60, axis=1)
y = data[60]

#### Holdout

A normal holdout means splitting the dataset into two parts: A training set and a test set. Therefore, this method cannot be used when the task involves parameter tuning. In this case, we should use pre-defined parameters not based on our dataset.

We already tried it once in the previous lab. Scikit-learn supports holdout method by **train_test_split** method in **model_selection** package. Let's load it!

In [5]:
from sklearn.model_selection import train_test_split

You can directly run this method using our dataset (`X, y`). Let's set the test size to 0.2 (test_size), and apply stratifying. Do not forget to set *random_state* to our `RANDOM_SEED` defined above!

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y)

Now let's create an SVC instance.

In [11]:
from sklearn.svm import SVC

In [12]:
svc = SVC()

Let's fit the model with a *training* dataset and its label!

In [13]:
svc.fit(X_train, y_train)

SVC()

Let's return the score with the fitted model and the *test* dataset and its label! Let's assign the value into the value called `score`.

In [14]:
score = svc.score(X_test, y_test)

#### Repeated holdout

A repeated holdout is somewhat similar to k-fold cross-validation. You can even say that it is a more generalized version of k-fold. It is running the normal holdout test multiple times by shuffling the dataset for each trial. In this way, you can come up with a more reasonable performance measure.

The repeated holdout is supported by the class called **ShuffleSplit** located in the **model_selection** package. Let's load it first!

In [15]:
from sklearn.model_selection import ShuffleSplit

ShuffleSplit method receives the following parameters: number of splits (`n_splits`), the proportion of a test set (`test_size`), and random state (`random_state`).

In [16]:
rs = ShuffleSplit(n_splits=5, test_size=.20, random_state=RANDOM_SEED)

In this round, let's run the **cross_val_score** method we learned above. You can use this method with ShuffleSplit instance together. You may find the information somewhere in this lab!

In [19]:
from sklearn.model_selection import cross_val_score

In [20]:
scores = cross_val_score(svc, X, y, cv=rs)
scores

array([0.76190476, 0.85714286, 0.73809524, 0.78571429, 0.83333333])

Let's take a maximum score and mean cross-validation score and save them into corresponding variables here!

In [21]:
max_score = np.max(scores)
mean_score = np.mean(scores)

## 4-4. Implement manually

Now your last extra work in this lab is to try implementing grid search. For simplicity, we will not require you to develop cross-validation inside, since the core function of grid-search is to find out optimal parameter given the dataset. 

#### Grid-search

Here we pre-made the skeleton structure for you. You should develop *fit* and two member variables (*best_estimator_, best_score_*). You may want to use Python's `itertools.product` if you need to make your work easier. To evaluate each parameter, you should use **StratifiedKFold** inside to get the same result with scikit-learn.

In [22]:
import itertools

Here we give you a parameter grid variable that was used above in our lab session.

In [23]:
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

And we will again stick to a simple SVC classifier. But we do not need to initialize it now since we may need different instances for each iteration of parameter combinations. We used **class** notation and use **self** variable inside. This structure is made to give you the same experience with scikit-learn when testing. You only use **self** here when you want to call the methods defined in the class structure or want to access the class variable defined by self first. The class-based structure will not appear in the assignment.

In [24]:
from sklearn.svm import SVC

In [25]:
class GridSearch_Manual():
    def __init__(self, estimator, param_grid, cv):
        self.estimator = estimator
        self.param_grid = param_grid
        self.best_estimator_ = None
        self.best_score_ = 0
        self.cv = cv
        self.skf = StratifiedKFold(n_splits=cv)
        
    def fit(self, X, y):
        
        for param_set in self.param_grid:
            keys, values = zip(*param_set.items())
            permutations_dicts = [dict(zip(keys, v)) for v in itertools.product(*values)]
            
            for param in permutations_dicts:
                classifier = self.estimator(**param)
                scores = []
                for train_index, test_index in skf.split(X, y):
                    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
                    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
                    
                    classifier.fit(X_train, y_train)
                    scores.append(classifier.score(X_test, y_test))
                
                mean_score = np.mean(scores)
                if mean_score > self.best_score_:
                    self.best_estimator_ = classifier
                    self.best_score_ = mean_score

Here we made some callers for you. You should be able to initialize your instance in the same way as scikit-learn's **GridSearchCV** works. 

In [27]:
from sklearn.model_selection import StratifiedKFold

In [28]:
grid = GridSearch_Manual(estimator=SVC, param_grid=param_grid, cv=5)

You should be able to fit and return the scores in the same way too!

In [29]:
grid.fit(X, y)

NameError: name 'skf' is not defined

In [30]:
grid.best_score_

0

In [31]:
grid.best_estimator_

This extra task is not graded, but let's check that your result is the same with the one from scikit_learn!

In [None]:
gs = GridSearchCV(estimator=svc, param_grid=param_grid, cv=5)
gs.fit(X, y)
print(grid.best_score_, grid.best_estimator_)

# End of practice assignment 4