# Optional: Scikit-learn primer.
In this additional assignment, you will learn to use the scikit-learn library. It is highly recommended to go through this notebook before starting with the final assignment.

$\newcommand{\q}[1]{\rightarrow \textbf{Question #1}}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1}}$

## Introduction
All algorithms, both learning and pre-processing, in scikit-learn have been implemented with the same `fit`, `predict` and `transform` API. As soon as you have learned this API you can use any algorithm without having to implement it on your own. For a given learning problem, you can then apply all those algortihms in the same way. The API also hides all the complex optimization choices that have to be made. You can control these by changing the hyper-parameters. The effects of these choices have been well documented in the API documentation and the provided tutorials of scikit-learn.  



## Dataset

In this assignment, we will use the Iris dataset to keep things simple.

In [None]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

## Using classifiers
Using a classifier in scikit-learn consist of 3 steps:
1. Initialize the model. During this step, you can already give it some default hyper-parameters.
2. Fitting the model on the training data.
3. Making predictions and/or evaluating the model.

### Create
Creating models is very easy in scikit-learn. All you have to do is create a new instance of the model's class.

$ \ex{1} $ Extent the list of models with the`SVC` and `LogisticRegression` algorithms. Give the SVM a `poly` kernel. Also, give both algorithms a regularization constant `C=0.5` and `random_state=42`.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

random_state = 42

models = {
    "GaussianNB": GaussianNB(),
    "DummyClassifier": DummyClassifier(strategy="most_frequent"),
    "DecisionTreeClassifier": DecisionTreeClassifier(max_depth=None, min_samples_leaf=2, random_state=random_state),
    "KNeighborsClassifier": KNeighborsClassifier(n_neighbors=3, weights="distance"),
    # START ANSWER   
    # END ANSWER
}

assert "GaussianNB" in models and isinstance(models["GaussianNB"], GaussianNB), "There is no GaussianNB in models"
assert "DecisionTreeClassifier" in models and isinstance(models["DecisionTreeClassifier"], DecisionTreeClassifier), "There is no DecisionTreeClassifier in models"
assert "KNeighborsClassifier" in models and isinstance(models["KNeighborsClassifier"], KNeighborsClassifier), "There is no KNeighborsClassifier in models"
assert "SVM" in models and isinstance(models["SVM"], SVC), "There is no SVC in models"
assert "LogisticRegression" in models and isinstance(models["LogisticRegression"], LogisticRegression), "There is no LogisticRegression in models"

### Fit
$ \ex{2} $ Fit each of your models on the entire training set by calling the `.fit` method of the model.

In [None]:
for name, model in models.items():
    # START ANSWER  
    # END ANSWER

In [None]:
from sklearn.utils.validation import check_is_fitted

for model in models.values():
    check_is_fitted(model)

### Evaluate
The `sklearn.metrics` module has lots of metrics that can evaluate a model's predictions. Here is an example of how to calculate a model's F1 and accuracy score.

In [None]:
from sklearn.metrics import f1_score, accuracy_score

for name, model in models.items():
    prediction = model.predict(X)
    f1_score_value = f1_score(prediction, y, average="weighted")
    accuracy = accuracy_score(prediction, y)
    print(name)
    print("- accuracy_score", accuracy)
    print("- f1_score", f1_score_value)

## Data splitting
Models usually achieve a high evaluation score on the training set. However, this doesn't say anything about how well it generalizes to unseen data. So we usually evaluate models using either a test/validation split or k-fold validation. Scikit-learn also makes our life easier here by implementing both functions for us.

### Test/validation split
We can split datasets into training and test sets using the `train_test_split` function. The `test_size` parameter indicate the percentage of data that should go to the test set. The `stratify`  parameter indicate that the split should take the distribution of target labels `y` into account during the split. This parameter ensures that both the train and test have the same distribution of target variables.

$ \ex{3} $ The data has already been split into a training and a test set. Fit the model using the training set and evaluate them using the test set.

The result on the test set should roughly be equal to:

|                  Model |   Accuracy  |  F1|
|-----------------------:|------:|---------:|
|             GaussianNB |  0.86 |     0.86 |
| DummyClassifier        | 0.33  | 0.5      |
| DecisionTreeClassifier | 0.866 |    0.866 |
| KNeighborsClassifier   | 1     | 1        |
| SVM                    | 0.93     | 0.934        |
| LogisticRegression     | 0.933 | 0.934    |

In [None]:
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True, stratify=y)

# START ANSWER 
# END ANSWER 

## K-fold validation
Setting up k-fold validation is a bit more work but we can do it as follows:

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

def k_fold_fit_and_evaluate(X, y, model, scoring_method, n_splits=5):
    # define evaluation procedure
    cv = KFold(n_splits=n_splits, random_state=42, shuffle=True)
    # evaluate model
    scores = cross_validate(model, X, y, scoring=scoring_method, cv=cv, n_jobs=-1)
    
       
    return scores["test_score"]

Note: `cross_validate` expects a `scoring_method`. We can create a `scoring_method` using the `make_scorer` function from scikit-learn.

$ \ex{4} $ Use the example below to calculate the mean and std for both the F1 and the accuracy score. The `k_fold_fit_and_evaluate` method returns the resulting k-fold validation score from the provided `scoring_method`.

Hint: use `np.mean` and `np.std`.

The result using k-fold validation should roughly be equal to:


|                  Model | mean F1 | std F1 | mean Accuracy | std Accuracy |
|-----------------------:|--------:|--------|--------------:|--------------|
|             GaussianNB |   0.959 | 0.0249 |        0.960 | 0.0249        |
| DummyClassifier        | 0.107   | 0.0187 | 0.26        | 0.0249       |
| DecisionTreeClassifier |   0.946 | 0.0338 |       0.94655 | 0.0338       |
| KNeighborsClassifier   | 0.966   | 0.0214 | 0.9663        | 0.02144      |
| SVM                    | 0.980   | 0.0163 | 0.980        | 0.0163      |
| LogisticRegression     | 0.966   | 0.0298 | 0.966        | 0.0298     |

In [None]:
from sklearn.metrics import make_scorer

n_splits = 5


scoring_method_f1 = make_scorer(lambda prediction, true_target: f1_score(prediction, true_target, average="weighted"))
# START ANSWER 
# END ANSWER 


for name, model in models.items():
    print(name)
    metrics_f1 = k_fold_fit_and_evaluate(X, y, model, scoring_method_f1, n_splits=n_splits) 
    # START ANSWER 
    # END ANSWER 

## Grid search
Scikit-learn also makes it easier to tune hyper-parameters using `GridSearchCV`.

$ \ex{5} $ Extend the `model_parameters` dict by specifying a grid search for the `KNeighborsClassifier`, `SVM` and the `LogisticRegression` models.

In [None]:
from sklearn.model_selection import GridSearchCV

random_state = 42
n_splits = 5
scoring_method = make_scorer(lambda prediction, true_target: f1_score(prediction, true_target, average="weighted"))

model_parameters = {
    "GaussianNB": {
    
    },
    "DummyClassifier": {
        
    },
    "DecisionTreeClassifier": {
        'random_state': [random_state],
        'max_depth': [None, 2, 5, 10]
    },
    # START ANSWER
    # END ANSWER
}

for model_name, parameters in model_parameters.items():
    model = models[model_name]
    
    cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, verbose=False, scoring=scoring_method).fit(X, y)
    
    best_model = grid_search.best_estimator_
    best_score = grid_search.best_score_
    best_params = grid_search.best_params_
    
    print(model_name)
    print("- best_score =", best_score)
    print("best paramters:")
    for k,v in best_params.items():
        print("-", k, v)


## Using Transformers
The transformers have a similar but slightly different API than the models. Transformers still have the `fit` method. The fit method is, for example, use in the `StandardScaler` to find the `mean` and `std` values. However, the `predict` method is replaced with the `transform` method. Scikit-learn did this to make it clear to the users that this is not a model but a feature transformer.

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_train)

scaler.mean_, scaler.scale_

After fitting the transformer, you can call the `transform` method, and it will transform the input features based on the parameters it found during the last `fit` call.

In [None]:
X_train_transformed = scaler.transform(X_train)
print("X_train")
print("mean", X_train.mean())
print("std", X_train.std())
print()
print("X_train_transformed")
print("mean", X_train_transformed.mean())
print("std", X_train_transformed.std())

$ \ex{6} $ First, transform the dataset using the `Normalizer` transformer. The fit and evaluate each model using the transformed features.

In [None]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True, stratify=y)

scaler = preprocessing.Normalizer()

# START ANSWER 
# END ANSWER 