# Lab3: Introduction to supervised learning
This lab will be separated into two parts:

1. First, we will code ourselves a random-based classifier and evaluate it using k-fold validation on the Pokemon dataset.

2. We will learn to do the same thing using the [sklearn](https://scikit-learn.org/stable/) library.

In [36]:
import pandas as pd
import numpy as np

## Loading the dataset

Load the Pokemon dataset (or the `pre_processed.csv` one we did in the previous session).

In [37]:
df = pd.read_csv("../pokemon.csv")

Extract 3 features of your choice into an array `X` and a target array `y` (conventional notations of `sklearn`).

In [38]:
# I am only extracting a few subset of variables, 
# because we are working on the random classifier, but you can take all the features we studied last lab
X = df[['sp_attack', 'sp_defense']].values
y = df["is_legendary"].values

## Coding our own solution

### Coding a random classifier

1. Implement the simplest possible classifier: given a numpy vector and its ground truth, return a random value between 0 and 1 (use `numpy.random.binomial`). Make $p$ (the probability of being classified as 1) a variable.

In [60]:
def random_classifier(X, p: float = .5):
    """Random classifier: given a numpy vector X, return either the value 0 or 1, with probability p.
    
    Example:
        random_classifier(np.array([1, 2, 3])) returns 1
    """
    return np.random.binomial(1,p,1)


2. Apply this classifier on all values in the `X` numpy matrix and store it in `y_predict`.

In [40]:
y_Binpredict=[random_classifier(x) for x in X]
y_Binpredict


[0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,


3. Create the four evaluation functions we saw during lecture 4, that takes as iput :
- `accuracy`
- `recall`
- `f1_score`
- `precision`

In [66]:
def confMatrix(y_true, y_pred):
    TP=TN=FP=FN =0
    
def accuracy(y_true, y_pred):
    return np.sum(np.equal(y_true, y_pred)) / len(y_true)

def precision(y_true,y_pred):
    return np.sum(np.equal(y_true, y_pred))/np.sum(np.sum(np.equal(y_true, y_pred))),

def recall(y_true,y_pred):
    return np.sum(np.equal(y_true, y_pred))/np.sum(np.sum(np.equal(y_true, y_pred)),not(np.equal(y_true,y_pred)))

In [42]:
import sklearn 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print(accuracy_score(y,y_Binpredict))
print(precision_score(y,y_Binpredict))
print(recall_score(y,y_Binpredict))
print(f1_score(y,y_Binpredict))

0.5081148564294632
0.07142857142857142
0.38571428571428573
0.12053571428571429


4. Apply these functions to `y` and `y_predict` and draw conclusion.

We can see that all scores are above .5, value that should be used to compare the quality of our algorithms to.

### Separation between tests and train
We will evaluate our algorithm by "training" it on a subset of the data `X_train`, `y_train` and evaluate it on the data `X_test`, and compare `y_test` with the ground truth.

1. Is there a training phase of the random classifier ?

No

2. Create a function `split_train_test` that takes as input a matrix `X` and a target `y` and randomly splits into two matrixes `X_train` and `X_test` and a target `y_train` and `y_test`. You can use the function `numpy.random.choice`.

In [43]:
import random
def split_train_test(X,y):
    assert(len(X) == len(y))

    randsplit = random.randint(0,len(X))
    X_train = X[:randsplit]
    X_test = X[randsplit:]
    y_train = y[:randsplit]
    y_test = y[randsplit:]

    return X_train,X_test,y_train,y_test
split_train_test(X,y)

(array([[ 65,  65],
        [ 80,  80],
        [122, 120],
        [ 60,  50],
        [ 80,  65],
        [159, 115],
        [ 50,  64],
        [ 65,  80],
        [135, 115],
        [ 20,  20],
        [ 25,  25],
        [ 90,  80],
        [ 20,  20],
        [ 25,  25],
        [ 15,  80],
        [ 35,  35],
        [ 50,  50],
        [135,  80],
        [ 25,  35],
        [ 40,  80],
        [ 31,  31],
        [ 61,  61],
        [ 40,  54],
        [ 65,  79],
        [ 50,  50],
        [ 95,  85],
        [ 10,  35],
        [ 25,  65],
        [ 40,  40],
        [ 55,  55],
        [ 75,  85],
        [ 40,  40],
        [ 55,  55],
        [ 85,  75],
        [ 60,  65],
        [ 95,  90],
        [ 50,  65],
        [ 81, 100],
        [ 45,  25],
        [ 85,  50],
        [ 30,  40],
        [ 65,  75],
        [ 75,  65],
        [ 85,  75],
        [110,  90],
        [ 45,  55],
        [ 60,  80],
        [ 40,  55],
        [ 90,  75],
        [ 35,  45],


3. Predict the value on the test dataset `X_test` on `y_test_predict`.

4. Compute the accuracy, precision, recall, f1_score by comparing `y_test_predict` to `y_test`.

5. Can you see what is the limitation of using simply accuracy ? What would be the problem if we had an unbalanced dataset ?

Precision reflects the repartition of the data, in the case of an unbalanced dataset, if we predicted always the same value we would get a good score even though our classifier is a constant.

### K-fold validation

The other, more robust approach we saw in class is k fold validation, which consists in using *k-1* fold for training and 1 fold for testing. We then compute an average/median of the performance metrics over all experiments.

1. Create a function `k_fold_train_test` that will first shuffle an input matrix and then divide into k-fold with the number of folds specified as input.

In [51]:
from numpy.random import shuffle
def pimp_my_k_fold(matrix,y,k=3):

    folds=[]
    # index shuffles
    indexes=np.arange(len(matrix))
    shuffle(indexes)
    #
    fold_size=round(len(matrix)/k)
    #iterate in 
    index=0
    for fold in range(1,k+1):
        k_fold_index = indexes[index:index+fold_size]
        folds.append((matrix[k_fold_index], y[k_fold_index]))
        index+= fold_size    
    return folds 
folds=pimp_my_k_fold(X,y,k=4)     
folds

[(array([[ 40,  62],
         [ 40,  79],
         [ 95,  60],
         [ 80, 105],
         [ 81,  67],
         [ 80,  63],
         [ 40,  30],
         [105, 105],
         [ 69,  59],
         [ 60,  60],
         [ 30,  30],
         [104,  71],
         [ 90, 100],
         [114, 100],
         [ 53,  45],
         [ 75,  85],
         [125,  90],
         [ 56,  52],
         [120,  80],
         [ 45,  45],
         [ 63,  45],
         [ 86,  90],
         [ 55,  95],
         [ 75,  85],
         [ 60,  90],
         [105, 105],
         [ 70,  50],
         [115,  60],
         [ 60,  69],
         [ 60, 120],
         [ 83, 100],
         [ 95, 110],
         [ 61,  87],
         [ 55,  55],
         [ 91,  81],
         [180, 100],
         [115,  95],
         [111, 101],
         [ 55,  75],
         [165, 110],
         [ 50,  50],
         [ 30,  85],
         [ 50,  60],
         [ 55,  75],
         [ 49,  65],
         [130, 115],
         [ 40,  40],
         [ 53

2. Use the k-fold algorithm to compute the average accuracy and recall the k folds. The algorithm will:
    - Iterate over the k folds
    - Train the model on the k-1 models
    - Evaluate the performance on the 1 remaining fold and store it
    - Compute the average/median performance

3. What problem do you see with this approach ?

In [67]:

# explanations are tricky
folds
for i in range(0,len(folds)) :
    index_test=i
    
    X_test,y_test=folds[i]
    
    indexes_train=[ix for ix in range(len(folds)) if ix !=index_test]
    trains=[folds[ix] for ix in indexes_train]
    
    X_train=np.concatenate([train[0] for train in trains])
    y_train=np.concatenate([train[1] for train in trains])
    # for trained models
    #model.fit(X_train,y_train)
    
    y_pred=random_classifier(X_test)
    
    accuracy = accuracy(y_test, y_pred)
    #recall = recall(y_test, y_pred)

TypeError: 'numpy.float64' object is not callable

## Using sklearn
Sklearn is THE usual library for machine learning (but not so much deep learning), which comes with built-in methods (and many more) for training and performance evaluation.

1. Import different performance evaluation metrics by reading the documentation [here](https://scikit-learn.org/stable/modules/model_evaluation.html). (it's too long a read for a lab, but it's definitely an interesting read). Compare the `balanced_accuracy` and `accuracy` to our previous implementation (see [here](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score) for more). Compute the scores on `y` for the random classifier we implemented.

2. Plenty of functions are available to split the dataset into train and test (see [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) for the complete list). Split `X` and `y` into train and test using the function `sklearn.model_selection.train_test_split`. What is the role of the `stratify` variable ? What problem does it solve ?

3. Use the function `sklearn.model_selection.KFold` to get the proper indexes and perform cross validation on the random classifier using `balanced_accuracy`.

# Conclusion and further works
What do you think could be the use of this random classifier for the rest of our work on the titanic dataset ?


**Highly advised bonus** (you will be able to use it during the exam): 
Create a Python module `utils.py` with the different functions and tools we coded today. We will re-use it throughout the rest of the labs.