# <center> TP2 - 01 Description

# Objective of TP

In this TP you will develop the full **supervised learning pipeline** including the *hyper-parameter tuning* and *model evalutaion*.  

You will then apply the pipeline to three algorithms
* nearest neighbour
* decision tree
* default classifier

Finally, you will perform *model comparison* and **discuss** its results.

### Recommendation:
The code you will develop in this TP is to be re-used in TP3 and the exam.  
Therefore we recommend you try to make it clear (use comments, when printing say what you print) so that next time it is easier for you to remember what it does.  
Also, try to make the code generic so that it can be easilly used for different datasets.   
Try to automate as much as possible so that the code does not require too much of your attention, finally you will need to do the same type of analysis not for 3 algorithms but for 5-6.

## Dataset

You will be workig with the same cars dataset as in TP1.  
Each group shall be using the same `brands` as in TP1.


In [1]:
# Load dataset and extract our part
import pandas as pd

# Reading csv file
autos = pd.read_csv('autos.csv',encoding='latin-1')

# Extracting the relevant part for our group
only_specific_brands = autos.brand.isin(['renault', 'peugeot', 'skoda', 'citroen', 'ford'])
autos = autos[only_specific_brands]

In [2]:
autos.head()

Unnamed: 0,price,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,fuelType,brand,notRepairedDamage,fast_sale
2,11400.0,limousine,2010.0,manuell,175.0,mondeo,125000.0,diesel,ford,nein,False
4,4100.0,kleinwagen,2009.0,manuell,68.0,1_reihe,90000.0,benzin,peugeot,nein,False
6,888.0,kombi,2000.0,manuell,115.0,mondeo,150000.0,benzin,ford,nein,True
7,13700.0,bus,2012.0,manuell,86.0,roomster,5000.0,benzin,skoda,nein,True
9,4299.0,kleinwagen,2010.0,manuell,75.0,2_reihe,125000.0,benzin,peugeot,nein,False


# Data preprocessing

Remember that after loading the dataset, there are several preprocessing steps you need to do before trainign the algorithm.
If you are not sure what these are, see *Course 8 - 02 Hyper-parameter tuning*.

When writing the code, **put short comments explaining what the pre-processing steps are and why you need to do them**.

### First we are changing the column names:

- We changed the style of writing the column names from camelcase to snakecase. This change will make column names easier to read, and we won't have to remember which letter is capitalized.
- We reworded some of the column names in order for them to be more descriptive and obvious.

In [3]:
# Changing the column names
autos.columns = ['price', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 
                 'kilometer', 'fuel_type', 'brand', 'unrepaired_damage', 'fast_sale']
autos.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,fuel_type,brand,unrepaired_damage,fast_sale
2,11400.0,limousine,2010.0,manuell,175.0,mondeo,125000.0,diesel,ford,nein,False
4,4100.0,kleinwagen,2009.0,manuell,68.0,1_reihe,90000.0,benzin,peugeot,nein,False
6,888.0,kombi,2000.0,manuell,115.0,mondeo,150000.0,benzin,ford,nein,True
7,13700.0,bus,2012.0,manuell,86.0,roomster,5000.0,benzin,skoda,nein,True
9,4299.0,kleinwagen,2010.0,manuell,75.0,2_reihe,125000.0,benzin,peugeot,nein,False


### Next, we are going to convert some columns data type:

- We are changing data type in order to facilitate the data processing (classifying the data in numeric or categorical)
- We are converting column 'unrepaired_damage' from object to boolean type
- We are converting column 'registration_year' from float to int.

In [4]:
# Converting column 'unrepaired_damage' from object to boolean type
autos['unrepaired_damage'] = (autos['unrepaired_damage']
                 .str.replace('nein','')
                 .str.replace('ja','True')
                  .astype(bool)
                 )

In [5]:
# Converting column 'registration_year' from float to int.
autos['registration_year'] = (autos['registration_year'].astype(int))

# Is it relevant or we can simply keep it in float format ?

In [6]:
autos.info()
autos.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28401 entries, 2 to 166073
Data columns (total 11 columns):
price                28401 non-null float64
vehicle_type         28401 non-null object
registration_year    28401 non-null int64
gearbox              28401 non-null object
power_ps             28401 non-null float64
model                28401 non-null object
kilometer            28401 non-null float64
fuel_type            28401 non-null object
brand                28401 non-null object
unrepaired_damage    28401 non-null bool
fast_sale            28401 non-null bool
dtypes: bool(2), float64(3), int64(1), object(5)
memory usage: 2.2+ MB


Unnamed: 0,price,registration_year,power_ps,kilometer
count,28401.0,28401.0,28401.0,28401.0
mean,4177.515017,2003.965565,102.402979,121472.307313
std,4680.629533,5.87246,40.664873,39816.529262
min,1.0,1923.0,2.0,5000.0
25%,1199.0,2001.0,75.0,100000.0
50%,2500.0,2004.0,101.0,150000.0
75%,5400.0,2008.0,122.0,150000.0
max,73500.0,2016.0,952.0,150000.0


### Next, we are goint to decide what is output (target) and what is input:
- We do this because ...
- How we do it ...

In [7]:
# output dataframe
out_autos = autos['fast_sale']
# input dataframe
in_autos = autos.iloc[:,[0,1,2,3,4,5,6,7,8,9]]

print('Inputs')
display(in_autos.head())
print('Outputs')
display(out_autos.head())

Inputs


Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,fuel_type,brand,unrepaired_damage
2,11400.0,limousine,2010,manuell,175.0,mondeo,125000.0,diesel,ford,False
4,4100.0,kleinwagen,2009,manuell,68.0,1_reihe,90000.0,benzin,peugeot,False
6,888.0,kombi,2000,manuell,115.0,mondeo,150000.0,benzin,ford,False
7,13700.0,bus,2012,manuell,86.0,roomster,5000.0,benzin,skoda,False
9,4299.0,kleinwagen,2010,manuell,75.0,2_reihe,125000.0,benzin,peugeot,False


Outputs


2    False
4    False
6     True
7     True
9    False
Name: fast_sale, dtype: bool

### Next, we convert categorical to numeric (one-hot encoding)

- The algorithms we use (except the deciscion tree) cannot operate on label data directly. They require all input variables and output variables to be numeric. Therefore we need to convert categorical data to a numerical form.
- Since no ordinal relationship exists in the categorical variables in our dataset (and we don't want to allow the model to assume a natural ordering between categories), we use the one-hot encoding. 

In [8]:
print('Original input data')
print('-------------------')
display(in_autos.head())

# Create one-hot encoding
in_long = pd.get_dummies(in_autos)
print('\n Numerical input data')
print('-------------------')
display(in_long.head())

Original input data
-------------------


Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,fuel_type,brand,unrepaired_damage
2,11400.0,limousine,2010,manuell,175.0,mondeo,125000.0,diesel,ford,False
4,4100.0,kleinwagen,2009,manuell,68.0,1_reihe,90000.0,benzin,peugeot,False
6,888.0,kombi,2000,manuell,115.0,mondeo,150000.0,benzin,ford,False
7,13700.0,bus,2012,manuell,86.0,roomster,5000.0,benzin,skoda,False
9,4299.0,kleinwagen,2010,manuell,75.0,2_reihe,125000.0,benzin,peugeot,False



 Numerical input data
-------------------


Unnamed: 0,price,registration_year,power_ps,kilometer,unrepaired_damage,vehicle_type_andere,vehicle_type_bus,vehicle_type_cabrio,vehicle_type_coupe,vehicle_type_kleinwagen,...,fuel_type_cng,fuel_type_diesel,fuel_type_elektro,fuel_type_hybrid,fuel_type_lpg,brand_citroen,brand_ford,brand_peugeot,brand_renault,brand_skoda
2,11400.0,2010,175.0,125000.0,False,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
4,4100.0,2009,68.0,90000.0,False,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
6,888.0,2000,115.0,150000.0,False,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,13700.0,2012,86.0,5000.0,False,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,4299.0,2010,75.0,125000.0,False,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


### Next, we normalize (bring all variables to same scale [0,1])

-  We need to normalize the data to ensure that all sources are treated equally, and that data-availability bias (and its corresponding misrepresentation of the data universe) is reduced.

In [9]:
# maximum of each attribute
maxs = in_long.max(axis=0) # max accross rows (axis=0)

print('Maximums')
print('----------')
print(maxs)

# minimum of each attribute
mins = in_long.min(axis=0) # max accross rows (axis=0)

print('\n Minimums')
print('----------')
print(mins)

# min-max normalization
norm_in = (in_long - mins) / (maxs-mins)

# new maximum of each attribute
maxs = norm_in.max(axis=0) # max accross rows (axis=0)

print('\n New maximums')
print('----------')
print(maxs)

# new minimum of each attribute
mins = norm_in.min(axis=0) # max accross rows (axis=0)

print('\n New minimums')
print('----------')
print(mins)

Maximums
----------
price                       73500.0
registration_year            2016.0
power_ps                      952.0
kilometer                  150000.0
unrepaired_damage               1.0
vehicle_type_andere             1.0
vehicle_type_bus                1.0
vehicle_type_cabrio             1.0
vehicle_type_coupe              1.0
vehicle_type_kleinwagen         1.0
vehicle_type_kombi              1.0
vehicle_type_limousine          1.0
vehicle_type_suv                1.0
gearbox_automatik               1.0
gearbox_manuell                 1.0
model_1_reihe                   1.0
model_2_reihe                   1.0
model_3_reihe                   1.0
model_4_reihe                   1.0
model_5_reihe                   1.0
model_andere                    1.0
model_b_max                     1.0
model_berlingo                  1.0
model_c1                        1.0
model_c2                        1.0
model_c3                        1.0
model_c4                        1.0
model_c5

### As of this step, no more use of original data and only use preprocessed

In our case above: `out_autos` and `norm_in`

# Prepare for model evaluation and hyper-parameter tuning

### Data splits for model evaluation (training and testing)

You will need to write the code splitting the data to training set (used for model learning and hyper-parameter tuning) and test set used for final model evaluation (test error).

Here, you can choose to **use either 5-folds cross-validation or 5 time repeated hold-out method.**

**Tell us what your choice is and why**. Both choices are good, we just want to know that you understand the differences and you have thought about them.

Remember that in the end this procedure will be used for all your algorithms and that these should work over the same train/test splits. You can make sure this will be the case by fixing the seed for the random sample generation.

### 5-folds cross-validation
- We chose cross-validation because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split (hold-out).
- The more folds we have, the more we will be reducing the error due the bias but increasing the error due to variance; the computational price would go up too, obviously — the more folds you have, the longer it would take to compute it and you would need more memory. With a lower number of folds, we’re reducing the error due to variance, but the error due to bias would be bigger. It’s would also computationally cheaper

In [278]:
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn import neighbors, tree
from operator import itemgetter


# Supervised learning pipeline with hyper-parameter tuning¶

def run_classifier(hyper_params, in_data, out_data, clf_callback, n_split_outer=5, n_split_inner=3):    

    k_fold = KFold(n_splits=n_split_outer, random_state=123, shuffle=True)

    # Dictrionnary holding the cross-validation (cv) scores
    cv_scores = dict_filled_with_zeros_for_each(hyper_params)
    
    # Outter cross validation
    for train_idx, test_idx in k_fold.split(in_data, out_data):
        scores_by_fold = dict()
        in_train = norm_in.iloc[train_idx]
        out_train = out_autos.iloc[train_idx]
        
        # Inner cross validation
        for hp in hyper_params:
            clf = clf_callback(hp)
            scores = cross_val_score(clf, in_train, out_train, cv=n_split_inner)
            cv_scores[hp] += scores.mean() 
            scores_by_fold[hp] = scores.mean() 
        
        # Print best score by fold
        best_hp_by_fold = max(scores_by_fold.items(), key=operator.itemgetter(1))[0]
        print ("The optimal number of neighbors is %d" % best_hp_by_fold)
        
    # Calculate average of each hyper-parameter
    for hyper_param, accuracy in cv_scores.items():
        cv_scores[hyper_param] = accuracy / n_split_outer

    return cv_scores


def dict_filled_with_zeros_for_each(keys):
    _dict = dict()
    for key in keys:
        _dict[key] = 0.0
    return _dict


def get_best_hyper_param(scores):
    # Get the hyper-param with the highest accuracy in scores dict.
    return max(scores.items(), key=operator.itemgetter(1))[0]


def clf_neighbors(hyper_param):
    return neighbors.KNeighborsClassifier(n_neighbors=hyper_param)


def clf_tree(hyper_param):
    return tree.DecisionTreeClassifier(max_leaf_nodes=hyper_param)

In [279]:
#### DONE BY CROSS_VAL_SCORE ####
# Shuffle the dataset randomly.
# Split the dataset into k groups
# For each unique group:
    # Take the group as a hold out or test data set
    # Take the remaining groups as a training data set
    # Fit a model on the training set and evaluate it on the test set
    # Retain the evaluation score and discard the model
################################# 

In [280]:
knn_scores = run_classifier(
    hyper_params=[10, 15, 40, 400],
    n_split_outer=5,
    n_split_inner=3,
    in_data=norm_in[:2000],
    out_data=out_autos[:2000],
    clf_callback=clf_neighbors,
)

knn_scores

The optimal number of neighbors is 400
The optimal number of neighbors is 40
The optimal number of neighbors is 40
The optimal number of neighbors is 400
The optimal number of neighbors is 400


{10: 0.5164983961755968,
 15: 0.5242412928837472,
 40: 0.5318715572440458,
 400: 0.5363703878524089}

In [281]:
optimal_neighbors = get_best_hyper_param(knn_scores)
print ("The optimal number of neighbors is %d" % optimal_neighbors)

The optimal number of neighbors is 400


In [282]:
tree_scores = run_classifier(
    hyper_params=[3, 5, 10, 15, 20, 50],
    n_split_outer=5,
    n_split_inner=3,
    in_data=norm_in,
    out_data=out_autos,
    clf_callback=clf_tree,
)

The optimal number of neighbors is 3
The optimal number of neighbors is 10
The optimal number of neighbors is 20
The optimal number of neighbors is 15
The optimal number of neighbors is 5


In [283]:
tree_scores

{3: 0.5764499128681558,
 5: 0.5775942155387808,
 10: 0.5767755256650724,
 15: 0.5776734699442199,
 20: 0.5772949891005181,
 50: 0.5760625567603999}

In [284]:
optimal_leaf_nodes = get_best_hyper_param(tree_scores)
print ("The optimal number of leaf is %d" % optimal_leaf_nodes)

The optimal number of leaf is 15


In [None]:
# # Dans l'ideal on veut
# dt_clf.run()
# dt_clf.get_best_hyper_param()
# dt_clf.get_best_hyper_params()
# dt_clf.get_avg_accuracy()
# # dt_clf.get_avg_accuracies()

In [285]:
from sklearn.model_selection import cross_val_score
from sklearn import neighbors
from sklearn.model_selection import KFold
import numpy as np

def run_classifier_without_hp(in_data, out_data, n_split_outer=5):    

    k_fold = KFold(n_splits=n_split_outer, random_state=123, shuffle=True)

    test_accuracy = 0
    
    # outer kfold
    for train_idx,test_idx in k_fold.split(in_data, out_data):
        in_test = in_data.iloc[test_idx]
        out_test = out_data.iloc[test_idx]
        
        # get counts per class
        class_counts = out_train.value_counts()
        majority_class = class_counts.idxmax()

        pred_test = out_test.copy()
        pred_test[:] = majority_class

        # calculate test accuracy
        test_accuracy += calculate_pred_accuracy(pred_test, out_test)
        
  
    return test_accuracy/n_split_outer


def calculate_pred_accuracy(pred_data, out_data):
    match_test = (pred_data == out_data)
    accuracy = match_test.sum() / match_test.count()
    return accuracy

In [286]:
test_acc = run_classifier_without_hp(norm_in, out_autos)
print(train_acc)
print(test_acc)

0.5157564959759879
0.5157566238834167


In [305]:
def final_train(input_data, output_data, hyper_param, clf_callback, n_splits=5):
    # Use the learned model f to do predictions
    k_fold = KFold(n_splits = 5, random_state=123, shuffle=True)

    avg_test_accuracy = 0
    
    for train_idx, test_idx in k_fold.split(in_data, out_data):
        in_train = in_data.iloc[train_idx]
        out_train = out_data.iloc[train_idx]
        in_test = in_data.iloc[test_idx]
        out_test = out_data.iloc[test_idx]
        
        clf = clf_callback(hyper_param)
        clf.fit(in_train, out_train)

        # Prediction for all test data
        pred_test = clf.predict(in_test)

        # Get test set accuracy        
        test_accuracy = calculate_pred_accuracy(pred_test, out_test)
        print('Test accuracy', test_accuracy)
        avg_test_accuracy += test_accuracy
    
    # Use test sets to get estimate of generalization error (accuracy)
    print(avg_test_accuracy/n_splits)
    return avg_test_accuracy/n_splits

In [306]:
final_train(norm_in[:3000], out_autos[:3000], hyper_param=10, clf_callback=clf_neighbors)

Test accuracy 0.5382855131138884
Test accuracy 0.5466549295774648
Test accuracy 0.5383802816901408
Test accuracy 0.5464788732394367
Test accuracy 0.5431338028169014
0.5425866800875665


0.5425866800875665

### Data splits for hyper-parameter tuning

Here we want you to **use 3-folds inner cross validation**.

You will need to write the code to split each of the training sets above to train/validation accoridng to the 3-fold cross-validation strategy.

### Generalization accuracy

You will also need to prepere the code that will use the trained models to produce predictions for the test instances, calculate the accuracy of over each test set, and calculate the final average accuracy over all the test instances (estimate of generalization accuracy).

# Train and test nearest neigbour model

Once you have the general procedure in place, train the nearest neigbour model.

### Hyper-parameter search

Hyper-parameter in nearest neighbour algorithm is the number of neighbours to use.
We want you to try at least 5 different values. **Tell us which values you decide to try.** (There is no Why questoin here.)

Remember that for choosing the best hyper-parameter value, you use the the inner cross validation and the best hyper-parameter is the one with the highest average accuracy over the validation sets.

### Model lerning and test accuracy

Once you have the best value of the hyper-parameter, you use it to **retrain** the model over the merged train+validation (you do this 5 times, see above *Data splits for model evaluation*). You then use this **retrained** model to get the final test accuracy.

For each of the test samples (there should be 5, see above), report the test accuracy and the corresponding hyper-parameter setting (the one chosen as best for this specific split).

Are the hyper-parameter parameters the same for all the test sets? **Discuss** if you think this is  normal or not, why it happens and if it creates some difficulties for interpreting the model. **There is no single correct answer here!** We want to see that you undertand the procedure and that you use your brain.



### Comment
- We found that the best hyper-parameter value is 13 , because it has the highest average accuracy over the validation sets.

# Train and test decision tree

Use the same general procedure to train a decision tree.

Hyper-parameters for decision trees are the pre-prunning criteria such as maximum number of leafs (see *Course 5 - 02 Decision tree prunning*). 
Pick one of these and use at least 5 different values. **Tell us which one you pick and what values you are using.**

Calculate and report the test accuracies together with their corresponding heper-parameter values. (No more comments needed here.)

# Train and test default classifier

Default classifier has no hyper-parameters, so you can skip the inner-cross validatoin procedure.

Calculate and report the test accuracies for the 5 test sets from the part *Data splits for model evaluation*

#  /!\ THIS IS THE MOST IMPORTANT PART /!\

# Compare models

Once you have all your test accuracies for the nearest neighbour, decision trees and default classifier, calculate the estimated generalization accuracy of each (the everage accuracy accross the test sets).

Is any of the algorithms peforming better than the other two? **Discuss, comment.**

## Use the McNemar test 

Use the McNemar test to verify whether the differences in the generalization accuracy are significant. 

In McNemar you can always compare only two algorithms. Do all the pair-wise comparisons, present and **explain** the results. Are these what you would expect?

In [276]:
def mc_nemar_test(m1, m2, alpha=0.05):
    ctab = pd.crosstab(m1,m2)
    #display(ctab)
    # calculate mcnemar test
    result = mcnemar(ctab)
    # interpret test results
    if result.pvalue < alpha:
        print('p-value < alpha: at {} significance level we reject H0. We conclude that M1 and M2 differ.'.format(alpha))
    else:
        print('p-value >= alpha: at {} significance level we cannot reject H0. We cannot say that M1 and M2 differ.'.format(alpha))

# Comparing the models

### Write an observation for each algorithm and compare performance
#### ==>
### Explain WHY some algorithms are better than the others
#### ==>

# The McNemar test

### Nearest neighbour vs Decision trees 
#### Result
#### Explaination
### Nearest neighbour vs Default classifier
#### Result
#### Explaination
### Decision trees vs Default classifier
#### Result
#### Explaination