# <center> TP2 - 01 Description

# Objective of TP

In this TP you will develop the full **supervised learning pipeline** including the *hyper-parameter tuning* and *model evalutaion*.  

You will then apply the pipeline to three algorithms
* nearest neighbour
* decision tree
* default classifier

Finally, you will perform *model comparison* and **discuss** its results.

### Recommendation:
The code you will develop in this TP is to be re-used in TP3 and the exam.  
Therefore we recommend you try to make it clear (use comments, when printing say what you print) so that next time it is easier for you to remember what it does.  
Also, try to make the code generic so that it can be easilly used for different datasets.   
Try to automate as much as possible so that the code does not require too much of your attention, finally you will need to do the same type of analysis not for 3 algorithms but for 5-6.

## Dataset

You will be workig with the same cars dataset as in TP1.  
Each group shall be using the same `brands` as in TP1.


In [2]:
# Load dataset and extract our part
import pandas as pd

# Reading csv file
autos = pd.read_csv('autos.csv',encoding='latin-1')

# Extracting the relevant part for our group
only_specific_brands = autos.brand.isin(['renault', 'peugeot', 'skoda', 'citroen', 'ford'])
autos = autos[only_specific_brands]

In [3]:
autos.head()

Unnamed: 0,price,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,fuelType,brand,notRepairedDamage,fast_sale
2,11400.0,limousine,2010.0,manuell,175.0,mondeo,125000.0,diesel,ford,nein,False
4,4100.0,kleinwagen,2009.0,manuell,68.0,1_reihe,90000.0,benzin,peugeot,nein,False
6,888.0,kombi,2000.0,manuell,115.0,mondeo,150000.0,benzin,ford,nein,True
7,13700.0,bus,2012.0,manuell,86.0,roomster,5000.0,benzin,skoda,nein,True
9,4299.0,kleinwagen,2010.0,manuell,75.0,2_reihe,125000.0,benzin,peugeot,nein,False


# Data preprocessing

Remember that after loading the dataset, there are several preprocessing steps you need to do before trainign the algorithm.
If you are not sure what these are, see *Course 8 - 02 Hyper-parameter tuning*.

When writing the code, **put short comments explaining what the pre-processing steps are and why you need to do them**.

---------------------------
# Comments & Coding
- The name change makes the table easier to read 
- We change the types because it will facilitate the data processing(classify the data in numeric or categorical)

In [4]:
# Changing the column names
autos.columns = ['price', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 
                 'kilometer', 'fuel_type', 'brand', 'unrepaired_damage', 'fast_sale']
# Converting column 'unrepaired_damage' from object to boolean type
autos['unrepaired_damage'] = (autos['unrepaired_damage']
                 .str.replace('nein','')
                 .str.replace('ja','True')
                  .astype(object)
                 )
# Converting column 'registration_year' from float to int.
autos['registration_year'] = (autos['registration_year'].astype(int))

In [7]:
display(autos.info())
autos.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28401 entries, 2 to 166073
Data columns (total 11 columns):
price                28401 non-null float64
vehicle_type         28401 non-null object
registration_year    28401 non-null float64
gearbox              28401 non-null object
power_ps             28401 non-null float64
model                28401 non-null object
kilometer            28401 non-null float64
fuel_type            28401 non-null object
brand                28401 non-null object
unrepaired_damage    28401 non-null bool
fast_sale            28401 non-null bool
dtypes: bool(2), float64(4), object(5)
memory usage: 2.2+ MB


None

Unnamed: 0,price,registration_year,power_ps,kilometer
count,28401.0,28401.0,28401.0,28401.0
mean,4177.515017,2003.965565,102.402979,121472.307313
std,4680.629533,5.87246,40.664873,39816.529262
min,1.0,1923.0,2.0,5000.0
25%,1199.0,2001.0,75.0,100000.0
50%,2500.0,2004.0,101.0,150000.0
75%,5400.0,2008.0,122.0,150000.0
max,73500.0,2016.0,952.0,150000.0


-----------------------------------------------------------------

# Prepare for model evaluation and hyper-parameter tuning

### Data splits for model evaluation (training and testing)

You will need to write the code splitting the data to training set (used for model learning and hyper-parameter tuning) and test set used for final model evaluation (test error).

Here, you can choose to **use either 5-folds cross-validation or 5 time repeated hold-out method.**

**Tell us what your choice is and why**. Both choices are good, we just want to know that you understand the differences and you have thought about them.

Remember that in the end this procedure will be used for all your algorithms and that these should work over the same train/test splits. You can make sure this will be the case by fixing the seed for the random sample generation.

--------------------------------------
# Comments & Coding
We have chose the cross-validation. Without doubt, that it's a longer method but on the other hand, it's more precise.

In [26]:
from sklearn.model_selection import KFold

out_autos = autos['fast_sale']
in_autos = autos.iloc[:,[0,1,2,3,4,5,6,7,8,9]]

num_of_splits = 5
kf = KFold(n_splits=num_of_splits, random_state=123, shuffle=True) 

for train_idx,test_idx in kf.split(in_autos,out_autos):
    print("train:", train_idx, "test:", test_idx)

print('Test outputs')
display(out_autos.iloc[test_idx].head())
print('Test inputs')
display(in_autos.iloc[test_idx].head())

train: [    0     1     2 ... 28398 28399 28400] test: [    8    20    23 ... 28386 28387 28393]
train: [    0     1     2 ... 28398 28399 28400] test: [    9    10    12 ... 28374 28376 28392]
train: [    0     1     4 ... 28398 28399 28400] test: [    2     3     7 ... 28362 28381 28396]
train: [    1     2     3 ... 28398 28399 28400] test: [    0     4     6 ... 28385 28388 28390]
train: [    0     2     3 ... 28392 28393 28396] test: [    1     5    13 ... 28398 28399 28400]
Test outputs


4     False
14    False
66    False
79    False
80    False
Name: fast_sale, dtype: bool

Test inputs


Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,fuel_type,brand,unrepaired_damage
4,4100.0,kleinwagen,2009.0,manuell,68.0,1_reihe,90000.0,benzin,peugeot,False
14,6500.0,kleinwagen,2013.0,manuell,75.0,citigo,40000.0,benzin,skoda,False
66,1600.0,limousine,2002.0,manuell,116.0,focus,150000.0,diesel,ford,False
79,3900.0,bus,2002.0,manuell,90.0,berlingo,150000.0,diesel,citroen,False
80,3000.0,suv,1993.0,automatik,136.0,andere,150000.0,benzin,ford,False


### Data splits for hyper-parameter tuning

Here we want you to **use 3-folds inner cross validation**.

You will need to write the code to split each of the training sets above to train/validation accoridng to the 3-fold cross-validation strategy.

### Generalization accuracy

You will also need to prepere the code that will use the trained models to produce predictions for the test instances, calculate the accuracy of over each test set, and calculate the final average accuracy over all the test instances (estimate of generalization accuracy).

# Train and test nearest neigbour model

Once you have the general procedure in place, train the nearest neigbour model.

### Hyper-parameter search

Hyper-parameter in nearest neighbour algorithm is the number of neighbours to use.
We want you to try at least 5 different values. **Tell us which values you decide to try.** (There is no Why questoin here.)

Remember that for choosing the best hyper-parameter value, you use the the inner cross validation and the best hyper-parameter is the one with the highest average accuracy over the validation sets.

### Model lerning and test accuracy

Once you have the best value of the hyper-parameter, you use it to **retrain** the model over the merged train+validation (you do this 5 times, see above *Data splits for model evaluation*). You then use this **retrained** model to get the final test accuracy.

For each of the test samples (there should be 5, see above), report the test accuracy and the corresponding hyper-parameter setting (the one chosen as best for this specific split).

Are the hyper-parameter parameters the same for all the test sets? **Discuss** if you think this is  normal or not, why it happens and if it creates some difficulties for interpreting the model. **There is no single correct answer here!** We want to see that you undertand the procedure and that you use your brain.



# Train and test decision tree

Use the same general procedure to train a decision tree.

Hyper-parameters for decision trees are the pre-prunning criteria such as maximum number of leafs (see *Course 5 - 02 Decision tree prunning*). 
Pick one of these and use at least 5 different values. **Tell us which one you pick and what values you are using.**

Calculate and report the test accuracies together with their corresponding heper-parameter values. (No more comments needed here.)

# Train and test default classifier

Default classifier has no hyper-parameters, so you can skip the inner-cross validatoin procedure.

Calculate and report the test accuracies for the 5 test sets from the part *Data splits for model evaluation*

# Compare models

Once you have all your test accuracies for the nearest neighbour, decision trees and default classifier, calculate the estimated generalization accuracy of each (the everage accuracy accross the test sets).

Is any of the algorithms peforming better than the other two? **Discuss, comment.**

## Use the McNemar test 

Use the McNemar test to verify whether the differences in the generalization accuracy are significant. 

In McNemar you can always compare only two algorithms. Do all the pair-wise comparisons, present and **explain** the results. Are these what you would expect?