# <center> TP3 - 01 Description

# Objective of TP

In this TP you will practice your skills of independent and by now (fairly) experienced data analyst.  
Being able to work independently and to build on your knowledge and experience to adapt to new algorithms is an essential part of the work of a data analyst.
There is **no single best** algorithm that you could learn in a data analysis class and live with it for the rest of your life. 
Rahter the opposite, there are many algorithms (tens or even hundreds) which all have their pros and cons. 
No class can cover them all. But a class such as ours can teach you the basic principles on which you should build to be able to use new algorithms that you never even heard before.

In this TP. you will work with two new algorithms:
* you will implement on your own **Naive Bayes classifier** (*Course 11 - 01 Naive Bayes algoritm*)
* you will use the scikit-learn implementation of the **logistic regression** (*Course 12 - 02 Logistic regression*) and use the official documentation together with any other information you can find (google) to understand how to use it

You will reuse your work from TP2 on the full **supervised learning pipeline** to train, pick and evaluate your models.

### Recommendation:
As always, the code you will develop in this TP is to be re-used later (in the exam).  
Therefore we recommend you try to make it clear (use comments, when printing say what you print) so that next time it is easier for you to remember what it does.  
Also, try to make the code generic so that it can be easilly used for different datasets.   
Try to automate as much as possible so that the code does not require too much of your attention.

# Reusing TP2 

## Dataset

You will be workig with the same cars dataset as in TP1 and TP2.  
Each group shall be using the same `brands` as in TP1 and TP2.


In [4]:
# Load dataset and extract our part
import pandas as pd

# Reading csv file
autos = pd.read_csv('autos.csv',encoding='latin-1')

# Extracting the relevant part for our group
only_specific_brands = autos.brand.isin(['renault', 'peugeot', 'skoda', 'citroen', 'ford'])
autos = autos[only_specific_brands]
autos.head()

Unnamed: 0,price,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,fuelType,brand,notRepairedDamage,fast_sale
2,11400.0,limousine,2010.0,manuell,175.0,mondeo,125000.0,diesel,ford,nein,False
4,4100.0,kleinwagen,2009.0,manuell,68.0,1_reihe,90000.0,benzin,peugeot,nein,False
6,888.0,kombi,2000.0,manuell,115.0,mondeo,150000.0,benzin,ford,nein,True
7,13700.0,bus,2012.0,manuell,86.0,roomster,5000.0,benzin,skoda,nein,True
9,4299.0,kleinwagen,2010.0,manuell,75.0,2_reihe,125000.0,benzin,peugeot,nein,False


## Data preprocessing

Remember that after loading the dataset, there are several preprocessing steps you need to do before training the algorithms.
You already did all the necessary pre-processing steps in TP2 so you can simply reuse them.   
**Important note:** While in practice the step *'check and clean your data'* is super important, for our class (this TP and exam) consider the data to be checked and clean already so you can skip it.

Remember to comment in your code the pre-processing steps you do (this is important for you or any other user of your code).

## Prepare for model evaluation and hyper-parameter tuning

### Data splits for model evaluation (training and testing)

You have already created a code for this in TP2. In this TP and the exam we will make the evaluation procedure somewhat simpler. Because our datasets are generally rather big, we do not need to repeat the hold-out several times. Instead we will use a **single hold-out** method. That is, we will split the data to training and test (hold-out) datasets only once. In result, we will train only one final model and evaluate the model accuracy only over a single test set.  
Remember that the **accuracy over the test data serves as an estimate of the generalization accuracy** and that there is a relation between the confidence we can have in our estimate and the number of samples we have in the test set. A reasonble split to train vs test instances is 2/3 vs 1/3.

### Data splits for hyper-parameter tuning

Again, you have already created a code for this in TP2 and we will re-use the same procedure (here and in the exam): **use 5-folds inner cross validation** to discover the best values of the hyper-parameters.

Remember that once you find the best hyper-parameter values, you should re-train your model with this hyper-parameter value fixed over the whole training set.

You then evaluate this final model by comparing its predictions over the test set (hold-out set never used in training) to the true values and establishing the model accuracy.

### Generalization accuracy

To estimate the generalization accuracy you will need to use the test-set accuracy. You have already created a code to use a model to do predicitons and calculate the accuracy in TP2 so you only need to re-use it in this TP (and the exam).

## Train and test default classifier

Default classifier has no hyper-parameters, so you can skip the inner-cross validatoin procedure.

**Calculate and report the test accuracy for the default classifier**

# New in TP3

## Train and test Naive Bayes (NB) classifier

All of the above steps are just re-using your work from TP2. Here begins the real added value of TP3.

You will need to implement the NB classifier. This will show that you really understand how the method works. The NB classifier is based on basic probability rules such as conditional and joint probability that we have seen in the beginning of the course, practiced in TP1 and reviewed later.

### Implement the NB classifier

We discussed the Naive Bayes classifier in *Course 11 - 01 Naive Bayes algoritm* so you will need to review the lecture to be able to implement the algorithm. The outline of the implementation steps was at the end of that lecture.

### A few more hints:

At **training** step of the NB algorithm you use the training data to calculate
* the prior probabilities $P(c_i)$ for each output class $c_i, \, i=1,2$
* conditional probabilities $P(x_j \, | \, c_i)$ for all discrete attributes and each output class
 * **hint 1:** use the pseudo-counts explained in *Course 11 - 01 Naive Bayes algoritm*
 * **hint 2:** to be sure you have all possible values $x_j$ for all discrete attributes get the possible unique values from the full dataset not just the trianing set.  
 Note: if your dataset is big this should not matter. This is just to make sure that you do not have a value $x_j$ in test that you haven't seen in training and therefore haven't calcualted $P(x_j \, | \, c_i)$ for it.
* conditional means and variances for all continuous attributes and each output class

At **prediction** step of the NB algorithm, for each instance you want to predict you need to calculate 
* the conditional probabilities $P(x_j \, | \, c_i)$ of all the continous attributes and each output class (using the Normal distribution with means and variances calculate over the trianing data above)
* the likelihood as the product $P(\mathbf{x} \, | \, c_i) = \prod_{j=1}^d P(x_j \, | \, c_i)$ across all attributes and for each output class $c_i, \, i=1,2$
* the simplified posterior $P(c_i \, | \, \mathbf{x}) \propto P(\mathbf{x} \, | \, c_i) P(c_i)$

Finally, for each instance individually you use the Bayes decision rule: pick the class $c_1$ or $c_2$ which has higher posterior probability (is $P(c_1 \, | \, \mathbf{x})$ higher or smaller than $P(c_2 \, | \, \mathbf{x})$)


<font color=red>**Note:** The Naive Bayes classifier has no hyper-parameters to be selected, therefore you do not need to perform the inner cross-validation.  
In this respect the NB classifier is easy.  
You only need to do the train/test split and perform the train and prediction steps described above.</font>

**Calcualte and report the test accuracy of the NB classifier.**

## Train and test logistic regression

We discussed the theory of logistic regression in the course *Course 12 - 02 Logistic regression*.

Implementing logistic regression from scratch can get somewhat tedious.
Therefore we recommend you use an existing implementation in **sci-kit learn**
[sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
You can use the official documentation or any other information you can find (google) to make it work correctly.

The sci-kit learn implementation of the logistic regression performs the optimisation steps for you, therefore you do **not need to implement the gradient descent** procedure.
The general steps for using the logistic regression model in sci-kit learn are the same as se used for decision trees and nearest neighbour and are described **at the end of *Course 12 - 02 Logistic regression***.

*Though implementing logistic regression from scratch is rather more demanding, you should in fact be able to do it based on the information provided in the course sheet. If someone wants to give it a try let us know and we will help you get started.*

**We want you to use:**
* $\ell_2$ regularization
* perform a hyper-parameter search over a grid $\lambda \in \{0.0001, 0.001, 0.01, 0.1, 1, 10, 100 \}$ using 5-folds inner cross validation (you can change the grid if you wish to achieve better prediciton accuracy, let us know if you decide to do this). 
* train final model over the full training data using the best $\lambda$ (write in your file which value you pick as the best)
* **calcualte and report the test accuracy of the final logistic regression model**

## Compare models

Once you have the test accuracies for the Naive Bayes, logistic regression and default classifier, calculate the confidence intervals for the generalization accuracy for each of the algorithm at *95%* confidence level (*Course 10 - 01 Confidence intervals*).

This step is similar to the McNemar test. If the intervals of two algorithms overlap, you cannot conclude that one is better than the other (with the given confidence).

**Is any of the three algorithms clearly better than the other two based on the generalization accuracy confidence intervals?**


## Precision and recall

In *Course 10 - 02 Classification performance measures* we discussed alternative measures for the performance of an algorithm. **Calcualte and report the precision and recall (over the test data) of all the three algorithms of this TP.**

**Looking at these, does any of the algoritm look better/worse than the others? Why? Explain, discuss.** (There is no correct or wrong answer, we want to see you understand the concepts.)

# IMPORTANT!

This TP is not easy. You cannot simply re-use the information given in the course sheets and copy-paste or slightly adapt bits of code we have given you. Instead we ask you to use your experience, inventivness, ability to combine information to arrive to new solutions and other skills you have acquired over your bachelor studies and in this course to do the exercise. We believeve we have given you sufficient information to be be able solve the problems on your own. 

Nevertheless, we are of course ready to help you. 

**A few rules for asking for help:**
* As we expect more questions possibly of the similar nature, we ask you to use the **Foire aux questions** in the Cyberlearn. **We will generally not answer questions sent by direct emails.**
* We will **not give you the code** for the Naive Bayes and the logistic regression, not in the next class, not by email upon later request.
* We will **not check your code** (complete or partial) before the final submission date. Before the submission deadline, we will not do the **debugging** for you and we will not control wheather the code **performs all the steps** it shall. We will, however, answer specific technical questions or general questions related to the correct procefure to follow through the **Foire aux questions**.
* If you feel you need more personalised help, **fix a meeting as soon as possible**, preferably before the break. Prepare your questions and try to be efficient and concious of your and our time.

If you see a questoin in the **Foire aux questoins** for which **you know the answer**, feel free to reply. :)


### <font color=red> Deadline: 6/1/2019 23:59:59 submit to frantzeska.lavda@hesge.ch</font>