# <center> TP3 - 01 Description

# Objective of TP

In this TP you will practice your skills of independent and by now (fairly) experienced data analyst.  
Being able to work independently and to build on your knowledge and experience to adapt to new algorithms is an essential part of the work of a data analyst.
There is **no single best** algorithm that you could learn in a data analysis class and live with it for the rest of your life. 
Rahter the opposite, there are many algorithms (tens or even hundreds) which all have their pros and cons. 
No class can cover them all. But a class such as ours can teach you the basic principles on which you should build to be able to use new algorithms that you never even heard before.

In this TP. you will work with two new algorithms:
* you will implement on your own **Naive Bayes classifier** (*Course 11 - 01 Naive Bayes algoritm*)
* you will use the scikit-learn implementation of the **logistic regression** (*Course 12 - 02 Logistic regression*) and use the official documentation together with any other information you can find (google) to understand how to use it

You will reuse your work from TP2 on the full **supervised learning pipeline** to train, pick and evaluate your models.

### Recommendation:
As always, the code you will develop in this TP is to be re-used later (in the exam).  
Therefore we recommend you try to make it clear (use comments, when printing say what you print) so that next time it is easier for you to remember what it does.  
Also, try to make the code generic so that it can be easilly used for different datasets.   
Try to automate as much as possible so that the code does not require too much of your attention.

# Reusing TP2 

## Dataset

You will be workig with the same cars dataset as in TP1 and TP2.  
Each group shall be using the same `brands` as in TP1 and TP2.


In [1]:
# Load dataset and extract our part
import pandas as pd

# Reading csv file
autos = pd.read_csv('autos.csv',encoding='latin-1')

# Extracting the relevant part for our group
only_specific_brands = autos.brand.isin(['renault', 'peugeot', 'skoda', 'citroen', 'ford'])
autos = autos[only_specific_brands]
autos.head()

Unnamed: 0,price,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,fuelType,brand,notRepairedDamage,fast_sale
2,11400.0,limousine,2010.0,manuell,175.0,mondeo,125000.0,diesel,ford,nein,False
4,4100.0,kleinwagen,2009.0,manuell,68.0,1_reihe,90000.0,benzin,peugeot,nein,False
6,888.0,kombi,2000.0,manuell,115.0,mondeo,150000.0,benzin,ford,nein,True
7,13700.0,bus,2012.0,manuell,86.0,roomster,5000.0,benzin,skoda,nein,True
9,4299.0,kleinwagen,2010.0,manuell,75.0,2_reihe,125000.0,benzin,peugeot,nein,False


## Data preprocessing

Remember that after loading the dataset, there are several preprocessing steps you need to do before training the algorithms.
You already did all the necessary pre-processing steps in TP2 so you can simply reuse them.   
**Important note:** While in practice the step *'check and clean your data'* is super important, for our class (this TP and exam) consider the data to be checked and clean already so you can skip it.

Remember to comment in your code the pre-processing steps you do (this is important for you or any other user of your code).

### First we are changing the column names:

- We changed the style of writing the column names from camelcase to snakecase. This change will make column names easier to read, and we won't have to remember which letter is capitalized.
- We reworded some of the column names in order for them to be more descriptive and obvious.

In [2]:
# Changing the column names
autos.columns = ['price', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 
                 'kilometer', 'fuel_type', 'brand', 'unrepaired_damage', 'fast_sale']
autos.head()

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,fuel_type,brand,unrepaired_damage,fast_sale
2,11400.0,limousine,2010.0,manuell,175.0,mondeo,125000.0,diesel,ford,nein,False
4,4100.0,kleinwagen,2009.0,manuell,68.0,1_reihe,90000.0,benzin,peugeot,nein,False
6,888.0,kombi,2000.0,manuell,115.0,mondeo,150000.0,benzin,ford,nein,True
7,13700.0,bus,2012.0,manuell,86.0,roomster,5000.0,benzin,skoda,nein,True
9,4299.0,kleinwagen,2010.0,manuell,75.0,2_reihe,125000.0,benzin,peugeot,nein,False


### Next, we are going to convert some columns data type:

- We are changing data type in order to facilitate the data processing (classifying the data in numeric or categorical)
- We are converting column 'unrepaired_damage' from object to boolean type
- We are converting column 'registration_year' from float to int.

In [3]:
# Converting column 'unrepaired_damage' from object to boolean type
autos['unrepaired_damage'] = (autos['unrepaired_damage']
                 .str.replace('nein','')
                 .str.replace('ja','True')
                  .astype(bool)
                 )
# Converting column 'registration_year' from float to int.
autos['registration_year'] = (autos['registration_year'].astype(int))
autos.info()
autos.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28401 entries, 2 to 166073
Data columns (total 11 columns):
price                28401 non-null float64
vehicle_type         28401 non-null object
registration_year    28401 non-null int64
gearbox              28401 non-null object
power_ps             28401 non-null float64
model                28401 non-null object
kilometer            28401 non-null float64
fuel_type            28401 non-null object
brand                28401 non-null object
unrepaired_damage    28401 non-null bool
fast_sale            28401 non-null bool
dtypes: bool(2), float64(3), int64(1), object(5)
memory usage: 2.2+ MB


Unnamed: 0,price,registration_year,power_ps,kilometer
count,28401.0,28401.0,28401.0,28401.0
mean,4177.515017,2003.965565,102.402979,121472.307313
std,4680.629533,5.87246,40.664873,39816.529262
min,1.0,1923.0,2.0,5000.0
25%,1199.0,2001.0,75.0,100000.0
50%,2500.0,2004.0,101.0,150000.0
75%,5400.0,2008.0,122.0,150000.0
max,73500.0,2016.0,952.0,150000.0


### Next, we are goint to decide what is output (target) and what is input:
- This steps depends what we are trying to predict. Here we want to predict if a car will be sold fastly (fast_sale = True). Therefore we need to select the 'fast_sale' data as output data. 
- The other attributes will be the input data (the data that we will use in order to predict if a car will be sold fastly).

In [4]:
# output dataframe
out_autos = autos['fast_sale']
# input dataframe
in_autos = autos.iloc[:,[0,1,2,3,4,5,6,7,8,9]]

print('Inputs')
display(in_autos.head())
print('Outputs')
display(out_autos.head())

Inputs


Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,fuel_type,brand,unrepaired_damage
2,11400.0,limousine,2010,manuell,175.0,mondeo,125000.0,diesel,ford,False
4,4100.0,kleinwagen,2009,manuell,68.0,1_reihe,90000.0,benzin,peugeot,False
6,888.0,kombi,2000,manuell,115.0,mondeo,150000.0,benzin,ford,False
7,13700.0,bus,2012,manuell,86.0,roomster,5000.0,benzin,skoda,False
9,4299.0,kleinwagen,2010,manuell,75.0,2_reihe,125000.0,benzin,peugeot,False


Outputs


2    False
4    False
6     True
7     True
9    False
Name: fast_sale, dtype: bool

### Next, we convert categorical to numeric (one-hot encoding)

- The algorithms we use (except the deciscion tree) cannot operate on label data directly. They require all input variables and output variables to be numeric. Therefore we need to convert categorical data to a numerical form.
- Since no ordinal relationship exists in the categorical variables in our dataset (and we don't want to allow the model to assume a natural ordering between categories), we use the one-hot encoding. 

In [5]:
print('Original input data')
print('-------------------')
display(in_autos.head())

# Create one-hot encoding
in_long = pd.get_dummies(in_autos)
print('\n Numerical input data')
print('-------------------')
display(in_long.head())

Original input data
-------------------


Unnamed: 0,price,vehicle_type,registration_year,gearbox,power_ps,model,kilometer,fuel_type,brand,unrepaired_damage
2,11400.0,limousine,2010,manuell,175.0,mondeo,125000.0,diesel,ford,False
4,4100.0,kleinwagen,2009,manuell,68.0,1_reihe,90000.0,benzin,peugeot,False
6,888.0,kombi,2000,manuell,115.0,mondeo,150000.0,benzin,ford,False
7,13700.0,bus,2012,manuell,86.0,roomster,5000.0,benzin,skoda,False
9,4299.0,kleinwagen,2010,manuell,75.0,2_reihe,125000.0,benzin,peugeot,False



 Numerical input data
-------------------


Unnamed: 0,price,registration_year,power_ps,kilometer,unrepaired_damage,vehicle_type_andere,vehicle_type_bus,vehicle_type_cabrio,vehicle_type_coupe,vehicle_type_kleinwagen,...,fuel_type_cng,fuel_type_diesel,fuel_type_elektro,fuel_type_hybrid,fuel_type_lpg,brand_citroen,brand_ford,brand_peugeot,brand_renault,brand_skoda
2,11400.0,2010,175.0,125000.0,False,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
4,4100.0,2009,68.0,90000.0,False,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
6,888.0,2000,115.0,150000.0,False,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,13700.0,2012,86.0,5000.0,False,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,4299.0,2010,75.0,125000.0,False,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


### Next, we normalize (bring all variables to same scale [0,1])

-  We need to normalize the data to ensure that all sources are treated equally, and that data-availability bias (and its corresponding misrepresentation of the data universe) is reduced.

In [6]:
# maximum of each attribute
maxs = in_long.max(axis=0) # max accross rows (axis=0)

print('Maximums')
print('----------')
display(maxs.head())

# minimum of each attribute
mins = in_long.min(axis=0) # max accross rows (axis=0)

print('\n Minimums')
print('----------')
display(mins.head())

# min-max normalization
norm_in = (in_long - mins) / (maxs-mins)

# new maximum of each attribute
maxs = norm_in.max(axis=0) # max accross rows (axis=0)

print('\n New maximums')
print('----------')
display(maxs.head())

# new minimum of each attribute
mins = norm_in.min(axis=0) # max accross rows (axis=0)

print('\n New minimums')
print('----------')
display(mins.head())

Maximums
----------


price                 73500.0
registration_year      2016.0
power_ps                952.0
kilometer            150000.0
unrepaired_damage         1.0
dtype: float64


 Minimums
----------


price                   1.0
registration_year    1923.0
power_ps                2.0
kilometer            5000.0
unrepaired_damage       0.0
dtype: float64


 New maximums
----------


price                1.0
registration_year    1.0
power_ps             1.0
kilometer            1.0
unrepaired_damage    1.0
dtype: float64


 New minimums
----------


price                0.0
registration_year    0.0
power_ps             0.0
kilometer            0.0
unrepaired_damage    0.0
dtype: float64

### As of this step, no more use of original data and only use preprocessed

In our case above: `out_autos` and `norm_in`

## Prepare for model evaluation and hyper-parameter tuning

### Data splits for model evaluation (training and testing)

You have already created a code for this in TP2. In this TP and the exam we will make the evaluation procedure somewhat simpler. Because our datasets are generally rather big, we do not need to repeat the hold-out several times. Instead we will use a **single hold-out** method. That is, we will split the data to training and test (hold-out) datasets only once. In result, we will train only one final model and evaluate the model accuracy only over a single test set.  
Remember that the **accuracy over the test data serves as an estimate of the generalization accuracy** and that there is a relation between the confidence we can have in our estimate and the number of samples we have in the test set. A reasonble split to train vs test instances is 2/3 vs 1/3.

In [7]:
from sklearn.model_selection import train_test_split
import random


out_train, out_test, in_train, in_test = train_test_split(out_autos, norm_in, 
                                                          train_size=2/3, 
                                                          test_size=1/3, 
                                                          shuffle=True, 
                                                          random_state=random.randint(1, 10000))

print("train",in_train.shape)
print("test",in_test.shape)

train (18934, 67)
test (9467, 67)


### Data splits for hyper-parameter tuning

Again, you have already created a code for this in TP2 and we will re-use the same procedure (here and in the exam): **use 5-folds inner cross validation** to discover the best values of the hyper-parameters.

Remember that once you find the best hyper-parameter values, you should re-train your model with this hyper-parameter value fixed over the whole training set.

You then evaluate this final model by comparing its predictions over the test set (hold-out set never used in training) to the true values and establishing the model accuracy.

In [8]:
class DefaultClassifier:
    
    def __init__(self):
        self.majority_class = None
        
    def fit(self, in_train, out_train):
        # get counts per class
        class_counts = out_train.value_counts()
        self.majority_class = class_counts.idxmax()
        
    def predict(self, out_test):
        # predict
        pred_test = out_test.copy()
        pred_test[:] = self.majority_class
        return pred_test

In [9]:
from enum import Enum
from sklearn import neighbors, tree, linear_model, naive_bayes
    
    
class ClassifierCallbacks(Enum):
    """A class containing all the callbacks for the Classifiers"""
    
    def clf_neighbors(self, hyper_param):
        return neighbors.KNeighborsClassifier(n_neighbors=hyper_param)

    def clf_tree(self, hyper_param):
        return tree.DecisionTreeClassifier(max_leaf_nodes=hyper_param)
    
    def clf_default(hyper_param):
        return DefaultClassifier()
    
    def clf_naive_bayes(hyper_param): #need to pass zero
        return naive_bayes.GaussianNB()
    
    def clf_log_regression(hyper_param):
        return linear_model.LogisticRegression(C=hyper_param, random_state=123, solver="liblinear")
    
    K_NEIGHBORS = clf_neighbors
    DECISION_TREE = clf_tree
    DEFAULT = clf_default
    NAIVE_BAYES = clf_naive_bayes
    LOGISTIC_REGRESSION = clf_log_regression

In [10]:
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn import neighbors, tree, linear_model
import operator


class HyperParameterTuner:
    
    RANDOM_STATE = 123
    
    def __init__(self, in_data, out_data, hyper_params, 
                 clf_callback, n_split_outer=5, n_split_inner=3):
        self._in_data = in_data
        self._out_data = out_data
        self._hyper_params = hyper_params
        self.n_split_outer = n_split_outer
        self.n_split_inner = n_split_inner
        self._clf_callback = clf_callback
              
    def run(self):
        self._pre_run_init()
        self.cross_validation()
        self.calculate_avg_of_hyper_params()
        
    def _pre_run_init(self):
        # Dictionnary holding the cross-validation (cv) scores.
        self._cv_scores = self.dict_filled_with_zeros_for_each(self._hyper_params)
        self._best_hyper_param_by_fold = dict()
    
    # We need this function to initiate the cross-validation scores dict with zeros
    def dict_filled_with_zeros_for_each(self, keys):
        _dict = dict()
        for key in keys:
            _dict[key] = 0.0
        return _dict   
            
    def cross_validation(self):
        k_fold = KFold(n_splits=self.n_split_outer, random_state=self.RANDOM_STATE, shuffle=True)
        for train_idx, test_idx in k_fold.split(self._in_data, self._out_data):
            in_train = self._in_data.iloc[train_idx]
            out_train = self._out_data.iloc[train_idx]
            out_test = self._out_data.iloc[test_idx]
            self.inner_cross_validation(in_train, out_train)
            self._print_best_hyper_param_by_fold()
    
    def inner_cross_validation(self, in_train, out_train):
        for hp in self._hyper_params:
            clf = self._clf_callback(hp)
            scores = cross_val_score(clf, in_train, out_train, cv=self.n_split_inner)
            self._cv_scores[hp] += scores.mean() 
            self._best_hyper_param_by_fold[hp] = scores.mean() 

    def _print_best_hyper_param_by_fold(self):
        best_hp = max(self._best_hyper_param_by_fold.items(), key=operator.itemgetter(1))[0]
        print ("The optimal hyper-parameter is %d" % best_hp)
                
    def calculate_avg_of_hyper_params(self):
        for hyper_param, accuracy in self._cv_scores.items():
            self._cv_scores[hyper_param] = accuracy / self.n_split_outer
    
    def get_best_hyper_param(self):
        # Get the hyper-param with the highest accuracy in scores dict.
        return max(self._cv_scores.items(), key=operator.itemgetter(1))[0]
        
    def get_hyper_params_avg_scores(self):
        return self._cv_scores

### Generalization accuracy

To estimate the generalization accuracy you will need to use the test-set accuracy. You have already created a code to use a model to do predicitons and calculate the accuracy in TP2 so you only need to re-use it in this TP (and the exam).

In [11]:
class ModelEvaluator:
    
    RANDOM_STATE = 123
    
    def __init__(self, in_data, out_data, hyper_param, clf_callback, n_splits=5):
        self.in_data=in_data
        self.out_data=out_data
        self.hyper_param=hyper_param
        self.clf_callback=clf_callback
        self.n_splits=n_splits
        self.out_test=list()
        self.pred_test=list()
    
    def run(self):
        # Use the learned model f to do predictions
        k_fold = KFold(n_splits=self.n_splits, random_state=self.RANDOM_STATE, shuffle=True)

        avg_test_accuracy = 0
        
        print("Test accuracies for each test set")
        print("---------------------------------")
        for train_idx, test_idx in k_fold.split(self.in_data, self.out_data):
            in_train = self.in_data.iloc[train_idx]
            out_train = self.out_data.iloc[train_idx]
            in_test = self.in_data.iloc[test_idx]
            self.out_test = self.out_data.iloc[test_idx]

            clf = self.clf_callback(self.hyper_param)
            clf.fit(in_train, out_train)

            # Prediction for all test data
            if isinstance(clf, DefaultClassifier):
                self.pred_test = clf.predict(self.out_test)
            else:
                self.pred_test = clf.predict(in_test)

            # Get test set accuracy        
            test_accuracy = self.calculate_pred_accuracy(self.pred_test, self.out_test)
            print('Test accuracy', test_accuracy)
            avg_test_accuracy += test_accuracy

        # Use test sets to get estimate of generalization error (accuracy)
        print()
        print(f'The test average accuracy is {avg_test_accuracy/self.n_splits}')
        return (avg_test_accuracy/self.n_splits)
 
    def get_test_data(self):
        return (self.out_test, self.pred_test)
    
    def calculate_pred_accuracy(self, pred_data, out_data):
        match_test = (pred_data == out_data)
        accuracy = match_test.sum() / match_test.count()
        return accuracy

## Train and test default classifier

Default classifier has no hyper-parameters, so you can skip the inner-cross validatoin procedure.

**Calculate and report the test accuracy for the default classifier**

In [48]:
dc_me = ModelEvaluator(
    norm_in, 
    out_autos, 
    hyper_param=0,
    clf_callback=ClassifierCallbacks.DEFAULT
)

dc_generalization_accuracy = dc_me.run()
dc_pred_test, dc_out_test = dc_me.get_test_data()

Test accuracies for each test set
---------------------------------
Test accuracy 0.5118817109663791
Test accuracy 0.5121478873239437
Test accuracy 0.5184859154929577
Test accuracy 0.5126760563380282
Test accuracy 0.5235915492957747

The test average accuracy is 0.5157566238834167


# New in TP3

## Train and test Naive Bayes (NB) classifier

All of the above steps are just re-using your work from TP2. Here begins the real added value of TP3.

You will need to implement the NB classifier. This will show that you really understand how the method works. The NB classifier is based on basic probability rules such as conditional and joint probability that we have seen in the beginning of the course, practiced in TP1 and reviewed later.

### Implement the NB classifier

We discussed the Naive Bayes classifier in *Course 11 - 01 Naive Bayes algoritm* so you will need to review the lecture to be able to implement the algorithm. The outline of the implementation steps was at the end of that lecture.

### A few more hints:

At **training** step of the NB algorithm you use the training data to calculate
* the prior probabilities $P(c_i)$ for each output class $c_i, \, i=1,2$
* conditional probabilities $P(x_j \, | \, c_i)$ for all discrete attributes and each output class
 * **hint 1:** use the pseudo-counts explained in *Course 11 - 01 Naive Bayes algoritm*
 * **hint 2:** to be sure you have all possible values $x_j$ for all discrete attributes get the possible unique values from the full dataset not just the trianing set.  
 Note: if your dataset is big this should not matter. This is just to make sure that you do not have a value $x_j$ in test that you haven't seen in training and therefore haven't calcualted $P(x_j \, | \, c_i)$ for it.
* conditional means and variances for all continuous attributes and each output class

At **prediction** step of the NB algorithm, for each instance you want to predict you need to calculate 
* the conditional probabilities $P(x_j \, | \, c_i)$ of all the continous attributes and each output class (using the Normal distribution with means and variances calculate over the trianing data above)
* the likelihood as the product $P(\mathbf{x} \, | \, c_i) = \prod_{j=1}^d P(x_j \, | \, c_i)$ across all attributes and for each output class $c_i, \, i=1,2$
* the simplified posterior $P(c_i \, | \, \mathbf{x}) \propto P(\mathbf{x} \, | \, c_i) P(c_i)$

Finally, for each instance individually you use the Bayes decision rule: pick the class $c_1$ or $c_2$ which has higher posterior probability (is $P(c_1 \, | \, \mathbf{x})$ higher or smaller than $P(c_2 \, | \, \mathbf{x})$)


<font color=red>**Note:** The Naive Bayes classifier has no hyper-parameters to be selected, therefore you do not need to perform the inner cross-validation.  
In this respect the NB classifier is easy.  
You only need to do the train/test split and perform the train and prediction steps described above.</font>

**Calcualte and report the test accuracy of the NB classifier.**

In [13]:
#############################################################################
########### TODO ########### TODO ########### TODO ########### TODO #########
# It must be adapted because this classifier doesn't need to make a loop.
nb_clf_me = ModelEvaluator(
    norm_in, 
    out_autos, 
    hyper_param=0,
    clf_callback=ClassifierCallbacks.NAIVE_BAYES
)

nb_clf_generalization_accuracy = nb_clf_me.run() 
nb_pred_test, nb_out_test = nb_clf_me.get_test_data()
#############################################################################

Test accuracies for each test set
---------------------------------
Test accuracy 0.5222672064777328
Test accuracy 0.5327464788732394
Test accuracy 0.5369718309859155
Test accuracy 0.5299295774647887
Test accuracy 0.5387323943661971

The test average accuracy is 0.5321294976335746


## Train and test logistic regression

We discussed the theory of logistic regression in the course *Course 12 - 02 Logistic regression*.

Implementing logistic regression from scratch can get somewhat tedious.
Therefore we recommend you use an existing implementation in **sci-kit learn**
[sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
You can use the official documentation or any other information you can find (google) to make it work correctly.

The sci-kit learn implementation of the logistic regression performs the optimisation steps for you, therefore you do **not need to implement the gradient descent** procedure.
The general steps for using the logistic regression model in sci-kit learn are the same as se used for decision trees and nearest neighbour and are described **at the end of *Course 12 - 02 Logistic regression***.

*Though implementing logistic regression from scratch is rather more demanding, you should in fact be able to do it based on the information provided in the course sheet. If someone wants to give it a try let us know and we will help you get started.*

**We want you to use:**
* $\ell_2$ regularization
* perform a hyper-parameter search over a grid $\lambda \in \{0.0001, 0.001, 0.01, 0.1, 1, 10, 100 \}$ using 5-folds inner cross validation (you can change the grid if you wish to achieve better prediciton accuracy, let us know if you decide to do this). 
* train final model over the full training data using the best $\lambda$ (write in your file which value you pick as the best)
* **calcualte and report the test accuracy of the final logistic regression model**

In [14]:
logr_hpt = HyperParameterTuner(norm_in,
                              out_autos, 
                              hyper_params=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100], 
                              clf_callback=ClassifierCallbacks.LOGISTIC_REGRESSION)
logr_hpt.run()

print()
print("Average score for each hyper-parameter")
print("--------------------------------------")
display(logr_hpt.get_hyper_params_avg_scores())

best_hp_logr = logr_hpt.get_best_hyper_param()
print(f'The most optimal hyper-parameter is {best_hp_logr}')

The optimal hyper-parameter is 100
The optimal hyper-parameter is 100
The optimal hyper-parameter is 100
The optimal hyper-parameter is 100
The optimal hyper-parameter is 100

Average score for each hyper-parameter
--------------------------------------


{0.0001: 0.516407866559321,
 0.001: 0.5394087450937013,
 0.01: 0.552938365968489,
 0.1: 0.5616264429408085,
 1: 0.568351502968935,
 10: 0.5702088709985631,
 100: 0.5706225820873685}

The most optimal hyper-parameter is 100


In [26]:
logr_clf_me = ModelEvaluator(
    norm_in, 
    out_autos, 
    hyper_param=best_hp_logr,
    clf_callback=ClassifierCallbacks.LOGISTIC_REGRESSION
)

logr_clf_generalization_accuracy = logr_clf_me.run()
logr_pred_test, logr_out_test = logr_clf_me.get_test_data()

Test accuracies for each test set
---------------------------------
Test accuracy 0.5645132899137476
Test accuracy 0.571830985915493
Test accuracy 0.5866197183098592
Test accuracy 0.577112676056338
Test accuracy 0.5653169014084507

The test average accuracy is 0.5730787143207776


## Compare models

Once you have the test accuracies for the Naive Bayes, logistic regression and default classifier, calculate the confidence intervals for the generalization accuracy for each of the algorithm at *95%* confidence level (*Course 10 - 01 Confidence intervals*).

This step is similar to the McNemar test. If the intervals of two algorithms overlap, you cannot conclude that one is better than the other (with the given confidence).

**Is any of the three algorithms clearly better than the other two based on the generalization accuracy confidence intervals?**


In [72]:
# we aren't sure.
def confidence_interval(test_accurency, test_size, z=1.96, confidence_level=0.95):
    accurency = test_accurency
    n = test_size
    variance = accurency * (1 - accurency) / n
    standard_deviation = variance**0.5
    conf_int_left = accurency - z * standard_deviation;
    conf_int_right = accurency + z * standard_deviation;
    print(f'P({conf_int_left} <= {accurency} <= {conf_int_right}) = {confidence_level}')

print('Naive Bayes')
confidence_interval(nb_clf_generalization_accuracy, len(nb_pred_test)) 
print('Logistic regression')
confidence_interval(logr_clf_generalization_accuracy, len(logr_pred_test)) 
print('Default classifier')
confidence_interval(dc_generalization_accuracy, len(dc_pred_test)) 

Naive Bayes
P(0.5191531221699747 <= 0.5321294976335746 <= 0.5451058730971746) = 0.95
Logistic regression
P(0.560215101887784 <= 0.5730787143207776 <= 0.5859423267537712) = 0.95
Default classifier
P(0.5027598322873742 <= 0.5157566238834167 <= 0.5287534154794592) = 0.95


## Precision and recall

In *Course 10 - 02 Classification performance measures* we discussed alternative measures for the performance of an algorithm. **Calcualte and report the precision and recall (over the test data) of all the three algorithms of this TP.**

**Looking at these, does any of the algoritm look better/worse than the others? Why? Explain, discuss.** (There is no correct or wrong answer, we want to see you understand the concepts.)

# IMPORTANT!

This TP is not easy. You cannot simply re-use the information given in the course sheets and copy-paste or slightly adapt bits of code we have given you. Instead we ask you to use your experience, inventivness, ability to combine information to arrive to new solutions and other skills you have acquired over your bachelor studies and in this course to do the exercise. We believeve we have given you sufficient information to be be able solve the problems on your own. 

Nevertheless, we are of course ready to help you. 

**A few rules for asking for help:**
* As we expect more questions possibly of the similar nature, we ask you to use the **Foire aux questions** in the Cyberlearn. **We will generally not answer questions sent by direct emails.**
* We will **not give you the code** for the Naive Bayes and the logistic regression, not in the next class, not by email upon later request.
* We will **not check your code** (complete or partial) before the final submission date. Before the submission deadline, we will not do the **debugging** for you and we will not control wheather the code **performs all the steps** it shall. We will, however, answer specific technical questions or general questions related to the correct procefure to follow through the **Foire aux questions**.
* If you feel you need more personalised help, **fix a meeting as soon as possible**, preferably before the break. Prepare your questions and try to be efficient and concious of your and our time.

If you see a questoin in the **Foire aux questoins** for which **you know the answer**, feel free to reply. :)


### <font color=red> Deadline: 6/1/2019 23:59:59 submit to frantzeska.lavda@hesge.ch</font>