# Description

This excercise will help you in understanding basic princples of predicitive algorithm - classifiers - as well, as how to train them and compare.

You will also learn, how to deal with discrete attributes - how to prepare them and binarize.

The same dataset as before (**adult census** available here: https://archive.ics.uci.edu/ml/datasets/adult).

## Your task

Your task will be to fill parts of the notebook marked as 

> ...

with your code. Sometimes you need to write everything yourself, sometimes just fill the blanks :)

Many new functions and techniques will be introduced in this section, so often search proper documentation from sklearn:

http://scikit-learn.org/stable/user_guide.html

## Dataset description

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

 Prediction task is to determine **whether a person makes over 50K a year.**

# Libraries import

In [53]:
import pandas as pd
import numpy as np

import sklearn.preprocessing as preproc
import sklearn.model_selection as modsel

# Data reading

Read dataset using pandas

In [19]:
data = ...

In [20]:
data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,earnings
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


Most algorithms cannot handle text-based attributes (discrete attributes). Therefore they need to be binarized.

Binarization is a process of building vector of size equal to number of unique values for attribute. Then flag **1** is placed for each row, where attribute has given value.

Example - original table:

| id | Occupation |
|----|------------|
| 1  | developer          |
| 2  | developer          |
| 3  | admin          |
| 4  | researcher          |


Binarized version:


| id | Occupation developer | Occupation admin | Occupation researcher |
|----|----------------------|------------------|-----------------------|
| 1  | 1                    | 0                |  0                     |
| 2  | 1                    | 0                 | 0                     |
| 3  | 0                   | 1                |    0                   |
| 4  | 0                   | 0                |    1                   |

We will do the following steps:

1. Copy original data - always a good practive if you are not working with big data that does not fit into memory
2. Manually binarize category "earnings > 50K"
3. Divide data into X and Y - training and testing
4. Use built-in binarization procedure

In [81]:
# STEP 1
data_binarized = data.copy()

# STEP 2
# Turn predictive feature into binary manually - it will be easier

data_binarized.earnings = data_binarized.earnings == " >50K"

In [82]:
# STEP 3
# Divide dataset in X and Y - independent and dependent variable:

X, Y = data_binarized.drop('earnings', axis=1), data_binarized.earnings

In [83]:
# STEP 4
X = pd.get_dummies(X)

In [84]:
X.head(3)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


# One-hold train-test splitting

Basic technique is to divide dataset into training and testing. This is called **holdout** validation procedure. It's very basic and simple, but will help you in understanding how different algorithms can be tested.

Keep 20% of your data in testing

In [85]:
X_train, X_test, y_train, y_test = ...

Check dimensionality of your data.

In [86]:
print("train x shape: ", X_train.shape)
print("test x shape: ", X_test.shape)

print("train shape: ", y_train.shape)
print("train shape: ", y_test.shape)

train x shape:  (24420, 108)
test x shape:  (8141, 108)
train shape:  (24420,)
train shape:  (8141,)


## Algorithms testing

In this part, you'll have to try different classification algorithms, to check which performs best.

Save results to common dataframe and in the end - compare results

In [87]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, f1_score

In [88]:
algorithms = [
    ("dec tree", DecisionTreeClassifier()),
    ("knn", KNeighborsClassifier()),
    ("rf", RandomForestClassifier())    
]


results = pd.DataFrame(np.zeros((len(algorithms), 2)), index=[name for name, _ in algorithms], columns=['accuracy', 'f1'])

    
for name, algo in algorithms:
    algo. ... # train algorithm here
    prediction = algo. ... # predict on test set here
    accuracy = ... # check algorithm accuracy here
    f1 = ... # check algorithm f1 here
    results.loc[name, "accuracy"] = accuracy
    results.loc[name, "f1"] = f1

Check results. Discuss the following questions:

* Which algorithm performed best? According to which metric?
* How can you explain discrepancy between f1 and accuracy? 
* Why is it so?

In [89]:
results

Unnamed: 0,accuracy,f1
dec tree,0.812677,0.621494
knn,0.774352,0.405694
rf,0.847562,0.652673


# Cross-validation testing

Single split is not relevant. The reason for this is pretty simple - you might get 'unlucky' split which is not representative to the whole dataset.

Therefore crossvalidtion procedure is preferred. Just to remind you, it consist of several disjoint train-test splits, on which algorithms are trained.

This helps to better approximate an algorithm behavior. 

![Crossvalidation](https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg)

In [90]:
#TODO: fill this