<a href="https://colab.research.google.com/github/yandexdataschool/MLatImperial2020/blob/master/02_lab/lab02_Data_preprocessing_and_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

## Metrics

There are various different metrics to measure you algorithm's quality. In this exercise we will use ROC AUC. It is simply area under the ROC-curve (that's what AUC stands for - area under curve). ROC-curve is a plot of TPR - true positive rate vs FPR - false positive rate.

A very nice interactive demonstartion of ROC-curves can be found [here](http://arogozhnikov.github.io/2015/10/05/roc-curve.html).

Here is an example in sklearn [click](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py)

# Data

In [0]:
!wget https://github.com/yandexdataschool/MLatImperial2020/raw/master/02_lab/data.adult.csv

In [0]:
data = pd.read_csv('data.adult.csv')
data.shape

In [0]:
data.head(5)

### Task 1
- Find all features, that have missing values in them. Missing values have value "?" in this dataset.

In [0]:
<YOUR CODE>

- Select target variable (salary), delete it from dataset and convert in to binary format.
- Select real valued features (the `DataFrame`'s `select_dtypes` method might help you here)

In [0]:
target = <YOUR CODE>
features_all = <YOUR CODE> # (data without the target column)
features_numeric = <YOUR CODE>
features_string = <YOUR CODE>

## Now, we are going to train Knn Classifier

Recall, that KNN is a metric based algorithm. It caclulates distance in the given space. It then just assigns a label as a vote of its neighbours. Modifications with different distance weightning can also be used.

In [0]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [0]:
# A function to plot score vs param value

def plotting(X, Y, Error, model="KNN"):
    plt.figure(figsize=[12,5])
    plt.title("ROC_AUC score: " + model)
    plt.xlabel('param_value')
    plt.ylabel('ROC_AUC score')
    plt.plot(X, Y,'bo-', color='b', ms=5, label="ROC_AUC score")
    plt.fill_between(X, Y - 1.96 * Error, Y + 1.96 * Error, facecolor='g', alpha=0.6, label="95% confidence interval")
    plt.legend()
    plt.grid()
    plt.show()

Now, your task is to find the optimal number of neighbours using cross-validation with GridSearchCV (see the import above).

Use the function `plotting` defined above, where `X` is the grid searched value of the number of neigbbours,
`Y` is the test score and `Error` is its standard deviation. All this info can be extracted from an instance of the GridSearchCV class, once fitted.

In [0]:
base_model = KNeighborsClassifier(n_jobs=-1)

param_grid = <YOUR CODE> # GridSearchCV expects you to provide a dictionary
                         # mapping parameter names to arrays of values you
                         # want to do the search over. Here we'll do the search
                         # over only one parameter of KNeighborsClassifier - 'n_neighbors'

gscv = GridSearchCV(<YOUR CODE>) # Check the documentation to fill
                                 # in the GridSearchCV constructor call.
                                 # Use 'roc_auc' for the scoring function and
                                 # 5-fold cross-validation. Also, adding
                                 # n_jobs=-1 should make it run faster.


# Call the fit method to run the grid search
gscv.fit(features_numeric, target)

In [0]:
plotting(<YOUR CODE>) # Explore the contents of gscv.cv_results_ after fitting
                      # to find the resulting scores and their errors

What do you think? What's the problem for such bad results?

#### Plot histograms for age, fnlwgt, capital-gain. What do you observe?

In [0]:
<YOUR CODE>

We now scale the data, using inbuild Standard Scaler, which standartised features - makes zero mean and unit variance. 

In [0]:
from sklearn.preprocessing import StandardScaler
features_numeric_scaled = features_numeric.copy()
features_numeric_scaled[:] = StandardScaler().fit_transform(features_numeric)

Try running the grid search again, now on the scaled features.

In [0]:
<YOUR CODE>

## Remember we had categorical features? Lets try to use them, for the same best n_neighbours we have found. Before we need to encode them - create one-hot encoded representation.

In [0]:
features_one_hot = pd.get_dummies(features_string)
print(features_one_hot.shape)
features_one_hot.head(5)

In [0]:
features_merged = pd.concat([features_one_hot, features_numeric_scaled], axis=1)

features_merged.head(3)

In [0]:
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors=52, metric='minkowski', n_jobs=-1)

print("Result: {}".format(
    np.mean(cross_val_score(knn, features_merged, target, scoring='roc_auc', n_jobs=-1, cv=5))
))

In [0]:
%%html
<img src=http://dogr.io/wow/suchresult/sooneenhot.png width="300">