# The task

In this assignment you will need to implement features, based on nearest neighbours. 

KNN classifier (regressor) is a very powerful model, when the features are homogeneous and it is a very common practice to use KNN as first level model. In this homework we will extend KNN model and compute more features, based on nearest neighbors and their distances. 

You will need to implement a number of features, that were one of the key features, that leaded the instructors to prizes in [Otto](https://www.kaggle.com/c/otto-group-product-classification-challenge) and [Springleaf](https://www.kaggle.com/c/springleaf-marketing-response) competitions. Of course, the list of features you will need to implement can be extended, in fact in competitions the list was at least 3 times larger. So when solving a real competition do not hesitate to make up your own features.   

You can optionally implement multicore feature computation.

# Check your versions

In [1]:
import numpy as np
import pandas as pd 
import sklearn
import scipy.sparse 

for p in [np, pd, sklearn, scipy]:
    print (p.__name__, p.__version__)

numpy 1.13.1
pandas 0.20.3
sklearn 0.19.0
scipy 0.19.1


The versions should be not less than:

    numpy 1.13.1
    pandas 0.20.3
    sklearn 0.19.0
    scipy 0.19.1
   
**IMPORTANT!** The results with `scipy=1.0.0` will be different! Make sure you use _exactly_ version `0.19.1`.

# Load data

Learn features and labels. These features are actually OOF predictions of linear models.

In [2]:
train_path = '../readonly/KNN_features_data/X.npz'
train_labels = '../readonly/KNN_features_data/Y.npy'

test_path = '../readonly/KNN_features_data/X_test.npz'
test_labels = '../readonly/KNN_features_data/Y_test.npy'

# Train data
X = scipy.sparse.load_npz(train_path)
Y = np.load(train_labels)

# Test data
X_test = scipy.sparse.load_npz(test_path)
Y_test = np.load(test_labels)

# Out-of-fold features we loaded above were generated with n_splits=4 and skf seed 123
# So it is better to use seed 123 for generating KNN features as well 
skf_seed = 123
n_splits = 4

Below you need to implement features, based on nearest neighbors.

In [3]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.neighbors import NearestNeighbors
from multiprocessing import Pool

import numpy as np


class NearestNeighborsFeats(BaseEstimator, ClassifierMixin):
    def __init__(self, n_jobs, k_list, metric, n_classes=None, n_neighbors=None, eps=1e-6):
        self.n_jobs = n_jobs
        self.k_list = k_list
        self.metric = metric
        
        if n_neighbors is None:
            self.n_neighbors = max(k_list) 
        else:
            self.n_neighbors = n_neighbors
            
        self.eps = eps        
        self.n_classes_ = n_classes
    
    def fit(self, X, y):
        self.NN = NearestNeighbors(n_neighbors=max(self.k_list), 
                                      metric=self.metric, 
                                      n_jobs=1, 
                                      algorithm='brute' if self.metric=='cosine' else 'auto')
        self.NN.fit(X)
        
        # Store labels 
        self.y_train = y
        
        # Save number of classes
        self.n_classes = np.unique(y).shape[0] if self.n_classes_ is None else self.n_classes_
        
        
    def predict(self, X):   
        #  Produces KNN features for every object of a dataset X
        if self.n_jobs == 1:
            test_feats = []
            for i in range(X.shape[0]):
                test_feats.append(self.get_features_for_one(X[i:i+1]))
        else:
            with Pool(self.n_jobs) as p:
                test_feats = p.map(self.get_features_for_one, [X[i:i+1] for i in range(X.shape[0])])
        return np.vstack(test_feats)
        
        
    def get_features_for_one(self, x):
        # Compute KNN features for a single object
        NN_output = self.NN.kneighbors(x)
        
        # Vector of size `n_neighbors`
        # Stores indices of the neighbors
        neighs = NN_output[1][0]
        
        # Vector of size `n_neighbors`
        # Stores distances to corresponding neighbors
        neighs_dist = NN_output[0][0] 

        # Vector of size `n_neighbors`
        # Stores labels of corresponding neighbors
        neighs_y = self.y_train[neighs] 

        return_list = [] 
                
        ''' 
            1. Fraction of objects of every class.
               It is basically a KNNСlassifiers predictions.
        '''
        for k in self.k_list:
            feats = np.bincount(neighs_y[:k], minlength=self.n_classes)/k
            
            assert len(feats) == self.n_classes
            return_list += [feats]
        
        '''
            2. Same label streak: the largest number N, 
               such that N nearest neighbors have the same label.
        '''
        res = np.where(neighs_y==neighs_y[0], 1, 0)
        if np.argmax(res==0) == 0:
            feats = [sum(res)]
        else:
            feats = [np.argmax(res==0)]
        
        assert len(feats) == 1
        return_list += [feats]
        
        '''
            3. Minimum distance to objects of each class
        '''
        feats = []
        for c in range(self.n_classes):
            idx = np.where(neighs_y==c)
            if len(idx[0]) == 0:
                res = 999
            else:
                res = neighs_dist[idx[0][0]]
            feats.append(res)
        
        assert len(feats) == self.n_classes
        return_list += [feats]
        
        '''
            4. Minimum *normalized* distance to objects of each class
        '''
        feats = []
        for c in range(self.n_classes):
            idx = np.where(neighs_y==c)
            if len(idx[0]) == 0:
                res = 999
            else:
                res = neighs_dist[idx[0][0]]
                res = res/(neighs_dist[0]+self.eps)
            feats.append(res)
        
        assert len(feats) == self.n_classes
        return_list += [feats]
        
        '''
            5. 
               5.1 Distance to Kth neighbor
                   Like quantiles of a distribution
               5.2 Distance to Kth neighbor normalized by 
                   distance to the first neighbor
        '''
        for k in self.k_list:
            feat_51 = neighs_dist[k-1]
            feat_52 = neighs_dist[k-1]/(neighs_dist[0]+self.eps)
            return_list += [[feat_51, feat_52]]
        
        '''
            6. Mean distance to neighbors of each class for each K from `k_list` 
        '''
        for k in self.k_list:
            
            total_dist = np.bincount(neighs_y[:k], weights=neighs_dist[:k], minlength=self.n_classes)
            total_occur = np.bincount(neighs_y[:k], minlength=self.n_classes)
            
            feats = np.array([d/(o+self.eps) if o != 0 else 999 for d, o in zip(total_dist, total_occur)])
            
            assert len(feats) == self.n_classes
            return_list += [feats]
        
        # merge
        knn_feats = np.hstack(return_list)
        
        assert knn_feats.shape == (239,) or knn_feats.shape == (239, 1)
        return knn_feats

## Sanity check

To make sure you've implemented everything correctly we provide you the correct features for the first 50 objects.

In [4]:
# a list of K in KNN, starts with one 
k_list = [3, 8, 32]

# Load correct features
true_knn_feats_first50 = np.load('../readonly/KNN_features_data/knn_feats_test_first50.npy')

# Create instance of our KNN feature extractor
NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric='minkowski')

# Fit on train set
NNF.fit(X, Y)

# Get features for test
test_knn_feats = NNF.predict(X_test[:50])

# This should be zero
print ('Deviation from ground thruth features: %f' % np.abs(test_knn_feats - true_knn_feats_first50).sum())

deviation =np.abs(test_knn_feats - true_knn_feats_first50).sum(0)
for m in np.where(deviation > 1e-3)[0]: 
    p = np.where(np.array([87, 88, 117, 146, 152, 239]) > m)[0][0]
    print ('There is a problem in feature %d, which is a part of section %d.' % (m, p + 1))

predicting
Deviation from ground thruth features: 0.000000


Now implement parallel computations and compute features for the train and test sets. 

## Get features for test

Now compute features for the whole test set.

In [None]:
for metric in ['minkowski', 'cosine']:
    print (metric)
    
    # Create instance of our KNN feature extractor
    NNF = NearestNeighborsFeats(n_jobs=4, k_list=k_list, metric=metric)
    
    # Fit on train set
    NNF.fit(X, Y)

    # Get features for test
    test_knn_feats = NNF.predict(X_test)
    
    # Dump the features to disk
    np.save('knn_feats_%s_test.npy' % metric , test_knn_feats)

minkowski
cosine


## Get features for train

Compute features for train, using out-of-fold strategy.

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import StratifiedKFold

# We will use two metrics for KNN
for metric in ['minkowski', 'cosine']:
    print (metric)
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=skf_seed)
    NNF = NearestNeighborsFeats(n_jobs=8, k_list=k_list, metric=metric)
    preds = cross_val_predict(NNF, X, Y, cv=skf)
    np.save('knn_feats_%s_train.npy' % metric, preds)

minkowski
predicting
predicting
predicting
predicting
cosine
predicting
predicting
predicting


# Submit

Run the following cells for submission

In [4]:
s = 0
for metric in ['minkowski', 'cosine']:
    knn_feats_train = np.load('knn_feats_%s_train.npy' % metric)
    knn_feats_test = np.load('knn_feats_%s_test.npy' % metric)

    s += knn_feats_train.mean() + knn_feats_test.mean()
    
answer = np.floor(s)
print (answer)

3838.0


Submit!

In [6]:
from grader import Grader
grader = Grader()

grader.submit_tag('statistic', answer)

STUDENT_EMAIL = 'dimistsaousis@gmail.com'
STUDENT_TOKEN = 'OtTMmp51v3J1ot34'
grader.status()

grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Current answer for task statistic is: 3838.0
You want to submit these numbers:
Task statistic: 3838.0
Submitted to Coursera platform. See results on assignment page!
