# Neighbors 

Here we are going to see Neighbors Algorithms for Classification and Regression. Neighbors Algorithms try to do predictive analysis measuring the distancen between each data, This algorithms are not good with medium/big amounts of data because for each new data in our dataset we have to compute again all the distances between the points and the new points, so it's computationally very expensive.    

For this purpose we are going to use the well-known datasets of Boston and Breast Cancer that are in sklearn. 

## 1. Load Librarys

In [114]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston, load_breast_cancer
from sklearn.metrics import f1_score, confusion_matrix, r2_score, mean_squared_error, classification_report, mean_absolute_error
from sklearn.preprocessing import StandardScaler

## 2 Load and Preprocess Dataset for regression and classifications problem
Here we are going to load to datasets Boston and Breast Cancer.

#### Boston
It's a Dataset to predict the cost of the Bostons houses based in m2, location, etc. The price have been divided by 10.000

#### Breast Cancer
This dataset are measurements of different kinds of tumors and we have to predict if it's cancer or not

In [213]:
scaler_cancer = StandardScaler()
scaler_boston = StandardScaler()
dataset_boston = load_boston()
dataset_breast_cancer = load_breast_cancer()
df_boston = pd.DataFrame(data=dataset_boston['data'], columns=[dataset_boston['feature_names']])
df_boston_target = pd.DataFrame(dataset_boston['target'], columns=['target'])
df_boston = pd.merge(df_boston, df_boston_target, left_index=True, right_index=True)

df_breast_cancer = pd.DataFrame(data=dataset_breast_cancer['data'], columns=[dataset_breast_cancer['feature_names']])
df_breast_target = pd.DataFrame(dataset_breast_cancer['target'], columns=['target'])
df_breast_cancer = pd.merge(df_breast_cancer, df_breast_target, left_index=True, right_index=True)

train_test_boston = train_test_split(df_boston)
train_test_cancer = train_test_split(df_breast_cancer)
boston_X_train = train_test_boston[0].drop(columns='target')
boston_X_train_scale = pd.DataFrame(scaler_boston.fit_transform(boston_X_train), columns= boston_X_train.columns)
boston_y_train = train_test_boston[0]['target']
boston_X_test = train_test_boston[1].drop(columns='target')
boston_X_test_scale = pd.DataFrame(scaler_boston.transform(boston_X_test), columns= boston_X_test.columns)
boston_y_test = train_test_boston[1]['target']
cancer_X_train = train_test_cancer[0].drop(columns='target')
cancer_X_train_scale = pd.DataFrame(scaler_cancer.fit_transform(cancer_X_train), columns= cancer_X_train.columns)
cancer_y_train = train_test_cancer[0]['target']
cancer_X_test = train_test_cancer[1].drop(columns='target')
cancer_X_test_scale = pd.DataFrame(scaler_cancer.transform(cancer_X_test), columns= cancer_X_test.columns)
cancer_y_test = train_test_cancer[1]['target']

## 3 Classification
The operation of this algorithm is very simple. When a new data come into the algorithm, the algorithm see the N points around this new data and create a vote, in this case if a patient have breast cancer or not, if are more points around the new point that are labeled as cancer then the cancer wins, and this patient will be classified as a Patient with cancer, if there are more points around the new data without cancer, the new point will be classified as no cancer.

As we can see, in contrast with tree based models here we need to standardize the variables to achieve better results

### Hyperparamns
* n_neighbors: the number of points that the algorithm have to see in order to classify data, more points more time needed to make the calculations
* weights: the ponderation in the vote of points, could be
    * uniform: one point, one vote
    * distance: the points more near of the new point to predict his vote have more value
* Algorithm: the algorithm that will compute the distances
    * brute: it's use a brute-force search wich means that our algorithm will compute all the distances between all points
    * kd_tree: it's recommended if we have a medium data dataset. Based on trees the idea of this algorithm is simple, if we know that a point **A** is very distant from point **B** and point **B** very close to point **C** then we know that points **A** and **C** are very distant. This saves us computer time.
    * ball_tree: it's very simmilar to kd_tree but works better if we have a big dataset
    * auto: the algorithm will choose the best kernel for our dataset.
* P: the function that compute the distance, could be:
    * 1: Mikowski distance
    * 2: Euclidean Distance
* leaf_size: if you choose a kernel based on trees, then you can set the number of leaf

In [170]:
KN_class_1 = KNeighborsClassifier()
KN_class_1.fit(cancer_X_train_scale, cancer_y_train)
preds_train = KN_class_1.predict(cancer_X_train_scale)
print('CLASSIFICATION IN TRAIN')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_train, cancer_y_train))
print()
print('F1 SCORE:\n', f1_score(preds_train, cancer_y_train))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_train, cancer_y_train))

CLASSIFICATION IN TRAIN

CONFUSION MATRIX:
 [[145   2]
 [ 10 269]]

F1 SCORE:
 0.9781818181818182

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       0.94      0.99      0.96       147
           1       0.99      0.96      0.98       279

    accuracy                           0.97       426
   macro avg       0.96      0.98      0.97       426
weighted avg       0.97      0.97      0.97       426



In [171]:
preds_test = KN_class_1.predict(cancer_X_test_scale)
print('CLASSIFICATION IN TRAIN')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_test, cancer_y_test))
print()
print('F1 SCORE:\n', f1_score(preds_test, cancer_y_test))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_test, cancer_y_test))

CLASSIFICATION IN TRAIN

CONFUSION MATRIX:
 [[55  0]
 [ 2 86]]

F1 SCORE:
 0.9885057471264368

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       0.96      1.00      0.98        55
           1       1.00      0.98      0.99        88

    accuracy                           0.99       143
   macro avg       0.98      0.99      0.99       143
weighted avg       0.99      0.99      0.99       143



Let try to improve the accuracy of our model with param tunning

In [211]:
KN_class_2 = KNeighborsClassifier(n_neighbors=3,
                                  algorithm='brute',
                                  p=1,
                                  weights='distance',
                                  n_jobs=-1)

KN_class_2.fit(cancer_X_train_scale, cancer_y_train)
preds_train = KN_class_2.predict(cancer_X_train_scale)
print('CLASSIFICATION IN TRAIN')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_train, cancer_y_train))
print()
print('F1 SCORE:\n', f1_score(preds_train, cancer_y_train))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_train, cancer_y_train))

CLASSIFICATION IN TRAIN

CONFUSION MATRIX:
 [[155   0]
 [  0 271]]

F1 SCORE:
 1.0

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       155
           1       1.00      1.00      1.00       271

    accuracy                           1.00       426
   macro avg       1.00      1.00      1.00       426
weighted avg       1.00      1.00      1.00       426



In [212]:
preds_test = KN_class_2.predict(cancer_X_test_scale)
print('CLASSIFICATION IN TRAIN')
print()
print('CONFUSION MATRIX:\n', confusion_matrix(preds_test, cancer_y_test))
print()
print('F1 SCORE:\n', f1_score(preds_test, cancer_y_test))
print()
print('CLASSIFICATION REPORT\n',classification_report(preds_test, cancer_y_test))

CLASSIFICATION IN TRAIN

CONFUSION MATRIX:
 [[56  0]
 [ 1 86]]

F1 SCORE:
 0.9942196531791908

CLASSIFICATION REPORT
               precision    recall  f1-score   support

           0       0.98      1.00      0.99        56
           1       1.00      0.99      0.99        87

    accuracy                           0.99       143
   macro avg       0.99      0.99      0.99       143
weighted avg       0.99      0.99      0.99       143



Great we have improve our accuracy in 1% with only 1 fail in 143 datas

## 4 Regression
Is very similar to the classification algorithm so we only are going to do the regression problem

In [216]:
KN_reg_1 = KNeighborsRegressor()

KN_reg_1.fit(boston_X_train_scale, boston_y_train)
train_predictions = KN_reg_1.predict(boston_X_train_scale)
test_predictions = KN_reg_1.predict(boston_X_test_scale)
print('MAE in train:', mean_absolute_error(train_predictions, boston_y_train))
print('MSE in train:', np.sqrt(mean_squared_error(train_predictions, boston_y_train)))
print('MAE in test:', mean_absolute_error(test_predictions, boston_y_test))
print('RMSE in test:', np.sqrt(mean_squared_error(test_predictions, boston_y_test)))

MAE in train: 2.3253825857519788
MSE in train: 3.59266828907933
MAE in test: 3.2366929133858267
RMSE in test: 5.923985942658785


In [235]:
KN_reg_2 = KNeighborsRegressor(n_neighbors=3,
                               algorithm='brute',
                               p=1,
                               weights='distance',
                               n_jobs=-1)

KN_reg_2.fit(boston_X_train_scale, boston_y_train)
train_predictions = KN_reg_2.predict(boston_X_train_scale)
test_predictions = KN_reg_2.predict(boston_X_test_scale)
print('MAE in train:', mean_absolute_error(train_predictions, boston_y_train))
print('MSE in train:', np.sqrt(mean_squared_error(train_predictions, boston_y_train)))
print('MAE in test:', mean_absolute_error(test_predictions, boston_y_test))
print('RMSE in test:', np.sqrt(mean_squared_error(test_predictions, boston_y_test)))

MAE in train: 0.0
MSE in train: 0.0
MAE in test: 2.907951599903701
RMSE in test: 5.007447640755706


In [106]:
print(df_boston_target.min())
print(df_boston_target.max())

target    5.0
dtype: float64
target    50.0
dtype: float64


We have improve our model reducing the MAE in test in more than 3.000€ and the RMSE in almost 10.000€ it's not a very good model.