# KNN (K-Nearest Neighbors)

*Author: Eda AYDIN*

Table of Contents
- [K-Nearest Neighbors](#k-nearest-neighbors)
- [Exploratory Data Analysis](#exploratory-data-analysis)
- [Data Pre-Processing](#data-pre-processing)
- [Modelling](#modelling)
- [Evaluation of Model Success](#evaluation-of-model-success)
- [Hyperparameter Optimizations](#hyperparameter-optimizations)
- [Final Model](#final-model)

## K-Nearest Neighbors

- Estimation is made based on the similarities between the observations.
- "Tell me about your friend and I will tell you who you are."
- ***K-Nearest Neighbors Regression***: If the target variable y contains numeric values, then there is a regression problem.
     - The k closest observations of an observation point are calculated with a Euclidean or similar distance, and the average of their y values is found which y value the target observation has.
     - The y values of the k closest observations are averaged.
- ***K-Nearest Neighbors Classification***: If the target variable y contains a categorical value, then there is a classification problem.
     - The most frequently observed frequency of y values of the k closest observations of an observation point becomes the predicted class.

------

- Gözlemlerin birbirine olan benzerlikleri üzerinden tahmin yapılır.
- "Bana arkadaşını söyle, sana kim olduğunu söyleyeyim."
-  ***K-Nearest Neighbors Regression***: Eğer hedef y değişkeni sayısal bir değerler içeriyorsa, o zaman regresyon problemi olur. 
    - Bir gözlem noktasının en yakın k adet gözlemin öklid ya da benzeri bir uzaklık ile hesaplanarak bunların y değerlerinin ortalaması ile hedef gözlemin hangi y değerine sahip olduğunu bulunur.
    - En yakın k adet gözlemin y değerlerinin ortalaması alınır. 
- ***K-Nearest Neighbors Classification***: Eğer hedef y değişkeni kategorik bir değer içeriyorsa, o zaman classification problemi olur.
    - Bir gözlem noktasının en yakın k adet gözlemin y değerlerinin en sık gözlenen frekansı tahmin edilen sınıf olur.

In [1]:
import pandas as pd
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

pd.set_option("display.max_columns",None)

## Exploratory Data Analysis

In [2]:
df = pd.read_csv("datasets/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [4]:
df["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

## Data Pre-Processing

In [5]:
y = df["Outcome"]
X = df.drop(["Outcome"], axis=1)

Uzaklık temelli yöntemlerde ve gradient descent temelli yöntemlerde değişkenlerin standard olması elde edilecek sonuçların ya daha hızlı ya da daha doğru olmasını sağlayacaktır.

In [6]:
X_scaled = StandardScaler().fit_transform(X)

In [7]:
X = pd.DataFrame(X_scaled, columns=X.columns)

In [8]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.639947,0.848324,0.149641,0.90727,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.23388,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.90727,0.765836,1.409746,5.484909,-0.020496


## Modelling

In [9]:
knn_model = KNeighborsClassifier().fit(X,y)

In [10]:
random_user = X.sample(1, random_state=45)
random_user

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
195,0.342981,1.161295,0.770014,1.283638,1.130518,0.940144,-0.232176,-0.360847


In [11]:
knn_model.predict(random_user)

array([1], dtype=int64)

## Evaluation of Model Success

In [12]:
# y_pred for confusion_matrix
y_pred = knn_model.predict(X)

# y_prob for AUC
y_prob = knn_model.predict_proba(X)[:,-1]

In [13]:
print(classification_report(y,y_pred))

              precision    recall  f1-score   support

           0       0.85      0.90      0.87       500
           1       0.79      0.70      0.74       268

    accuracy                           0.83       768
   macro avg       0.82      0.80      0.81       768
weighted avg       0.83      0.83      0.83       768



In [14]:
roc_auc_score(y, y_prob)

0.9017686567164179

In [15]:
cv_results = cross_validate(knn_model, X, y, cv=5, scoring=["accuracy", "f1", "roc_auc"])

In [16]:
cv_results['test_accuracy'].mean()

0.733112638994992

In [17]:
cv_results['test_f1'].mean()

0.5905780011534191

In [18]:
cv_results['test_roc_auc'].mean()

0.7805279524807827

First Accuracy : 0.83 - Second Accuracy: 0.73
First F1 Score: 0.74 - Second F1 Score: 0.59
First ROC AUC Score: 0.90 - Second ROC AUC Score. 0.78

- Buradan ilk kurduğumuz modelde modeli kurduğumuz veriyle test ettiğimiz için bir yanlılık tespit ettik. 
- cross_validate ifadesi ilk kurduğumuz modelin doğrulamasını sağlar. Kurduğumuz model ile doğru yolda mıyız, yoksa hatamı yaptık sorusunun cevabıdır. 

----

- Since we tested the model with the data we set up, we detected a bias in the first model we built here.
- The cross_validate statement provides validation of the model we first set up. It is the answer to the question of whether we are on the right track with the model we have built, or have we made a mistake?

***How can success scores be increased?***

1. The sample size can be increased.
2. Data preprocessing
3. Feature engineering
4. Optimizations can be made for the relevant algorithm.

In [19]:
knn_model.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

## Hyperparameter Optimizations

In [20]:
knn_model = KNeighborsClassifier()
knn_model.get_params() #Get parameters for this estimator.

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

In [21]:
knn_params = {"n_neighbors":range(2,50)}

In [22]:
# Exhaustive search over specified parameter values for an estimator.
knn_gs_best = GridSearchCV(estimator= knn_model,
                           param_grid= knn_params,
                           cv=5,
                           n_jobs=-1,
                           verbose=1).fit(X, y)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [23]:
knn_gs_best.best_params_

{'n_neighbors': 17}

## Final Model

In [24]:
knn_final = knn_model.set_params(**knn_gs_best.best_params_).fit(X, y)
# ** ifadesi GridSearchCV içerisinde bulduğumuz en iyi parametreleri alıp, modelin içerisinde kullanması için ekleriz. Birden fazla değer olduğunda elle tek tek yazmak yerine direkt olarak almasını sağlarız.

In [25]:
cv_results = cross_validate(estimator= knn_final,
                            X= X,
                            y= y,
                            cv=5,
                            scoring=["accuracy", "f1", "roc_auc"])

In [26]:
cv_results['test_accuracy'].mean()

0.7669892199303965

In [27]:
cv_results['test_f1'].mean()

0.6170909049720137

In [28]:
cv_results['test_roc_auc'].mean()

0.8127938504542278

In [29]:
random_user = X.sample(1)

In [30]:
knn_final.predict(random_user)

array([1], dtype=int64)