## **`K-Nearest Neighbors (k-NN)`** 

<img src="images/knn1.svg">

We want to guess if a new customer will sign up based on two pieces of information. We can do this by measuring how far away the new customer is from other customers on a graph and seeing which group they are closest to. If they are closest to the group that mostly signs up, we predict they will sign up too. If they are closest to the group that mostly doesn't sign up, we predict they won't sign up.


<img src="images/knn.svg">

The K-Nearest Neighbors algorithm works by grouping data points based on how similar their labels are. This allows certain rules to develop automatically.





## K-Nearest Neighbors Algorithm

To make a prediction about a new data point, the algorithm follows these steps:

1. Calculate the distance between the new data point and all the observations in the training dataset across all features.

2. Sort the distances in `ascending` order.
3. Select the `K` observations with the smallest distances from the previous step. These `K observations` are the `K-nearest neighbors` of the new data point.
   - Note that there should be at least `K ≥ 1` observations in the dataset.
4. Calculate which labels of those neighbors are the most common, and assign that label to the new data point.

This allows the algorithm to predict the label of a new data point based on the labels of its closest neighbors. The `K-Nearest Neighbors` algorithm is commonly used in machine learning for `classification tasks`.


### **`Data Preparation`**

In [106]:
import pandas as pd

In [107]:
banking_df = pd.read_csv("data/bank_data - bank_data.csv.csv")
banking_df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [108]:
print("Shape :",banking_df.shape)

Shape : (41188, 21)


In [109]:
print(banking_df.dtypes.value_counts())


object     11
int64       5
float64     5
Name: count, dtype: int64


In [110]:
banking_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [111]:
banking_df.isna().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

In [112]:
banking_df["y"].value_counts()

y
no     36548
yes     4640
Name: count, dtype: int64

In [113]:
banking_df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


 To convert "`yes`" values in a Pandas DataFrame column to `1` and "`no`" values to `0`, you can use the `replace()` method. 

In [114]:
banking_df["y"] = banking_df["y"].replace({"yes":1,"no":0})
banking_df["y"].value_counts()

y
0    36548
1     4640
Name: count, dtype: int64

In [115]:
# shuffle DataFrame in random order
train_df = banking_df.sample(frac=0.85,random_state=417)
test_df = banking_df.drop(train_df.index)

In [116]:
print(train_df["y"].value_counts(normalize=True))
print(test_df["y"].value_counts(normalize=True))

y
0    0.887889
1    0.112111
Name: proportion, dtype: float64
y
0    0.884267
1    0.115733
Name: proportion, dtype: float64


In [117]:
X_train = train_df.loc[:,train_df.columns != "y"]
y_train = train_df["y"]



In [118]:
X_test = train_df.loc[:,test_df.columns != "y"]
y_test = test_df["y"]