## Problem Statement
The goal is to predict customer churn using the K-Nearest Neighbors (KNN)
algorithm and analyze the effect of K on model performance.


In [2]:
import pandas as pd

dataset = pd.read_csv("customer_churn.csv")
dataset.head()

Unnamed: 0,age,monthly_charges,tenure,churn
0,22,250,3,1
1,25,300,6,1
2,30,400,12,0
3,35,450,18,0
4,40,500,24,0


KNN is a distance-based algorithm, so feature scale and data distribution
are critical for model performance.


In [3]:
x = dataset.iloc[:,:-1]
y = dataset["churn"]

In [4]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)

Stratified splitting ensures both churn and non-churn classes
are present in training and test sets.


In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

KNN relies on distance calculations.

Feature scaling is mandatory to prevent features with larger ranges
from dominating distance computation.


In [6]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train_scaled,y_train)

y_pred = knn.predict(x_test_scaled)

In [8]:
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[1 0]
 [0 1]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



KNN classification performance depends heavily on the choice of K
and the distance metric.


In [10]:
from sklearn.metrics import f1_score

k_values = range(1,8)
f1_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train_scaled,y_train)
    y_preds = knn.predict(x_test_scaled)
    f1_scores.append(f1_score(y_test,y_preds))

list(zip(k_values, f1_scores))

[(1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 0.0), (7, 0.0)]

Small K values lead to high variance (overfitting),
while larger K values increase bias.

Choosing K involves balancing bias and variance.


KNN memorizes training data and makes predictions based on proximity.

Unlike Logistic Regression, it does not learn parameters.

This makes KNN sensitive to noise and dataset size.


# CONCLUSION
KNN is a simple, distance-based classification algorithm.

It performs well on small, clean datasets but scales poorly with data size.

Feature scaling and careful selection of K are critical for good performance.
