# Diabetes Prediction with KNN

***
**Business Problem:** We would like to develop a machine learning model that can predict whether individuals have diabetes based on specified characteristics.
***

## Data Set
The data set is part of a larger dataset held by the **National Institute of Diabetes and Digestive and Kidney Diseases in the United States**. The data was used for a diabetes study conducted on Pima Indian women aged 21 and over who lived in Phoenix, the fifth largest city in the state of Arizona in the United States. The data set consists of **768 observations** and **8 numerical independent variables**. The target variable is labeled "outcome" and indicates a positive diabetes test result with a value of 1, and a negative result with a value of 0.

- **Pregnancies:** Number of pregnancies
- **Glucose:** Glucose level
- **BloodPressure:** Blood pressure
- **SkinThickness:** Skin thickness
- **Insulin:** Insulin level
- **BMI:** Body mass index
- **DiabetesPedigreeFunction:** A function that calculates the probability of developing diabetes based on family history
- **Age:** Age (in years)

*Target:*
- **Outcome:** Information on whether the person has diabetes. Has the disease (1) or not (0)

## Exploratory Data Analysis (EDA)

In [3]:
# Libraries

# for data manipulation
import pandas as pd

# for modelling
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [4]:
# setting configurations
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)

In [5]:
df = pd.read_csv("/content/diabetes.csv")

In [6]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
df.shape

(768, 9)

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845,3.37,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.895,31.973,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105,19.356,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536,15.952,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799,115.244,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.993,7.884,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.472,0.331,0.078,0.244,0.372,0.626,2.42
Age,768.0,33.241,11.76,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.349,0.477,0.0,0.0,0.0,1.0,1.0


In [9]:
df["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

## Data Preprocessing 

In [11]:
# standard scaler
y = df["Outcome"]
X = df.drop(["Outcome"], axis=1)

X_scaled = StandardScaler().fit_transform(X)

X = pd.DataFrame(X_scaled, columns=X.columns)

## Modeling & Prediction

In [15]:
#model
knn_model = KNeighborsClassifier().fit(X, y)

In [16]:
#prediction
random_user = X.sample(1, random_state=45)

knn_model.predict(random_user)

array([1])

### Model Evaluation

In [13]:
# y_pred for confusion matrix
y_pred = knn_model.predict(X)

# y_prob for AUC
y_prob = knn_model.predict_proba(X)[:, 1]

print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.90      0.87       500
           1       0.79      0.70      0.74       268

    accuracy                           0.83       768
   macro avg       0.82      0.80      0.81       768
weighted avg       0.83      0.83      0.83       768



In [17]:
# ROC AUC
roc_auc_score(y, y_prob)

0.9017686567164179

In [18]:
# cross validation
cv_results = cross_validate(knn_model, X, y, cv=5, scoring=["accuracy", "f1", "roc_auc"])

In [19]:
# Accuracy
cv_results['test_accuracy'].mean()

0.733112638994992

In [20]:
# f1-score
cv_results['test_f1'].mean()

0.5905780011534191

In [21]:
# ROC AUC
cv_results['test_roc_auc'].mean()

0.7805279524807827

### Hyperparameter Optimization

In [24]:
# hyperparameters
knn_model = KNeighborsClassifier()
knn_model.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

In [25]:
knn_params = {"n_neighbors": range(2, 50)}

In [26]:
# looking for the best number of neighbors
knn_gs_best = GridSearchCV(knn_model, knn_params, cv=5, n_jobs=-1, verbose=1).fit(X, y)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [27]:
knn_gs_best.best_params_

{'n_neighbors': 17}

### Final Model


In [29]:
# final model with 17 neighbors
knn_final = knn_model.set_params(**knn_gs_best.best_params_).fit(X, y)

In [30]:
# cross validation
cv_results = cross_validate(knn_final, X, y, cv=5, scoring=["accuracy", "f1", "roc_auc"])

In [31]:
# accuracy
cv_results['test_accuracy'].mean()

0.7669892199303965

In [32]:
# f1 score
cv_results['test_f1'].mean()

0.6170909049720137

In [33]:
# ROC AUC
cv_results['test_roc_auc'].mean()

0.8127938504542278

In [34]:
# prediction
random_user = X.sample(1)

knn_final.predict(random_user)

array([0])