# K-NN Classifier, supervised learning

We work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes.

## Import of the necessary modules

In [15]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

## Preparing X, y

In [2]:
# We load the dataset
df = pd.read_csv('diabetes.csv')

In [3]:
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
X = df.drop('diabetes',axis=1)

In [5]:
X.shape

(768, 8)

In [18]:
y = df.diabetes.values

In [19]:
y.shape

(768,)

## Split into training and test sets

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42, stratify=y)

## K-NN classifier

In [22]:
# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

## Confusion matrix and classification report

In [23]:
# Generate the confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

[[173  28]
 [ 57  50]]
             precision    recall  f1-score   support

          0       0.75      0.86      0.80       201
          1       0.64      0.47      0.54       107

avg / total       0.71      0.72      0.71       308



In [25]:
tp = cm[0,0] # Number of correctly predicted diabetes
tn = cm[1,1] # Number of correctly predicted non-diabetes
fn = cm[0,1] # Number of wrongly predicted non-diabetes
fp = cm[1,0] # Number of wrongly predicted diabetes

In [27]:
precision = tp / (tp + fp) # Number of correctly predicted diabetes / Total number of predicted diabetes
print(precision)

0.752173913043


In [30]:
recall = tp /(tp + fn) # Number of correctly predicted diabetes / Total number of real diabetes
print(recall)

0.860696517413
