## K-Nearest Neighbor (KNN)



* This KNN tutorial contains code from "Evaluating a Classification Model" post available at http://www.ritchieng.com/machine-learning-evaluate-classification-model/

## The Data

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

https://www.kaggle.com/uciml/pima-indians-diabetes-database/version/1#


In [1]:
# read the data into a Pandas DataFrame
import pandas as pd

df = pd.read_csv('pima_indians_diabetes.csv')
df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
# define X and y
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]

y = df['Outcome']

In [3]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

In [6]:
from sklearn.neighbors import KNeighborsRegressor

# instantiate model
model = KNeighborsRegressor(n_neighbors=5)

# fit model
model.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

In [7]:
# make class predictions for the testing set
y_pred_class = model.predict(X_test)
y_pred_class

array([0.6, 0.8, 0.2, 0.2, 0.6, 0. , 0.6, 0.2, 0.4, 0.2, 0.2, 0. , 0.8,
       1. , 0. , 0.8, 0. , 0.4, 0.4, 0. , 0.6, 0. , 0.6, 0.4, 0. , 0.8,
       0.6, 0.8, 0. , 0.2, 0.4, 0.2, 0. , 0.8, 0. , 0.4, 0. , 0.4, 0.2,
       0.4, 0.4, 0. , 0. , 0. , 0.2, 0. , 0.8, 1. , 0.2, 0. , 0. , 0.2,
       0.8, 0. , 0.6, 0. , 1. , 0.2, 0.2, 0.2, 0.8, 0. , 0.6, 0.4, 0.4,
       0. , 0. , 1. , 0.4, 0.8, 1. , 0.8, 0.2, 0. , 0.6, 0.4, 0.8, 0.6,
       0.4, 0.4, 0.8, 0.2, 0.6, 0.2, 0. , 0.8, 0.6, 0. , 0.6, 0.2, 0.4,
       0. , 0. , 0. , 0.8, 0.2, 0.4, 0. , 0.6, 0.4, 0. , 0.6, 0.8, 0.6,
       0. , 0.2, 0.4, 0.6, 0.2, 0.4, 0.8, 0. , 0.4, 0.6, 0. , 0. , 0. ,
       0.4, 0.2, 0.2, 0. , 0.6, 0.4, 0.4, 0.2, 0.2, 0. , 0.2, 0.4, 0.2,
       1. , 0. , 0.8, 0.6, 0. , 0.2, 0.2, 0.6, 0.2, 0.4, 1. , 0. , 0.4,
       1. , 0.4, 0.4, 0. , 0.2, 0.4, 0.2, 0.4, 0.6, 0.4, 0.2])

**Classification accuracy**: percentage of correct predictions

In [10]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

ValueError: Classification metrics can't handle a mix of binary and continuous targets