<a href="https://colab.research.google.com/github/git-ashiq/kaggle-diabetes-prediction/blob/master/diabetes_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pima Indians Diabetes Database**

**The Challenge:**

To build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

**Steps:**

1.   Question or problem definition
2.   Data Collection
3.   Exploratory Data Analysis
4.   Feature Engineering
5.   Modelling
6.   Testing
7.   Submitting











### **1. Defining the problem statement**

**Context:**

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

**Content:**

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

**Let us first learn about the diabetes.**

In [0]:
# To display video about diabetes for more information
from IPython.core.display import HTML
HTML('''<iframe width="560" height="315" src="https://www.youtube.com/embed/XfyGv-xwjlI" frameborder="0"
     allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>''')



---



### **2. Data Collection**
Collecting data from kaggle. you can download it from https://www.kaggle.com/uciml/pima-indians-diabetes-database#diabetes.csv



**2.1 Load libraries**

In [0]:
# Import Pandas for Data Manipulation and analyse
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

# Visulization library
import matplotlib.pyplot as plt
%matplotlib inline

#  Visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics
import seaborn as sns
sns.set() # setting seaborn as default for plots

# To suppress the warnings
import warnings

# Customizing the properties and default styles.
plt.style.use('fivethirtyeight')

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)



In [0]:
# load dataset
data = pd.read_csv('https://raw.githubusercontent.com/git-ashiq/kaggle-diabetes-prediction/master/data/diabetes.csv')



---



### **3. Exploratory Analysis**

In this stage, we will anaylze the data by following two methods and try to understand the patterns of the data.
*   Analyse by Describing data & Visualizing data 

#### **3.1 Analyze by Describing data**
Data Dictionary, Data Types, Data Informations, Data Statistical Informations

In [0]:
# head() method is used to return top n (5 by default) rows
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1




---



In [0]:
ProfileReport(data)

0,1
Number of variables,9
Number of observations,768
Total Missing (%),0.0%
Total size in memory,54.1 KiB
Average record size in memory,72.2 B

0,1
Numeric,8
Categorical,0
Boolean,1
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,17
Unique (%),2.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.8451
Minimum,0
Maximum,17
Zeros (%),14.5%

0,1
Minimum,0
5-th percentile,0
Q1,1
Median,3
Q3,6
95-th percentile,10
Maximum,17
Range,17
Interquartile range,5

0,1
Standard deviation,3.3696
Coef of variation,0.87634
Kurtosis,0.15922
Mean,3.8451
MAD,2.7716
Skewness,0.90167
Sum,2953
Variance,11.354
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
1,135,17.6%,
0,111,14.5%,
2,103,13.4%,
3,75,9.8%,
4,68,8.9%,
5,57,7.4%,
6,50,6.5%,
7,45,5.9%,
8,38,4.9%,
9,28,3.6%,

Value,Count,Frequency (%),Unnamed: 3
0,111,14.5%,
1,135,17.6%,
2,103,13.4%,
3,75,9.8%,
4,68,8.9%,

Value,Count,Frequency (%),Unnamed: 3
12,9,1.2%,
13,10,1.3%,
14,2,0.3%,
15,1,0.1%,
17,1,0.1%,

0,1
Distinct count,137
Unique (%),17.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,121.7
Minimum,44
Maximum,199
Zeros (%),0.0%

0,1
Minimum,44.0
5-th percentile,80.0
Q1,99.75
Median,117.0
Q3,141.0
95-th percentile,181.0
Maximum,199.0
Range,155.0
Interquartile range,41.25

0,1
Standard deviation,30.462
Coef of variation,0.25031
Kurtosis,-0.26869
Mean,121.7
MAD,24.57
Skewness,0.53093
Sum,93464
Variance,927.93
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
99.0,17,2.2%,
100.0,17,2.2%,
125.0,14,1.8%,
129.0,14,1.8%,
111.0,14,1.8%,
106.0,14,1.8%,
105.0,13,1.7%,
102.0,13,1.7%,
112.0,13,1.7%,
108.0,13,1.7%,

Value,Count,Frequency (%),Unnamed: 3
44.0,1,0.1%,
56.0,1,0.1%,
57.0,2,0.3%,
61.0,1,0.1%,
62.0,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
195.0,2,0.3%,
196.0,3,0.4%,
197.0,4,0.5%,
198.0,1,0.1%,
199.0,1,0.1%,

0,1
Distinct count,48
Unique (%),6.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,72.428
Minimum,24
Maximum,122
Zeros (%),0.0%

0,1
Minimum,24
5-th percentile,52
Q1,64
Median,72
Q3,80
95-th percentile,90
Maximum,122
Range,98
Interquartile range,16

0,1
Standard deviation,12.106
Coef of variation,0.16715
Kurtosis,1.0837
Mean,72.428
MAD,9.2757
Skewness,0.13151
Sum,55625
Variance,146.56
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
70.0,57,7.4%,
74.0,52,6.8%,
68.0,45,5.9%,
78.0,45,5.9%,
72.0,44,5.7%,
64.0,43,5.6%,
80.0,40,5.2%,
76.0,39,5.1%,
60.0,37,4.8%,
62.0,34,4.4%,

Value,Count,Frequency (%),Unnamed: 3
24.0,1,0.1%,
30.0,2,0.3%,
38.0,1,0.1%,
40.0,1,0.1%,
44.0,4,0.5%,

Value,Count,Frequency (%),Unnamed: 3
106.0,3,0.4%,
108.0,2,0.3%,
110.0,3,0.4%,
114.0,1,0.1%,
122.0,1,0.1%,

0,1
Distinct count,51
Unique (%),6.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,29.247
Minimum,7
Maximum,99
Zeros (%),0.0%

0,1
Minimum,7.0
5-th percentile,14.35
Q1,25.0
Median,28.0
Q3,33.0
95-th percentile,44.0
Maximum,99.0
Range,92.0
Interquartile range,8.0

0,1
Standard deviation,8.9239
Coef of variation,0.30512
Kurtosis,4.8953
Mean,29.247
MAD,6.6219
Skewness,0.76182
Sum,22462
Variance,79.636
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
27.235457063711912,139,18.1%,
33.0,108,14.1%,
32.0,31,4.0%,
30.0,27,3.5%,
27.0,23,3.0%,
23.0,22,2.9%,
18.0,20,2.6%,
28.0,20,2.6%,
31.0,19,2.5%,
19.0,18,2.3%,

Value,Count,Frequency (%),Unnamed: 3
7.0,2,0.3%,
8.0,2,0.3%,
10.0,5,0.7%,
11.0,6,0.8%,
12.0,7,0.9%,

Value,Count,Frequency (%),Unnamed: 3
54.0,2,0.3%,
56.0,1,0.1%,
60.0,1,0.1%,
63.0,1,0.1%,
99.0,1,0.1%,

0,1
Distinct count,187
Unique (%),24.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,157
Minimum,14
Maximum,846
Zeros (%),0.0%

0,1
Minimum,14.0
5-th percentile,50.0
Q1,121.5
Median,130.29
Q3,206.85
95-th percentile,293.0
Maximum,846.0
Range,832.0
Interquartile range,85.346

0,1
Standard deviation,88.861
Coef of variation,0.56598
Kurtosis,12.086
Mean,157
MAD,59.882
Skewness,2.6227
Sum,120580
Variance,7896.3
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
130.28787878787878,236,30.7%,
206.84615384615384,138,18.0%,
105.0,11,1.4%,
130.0,9,1.2%,
140.0,9,1.2%,
120.0,8,1.0%,
94.0,7,0.9%,
100.0,7,0.9%,
180.0,7,0.9%,
110.0,6,0.8%,

Value,Count,Frequency (%),Unnamed: 3
14.0,1,0.1%,
15.0,1,0.1%,
16.0,1,0.1%,
18.0,2,0.3%,
22.0,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
579.0,1,0.1%,
600.0,1,0.1%,
680.0,1,0.1%,
744.0,1,0.1%,
846.0,1,0.1%,

0,1
Distinct count,249
Unique (%),32.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,32.446
Minimum,18.2
Maximum,67.1
Zeros (%),0.0%

0,1
Minimum,18.2
5-th percentile,22.235
Q1,27.5
Median,32.05
Q3,36.6
95-th percentile,44.395
Maximum,67.1
Range,48.9
Interquartile range,9.1

0,1
Standard deviation,6.879
Coef of variation,0.21201
Kurtosis,0.91476
Mean,32.446
MAD,5.4099
Skewness,0.60214
Sum,24919
Variance,47.32
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
32.0,13,1.7%,
31.6,12,1.6%,
31.2,12,1.6%,
32.4,10,1.3%,
33.3,10,1.3%,
32.8,9,1.2%,
32.9,9,1.2%,
30.1,9,1.2%,
30.85967413441951,9,1.2%,
30.8,9,1.2%,

Value,Count,Frequency (%),Unnamed: 3
18.2,3,0.4%,
18.4,1,0.1%,
19.1,1,0.1%,
19.3,1,0.1%,
19.4,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
53.2,1,0.1%,
55.0,1,0.1%,
57.3,1,0.1%,
59.4,1,0.1%,
67.1,1,0.1%,

0,1
Distinct count,517
Unique (%),67.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.47188
Minimum,0.078
Maximum,2.42
Zeros (%),0.0%

0,1
Minimum,0.078
5-th percentile,0.14035
Q1,0.24375
Median,0.3725
Q3,0.62625
95-th percentile,1.1328
Maximum,2.42
Range,2.342
Interquartile range,0.3825

0,1
Standard deviation,0.33133
Coef of variation,0.70215
Kurtosis,5.595
Mean,0.47188
MAD,0.24731
Skewness,1.9199
Sum,362.4
Variance,0.10978
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
0.254,6,0.8%,
0.258,6,0.8%,
0.259,5,0.7%,
0.23800000000000002,5,0.7%,
0.207,5,0.7%,
0.268,5,0.7%,
0.261,5,0.7%,
0.16699999999999998,4,0.5%,
0.19,4,0.5%,
0.27,4,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0.078,1,0.1%,
0.084,1,0.1%,
0.085,2,0.3%,
0.088,2,0.3%,
0.089,1,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1.893,1,0.1%,
2.137,1,0.1%,
2.2880000000000003,1,0.1%,
2.329,1,0.1%,
2.42,1,0.1%,

0,1
Distinct count,52
Unique (%),6.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,33.241
Minimum,21
Maximum,81
Zeros (%),0.0%

0,1
Minimum,21
5-th percentile,21
Q1,24
Median,29
Q3,41
95-th percentile,58
Maximum,81
Range,60
Interquartile range,17

0,1
Standard deviation,11.76
Coef of variation,0.35379
Kurtosis,0.64316
Mean,33.241
MAD,9.5864
Skewness,1.1296
Sum,25529
Variance,138.3
Memory size,6.1 KiB

Value,Count,Frequency (%),Unnamed: 3
22,72,9.4%,
21,63,8.2%,
25,48,6.2%,
24,46,6.0%,
23,38,4.9%,
28,35,4.6%,
26,33,4.3%,
27,32,4.2%,
29,29,3.8%,
31,24,3.1%,

Value,Count,Frequency (%),Unnamed: 3
21,63,8.2%,
22,72,9.4%,
23,38,4.9%,
24,46,6.0%,
25,48,6.2%,

Value,Count,Frequency (%),Unnamed: 3
68,1,0.1%,
69,2,0.3%,
70,1,0.1%,
72,1,0.1%,
81,1,0.1%,

0,1
Distinct count,2
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.34896

0,1
0,500
1,268

Value,Count,Frequency (%),Unnamed: 3
0,500,65.1%,
1,268,34.9%,

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,206.846154,33.6,0.627,50,1
1,1,85.0,66.0,29.0,130.287879,26.6,0.351,31,0
2,8,183.0,64.0,33.0,206.846154,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [0]:
# To display the column names of the dataset.
print(data.columns.values)

['Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI'
 'DiabetesPedigreeFunction' 'Age' 'Outcome']


**Data Dictionary**

1.   Pregnancies : Number of times pregnant
2.   Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3.   BloodPressure : Diastolic blood pressure (mm Hg)
4.   Skin : ThicknessTriceps skin fold thickness (mm)
5.   Insulin : 2-Hour serum insulin (mu U/ml)
6.   BMI : Body mass index (weight in kg/(height in m)^2)
7.   Diabetes : PedigreeFunctionDiabetes pedigree function
8.   Age : Age (years)
9.   Outcome : Class variable (0 or 1) 


---


**Data Types:**

*   Nominal categorical features: Outcome.
*   Ordinal categorical features: -
*   Continous numerical features: Glucose, BloodPressure, SkinThickness, Insulin, BMI DiabetesPedigreeFunction, Age.
*   Discrete numerical features : Pregnancies.
*   Alphanumeric: -
*   Typos: -


---

**Data Information:**

General data information

> No Null values present in the dataset.



In [0]:
# print the information of the datasets
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


**Data Statistical Information**



In [0]:
# To display statistical information
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0



 **Distribution of Numerical value features across the samples**

*   **Outcome:** Around 34 % of the people have diabetes in the dataset
*   **Age Distribution:** 21 to 81 wirh the average of 33 (No Zero values)
*   **Pregnancies:** More than 75 % of the pregnant ladies have diabetes (We assume zero values as no pregnancies)
*   Other variables are important factors for the diabetes will explore clearly in following parts. 








---



#### **3.2 Handle Missing Values**

We observe that there are no data points missing in the data set. If there were any, we should deal with them accordingly, but we have noticed that zeros in the dataset.

The metrics like "Glucose, BloodPressure, SkinThickness,	Insulin,	BMI" should not be zero for living person. Hence, the dataset is seems to be wrong.


Replace the zero values of the features with mean for the each class of outcomes

Credit to Jason for this [code](https://https://www.kaggle.com/dbsnail/diabetes-prediction-over-0-86-accuracy) and Mohamed L for his kernel [here](https://https://www.kaggle.com/momo062/d/uciml/pima-indians-diabetes-database/79-47-pima-indians-diabetes-log-reg-and-svc)

In [0]:
# Find Zero values in the features

# print("Total Zeros in Glucose: ",data[data.Glucose == 0].shape[0])

fields = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for field in fields :
    print('%s : num of zero entries: %d' % (field, len(data.loc[ data[field] == 0, field ])))

Glucose : num of zero entries: 5
BloodPressure : num of zero entries: 35
SkinThickness : num of zero entries: 227
Insulin : num of zero entries: 374
BMI : num of zero entries: 11


In [0]:
# Replace Zero values with mean

"""
def replace_zero_field(data, field):
    nonzero_vals = data.loc[data[field] != 0, field]
    avg = nonzero_vals.median()
    length = len(data.loc[ data[field] == 0, field])   # num of zero entries
    data.loc[ data[field] == 0, field ] = avg
    print('Field: %s; fixed %d entries with value: %.3f' % (field,length, avg))

for field in fields :
    replace_zero_field(data,field)
print()
for field in fields :
    print('Field %s : num 0-entries: %d' % (field, len(data.loc[ data[field] == 0, field ])))
"""

# Replace Zero values with mean by group of target
# create a helper function
def replace_zero(df, field, target):
  mean_by_target = df.loc[df[field] != 0, [field, target]].groupby(target).mean()
  data.loc[(df[field] == 0)&(df[target] == 0), field] = mean_by_target.iloc[0][0]
  data.loc[(df[field] == 0)&(df[target] == 1), field] = mean_by_target.iloc[1][0]

    # run the function
for col in ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']:
  replace_zero(data, col, 'Outcome') 

In [0]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,206.846154,33.6,0.627,50,1
1,1,85.0,66.0,29.0,130.287879,26.6,0.351,31,0
2,8,183.0,64.0,33.0,206.846154,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1




---





---

Split the dataset into two as input and output components for modelling

In [0]:
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=100)
print(X_train.shape)
print(X_test.shape)
print(y_train.size)
print(y_test.size)

(614, 8)
(154, 8)
614
154




---



## **Testing the Algorithm**


In [0]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import confusion_matrix, classification_report, f1_score

#helper function

# train function
def train_clf(clf, X_train, y_train):
    return clf.fit(X_train, y_train)

#predcition function
def pred_clf(clf, features, target):
    y_pred = clf.predict(features)
    return f1_score(target.values, y_pred, pos_label = 1)

#find f1 score
def train_predict(clf, X_train, y_train, X_test, y_test):
    train_clf(clf, X_train, y_train)
    
    print("F1 score for training set is: {:.4f}".format(pred_clf(clf, X_train, y_train)))
    print("F1 score for testing set is: {:.4f}\n".format(pred_clf(clf, X_test, y_test)))


In [0]:
#load algorithms

nb = GaussianNB()
knn = KNeighborsClassifier()
svc = SVC()
dtc = DecisionTreeClassifier(random_state=0)
rfc = RandomForestClassifier(random_state=0)
abc = AdaBoostClassifier(random_state=0)
gbc = GradientBoostingClassifier(random_state=0)

algorithms = [nb, svc, knn, dtc, rfc, abc, gbc]

for clf in algorithms:
    print("{}:".format(clf))
    train_predict(clf, X_train, y_train, X_test, y_test)

GaussianNB(priors=None, var_smoothing=1e-09):
F1 score for training set is: 0.6954
F1 score for testing set is: 0.5490

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False):
F1 score for training set is: 0.7935
F1 score for testing set is: 0.6857

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform'):
F1 score for training set is: 0.8353
F1 score for testing set is: 0.7200

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,


In [0]:
# OPtimizing KNN Model

for n in range(3,11):
    knn = KNeighborsClassifier(n_neighbors=n)
    print("Number of Neighbours is : {}".format(n))
    train_predict(knn, X_train, y_train, X_test, y_test)

Number of Neighbours is : 3
F1 score for training set is: 0.8792
F1 score for testing set is: 0.7184

Number of Neighbours is : 4
F1 score for training set is: 0.8564
F1 score for testing set is: 0.7143

Number of Neighbours is : 5
F1 score for training set is: 0.8353
F1 score for testing set is: 0.7200

Number of Neighbours is : 6
F1 score for training set is: 0.8050
F1 score for testing set is: 0.7010

Number of Neighbours is : 7
F1 score for training set is: 0.8095
F1 score for testing set is: 0.7200

Number of Neighbours is : 8
F1 score for training set is: 0.7971
F1 score for testing set is: 0.7143

Number of Neighbours is : 9
F1 score for training set is: 0.8141
F1 score for testing set is: 0.7327

Number of Neighbours is : 10
F1 score for training set is: 0.8019
F1 score for testing set is: 0.7010



In [0]:
# find accuracy_score 
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=9)
clf_ = knn.fit(X_train, y_train)
y_pred = clf_.predict(X_test)
print('Accuracy is {}'.format(accuracy_score(y_test,y_pred )))

Accuracy is 0.8246753246753247


### **Standardization**



In [0]:
from sklearn.preprocessing import StandardScaler
scaling=StandardScaler()

standardized_X = scaling.fit_transform(X)
X_train_sn, X_test_sn, y_train_sn, y_test_sn = train_test_split(standardized_X, y, test_size=.2, random_state=100)

In [0]:
for clf in algorithms:
    print("{}:".format(clf))
    train_predict(clf, X_train_sn, y_train_sn, X_test_sn, y_test_sn)

GaussianNB(priors=None, var_smoothing=1e-09):
F1 score for training set is: 0.6954
F1 score for testing set is: 0.5490

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False):
F1 score for training set is: 0.8538
F1 score for testing set is: 0.6538

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform'):
F1 score for training set is: 0.8479
F1 score for testing set is: 0.5962

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,


In [0]:
# OPtimizing KNN Model

for n in range(3,11):
    knn = KNeighborsClassifier(n_neighbors=n)
    print("Number of Neighbours is : {}".format(n))
    train_predict(knn, X_train_sn, y_train_sn, X_test_sn, y_test_sn)

Number of Neighbours is : 3
F1 score for training set is: 0.8644
F1 score for testing set is: 0.6226

Number of Neighbours is : 4
F1 score for training set is: 0.8082
F1 score for testing set is: 0.5714

Number of Neighbours is : 5
F1 score for training set is: 0.8479
F1 score for testing set is: 0.5962

Number of Neighbours is : 6
F1 score for training set is: 0.8267
F1 score for testing set is: 0.6186

Number of Neighbours is : 7
F1 score for training set is: 0.8333
F1 score for testing set is: 0.6275

Number of Neighbours is : 8
F1 score for training set is: 0.8184
F1 score for testing set is: 0.6263

Number of Neighbours is : 9
F1 score for training set is: 0.8252
F1 score for testing set is: 0.6337

Number of Neighbours is : 10
F1 score for training set is: 0.8078
F1 score for testing set is: 0.6327





---

References: This notebook is created by learning from the following notebooks:


*   [Diabetic patients are known as Ex- Foodie](https://www.kaggle.com/pratirup/diabetic-patients-are-known-as-ex-foodie)
*  [ Diabete classification](https://www.kaggle.com/bloobeey/diabete-classification)

*   [Jason Li - Diabetes Prediction](https://www.kaggle.com/dbsnail/diabetes-prediction-over-0-86-accuracy)
*   [Mohamed L - Pima Indians Diabetes Log Reg and SVC](https://www.kaggle.com/momo062/79-47-pima-indians-diabetes-log-reg-and-svc)