# Diabetes Challenge
#### Machine Learning in Health
Diabetes is a condition that impairs the body's ability to process blood glucose, otherwise known as blood sugar. In the United States, the estimated number of people over 18 years of age with diagnosed and undiagnosed diabetes is 30.2 million. The figure represents between 27.9 and 32.7 percent of the population.

Without ongoing, careful management, diabetes can lead to a buildup of sugars in the blood, which can increase the risk of dangerous complications, including stroke and heart disease.

Different kinds of diabetes can occur, and managing the condition depends on the type. Not all forms of diabetes stem from a person being overweight or leading an inactive lifestyle. In fact, some are present from childhood.
<html>
    <div style="width:100%">
        <div style="width:150px;height:150px;margin:auto">
            <img src="https://image.flaticon.com/icons/svg/1399/1399301.svg" >
        </div>
    </div>
</html>

##### Challenge

In this problem you are given a Diabetes Data set consisting of following features -
['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
and your task is to predict whether a person is suffering from diabetes or not (Binary Classification)

##### Tasks
* Plot a bar graph showing number of classes and no of examples in each class.
* Classification Task, classify a person as 0 or 1 (Diabetic or Not) using K-Nearest Neighbors classifier.

In [1]:
import pandas as pd
import numpy as np

In [4]:
X=pd.read_csv("Diabetes_XTrain.csv")
Y=pd.read_csv("Diabetes_YTrain.csv")


In [7]:
X=pd.read_csv("Diabetes_XTrain.csv").values
Y=np.reshape(pd.read_csv("Diabetes_YTrain.csv").values,(-1,))


In [5]:
data=pd.concat([X,Y],axis=1)

In [6]:
data.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.175781,0.159623,-0.089059,-0.059903,0.043933,-0.036454,0.555994,0.213015
Glucose,0.175781,1.0,0.151899,0.035213,0.332527,0.213883,0.15149,0.261131,0.473483
BloodPressure,0.159623,0.151899,1.0,0.235094,0.102192,0.272952,0.031765,0.254055,0.072045
SkinThickness,-0.089059,0.035213,0.235094,1.0,0.456451,0.403305,0.179001,-0.115892,0.067829
Insulin,-0.059903,0.332527,0.102192,0.456451,1.0,0.183658,0.222323,-0.049814,0.141941
BMI,0.043933,0.213883,0.272952,0.403305,0.183658,1.0,0.143271,0.051957,0.311717
DiabetesPedigreeFunction,-0.036454,0.15149,0.031765,0.179001,0.222323,0.143271,1.0,0.034847,0.179672
Age,0.555994,0.261131,0.254055,-0.115892,-0.049814,0.051957,0.034847,1.0,0.204733
Outcome,0.213015,0.473483,0.072045,0.067829,0.141941,0.311717,0.179672,0.204733,1.0


In [8]:
X_test=pd.read_csv("Diabetes_XTest.csv").values

In [11]:
X_test=(X_test-mu)/(std)

In [10]:
mu  = X.mean(axis=0)
std = X.std(axis=0)
X=(X-mu)/(std)

In [12]:
X

array([[ 0.97457151,  1.52528095,  0.94599501, ...,  0.78036618,
         0.907501  ,  0.59363371],
       [ 1.27524274, -0.31683408,  0.35393439, ..., -0.49918316,
        -0.72639999,  2.11034006],
       [ 0.97457151,  0.85830827,  0.35393439, ...,  0.92800649,
        -0.66698541,  0.8464181 ],
       ...,
       [-0.52878465,  0.06429317, -0.43547977, ..., -0.25311598,
         0.69954996, -0.92307263],
       [ 0.07255781,  1.0806325 ,  0.15658085, ..., -0.06856559,
        -0.42635635,  0.34084932],
       [-0.82945588, -1.01556737,  0.45261116, ...,  0.01755792,
        -0.34614667, -0.33324239]])

In [13]:
def Knn(X,Y,X_co,k=17):
    dist=[]
    length=X.shape[0]
    
    distance=np.sqrt(np.sum((X-X_co)**2,axis=1))
    dist=list(zip(distance,Y))
    
    dist=sorted(dist)[:k]
    labels=np.array(dist)[:,-1]
    
    ans,cnt=np.unique(labels,return_counts=True)
    return ans[np.argmax(cnt)]
    

In [15]:
y_pred=[]
for i in range(X_test.shape[0]):
    z=Knn(X,Y,X_test[i])
    y_pred.append(int(z))

In [16]:
Data=pd.DataFrame(y_pred)
Data.to_csv("ans.csv",index=False,header=["Outcome"])