> Welcome to the project! You will find tips in quoted sections like this to help organize your approach to your investigation.

# Project: Investigate a Pima Indians Diabetes Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#pre">Prediction analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> Diabetes is one of the deadliest diseases in the world. It is not only a disease but also creator of different kinds of diseases like heart attack, blindness etc. The normal identifying process is that patients need to visit a diagnostic center, consult their doctor, and sit tight for a day or more to get their reports. So, the objective of this project is to identify whether the patient has diabetes or not based on diagnostic measurements.. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
>
> The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

> About Dataset:

>> Pregnancies: No. of times pregnant

>> Glucose: Plasma Glucose Concentration (mg/dl)

>> Blood Pressure: Diastolic Blood Pressure(mmHg)

>> Skin Thickness:A value used to estimate body fat. Normal Triceps SkinFold Thickness in women is 23mm. Higher thickness leads to obesity and chances of diabetes increases.

>> Insulin: 2-Hour Serum Insulin (mu U/ml)

>> BMI: Body Mass Index (weight in kg/ height in m2)

>> Diabetes Pedigree Function: It provides information about diabetes history in relatives and genetic relationship of those relatives with patients. Higher Pedigree Function means patient is more likely to have diabetes.

>> Age:Age (years)

>> Outcome: Class Variable (0 or 1) where ‘0’ denotes patient is not having diabetes and ‘1’ denotes patient having diabetes.

In [1]:
# Use this cell to set up import statements for all of the packages that you plan to use.
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import tree




import warnings
warnings.filterwarnings("ignore")

In [2]:
sns.set(rc={'figure.figsize':(12,10)})


<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
file_name="diabetes.csv"
df=pd.read_csv(file_name)
print(df.info())

In [None]:
#Print a few samples from head
display(df.head(10))

In [None]:
#Check if there are any null values
display(df.isnull().sum())

In [None]:
#Check if there are any symbols like ? or strings in the dataframe
df.applymap(np.isreal).value_counts()

In [None]:
#Basic statistics for each column
display(df.describe())

We notice in the table above abnormal range of values for things like glucose, blood pressure and skin thickness and BMI

In [None]:
#print some value counts for discrete columns
print(df['Pregnancies'].value_counts())  

In [None]:
#Check class imbalance
ar = df['Outcome'].value_counts()
print(df['Outcome'].value_counts())
print('Class Doesnt have diabetes :', ar[0] / (ar[0]+ar[1]),'%' ) 
print('Class Have diabetes :', ar[1] / (ar[0]+ar[1] ),'%')

In [None]:
plt.pie(df['Outcome'].value_counts(),labels=["Doesnt have diabetes","Have diabetes"],autopct='%1.1f%%' )

In [None]:
#Plot Histogram for all columns
f,a = plt.subplots(2,4)
a = a.ravel()
for idx,ax in enumerate(a):
  try:
    ax.hist((df[df.columns[idx]].astype(np.float16) ))
    ax.set_title(df.columns[idx])
  except :
    pass
plt.tight_layout()

In [None]:
#Plot The Disease by each age group
((df['Age'] [(df['Outcome']==0)] )).plot.hist(bins=20,alpha=0.3)
((df['Age'] [(df['Outcome']==1)] )).plot.hist(bins=20)

In [None]:
#Quartile Plot for Age
ax = sns.boxplot(y=df['Age'],x=df['Outcome'])

In [None]:
#Plot The Disease by each number of pregnancies
((df['Pregnancies'] [(df['Outcome']==0)] )).plot.hist(bins=20,alpha=0.3)
((df['Pregnancies'] [(df['Outcome']==1)] )).plot.hist(bins=20)

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning by replacing the missing values using interpolation method

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.
# BMI, glucose,SkinThickness, Bloodpressure

#check where the zero values lie
print((df==0).sum())


In [None]:

for col in ['BMI','Glucose', 'SkinThickness','BloodPressure']:
  print(col)
  af=(df[col]==0)
  df[col][af]=np.nan
  print(df[col][af])
  df[col]=df[col].interpolate()
  print(df[col][af])


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 : Data Correlation

Let's begin by observing correlation of the columns

In [None]:
display(df.corr())

In [None]:
#Visualized
ax = sns.heatmap(df.corr(),annot=True)

### Research Question 2 : Class Imbalance

How do we handle the class imbalance and prevent model from overfitting and predicting execlusively one class only

There are two methods over sampling and under sampling 
we will try over sampling

In [None]:
from sklearn.utils import resample


# Separate majority and minority classes
df_majority = df[df.Outcome==0]
df_minority = df[df.Outcome==1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     
                                 n_samples=500,    
                                 random_state=1) 
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
#print((df_upsampled.iloc[:,-1].values.shape) )
df_upsampled.Outcome.value_counts()

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df_upsampled.iloc[:,:-1], df_upsampled.iloc[:,-1].values.reshape(-1,1), test_size=0.2, random_state=1,shuffle= True)
print(Y_test .shape, X_test.shape)
print(Y_train .shape, X_train.shape)

print(Y_train.sum() /len(X_train ))
print(Y_test.sum()/len(X_test ))


In [None]:

model = LinearRegression()
scores = []
epochs=1
for i in range(epochs): 
  model.fit(X_train, Y_train )
  score = model.score(X_test, Y_test)
  scores.append(score)
predictions=model.predict(X_test)
print(scores)
eval=(1*(predictions>0.5)==Y_test)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print(eval, "%")
print(classification_report(Y_test,1*(predictions>0.5)))

print(confusion_matrix(Y_test,1*(predictions>0.5)))

#**PLEASE FIND RESEARCH QUESTION 3 AFTER CLASSIFIERS**

> **Note: if you have more questions and insights don't hesitate to do it**

<a id='pre'></a>
## Build a Prediction Model


In [None]:
# Make a feature scaling
#Normalize Values
normalized_df=(df-df.min())/(df.max()-df.min())
#print(normalized_df)
display(normalized_df.head())

In [None]:
# Split the data into train and test data

msk = np.random.rand(len(normalized_df)) < 0.8
normalized_df= normalized_df.sample(frac = 1,random_state=1)
train = normalized_df[msk]
test = normalized_df[~msk]

X_train=train.iloc[:, 0:-1].values
Y_train=train.iloc[:, -1].values.reshape(-1,1)
X_test=test.iloc[:, 0:-1].values
Y_test=test.iloc[:,-1].values.reshape(-1,1)

#print(X_train.shape,Y_train.shape, X_test.shape, Y_test.shape)
print(Y_test .shape, X_test.shape)
print(Y_train .shape, X_train.shape)

print(Y_train.sum() /len(X_train ))
print(Y_test.sum()/len(X_test ))

#print(test.info(), train.info())

## Compare the performance (Confusion matrix and classification report) of different classifiers (LR, KNN, SVM, DT and RF)

> **Note: use grid search with a suitable range of values to adjust the hyperparameters of DT and SVM and for loop to adjust the k value of KNN**

In [None]:
model = LinearRegression()
scores = []
epochs=1
for i in range(epochs): 
  model.fit(X_train, Y_train )
  score = model.score(X_test, Y_test)
  scores.append(score)
predictions=model.predict(X_test)
print(scores)
eval=(1*(predictions>0.5)==Y_test)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,1*(predictions>0.5)))

print('confusion_matrix\n',confusion_matrix(Y_test,1*(predictions>0.5)))

In [None]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, Y_train)

predict=neigh.predict(X_test).reshape(-1,1)

eval=(predict==Y_test)
print(predict.shape, Y_test.shape)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,predict))
print('confusion_matrix\n',confusion_matrix(Y_test,predict))

In [None]:
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, Y_train)
predict=clf.predict(X_test).reshape(-1,1)
eval=(predict==Y_test)
print(predict.shape, Y_test.shape)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,predict))
print('confusion_matrix\n',confusion_matrix(Y_test,predict))

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
predict=clf.predict(X_test).reshape(-1,1)
eval=(predict==Y_test)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,predict))
print('confusion_matrix\n',confusion_matrix(Y_test,predict))

In [None]:
from sklearn.neural_network import MLPClassifier


clf = MLPClassifier( alpha=1e-3,
                    hidden_layer_sizes=(10, 20), random_state=1)
for i in range(5):
  clf.fit(X_train, Y_train)
predict=clf.predict(X_test).reshape(-1,1)
eval=(predict==Y_test)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,predict))
print('confusion_matrix\n',confusion_matrix(Y_test,predict))



In [None]:

from sklearn.datasets import make_classification

clf = RandomForestClassifier(max_depth=5, random_state=0)
clf.fit(X_test, Y_test)
predict=clf.predict(X_test).reshape(-1,1)
eval=(predict==Y_test)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,predict))
print('confusion_matrix\n',confusion_matrix(Y_test,predict))

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work send it and Congratulations!

## Research Question 3

PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
df_pca = pca.fit_transform(normalized_df.iloc[:,:-1])
df_pca = pd.DataFrame(df_pca)
#print(normalized_df.iloc[:,-1])
df_pca['Outcome'] =  normalized_df.iloc[:,-1]
print(df_pca.head())

In [None]:
# Split the data into train and test data
df_pca= df_pca.sample(frac = 1)
msk = np.random.rand(len(df_pca)) < 0.8
train = df_pca[msk]
test = df_pca[~msk]

X_train=train.iloc[:, 0:-1].values
Y_train=train.iloc[:, -1].values.reshape(-1,1)
X_test=test.iloc[:, 0:-1].values
Y_test=test.iloc[:,-1].values.reshape(-1,1)

#print(X_train.shape,Y_train.shape, X_test.shape, Y_test.shape)


In [None]:
model = LinearRegression()
scores = []
epochs=1
for i in range(epochs): 
  model.fit(X_train, Y_train )
  score = model.score(X_test, Y_test)
  scores.append(score)
predictions=model.predict(X_test)
print(scores)
eval=(1*(predictions>0.5)==Y_test)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,1*(predictions>0.5)))

print('confusion_matrix\n',confusion_matrix(Y_test,1*(predictions>0.5)))

In [None]:

from sklearn.datasets import make_classification

clf = RandomForestClassifier(max_depth=5, random_state=0)
clf.fit(X_test, Y_test)
predict=clf.predict(X_test).reshape(-1,1)
eval=(predict==Y_test)
eval=eval.sum()
eval/=len(Y_test)
eval*=100
print('Accuracy : ',eval, "%")
print('classification_report\n',classification_report(Y_test,predict))
print('confusion_matrix\n',confusion_matrix(Y_test,predict))