**Reference**: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html<BR>
Copyright (c) 2007–2019 The scikit-learn developers.


# PIMA Indians Diabetes

# Disclaimers
## Authorship 

This notebook has been prepared by **Rishabh Pande** and modified to cope with the purpose of this class. Main modifications were inclusion of extra documentation of the functions and chunks of code, inclusions of references to 3rd part resources. If you are interested on the original file, please have a look at:
https://www.kaggle.com/rishpande/pima-indians-diabetes-beginner/data all credits should be given to [**Rishabh Pande**](https://www.kaggle.com/rishpande/pima-indians-diabetes-beginner/data) for his great work. 

## Liability

The material and information contained on this notebook is for general information and educational purposes only. You should not rely upon the material or information on the notebook as a basis for making any business, legal or any other decisions.

I am not be liable for any false, inaccurate, inappropriate or incomplete information presented on the notebook. Any reliance you place on such material is therefore strictly at your own risk.

## Datasets license

The **Deabetes Dataset** used in this notebook was obtained at Kaggle and originally published by [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/support/diabetes) no personal data that could be used to identify the subjects were included on the dataset.

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

Sincerely,

**Adriano Barbosa**

## Background

**Diabetes**, is a group of metabolic disorders in which there are high blood sugar levels over a prolonged period.  Symptoms of high blood sugar include frequent urination, increased thirst, and increased hunger.  If left untreated, diabetes can cause many complications.  Acute complications can include diabetic ketoacidosis, hyperosmolar hyperglycemic state, or death.  Serious long-term complications include cardiovascular disease, stroke, chronic kidney disease, foot ulcers, and damage to the eyes.

This **dataset** is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## Objective

We will try to build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

## Data


The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

* **Pregnancies**: Number of times pregnant
* **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure**: Diastolic blood pressure (mm Hg)
* **SkinThickness**: Triceps skin fold thickness (mm)
* **Insulin**: 2-Hour serum insulin (mu U/ml)
* **BMI**: Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction**: Diabetes pedigree function
* **Age**: Age (years)
* **Outcome**: Class variable (0 or 1)




In [24]:
#Required before starting the next chunks, this takes a little while until the packages are installed.
#!pip install -U numpy
#!pip install -U pandas
#!pip install -U matplotlib
#!pip install -U sklearn
#!pip install -U seaborn
#!pip install -U plotly
#!pip install -U chart_studio

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import tools

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

#import plotly.plotly as py
from chart_studio import plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
from IPython.display import HTML, Image

df = pd.read_csv('./diabetes.csv')

In [3]:
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Predictive Modeling

In [4]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

X = df.iloc[:, :-1]
y = df.iloc[:, -1]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [17]:
X.head(5)
y.head(5)

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

## Scaling

In [12]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

## Logistic Regression

In [36]:
#Model
model = LogisticRegression()

#Scaled values
#fiting the model
model.fit(X_train_scaled, y_train)
#prediction
y_pred = model.predict(X_test_scaled)
#Accuracy
print("Scaled:\nAccuracy on training set: {:.3f}".format(model.score(X_train_scaled, y_train)*100))
print("Accuracy on prediction: ", model.score(X_test_scaled, y_test)*100)

#Not scaled values
#fiting the model
model.fit(X_train, y_train)
#prediction
y_pred = model.predict(X_test)
#Accuracy
print("\nUnscaled:\nAccuracy on training set: {:.3f}".format(model.score(X_train, y_train)*100))
print("Accuracy on prediction:", model.score(X_test, y_test)*100)

Scaled:
Accuracy on training set: 76.562
Accuracy on prediction:  79.16666666666666

Unscaled:
Accuracy on training set: 75.694
Accuracy on prediction: 80.72916666666666








## Decision Tree Classifier

In [35]:
#Model
model = DecisionTreeClassifier()

#Scaled values
#fiting the model
model.fit(X_train_scaled, y_train)
#prediction
y_pred = model.predict(X_test_scaled)
#Accuracy
print("Scaled:\nAccuracy on training set: {:.3f}".format(model.score(X_train_scaled, y_train)*100))
print("Accuracy on prediction: ", model.score(X_test_scaled, y_test)*100)

#Not scaled values
#fiting the model
model.fit(X_train, y_train)
#prediction
y_pred = model.predict(X_test)
#Accuracy
print("\nUnscaled:\nAccuracy on training set: {:.3f}".format(model.score(X_train, y_train)*100))
print("Accuracy on prediction:", model.score(X_test, y_test)*100)



Scaled:
Accuracy on training set: 100.000
Accuracy on prediction:  71.875

Unscaled:
Accuracy on training set: 100.000
Accuracy on prediction: 72.39583333333334


## Gradient Boosting Classifier

In [32]:
#Model
model = GradientBoostingClassifier()

#Scaled values
#fiting the model
model.fit(X_train_scaled, y_train)
#prediction
y_pred = model.predict(X_test_scaled)
#Accuracy
print("Scaled:\nAccuracy on training set: {:.3f}".format(model.score(X_train_scaled, y_train)*100))
print("Accuracy on prediction: ", model.score(X_test_scaled, y_test)*100)

#Not scaled values
#fiting the model
model.fit(X_train, y_train)
#prediction
y_pred = model.predict(X_test)
#Accuracy
print("\nUnscaled:\nAccuracy on training set: {:.3f}".format(model.score(X_train, y_train)*100))
print("Accuracy on prediction:", model.score(X_test, y_test)*100)


Scaled:
Accuracy on training set: 93.229
Accuracy on prediction:  78.64583333333334

Unscaled:
Accuracy on training set: 93.229
Accuracy on prediction: 81.25
