## The Data

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

https://www.kaggle.com/uciml/pima-indians-diabetes-database/version/1#


In [None]:
# read the data into a Pandas DataFrame
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('pima_indians_diabetes.csv')
df.head()


In [None]:
# define X and y
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]

y = df['Outcome']

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

In [None]:
from sklearn.ensemble import RandomForestClassifier

# instantiate model
model = RandomForestClassifier(random_state=1, max_depth=10)

# fit model
model.fit(X_train, y_train)

In [None]:
# make predictions for the testing set
y_pred = model.predict(X_test)
y_pred

**Let's check our model's accuracy**

In [None]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred))

**Let's see how well our model will hold up with k-fold cross validation**

In [None]:
from sklearn.model_selection import KFold # import KFold
from sklearn.model_selection import cross_val_score, cross_val_predict

In [None]:
# Retrain model on the whole dataset
model.fit(X, y)

In [None]:
# Perform 10-fold cross validation
kf = KFold(n_splits=10, random_state=1, shuffle=False)
scores = cross_val_score(model, X, y, cv=kf)
print('Cross-validated scores:', scores)


In [None]:
print(scores.mean())

In [None]:
# Make cross validated predictions
pred = cross_val_predict(model, X, y, cv=kf)
pred

**Now, let's try Leave-One-Out (LOO) cross-validation**

In [None]:
from sklearn.model_selection import LeaveOneOut 
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
print('Cross-validated scores:', scores)


In [None]:
print(scores.mean())

In [None]:
# Make cross validated predictions using Leave-One-Out
pred = cross_val_predict(model, X, y, cv=loo)
pred

**Let's build our final model on the entire dataset**

In [None]:
# instantiate model
model = RandomForestClassifier(random_state=1, max_depth=10)

# fit model
model.fit(X, y)

**Make predictions based on known inputs:**

* Predictors: 
    * Pregnancies: 6
    * Glucose: 148
    * BloodPressure: 72
    * SkinThickness:  35
    * Insulin: 0
    * BMI 33.6
    * DiabetesPedigreeFunction: 0.627
    * Age: 50
* Response:
    * Outcome: 1

In [None]:
inputs = np.array([[6, 148, 72, 35, 0, 33.6, 0.627, 50]])
outcome = model.predict(inputs)
print("Predicted ouctome based on known data: ", outcome)


In [None]:
inputs = np.array([[3, 100, 120, 15, 1, 22, 0.5, 30]])
outcome = model.predict(inputs)
print("Predicted ouctome based on known data: ", outcome)
