This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

In [2]:
import pandas as pd
df = pd.read_csv("C:\\Users\\SAKETH\\Desktop\\diabetes.csv")
print(df.shape)
# print head of data set
print(df.head())

(768, 9)
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [3]:
X = df.drop('Outcome', axis=1)
y = df['Outcome']

In [4]:
from sklearn.model_selection import train_test_split
# implementing train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [5]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
# predictions
rfc_predict = rfc.predict(X_test)

  from numpy.core.umath_tests import inner1d


In [6]:
#Evaluating Performance
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix

In [7]:
rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring='roc_auc')

In [8]:
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
print('\n')
print("=== All AUC Scores ===")
print(rfc_cv_score)
print('\n')
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())

=== Confusion Matrix ===
[[154  22]
 [ 45  33]]


=== Classification Report ===
             precision    recall  f1-score   support

          0       0.77      0.88      0.82       176
          1       0.60      0.42      0.50        78

avg / total       0.72      0.74      0.72       254



=== All AUC Scores ===
[0.79296296 0.79222222 0.77851852 0.71222222 0.72814815 0.74074074
 0.83333333 0.85814815 0.74423077 0.81230769]


=== Mean AUC Score ===
Mean AUC Score - Random Forest:  0.7792834757834758


An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate   - TPR = TP / TP + FN
False Positive Rate  - FPR = FP / FP + TN

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0

In [9]:
#The next thing is we will tune our hyperparameters(parameter whose value is set before the learning process begins) so that we can
#improve the performance of the model
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# number of features at every split
max_features = ['auto', 'sqrt']

# max depth
max_depth = [int(x) for x in np.linspace(100, 500, num = 11)]
max_depth.append(None)
# create random grid
random_grid = {'n_estimators': n_estimators,'max_features': max_features,'max_depth': max_depth}
# Random search of parameters
rfc_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the model
rfc_random.fit(X_train, y_train)
# print results
print(rfc_random.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   35.8s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  4.4min finished


{'n_estimators': 200, 'max_features': 'sqrt', 'max_depth': 140}


In [10]:
#My results were: ‘n_estimators’ = 200; ‘max_features’ = ‘auto’; ‘max_depth’: 340.
#Now we can plug these back into the model to see if it improved our performance
rfc = RandomForestClassifier(n_estimators=200, max_depth=140, max_features='sqrt')
rfc.fit(X_train,y_train)
rfc_predict = rfc.predict(X_test)
rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring='roc_auc')
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
print('\n')
print("=== All AUC Scores ===")
print(rfc_cv_score)
print('\n')
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())

=== Confusion Matrix ===
[[150  26]
 [ 34  44]]


=== Classification Report ===
             precision    recall  f1-score   support

          0       0.82      0.85      0.83       176
          1       0.63      0.56      0.59        78

avg / total       0.76      0.76      0.76       254



=== All AUC Scores ===
[0.78259259 0.83592593 0.82259259 0.74962963 0.80518519 0.86592593
 0.8562963  0.9062963  0.81038462 0.85192308]


=== Mean AUC Score ===
Mean AUC Score - Random Forest:  0.8286752136752137


Our roc_auc score improved from 0.77 to 0.82. 
The downside is that our number of false positives increased slightly (but false negatives declined).

 A random forest regressor is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).They uses some kind of splitting criterion to measure the quality of a split. Supported criteria are “MSE” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “Mean Absolute Error” for the mean absolute error.