Adrien Clay

12/15/2020

Springboard Data Science

## Indian PIMA dataset

- Will a patient contract diabetes based on a number of features?

Features list:
Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction:  Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)

Age: Age in years

Outcome:

    - 0: Doesn't have Diabetes
    
    - 1: Has diabetes


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

KeyboardInterrupt: 

In [None]:
df = pd.read_csv('diabetes.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

### Although we have no missing data, some of the observations have a value of 0, such as skin thickness or Insulin. This doesn't make sense, and likely represents a missing value.

In [None]:
df[df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] == 0].count()

Looks like there's quite a a bit of missing data. We will likely impute this, as dropping all of this would reduce the dataset quite a bit, by at least 300+ rows.

In [None]:
df['Glucose'].replace(0, np.nan, inplace=True)
df['SkinThickness'].replace(0, np.nan, inplace=True)
df['Insulin'].replace(0, np.nan, inplace=True)
df['BMI'].replace(0, np.nan, inplace=True)

In [None]:
df.head()

In [None]:
def plot_scatter(x, y, color, size, xlabel, ylabel, twinylabel, title):
    fig, ax = plt.subplots()
    sc = ax.scatter(x, y, c=color,cmap='viridis', edgecolors='k', linewidths=.4, s=size * 10)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.axvline(x.mean(), c='k', linestyle='--', linewidth=3)
    ax.axhline(y.mean(), c='k', linestyle='--', linewidth=3)
    fig.colorbar(sc)
    ax2 = ax.twinx()
    ax2.set_ylabel(twinylabel, rotation=0)
    ax2.yaxis.set_label_coords(1.1, 1.08)
    fig.tight_layout()
    fig.set_figheight(9)
    fig.set_figwidth(9)


In [None]:
fig, ax = plt.subplots()
sc = ax.scatter(df['Insulin'], df['Glucose'], c=df['BloodPressure'],cmap='viridis', edgecolors='k', linewidths=.4, s=df['Age'] * 10)
ax.set_xlabel("Insulin")
ax.set_ylabel("Glucose Levels")
ax.set_title("Glucose As a Function of Insulin")
ax.axvline(df['Insulin'].mean(), c='k', linestyle='--', linewidth=3)
ax.axhline(df['Glucose'].mean(), c='k', linestyle='--', linewidth=3)
fig.colorbar(sc)
ax2 = ax.twinx()
ax2.set_ylabel('Blood Pressure', rotation=0)
ax2.yaxis.set_label_coords(1.1, 1.08)
fig.tight_layout()
fig.set_figheight(9)
fig.set_figwidth(9)

In [None]:
fig, ax = plt.subplots()
sc = ax.scatter(df['Insulin'], df['BMI'], c=df['BloodPressure'],cmap='viridis', edgecolors='k', linewidths=.4, s=df['Age'] * 10)
ax.set_xlabel("Insulin")
ax.set_ylabel("BMI")
ax.set_title("BMI As a Function of Insulin")
ax.axvline(df['Insulin'].mean(), c='k', linestyle='--', linewidth=3)
ax.axhline(df['BMI'].mean(), c='k', linestyle='--', linewidth=3)
fig.colorbar(sc)
ax2 = ax.twinx()
ax2.set_ylabel('Blood Pressure', rotation=0)
ax2.yaxis.set_label_coords(1.1, 1.08)
fig.tight_layout()
fig.set_figheight(9)
fig.set_figwidth(9)

#### I will sort the data and utilize a backfill, as it appears there is some correlation between the variables, even if it's not strong which means imputing the mean of points could skew the results.

In [None]:
df_sorted = df.sort_values(by=['Glucose', 'Insulin'], axis=0)

In [None]:
df_sorted.head()

In [None]:
df_sorted.fillna(method='ffill', inplace=True)
df_sorted.fillna(method='bfill', inplace=True)

In [None]:
df_sorted[df_sorted == 0].count()

A few values in blood pressure got mixed up as 0.

In [None]:
df_sorted['BloodPressure'] = df_sorted.BloodPressure.replace(0, df['BloodPressure'].median())

In [None]:
df_sorted[df_sorted == 0].count()

Let's have a look at how many people in the dataset have diabetes vs those that don't:

In [None]:
fix, ax = plt.subplots()
sns.countplot(x='Outcome',  data = df_sorted, palette='viridis')

#### Proportion of people with Diabetes

In [None]:
df_sorted[df_sorted['Outcome'] == 1].count()['Outcome'] / len(df_sorted) * 100

### In this dataset, 35% of participants do have diabetes

In [None]:
plot_scatter(df_sorted['Insulin'], 
             df_sorted['Glucose'], 
             df_sorted['BloodPressure'], 
             df_sorted['Age'], 
             'Insulin', 
             'Glucose', 
             'Blood Pressure',
            'Glucose as a Function of Insulin (Sized by Age)')

In [None]:
plot_scatter(df_sorted['Insulin'], 
             df_sorted['BMI'], 
             df_sorted['BloodPressure'], 
             df_sorted['Age'], 
             'Insulin', 
             'BMI', 
             'Blood Pressure',
            'BMI as a Function of Insulin (Sized by Age)')

We now have more data that is still clustered around the mean for the most part, and follows the same shape.

### Distribution of Data:

In [None]:
sns.pairplot(df_sorted)

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(data=df_sorted, x='Glucose', hue='Outcome', palette='viridis', kde=True)
plt.title("Distribution of Glucose Levels based on Diabetes")

The above figure makes sense as the glucose levels found in those with diabetes is likely slightly higher than those without, but there is plenty of room for overlap.

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(data=df_sorted, x='BloodPressure', hue='Outcome', palette='viridis', kde=True)
plt.title("Distribution of Blood Pressure based on Diabetes")

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x='BMI', y='BloodPressure', data=df_sorted, hue='Outcome', palette= 'viridis', size='Age', sizes=(20, 500),
               alpha=.5)
plt.title('Blood Pressure VS. BMI')

It might be useful to visualize the correlation between variables:

In [None]:
sns.heatmap(df_sorted.corr(), cmap='viridis')

We can see from the heatmap that, although the scatter plots appear strong, the correlations are as high as .7 or so at best. This is more like a moderate correlation.

In [None]:
colors=[]
for point in df['Outcome']:
    if point == 0:
        colors.append('red')
    else:
        colors.append('blue')

color_list = ['red', 'blue']

In [None]:
fig, ax = plt.subplots()
sc = ax.scatter(df['Insulin'], df['Glucose'], c=colors ,cmap='viridis', edgecolors='k', linewidths=.4, s=df['Age'] * 10, alpha=.5,
               label='No Diabetes')
ax.set_xlabel("Insulin")
ax.set_ylabel("Glucose")
ax.set_title("Glucose As a Function of Insulin")
ax.axvline(df['Insulin'].mean(), c='k', linestyle='--', linewidth=3)
ax.axhline(df['Glucose'].mean(), c='k', linestyle='--', linewidth=3)
ax2 = ax.twinx()
ax2.yaxis.set_label_coords(1.1, 1.08)
ax.legend()
ax.grid()
fig.set_figheight(9)
fig.set_figwidth(9)

As expected, we can see that those with low insulin are more likely to have diabetes, and we saw from the heatmap that Glucose and Insulin production are moderately correlated.

### Split Data

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, plot_roc_curve, confusion_matrix
from sklearn.preprocessing import StandardScaler

In [None]:
X = df_sorted.drop('Outcome', axis=1).values
y = df_sorted['Outcome'].values

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
logreg = LogisticRegression(C = .015)

In [None]:
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)

In [None]:
print(accuracy_score(y_test, pred)*100)
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

In [None]:
plot_roc_curve(logreg, X_test, y_test)

In [None]:
coeff = list(logreg.coef_[0])
labels = list(df_sorted.drop('Outcome', axis=1).columns)
features = pd.DataFrame()
features['Features'] = labels
features['importance'] = coeff
features.sort_values(by=['importance'], ascending=True, inplace=True)
features['positive'] = features['importance'] > 0
features.set_index('Features', inplace=True)
features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({True: 'blue', False: 'red'}))
plt.xlabel('Importance')

In [None]:
logreg_new = LogisticRegression()
logreg_new_cv = GridSearchCV(logreg_new, {"C":np.logspace(-10, 10, 50)}, cv=5)
logreg_new_cv.fit(X_train, y_train)
logreg_new_cv.score(X_test, y_test) * 100

### We  can see that on a level of importance, Glucose and BMI are the most important features, with pregnancies and age coming in next. It might be useful to test a model using only these features

In [None]:
log_reg2 = LogisticRegression(C=.015)
X = df_sorted.drop('Outcome', axis=1)[['Glucose', 'BMI']]
y = df_sorted.Outcome

In [None]:
X = StandardScaler().fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
log_reg2.fit(X_train, y_train)
pred = log_reg2.predict(X_test)

In [None]:
print(accuracy_score(y_test, pred))

##### With the same parameter for C, this model performs worse. 

# Random Forest

In [None]:
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA

In [None]:
X = df_sorted.drop('Outcome', axis=1)
y = df_sorted['Outcome']
pca = PCA(n_components=13)
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
def fit_and_score(classifier):
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    print("Accuracy Score: " + str(round(accuracy_score(y_test, pred), 4) *100) + "%")
    print(classification_report(y_test, pred))
    plot_roc_curve(classifier, X_test, y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
forest = RandomForestClassifier()

In [None]:
fit_and_score(forest)

# Random Forest Two:

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'n_estimators': range(100, 105), 'criterion': ['gini', 'entropy'], 'max_depth': range(2, 10)}
new_forest = RandomForestClassifier()
forest_cv = GridSearchCV(new_forest, params, scoring='accuracy', n_jobs=-1)
forest_cv.fit(X_train, y_train)

In [None]:
forest_cv.score(X_test, y_test)
forest_cv.best_params_

In [None]:
forest_new = RandomForestClassifier(n_estimators=300, criterion='entropy', max_depth=5)
fit_and_score(forest_new)

# Light Gradient Booster LGBM

In [None]:
from lightgbm import LGBMClassifier

In [None]:
X = df_sorted.drop('Outcome', axis=1)
y = df_sorted.Outcome

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
lgbm = LGBMClassifier()

In [None]:
fit_and_score(lgbm)

# XGBoost

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb = XGBClassifier(use_label_encoder=False)

In [None]:
fit_and_score(xgb)