In [23]:
import numpy as np
import pandas as pd
import altair as alt
import altair_ally as aly
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import (
    RandomizedSearchCV,
    cross_validate,
    cross_val_score,
    train_test_split,
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

# Methods

### Data

The data set that was used for the analysis of this project was created by Jack W Smith, JE Everhart, WC Dickson, WC Knowler, RS Johannes. The data set was sourced from the National Librabry of Medicine database from the National Institues of Health. Access to their respective analysis can be found [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC2245318/) and access to the dataset can be found via [kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database/data). Each row/obersvation from the dataset is an individual that identifies to be a part of the Pima (also known as The Akimel O'odham) Indeginous group, located mainly in the Central and Southern regions of the United States. Each observation recorded has summary statistics regarding features that include the Age, BMI, Blood Pressure, Number of Pregnancies, as well as The Diabetes Pedigree Function (which is a score that gives an idea about how much correlation is between person with diabetes and their family history).

In [3]:
df = pd.read_csv('../data/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


The `shape` attribute shows us the number of observations and the number of features in the dataset

In [4]:
df.shape

(768, 9)

The `info()` method shows that the data set does not have any features with missing values. It further shows that all features are numeric as well.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Using the `train_test_split()` function we will split our data set with 70% going to train the model and 30% going towards testing the model.

In [6]:
train_df, test_df = train_test_split(df,
                                     train_size = 0.7, 
                                     random_state=123)

The `describe()` shows us the summary statistics of each of our features as well as our target value. We can see the mean as well as the spread (standard deviation). Using this information and the visualization tools we will see next we can determine how skewed each of our features are for their respective values.

In [7]:
census_summary = train_df.describe()
census_summary

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,537.0,537.0,537.0,537.0,537.0,537.0,537.0,537.0,537.0
mean,3.810056,120.337058,69.247672,20.702048,81.960894,32.091806,0.463048,33.344507,0.335196
std,3.318488,31.744549,18.874886,15.677625,116.475625,7.56207,0.331082,11.851165,0.472499
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.2,0.237,24.0,0.0
50%,3.0,117.0,72.0,23.0,37.0,32.0,0.366,29.0,0.0
75%,6.0,140.0,80.0,33.0,128.0,36.6,0.6,41.0,1.0
max,15.0,199.0,122.0,63.0,744.0,59.4,2.42,81.0,1.0


In [8]:
features = census_summary.columns.tolist()
features

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

In [9]:
feature_histograms = alt.Chart(train_df).transform_calculate(
).mark_bar(opacity=0.5).encode( x = alt.X(alt.repeat()).type(
'quantitative').bin(maxbins=30), y= alt.Y('count()').stack(False),
                               color = 'Outcome:N'
).properties( height=250,
width=250 ).repeat(
features, columns=1
)

feature_histograms 

The Graphs above show us the respective distribution of each of the features. We have categorized the results to show the how distribution of each feature is when the Outcome is 0 (Non-Diabetic) and when the Outcome is 1 (Diabetic). This helps give us an indication on certain relationships between the features and the target.

For the Glucose levels, we see for the Non-Diabetic class that glucose levels are a somewhat normal distribution; but for the Diabetic class, the glucose levels lean heavily towards the middle to higher range. BMI for the Diabetic class looks like a normal distribution, but it also skews slighty to higher values. But for the Non-Diabetic class interestingly the BMI distribution seems more bimodal.

For the distribution of Age we see that Ages 20 to 32 are dominated by Non-Diabetics, but after the age of 32 we see that the count levels are close between the Diabetic and Non-Diabetic classes, where for some bins the Diabetic class even overtakes the Non-Diabetic even with a lower total count of observations in the data set. The Non-Diabetic class in the Age Distribution leans more towards lower ages meanwhile the Diabetic class' distribution is somewhat consistent across its age range.

For Pregnancies the lower range of pregnancies is dominated by the Non-Diabetic class, meanwhile for higher range of pregnancies the Diabetic class has more observations.

For Skin Thickness both the Diabetic and Non-Diabetic class are close to a normal distribution but the Non-Diabetic distribution skews slighty towards lower values and the Diabetic class skews more towards higher values.

In [10]:
aly.corr(train_df)

The graph above shows the correlation between all of the respective features. The main reasoning to analyze thi is to see if there is any multicollinearity between any of the features which is problamatic when conducting a Logistic Regression. We see that highest level of correlation is between Age and Pregnancies (0.53 by Pearson, and 0.59 via Spearman). Since this is below the threshold of 0.7 we can conclude that all feature coefficients are suitable and will not cause any multicollinearity in our model. 

In [11]:
aly.pair(train_df[features].sample(300), color='Outcome:N')

The graphs above give us a visualization between the realtiship between each of our features. We see for the most part that the features do not show and trends. The two features that do show somewhat of a relationship visuallly is Skin thickness and BMI. This would makes sense as the higher the body mass the higher the thickness of skin would be for the most part. 

Looking back at our previous at the correlation graph from before we see that Skin Thickness and BMI have a Pearson correlation of 0.41, meaning they do not cause multicollinearity in our model. 

Here we further split our data set into our X and y for both the training and test

In [12]:
X_train = train_df.drop(columns = ['Outcome'])
y_train = train_df['Outcome']
X_test = test_df.drop(columns = ['Outcome'])
y_test = test_df['Outcome']

We have created a Dummy Classifier to act as our base line for conductin our analysis.
The Dummy Baseline gives us a score of around 0.6648

In [25]:
dummy_clf = DummyClassifier()
mean_cv_score = cross_val_score(dummy_clf, 
                                X_train,
                                y_train).mean()
mean_cv_score

np.float64(0.6647975077881619)

We will be using a Logistic Regression model to do our classification. Since our features have outliers it would be best to use a StandardScaler() to normalize the feature values before fitting the model to them.

In [41]:
log_pipe=make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000,
random_state=123))

We optimze the hyperparamter 'C' for our logistic regression using a random search

In [42]:
np.random.seed(123)
param_dist = {
    "logisticregression__C": [10**i for i in range(-5,15)] 
}

In [43]:
random_search = RandomizedSearchCV(log_pipe,param_dist,
                                   n_iter=20,
                                   n_jobs=-1,
                                   return_train_score=True,
                                   random_state=123)

random_search.fit(X_train,y_train)

We find out best parameter value for our hyperparameter `C` that we will use in our model

In [44]:
best_params = random_search.best_params_ 
best_params

{'logisticregression__C': 10}

In [45]:
pd.DataFrame(random_search.cv_results_).sort_values(
    "rank_test_score").head(3)[["mean_test_score",
                                "mean_train_score"]]

Unnamed: 0,mean_test_score,mean_train_score
9,0.761717,0.771416
17,0.761717,0.771416
16,0.761717,0.771416
