# Diabetes dataset - EDA

## Abstract

1. Is this a classification or regression problem?
2. How many features are present? Are they all comntinuous or are there any categorical features? If so, are there any ordinal features?
3. What is the target variable, also referred to as the outcome variable?
4. If it is a classification problem, is the dataset imbalanced?
5. Are there duplicate data? How to deal with them?
6. Are there missing values? If so, how to deal with them?
7. Are there outlier samples? How to identify them?
8. Plot the distribution of the following features: Glucose, BP, Skin thickness, BMI. What distribution do you observe? How about the distribution of the features Pedigree and Age? How about the features Pregnancies and Insulin?

In [None]:
## Load essential libraries
import numpy as np
import pandas as pd
import sys
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
plt.style.use('seaborn-whitegrid')
%matplotlib inline

import scipy.stats
from scipy.stats import *

import warnings
warnings.filterwarnings("ignore")

In [None]:
import sklearn
sklearn.__version__

In [None]:
## Mount the Google Drive folder for accessing data etc
if('google.colab' in sys.modules):
    from google.colab import drive
    drive.mount('/content/drive', force_remount = True)
    # Change path below starting from /content/drive/MyDrive/Colab Notebooks/
    # depending on how data is organized inside your Colab Notebooks folder in
    # Google Drive
    DIR = '/content/drive/MyDrive/Colab Notebooks/MAHE/MSIS Coursework/Miniprojects/EDA'
    DATA_DIR = DIR+'/Data/'
else:
    DATA_DIR = 'Data/'

In [None]:
## Load the diabetes dataset
FILENAME = DATA_DIR + 'diabetes.csv'
df = pd.read_csv(FILENAME)
print('Diabetes dataset')
print('-----------')
print('Initial number of samples = %d'%(df.shape[0]))
print('Initial number of features = %d\n'%(df.shape[1]))
df.head(5)

## EDA Begins

In [None]:
## Check missing values across all features
df.isna().sum()

In [None]:
## How many missing values in total across all features?
df.isna().sum() #There are no na

In [None]:
## Plot percentage of missing values (NaNs) for each feature
cutoff = 30 # we will remove features missing in more than 30% of the samples
fig = plt.figure(figsize=(20,10))
percent_missing = (df.isna().sum() / df.shape[0]) * 100
percent_missing.plot(kind = 'bar', color = cm.rainbow(np.linspace(0, 1, 2))[(percent_missing <= cutoff).values.astype(int)])
plt.plot(np.arange(df.shape[1]), np.repeat(cutoff, df.shape[1]), 'g--')
fig.suptitle('Percentage Missing Values Across All Features', fontsize = 20)
plt.xlabel('Feature', fontsize = 16)
plt.ylabel('% Missing Values', fontsize = 16)

In [None]:
## Find null columns
# All columns wont be displayed on large column datasets if these are not specified: verbose=True, show_counts=True
df.info(verbose = True, show_counts = True)

In [None]:
## Check for duplicate samples
dupsSeries = df.duplicated() # returns a series with True False for every row
print(f"Number of duplicates = {dupsSeries.sum()}") # WOW 1256 duplicate rows. Need to be dropped
df.drop_duplicates(inplace=True)
df.info() # only 744 unique rows here

In [None]:
## Print unique values across all features
df.nunique() # this is for identifying candidate features for encodings.

In [None]:
## Check if this is an imbalanced dataset
zeroClassCount = df[df["Outcome"] == 0]["Outcome"].count()
zeroClassCount/df.shape[0] #not exactly 50-50, quite imbalanced but not too bad. no need for drastic measure such as SMOTE

In [None]:
# Run this optional code only for visualization
# If you do not have yellowbricks, you can skip it
from yellowbrick.target import ClassBalance

visualizer = ClassBalance(labels=[1, 0])
visualizer.fit(df["Outcome"]) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure

In [None]:
## Plot distributions of continuous features
plt.figure(figsize=(12,12))
plt.subplot(3,3,1)
sns.distplot(df.Pregnancies)
plt.subplot(3,3,2)
sns.distplot(df.Glucose)
plt.subplot(3,3,3)
sns.distplot(df.BloodPressure)
plt.subplot(3,3,4)
sns.distplot(df.SkinThickness)
plt.subplot(3,3,5)
sns.distplot(df.Insulin)
plt.subplot(3,3,6)
sns.distplot(df.BMI)
plt.subplot(3,3,7)
sns.distplot(df.DiabetesPedigreeFunction)
plt.subplot(3,3,8)
sns.distplot(df.Age)

In [None]:
## Box-plot for detecting outliers across features
df.plot(kind="box",subplots=True,figsize=(15,5),title="Data with Outliers");

In [None]:
## How many 0 in each column
for col in df.columns:
    count = (df[col] == 0).sum()
    percentage = (count * 100)/df.shape[0]
    print(f'Count of zeros in Column {col} : {count}, percentage 0s: {percentage:.2f}%')

In [None]:
## High percentage of 0. Replace 0 with Nan
df[['Pregnancies', 'BloodPressure','SkinThickness','Insulin','BMI']] = df[['Pregnancies', 'BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)

In [None]:
## How many missing values now?
df.isna().sum()

In [None]:
## Distribution after 0 to Nan Replacement
df.hist(figsize=(12,6))