# Exploratory Data Analysis

## Describition of Raw Data

In [1]:
from data.data_loader import load_data

In [2]:
# Load the raw dataset from the directory

raw_data = load_data("diabetes_prediction_dataset.csv")

In [3]:
# Check the head of the dataframe

raw_data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [4]:
# Check the shape of the dataframe

raw_data.shape

(100000, 9)

The raw dataset consists of 100,000 observations with 9 columns, including one target variable, "diabetes", and 8 explanatory variables. The objective of this task is to predict the presence of diabetes, making it a classification problem. As part of the data exploration process, it is essential to evaluate the proportion of the target variable ("diabetes") in the dataset. This step helps assess the class distribution, which is crucial for understanding potential class imbalance and informing the choice of modeling techniques.

In [5]:
from data.preprocessing import check_binary_column

check_binary_column(raw_data, "diabetes")

Column diabetes contains exactly two unique values: [0 1].
Counts and Proportions in diabetes:
  0: 91500 (91.50%)
  1: 8500 (8.50%)



As expected, individuals with diabetes represent a minority in the dataset, comprising only 8.50% of the observations.

In [6]:
check_binary_column(raw_data, "gender")
check_binary_column(raw_data, "hypertension")
check_binary_column(raw_data, "heart_disease")

Column gender does not contain exactly two unique values: ['Female' 'Male' 'Other']
Counts and Proportions in gender:
  Female: 58552 (58.55%)
  Male: 41430 (41.43%)
  Other: 18 (0.02%)

Column hypertension contains exactly two unique values: [0 1].
Counts and Proportions in hypertension:
  0: 92515 (92.52%)
  1: 7485 (7.48%)

Column heart_disease contains exactly two unique values: [1 0].
Counts and Proportions in heart_disease:
  0: 96058 (96.06%)
  1: 3942 (3.94%)



Both heart disease and hypertension demonstrate a skewed distribution, whereas gender is relatively well-balanced.

To addess class imblanace in classification, there are several strategies including both oversampling and undersampling. Given computational constraints and the large dataset of 100,000 observations, undersampling is opted and the sampled dataset is constructed to ensure a balanced distribution of observations for the target variable, diabetes. A script in data.sample+data.py was executed to generate a sampled data containing 5000 obesrvations. It is important to note that there is no need to rerun the script as the sampled data has already been stored in the raw_data folder. 

In [7]:
stratified_sampled_data = load_data("stratified_diabetes_prediction_dataset.csv")
check_binary_column(stratified_sampled_data, "diabetes")

Column diabetes contains exactly two unique values: [0 1].
Counts and Proportions in diabetes:
  0: 2500 (50.00%)
  1: 2500 (50.00%)



## Check for Missing Values and Outliers

In [8]:
from data.preprocessing import check_missing_values

In [9]:
check_missing_values(raw_data)

{'gender': 'No missing values',
 'age': 'No missing values',
 'hypertension': 'No missing values',
 'heart_disease': 'No missing values',
 'smoking_history': 'Missing values (35816 = 35.82%)',
 'bmi': 'No missing values',
 'HbA1c_level': 'No missing values',
 'blood_glucose_level': 'No missing values',
 'diabetes': 'No missing values'}

All columns have complete data, except for smoking_history, which has 36% missing values.