# **Data Preprocessing for Diabetes Dataset**

In this notebook, we will apply preprocessing steps to the diabetes dataset. The dataset contains various health indicators and a target variable indicating the diabetes status. We will perform the following preprocessing steps:

- 1
- 2
- 3

This preprocessing will prepare the data for further analysis and modeling. 

## **Diabetes Dataset Description**
This dataset contains 22 features, including 17 categorical features such as 'HighBP', 'HighChol', and 'Smoker', and 4 numerical features like 'BMI', 'Age', 'MentHlth', and 'PhysHlth', with a total of 253680 entries.

### Target Variable
- **Diabetes_012**
    - 0 = no diabetes
    - 1 = prediabetes
    - 2 = diabetes

### Features

- **HighBP** (High Blood Pressure)
    - 0 = no high BP
    - 1 = high BP

- **HighChol** (High Cholesterol)
    - 0 = no high cholesterol
    - 1 = high cholesterol

- **CholCheck** (Cholesterol Check)
    - 0 = no cholesterol check in 5 years
    - 1 = yes cholesterol check in 5 years

- **BMI** (Body Mass Index)
    - Body Mass Index

- **Smoker**
    - Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
    - 0 = no
    - 1 = yes

- **Stroke**
    - (Ever told) you had a stroke.
    - 0 = no
    - 1 = yes

- **HeartDiseaseorAttack** (Coronary Heart Disease or Myocardial Infarction)
    - 0 = no
    - 1 = yes

- **PhysActivity** (Physical Activity)
    - Physical activity in past 30 days - not including job
    - 0 = no
    - 1 = yes

- **Fruits**
    - Consume fruit 1 or more times per day
    - 0 = no
    - 1 = yes

- **Veggies** (Vegetables)
    - Consume vegetables 1 or more times per day
    - 0 = no
    - 1 = yes

- **HvyAlcoholConsump** (Heavy Alcohol Consumption)
    - Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)
    - 0 = no
    - 1 = yes

- **AnyHealthcare** (Any Health Care Coverage)
    - Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc.
    - 0 = no
    - 1 = yes

- **NoDocbcCost** (No Doctor Because of Cost)
    - Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
    - 0 = no
    - 1 = yes

- **GenHlth** (General Health)
    - Would you say that in general your health is:
        - 1 = excellent
        - 2 = very good
        - 3 = good
        - 4 = fair
        - 5 = poor

- **MentHlth** (Mental Health)
    - Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
    - Scale: 1-30 days

- **PhysHlth** (Physical Health)
    - Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
    - Scale: 1-30 days

- **DiffWalk** (Difficulty Walking)
    - Do you have serious difficulty walking or climbing stairs?
    - 0 = no
    - 1 = yes

- **Sex**
    - 0 = female
    - 1 = male

- **Age**
    - 13-level age category (_AGEG5YR see codebook)
        -  1 = 18-24
        -  2 = 25-29
        -  3 = 30-34
        -  4 = 35-39
        -  5 = 40-44
        -  6 = 45-49
        -  7 = 50-54
        -  8 = 55-59
        -  9 = 60-64
        - 10 = 65-69
        - 11 = 70-74
        - 12 = 75-79
        - 13 = 80 or older

- **Education**
    - Education level (EDUCA see codebook)
        - 1 = Never attended school or only kindergarten
        - 2 = Grades 1 through 8 (Elementary)
        - 3 = Grades 9 through 11 (Some high school)
        - 4 = Grade 12 or GED (High school graduate)
        - 5 = College 1 year to 3 years (Some college or technical school)
        - 6 = College 4 years or more (College graduate)

- **Income**
    - Income scale (INCOME2 see codebook)
        - 1 = less than $10,000
        - 2 = less than $15,000
        - 3 = less than $20,000
        - 4 = less than $25,000
        - 5 = less than $35,000
        - 6 = less than $50,000
        - 7 = less than $75,000
        - 8 = $75,000 or more


## Splits and preprocessing

In [1]:
# imports
import os
import sys

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler  # KBinsDiscretizer

sys.path.append(os.path.abspath("../scripts"))

from binning import BinningTransformer

In [None]:
# Read data
df = pd.read_csv("../data/raw/diabetes_012_health_indicators_BRFSS2015.csv")

# Drop rows where the target variable is 1 (prediabetes), rename column and set values to 0, 1 ==> alternative to merge below
# df = df[df['Diabetes_012'] != 1]
# df = df.rename(columns={'Diabetes_012': 'Diabetes'})
# df['Diabetes'] = df['Diabetes'].apply(lambda x: 1 if x == 2 else 0)
# df.head()

# Merge the classes diabetes and prediabetes for the target variable
df["Diabetes"] = df["Diabetes_012"].apply(lambda x: 1 if x == 2 else x)
df = df.drop(columns=["Diabetes_012"])
df.head()

In [None]:
# Lists for different types of features
binary_features = [
    "HighBP",
    "HighChol",
    "CholCheck",
    "Smoker",
    "Stroke",
    "HeartDiseaseorAttack",
    "PhysActivity",
    "Fruits",
    "Veggies",
    "HvyAlcoholConsump",
    "AnyHealthcare",
    "NoDocbcCost",
    "DiffWalk",
    "Sex",
]  # no further preprocessing required
ordinal_features = [
    "GenHlth",
    "Age",
    "Education",
    "Income",
]  # no further preprocessing required
numerical_features = [
    "MentHlth",
    "PhysHlth",
]  # will be normalized
binned_features = ["BMI"]  # will be binned to 0-3

In [None]:
# Create bins for the BMI
bin_edges = [0, 18.5, 25, 30, df["BMI"].max() + 1]
num_bins = len(bin_edges) - 1
labels = list(range(num_bins))

# Split the data into training and testing sets
X = df.drop("Diabetes", axis=1)
y = df["Diabetes"]

# Split data into training, validation and test splits [80%, 10%, 10%] using stratified split
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Split the temp set into validation and test sets
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# Define the preprocessing pipeline
binning_transformer = BinningTransformer(bins=bin_edges, labels=labels)
numerical_pipeline = Pipeline(steps=[("scaler", StandardScaler())])
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_pipeline, numerical_features),
        ("binned", binning_transformer, binned_features),
        ("binary", "passthrough", binary_features),
        ("ordinal", "passthrough", ordinal_features),
    ],
)

# Apply the preprocessing pipeline to the training and testing data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_val_preprocessed = preprocessor.transform(X_val)
X_test_preprocessed = preprocessor.transform(X_test)

# Display the shapes of the preprocessed datasets
print(f"X_train_preprocessed shape: {X_train_preprocessed.shape}")
print(f"X_val_preprocessed shape: {X_val_preprocessed.shape}")
print(f"X_test_preprocessed shape: {X_test_preprocessed.shape}")

# Display the shapes of the target variables
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"y_test shape: {y_test.shape}")

### save preprocessed data as csv

In [None]:
# Convert preprocessed training and testing sets back into DataFrames with correct column names
column_names = numerical_features + binned_features + binary_features + ordinal_features
train_preprocessed_df = pd.DataFrame(X_train_preprocessed, columns=column_names, index=X_train.index)
val_preprocessed_df = pd.DataFrame(X_val_preprocessed, columns=column_names, index=X_val.index)
test_preprocessed_df = pd.DataFrame(X_test_preprocessed, columns=column_names, index=X_test.index)

# Include the y values in datframes
train_preprocessed_df["Diabetes"] = y_train.values
val_preprocessed_df["Diabetes"] = y_val.values
test_preprocessed_df["Diabetes"] = y_test.values

In [None]:
train_preprocessed_df.head()

In [None]:
# Storing each dataset into .csv file
train_preprocessed_df.to_csv("../data/preprocessed/dataset_train.csv", index=False)
val_preprocessed_df.to_csv("../data/preprocessed/dataset_val.csv", index=False)
test_preprocessed_df.to_csv("../data/preprocessed/dataset_test.csv", index=False)

### how to load data with data_loader.py

In [None]:
import os
import sys

sys.path.append(os.path.abspath("../scripts"))
from data_loader import DataLoader

data_loader = DataLoader()
X_train, y_train = data_loader.training_data
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

### how to load data in a dataframe

In [None]:
import os
import sys

sys.path.append(os.path.abspath("../scripts"))
from data_loader import DataLoader

data_loader = DataLoader()
train_df = data_loader.training_dataframe
val_df = data_loader.validation_dataframe
test_df = data_loader.test_dataframe

train_df.head()

## PCA

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=None)  # n_components=None means that all components will be kept
X_train_pca = pca.fit_transform(X_train)
X_val_pca = pca.transform(X_val)
X_test_pca = pca.transform(X_test)

# Retrieve the eigenvectors (components)
eigenvectors = pca.components_

# Retrieve the eigenvalues (explained variance)
eigenvalues = pca.explained_variance_

print(eigenvalues)

Printing and Plotting the variance ratio of each principle components

In [None]:
num_comp = 1
for ratio in pca.explained_variance_ratio_:
    comp = "Component No. "
    text = comp + str(num_comp) + ":"

    print("Ratio of", text, ratio)
    num_comp += 1

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.scatter(
    x=[i + 1 for i in range(len(pca.explained_variance_ratio_))],
    y=pca.explained_variance_ratio_,
    s=200,
    alpha=0.75,
    c="orange",
    edgecolor="k",
)
plt.grid(True)
plt.title("Explained variance ratio of the \nfitted principal component vector\n", fontsize=25)
plt.xlabel("Principal components", fontsize=15)
plt.xticks([i + 1 for i in range(len(pca.explained_variance_ratio_))], fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel("Explained variance ratio", fontsize=15)
plt.show()

The above plot illustrates that the $1^{st}$ principle component explains about 47% of the total variance in the data and the $2^{nd}$ component explains further 24%. This means that only the first two components already explain about 71% of the total variance and the remaining 19 components only 29%.

In [None]:
X_train_pca

In [None]:
x_train_pca_df = pd.DataFrame(data=X_train_pca)
x_train_pca_df.head()

In [None]:
import numpy as np

# Get unique classes and map each class to a specific level for better visualization
unique_classes = np.unique(y_train)
class_levels = {cls: idx for idx, cls in enumerate(unique_classes)}

# Create new levels for each class for better separation in 1D plots
y_levels_pc1 = [class_levels[cls] for cls in y_train]  # Y-levels for PC1
x_levels_pc2 = [class_levels[cls] for cls in y_train]  # X-levels for PC2

plt.figure(figsize=(15, 10))

# 2D scatter plot for the first two principal components
plt.subplot(2, 2, 1)
plt.scatter(
    x_train_pca_df[0],
    x_train_pca_df[1],
    c=y_train,
    edgecolors="k",
    alpha=0.75,
    s=150,
)
plt.grid(True)
plt.title("Class separation using first two principal components\n", fontsize=20)
plt.xlabel("Principal component-1", fontsize=15)
plt.ylabel("Principal component-2", fontsize=15)

plt.tight_layout()
plt.show()

Save Principal Components as csv

In [None]:
# Convert remaining  principal components splits into DataFrames
train_pca_df = x_train_pca_df
val_pca_df = pd.DataFrame(data=X_val_pca)
test_pca_df = pd.DataFrame(data=X_test_pca)

# Include the y values in datframes
train_pca_df["Diabetes"] = y_train
val_pca_df["Diabetes"] = y_val
test_pca_df["Diabetes"] = y_test

In [None]:
train_pca_df.head()

In [None]:
# Storing each dataset into .csv file
train_pca_df.to_csv("../data/pca/dataset_train_pca.csv", index=False)
val_pca_df.to_csv("../data/pca/dataset_val_pca.csv", index=False)
test_pca_df.to_csv("../data/pca/dataset_test_pca.csv", index=False)

### how to load data with data_loader.py

In [5]:
import os
import sys

sys.path.append(os.path.abspath("../scripts"))
from data_loader import DataLoader

data_loader = DataLoader()
X_train_pca, y_train_pca = data_loader.training_data_pca(n=3) # if no n is given all components will be returned
X_val_pca, y_val_pca = data_loader.validation_data_pca(3)
X_test_pca, y_test_pca = data_loader.test_data_pca(3)

print(f"X_train_pca shape: {X_train_pca.shape}")
print(f"y_train_pca shape: {y_train_pca.shape}")
print(f"X_val_pca shape: {X_val_pca.shape}")
print(f"y_val_pca shape: {y_val_pca.shape}")
print(f"X_test_pca shape: {X_test_pca.shape}")
print(f"y_test_pca shape: {y_test_pca.shape}")

X_train_pca shape: (202944, 3)
y_train_pca shape: (202944,)
X_val_pca shape: (25368, 3)
y_val_pca shape: (25368,)
X_test_pca shape: (25368, 3)
y_test_pca shape: (25368,)


### how to load data in a dataframe

In [4]:
import os
import sys

sys.path.append(os.path.abspath("../scripts"))
from data_loader import DataLoader

data_loader = DataLoader()
train_df_pca = data_loader.training_dataframe_pca(3)
val_df_pca = data_loader.validation_dataframe_pca(3)
test_df_pca = data_loader.test_dataframe_pca(3)

train_df_pca.head()

Unnamed: 0,0,1,2,Diabetes
0,5.079228,0.569142,0.829476,1.0
1,-0.44637,1.760168,1.215792,0.0
2,0.714603,-2.786327,1.044002,0.0
3,-3.475631,-3.902256,1.998963,0.0
4,4.499275,2.595663,-0.986308,0.0


## Over- and Undersampling

### Random Oversampling

In [None]:
# apply random oversampling to training set
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_oversampled, y_oversampled = ros.fit_resample(X_train_preprocessed, y_train)

print(f"X_resampled shape: {X_oversampled.shape}")
print(f"y_resampled shape: {y_oversampled.shape}")

# Show class distributions
print("\nClass distribution before oversampling:")
print(y_train.value_counts())

print("\nClass distribution after oversampling:")
print(y_oversampled.value_counts())

In [None]:
# Convert preprocessed resampled training set back into DataFrame with correct column names
train_preprocessed_oversampled_df = pd.DataFrame(X_oversampled, columns=column_names)

# Include the y values in the dataframe
train_preprocessed_oversampled_df["Diabetes"] = y_oversampled.values

# Storing each dataset into .csv file
train_preprocessed_oversampled_df.to_csv(
    "../data/resampling/dataset_train_oversampled.csv", index=False
)

### SMOTE Oversampling

In [None]:
# apply synthetic oversampling with SMOTE to training set
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_oversampled_smote, y_oversampled_smote = smote.fit_resample(X_train, y_train)

print(f"X_resampled shape: {X_oversampled_smote.shape}")
print(f"y_resampled shape: {y_oversampled_smote.shape}")

# Show class distributions
print("\nClass distribution before oversampling:")
print(y_train.value_counts())

print("\nClass distribution after oversampling:")
print(y_oversampled_smote.value_counts())

In [None]:
# Convert preprocessed resampled training set back into DataFrame with correct column names
train_preprocessed_oversampled_smote_df = pd.DataFrame(X_oversampled_smote, columns=column_names)

# Include the y values in the dataframe
train_preprocessed_oversampled_smote_df["Diabetes"] = y_oversampled_smote.values

# Storing each dataset into .csv file
train_preprocessed_oversampled_smote_df.to_csv(
    "../data/resampling/dataset_train_oversampled_smote.csv", index=False
)

### Random Undersampling

In [None]:
# apply random undersampling to training set
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = rus.fit_resample(X_train_preprocessed, y_train)

print(f"X_resampled shape: {X_undersampled.shape}")
print(f"y_resampled shape: {y_undersampled.shape}")

# Show class distributions
print("\nClass distribution before undersampling:")
print(y_train.value_counts())

print("\nClass distribution after undersampling:")
print(y_undersampled.value_counts())

In [None]:
# Convert preprocessed resampled training set back into DataFrame with correct column names
train_preprocessed_undersampled_df = pd.DataFrame(X_undersampled, columns=column_names)

# Include the y values in the dataframe
train_preprocessed_undersampled_df["Diabetes"] = y_undersampled.values

# Storing each dataset into .csv file
train_preprocessed_undersampled_df.to_csv(
    "../data/resampling/dataset_train_undersampled.csv", index=False
)

### how to load data with data_loader.py

Note that when using resampling one should still use the unbalanced validation and test splits as these should represent the original, real-world distribution of the data.

Example shown for random under-sampling: `training_data_undersampling_random`. You can also use `training_data_oversampling_random` and `training_data_oversampling_smote`, as well as load a dataframe similar as before. 

In [None]:
import os
import sys

sys.path.append(os.path.abspath("../scripts"))
from data_loader import DataLoader

data_loader = DataLoader()
X_train_undersampling_random, y_train_undersampling_random = (
    data_loader.training_data_undersampling_random
)
X_val, y_val = data_loader.validation_data
X_test, y_test = data_loader.test_data

print(f"X_train_undersampling_random shape: {X_train_undersampling_random.shape}")
print(f"y_train_undersampling_random shape: {y_train_undersampling_random.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")