# **Data Preprocessing for Diabetes Dataset**

In this notebook, we will apply preprocessing steps to the diabetes dataset. The dataset contains various health indicators and a target variable indicating the diabetes status. We will perform the following preprocessing steps:

- 1
- 2
- 3

This preprocessing will prepare the data for further analysis and modeling.

In [None]:
# imports
import os
import sys

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler  # KBinsDiscretizer

sys.path.append(os.path.abspath("../scripts"))

## **Diabetes Dataset Description**
This dataset contains 22 features, including 17 categorical features such as 'HighBP', 'HighChol', and 'Smoker', and 4 numerical features like 'BMI', 'Age', 'MentHlth', and 'PhysHlth', with a total of 253680 entries.

### Target Variable
- **Diabetes_012**
    - 0 = no diabetes
    - 1 = prediabetes
    - 2 = diabetes

### Features

- **HighBP** (High Blood Pressure)
    - 0 = no high BP
    - 1 = high BP

- **HighChol** (High Cholesterol)
    - 0 = no high cholesterol
    - 1 = high cholesterol

- **CholCheck** (Cholesterol Check)
    - 0 = no cholesterol check in 5 years
    - 1 = yes cholesterol check in 5 years

- **BMI** (Body Mass Index)
    - Body Mass Index

- **Smoker**
    - Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
    - 0 = no
    - 1 = yes

- **Stroke**
    - (Ever told) you had a stroke.
    - 0 = no
    - 1 = yes

- **HeartDiseaseorAttack** (Coronary Heart Disease or Myocardial Infarction)
    - 0 = no
    - 1 = yes

- **PhysActivity** (Physical Activity)
    - Physical activity in past 30 days - not including job
    - 0 = no
    - 1 = yes

- **Fruits**
    - Consume fruit 1 or more times per day
    - 0 = no
    - 1 = yes

- **Veggies** (Vegetables)
    - Consume vegetables 1 or more times per day
    - 0 = no
    - 1 = yes

- **HvyAlcoholConsump** (Heavy Alcohol Consumption)
    - Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)
    - 0 = no
    - 1 = yes

- **AnyHealthcare** (Any Health Care Coverage)
    - Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc.
    - 0 = no
    - 1 = yes

- **NoDocbcCost** (No Doctor Because of Cost)
    - Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
    - 0 = no
    - 1 = yes

- **GenHlth** (General Health)
    - Would you say that in general your health is:
        - 1 = excellent
        - 2 = very good
        - 3 = good
        - 4 = fair
        - 5 = poor

- **MentHlth** (Mental Health)
    - Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
    - Scale: 1-30 days

- **PhysHlth** (Physical Health)
    - Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
    - Scale: 1-30 days

- **DiffWalk** (Difficulty Walking)
    - Do you have serious difficulty walking or climbing stairs?
    - 0 = no
    - 1 = yes

- **Sex**
    - 0 = female
    - 1 = male

- **Age**
    - 13-level age category (_AGEG5YR see codebook)
        - 1 = 18-24
        - 9 = 60-64
        - 13 = 80 or older

- **Education**
    - Education level (EDUCA see codebook)
        - 1 = Never attended school or only kindergarten
        - 2 = Grades 1 through 8 (Elementary)
        - 3 = Grades 9 through 11 (Some high school)
        - 4 = Grade 12 or GED (High school graduate)
        - 5 = College 1 year to 3 years (Some college or technical school)
        - 6 = College 4 years or more (College graduate)

- **Income**
    - Income scale (INCOME2 see codebook)
        - 1 = less than $10,000
        - 5 = less than $35,000
        - 8 = $75,000 or more


In [None]:
# Read data
df = pd.read_csv("../data/raw/diabetes_012_health_indicators_BRFSS2015.csv")
# Convert all values in the dataframe to int
df = df.astype(int)

# Drop rows where the target variable is 1 (prediabetes), rename column and set values to 0, 1
# df = df[df['Diabetes_012'] != 1]
# df = df.rename(columns={'Diabetes_012': 'Diabetes'})
# df['Diabetes'] = df['Diabetes'].apply(lambda x: 1 if x == 2 else 0)
# df.head()

# Merge the classes diabetes and prediabetes for the target variable
df["Diabetes"] = df["Diabetes_012"].apply(lambda x: 1 if x == 2 else x)
df = df.drop(columns=["Diabetes_012"])
df.head()

In [None]:
# Lists for different types of features
binary_features = [
    "HighBP",
    "HighChol",
    "CholCheck",
    "Smoker",
    "Stroke",
    "HeartDiseaseorAttack",
    "PhysActivity",
    "Fruits",
    "Veggies",
    "HvyAlcoholConsump",
    "AnyHealthcare",
    "NoDocbcCost",
    "DiffWalk",
    "Sex",
]  # no further preprocessing required
ordinal_features = [
    "GenHlth",
    "Age",
    "Education",
    "Income",
]  # no further preprocessing required
numerical_features = [
    "MentHlth",
    "PhysHlth",
]  # will be normalized
binned_features = ["BMI"]  # will be binned to 0-3


# Split the data into training and testing sets using stratified split
X = df.drop("Diabetes", axis=1)
y = df["Diabetes"]
strat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Create bins for the BMI
bin_edges = [0, 18.5, 25, 30, df["BMI"].max() + 1]
num_bins = len(bin_edges) - 1
labels = list(range(num_bins))
X["BMI"] = pd.cut(
    X["BMI"], bins=bin_edges, labels=labels, include_lowest=True, right=False
)

for train_index, test_index in strat_split.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# Define the preprocessing pipeline
# binned_pipeline = Pipeline(steps=[
#     ('binner', KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='kmeans'))
# ]) # => this is now done with custom logic above

numerical_pipeline = Pipeline(steps=[("scaler", StandardScaler())])
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_pipeline, numerical_features),
        ("binned", "passthrough", binned_features),
        ("binary", "passthrough", binary_features),
        ("ordinal", "passthrough", ordinal_features),
    ],
)

# Apply the preprocessing pipeline to the training and testing data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

column_names = numerical_features + binned_features + binary_features + ordinal_features

# Convert preprocessed training and testing sets back into DataFrames with correct column names
X_train_preprocessed_df = pd.DataFrame(
    X_train_preprocessed, columns=column_names, index=X_train.index
)
X_test_preprocessed_df = pd.DataFrame(
    X_test_preprocessed, columns=column_names, index=X_test.index
)

# Convert only binary features to int
binary_columns = binary_features + binned_features + ordinal_features
X_train_preprocessed_df[binary_columns] = X_train_preprocessed_df[
    binary_columns
].astype(int)
X_test_preprocessed_df[binary_columns] = X_test_preprocessed_df[binary_columns].astype(
    int
)

# Display the shapes of the preprocessed datasets
print(f"X_train_preprocessed_df shape: {X_train_preprocessed_df.shape}")
print(f"X_test_preprocessed_df shape: {X_test_preprocessed_df.shape}")

In [None]:
X_train_preprocessed_df.head().T

In [None]:
X_train_preprocessed_df.info()

In [None]:
# TODO
# PCA
# validate split