# Data Processing

Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate blood glucose levels, which can lead to a reduction in quality of life and life expectancy.

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey collected annually by the CDC (Centers for Disease Control and Prevention in the United States). Each year, the survey gathers responses from thousands of Americans about health-related risk behaviors, chronic health conditions, and the use of preventive services. For this project, a dataset available on Kaggle for the year 2015 was used.

https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

Dicionário de variáveis:

- `Diabetes_binary`: 0 = no diabetes, 1 = with diabetes
- `HighBP`: 0 = no high blood pressure, 1 = with high blood pressure
- `HighChol`: 0 = no high cholesterol, 1 = with high cholesterol
- `CholCheck`: 0 = never had a cholesterol test, 1 = had a cholesterol test at some point
- `BMI`: Body Mass Index (BMI)
- `Smoker`: 0 = non-smoker, 1 = smoker
- `Stroke`: 0 = no history of stroke, 1 = with history of stroke
- `HeartDiseaseorAttack`: 0 = no history of heart disease or heart attack, 1 = with history of heart disease or heart attack
- `PhysActivity`: 0 = does not engage in physical activity, 1 = engages in physical activity
- `Fruits`: 0 = does not consume fruits, 1 = consumes fruits
- `Veggies`: 0 = does not consume vegetables, 1 = consumes vegetables
- `HvyAlcoholConsump`: 0 = does not consume alcohol in high amounts, 1 = consumes alcohol in high amounts
- `AnyHealthcare`: 0 = does not have health insurance, 1 = has health insurance
- `NoDocbcCost`: 0 = did not visit a doctor due to financial reasons, 1 = visited a doctor due to financial reasons (last 12 months)
- `GenHlth`: General health (1 to 5) - 1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor
- `MentHlth`: In the last 30 days, how many days was mental health not good (0 to 30)
- `PhysHlth`: In the last 30 days, how many days was physical health not good (0 to 30)
- `DiffWalk`: 0 = no difficulty walking, 1 = has difficulty walking
- `Sex`: 0 = female, 1 = male
- `Age`: Age in ranges 1 = 18-24; 2 = 25-29; 3 = 30-34; 4 = 35-39; 5 = 40-44; 6 = 45-49; 7 = 50-54; 8 = 55-59; 9 = 60-64; 10 = 65-69; 11 = 70-74; 12 = 75-79; 13 = 80+
- `Education`: Education levels 1 = never attended school; 2 = elementary school; 3 = incomplete high school; 4 = high school; 5 = incomplete college or technical course; 6 = completed college or higher degrees
- `Income`: Annual income in ranges 1 = < 10,000; 2 = 10,000-14,999; 3 = 15,000-19,999; 4 = 20,000-24,999; 5 = 25,000-34,999; 6 = 35,000-49,999; 7 = 50,000-74,999; 8 = 75,000+

In [1]:
import pandas as pd

from src.config import ORIGINAL_DATA, PROCESSED_DATA

df_diabetes = pd.read_csv(ORIGINAL_DATA)

df_diabetes.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0


In [2]:
with pd.option_context("display.max_columns", None):
    display(df_diabetes.head())

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0


In [3]:
df_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Diabetes_binary       70692 non-null  float64
 1   HighBP                70692 non-null  float64
 2   HighChol              70692 non-null  float64
 3   CholCheck             70692 non-null  float64
 4   BMI                   70692 non-null  float64
 5   Smoker                70692 non-null  float64
 6   Stroke                70692 non-null  float64
 7   HeartDiseaseorAttack  70692 non-null  float64
 8   PhysActivity          70692 non-null  float64
 9   Fruits                70692 non-null  float64
 10  Veggies               70692 non-null  float64
 11  HvyAlcoholConsump     70692 non-null  float64
 12  AnyHealthcare         70692 non-null  float64
 13  NoDocbcCost           70692 non-null  float64
 14  GenHlth               70692 non-null  float64
 15  MentHlth           

In [4]:
df_diabetes.columns = [
    "Diabetes",
    "HighBloodPressure",
    "HighCholesterol",
    "CholesterolTest",
    "BodyMassIndex",
    "Smoker",
    "Stroke",
    "HeartProblem",
    "PhysicalActivity",
    "EatsFruits",
    "EatsVegetables",
    "HeavyDrinking",
    "HealthInsurance",
    "NoDoctorMoney",
    "GeneralHealth",
    "MentalHealthDays",
    "PhysicalHealthDays",
    "WalkingDifficulty",
    "Gender",
    "AgeRange",
    "EducationLevel",
    "IncomeRange",
]

In [5]:
df_diabetes.nunique()

Diabetes               2
HighBloodPressure      2
HighCholesterol        2
CholesterolTest        2
BodyMassIndex         80
Smoker                 2
Stroke                 2
HeartProblem           2
PhysicalActivity       2
EatsFruits             2
EatsVegetables         2
HeavyDrinking          2
HealthInsurance        2
NoDoctorMoney          2
GeneralHealth          5
MentalHealthDays      31
PhysicalHealthDays    31
WalkingDifficulty      2
Gender                 2
AgeRange              13
EducationLevel         6
IncomeRange            8
dtype: int64

In [6]:
binary_columns = df_diabetes.nunique()[df_diabetes.nunique() == 2].index.tolist()

binary_columns

['Diabetes',
 'HighBloodPressure',
 'HighCholesterol',
 'CholesterolTest',
 'Smoker',
 'Stroke',
 'HeartProblem',
 'PhysicalActivity',
 'EatsFruits',
 'EatsVegetables',
 'HeavyDrinking',
 'HealthInsurance',
 'NoDoctorMoney',
 'WalkingDifficulty',
 'Gender']

In [7]:
df_diabetes_processed = df_diabetes.copy()

df_diabetes_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Diabetes            70692 non-null  float64
 1   HighBloodPressure   70692 non-null  float64
 2   HighCholesterol     70692 non-null  float64
 3   CholesterolTest     70692 non-null  float64
 4   BodyMassIndex       70692 non-null  float64
 5   Smoker              70692 non-null  float64
 6   Stroke              70692 non-null  float64
 7   HeartProblem        70692 non-null  float64
 8   PhysicalActivity    70692 non-null  float64
 9   EatsFruits          70692 non-null  float64
 10  EatsVegetables      70692 non-null  float64
 11  HeavyDrinking       70692 non-null  float64
 12  HealthInsurance     70692 non-null  float64
 13  NoDoctorMoney       70692 non-null  float64
 14  GeneralHealth       70692 non-null  float64
 15  MentalHealthDays    70692 non-null  float64
 16  Phys

In [8]:
for column in binary_columns:
    if column != "Gender":
        df_diabetes_processed[column] = pd.Categorical(df_diabetes_processed[column]).rename_categories(["No", "Yes"])
    else:
        df_diabetes_processed[column] = pd.Categorical(df_diabetes_processed[column]).rename_categories(
            ["Female", "Male"]
            )

In [9]:
with pd.option_context("display.max_columns", None):
    display(df_diabetes_processed.head())

Unnamed: 0,Diabetes,HighBloodPressure,HighCholesterol,CholesterolTest,BodyMassIndex,Smoker,Stroke,HeartProblem,PhysicalActivity,EatsFruits,EatsVegetables,HeavyDrinking,HealthInsurance,NoDoctorMoney,GeneralHealth,MentalHealthDays,PhysicalHealthDays,WalkingDifficulty,Gender,AgeRange,EducationLevel,IncomeRange
0,No,Yes,No,Yes,26.0,No,No,No,Yes,No,Yes,No,Yes,No,3.0,5.0,30.0,No,Male,4.0,6.0,8.0
1,No,Yes,Yes,Yes,26.0,Yes,Yes,No,No,Yes,No,No,Yes,No,3.0,0.0,0.0,No,Male,12.0,6.0,8.0
2,No,No,No,Yes,26.0,No,No,No,Yes,Yes,Yes,No,Yes,No,1.0,0.0,10.0,No,Male,13.0,6.0,8.0
3,No,Yes,Yes,Yes,28.0,Yes,No,No,Yes,Yes,Yes,No,Yes,No,3.0,0.0,3.0,No,Male,11.0,6.0,8.0
4,No,No,No,Yes,29.0,Yes,No,No,Yes,Yes,Yes,No,Yes,No,2.0,0.0,0.0,No,Female,8.0,5.0,8.0


In [10]:
df_diabetes_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Diabetes            70692 non-null  category
 1   HighBloodPressure   70692 non-null  category
 2   HighCholesterol     70692 non-null  category
 3   CholesterolTest     70692 non-null  category
 4   BodyMassIndex       70692 non-null  float64 
 5   Smoker              70692 non-null  category
 6   Stroke              70692 non-null  category
 7   HeartProblem        70692 non-null  category
 8   PhysicalActivity    70692 non-null  category
 9   EatsFruits          70692 non-null  category
 10  EatsVegetables      70692 non-null  category
 11  HeavyDrinking       70692 non-null  category
 12  HealthInsurance     70692 non-null  category
 13  NoDoctorMoney       70692 non-null  category
 14  GeneralHealth       70692 non-null  float64 
 15  MentalHealthDays    70692 non-null  

In [11]:
df_diabetes_processed["GeneralHealth"] = pd.Categorical(
    df_diabetes_processed["GeneralHealth"],
    ordered=True
).rename_categories(["Excellent", "Very good", "Good", "Fair", "Poor"])

df_diabetes_processed["GeneralHealth"].head()

0         Good
1         Good
2    Excellent
3         Good
4    Very good
Name: GeneralHealth, dtype: category
Categories (5, object): ['Excellent' < 'Very good' < 'Good' < 'Fair' < 'Poor']

In [12]:
df_diabetes_processed["AgeRange"] = pd.Categorical(
    df_diabetes_processed["AgeRange"],
    ordered=True
).rename_categories(
    [
        "18-24",
        "25-29",
        "30-34",
        "35-39",
        "40-44",
        "45-49",
        "50-54",
        "55-59",
        "60-64",
        "65-69",
        "70-74",
        "75-79",
        "80+",
    ]
)

df_diabetes_processed["EducationLevel"] = pd.Categorical(
    df_diabetes_processed["EducationLevel"],
    ordered=True
).rename_categories(
    [
        "No schooling",
        "Primary",
        "Secondary incomplete",
        "Secondary",
        "College incomplete or Technical",
        "College or higher"
    ]
)

df_diabetes_processed["IncomeRange"] = pd.Categorical(
    df_diabetes_processed["IncomeRange"],
    ordered=True
).rename_categories(
    [
        "< $10.000",
        "$10.000-$14.999",
        "$15.000-$19.999",
        "$20.000-$24.999",
        "$25.000-$34.999",
        "$35.000-$49.999",
        "$50.000-$74.999",
        "$75.000+",
    ]
)

In [13]:
df_diabetes_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Diabetes            70692 non-null  category
 1   HighBloodPressure   70692 non-null  category
 2   HighCholesterol     70692 non-null  category
 3   CholesterolTest     70692 non-null  category
 4   BodyMassIndex       70692 non-null  float64 
 5   Smoker              70692 non-null  category
 6   Stroke              70692 non-null  category
 7   HeartProblem        70692 non-null  category
 8   PhysicalActivity    70692 non-null  category
 9   EatsFruits          70692 non-null  category
 10  EatsVegetables      70692 non-null  category
 11  HeavyDrinking       70692 non-null  category
 12  HealthInsurance     70692 non-null  category
 13  NoDoctorMoney       70692 non-null  category
 14  GeneralHealth       70692 non-null  category
 15  MentalHealthDays    70692 non-null  

In [14]:
df_diabetes_processed.head()

Unnamed: 0,Diabetes,HighBloodPressure,HighCholesterol,CholesterolTest,BodyMassIndex,Smoker,Stroke,HeartProblem,PhysicalActivity,EatsFruits,...,HealthInsurance,NoDoctorMoney,GeneralHealth,MentalHealthDays,PhysicalHealthDays,WalkingDifficulty,Gender,AgeRange,EducationLevel,IncomeRange
0,No,Yes,No,Yes,26.0,No,No,No,Yes,No,...,Yes,No,Good,5.0,30.0,No,Male,35-39,College or higher,$75.000+
1,No,Yes,Yes,Yes,26.0,Yes,Yes,No,No,Yes,...,Yes,No,Good,0.0,0.0,No,Male,75-79,College or higher,$75.000+
2,No,No,No,Yes,26.0,No,No,No,Yes,Yes,...,Yes,No,Excellent,0.0,10.0,No,Male,80+,College or higher,$75.000+
3,No,Yes,Yes,Yes,28.0,Yes,No,No,Yes,Yes,...,Yes,No,Good,0.0,3.0,No,Male,70-74,College or higher,$75.000+
4,No,No,No,Yes,29.0,Yes,No,No,Yes,Yes,...,Yes,No,Very good,0.0,0.0,No,Female,55-59,College incomplete or Technical,$75.000+


In [15]:
df_diabetes_processed.describe()

Unnamed: 0,BodyMassIndex,MentalHealthDays,PhysicalHealthDays
count,70692.0,70692.0,70692.0
mean,29.856985,3.752037,5.810417
std,7.113954,8.155627,10.062261
min,12.0,0.0,0.0
25%,25.0,0.0,0.0
50%,29.0,0.0,0.0
75%,33.0,2.0,6.0
max,98.0,30.0,30.0


In [16]:
df_diabetes_processed["BodyMassIndex"].apply(float.is_integer).all()

True

In [17]:
numeric_columns = df_diabetes_processed.select_dtypes(include="number").columns.tolist()

numeric_columns

['BodyMassIndex', 'MentalHealthDays', 'PhysicalHealthDays']

In [18]:
for column in numeric_columns:
    df_diabetes_processed[column] = pd.to_numeric(
        df_diabetes_processed[column],
        downcast="integer"
    )

df_diabetes_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Diabetes            70692 non-null  category
 1   HighBloodPressure   70692 non-null  category
 2   HighCholesterol     70692 non-null  category
 3   CholesterolTest     70692 non-null  category
 4   BodyMassIndex       70692 non-null  int8    
 5   Smoker              70692 non-null  category
 6   Stroke              70692 non-null  category
 7   HeartProblem        70692 non-null  category
 8   PhysicalActivity    70692 non-null  category
 9   EatsFruits          70692 non-null  category
 10  EatsVegetables      70692 non-null  category
 11  HeavyDrinking       70692 non-null  category
 12  HealthInsurance     70692 non-null  category
 13  NoDoctorMoney       70692 non-null  category
 14  GeneralHealth       70692 non-null  category
 15  MentalHealthDays    70692 non-null  

In [19]:
df_diabetes_processed.describe()

Unnamed: 0,BodyMassIndex,MentalHealthDays,PhysicalHealthDays
count,70692.0,70692.0,70692.0
mean,29.856985,3.752037,5.810417
std,7.113954,8.155627,10.062261
min,12.0,0.0,0.0
25%,25.0,0.0,0.0
50%,29.0,0.0,0.0
75%,33.0,2.0,6.0
max,98.0,30.0,30.0


In [20]:
df_diabetes_processed.describe(exclude="number")

Unnamed: 0,Diabetes,HighBloodPressure,HighCholesterol,CholesterolTest,Smoker,Stroke,HeartProblem,PhysicalActivity,EatsFruits,EatsVegetables,HeavyDrinking,HealthInsurance,NoDoctorMoney,GeneralHealth,WalkingDifficulty,Gender,AgeRange,EducationLevel,IncomeRange
count,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692
unique,2,2,2,2,2,2,2,2,2,2,2,2,2,5,2,2,13,6,8
top,No,Yes,Yes,Yes,No,No,No,Yes,Yes,Yes,No,Yes,No,Good,No,Female,65-69,College or higher,$75.000+
freq,35346,39832,37163,68943,37094,66297,60243,49699,43249,55760,67672,67508,64053,23427,52826,38386,10856,26020,20646


In [21]:
#df_diabetes_processed.to_parquet(PROCESSED_DATA, index=False)