# Data processing of 2020 annual survey data from the CDC

## Import modules

Firstly, I load the modules and define the constant variables (the path to read the original file and write the cleared one).

In [1]:
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
READ_PATH = "heart_2020.csv"
WRITE_PATH = "heart_2020_cleaned.csv"

I then test the read times of the dataset using methods from the pandas and polars modules. pandas is not efficient when loading and performing operations on large data sets. An alternative here is polars using the Apache Arrow Columnar Format approach. polars is used to load the dataset. However, this dataset is not so large that cleaning operations take a long time, so pandas will be used to organize it.

In [None]:
%timeit pd.read_csv(READ_PATH)

In [None]:
%timeit pl.read_csv(READ_PATH)

In [3]:
heart = pl.read_csv(READ_PATH, null_values="")
heart = heart.to_pandas()

The dataset contains 401,958 rows and 279 columns (variables). [This link](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf) contains information about all of its variables. As we can see, some of it is information completely useless in the analysis of heart disease (Interview Month), and some of it, although it contains some information about the respondent's health, most likely does not affect the disease itself (What is his or her relationship to you?).

In [None]:
heart.head()

In [None]:
heart.info()

According to the [CDC](https://www.cdc.gov/heartdisease/risk_factors.htm), there are several key factors that overwhelminghly influence the likelihood of heart disease. They write: *About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking.*. Majore health factors include the following:
*  high blood pressure,
*  high blood cholesterol levels,
*  diabetes mellitus,
*  obesity.

Heart disease also depends on habits and behaviors. Here, the CDC lists the following:
*  eating a diet high in saturated fats, trans fat, and cholesterol,
*  not getting enough physical activity,
*  drinking too much alcohol,
*  tobacco use,

Also, the higher the age, the risk of the disease increases. It predominates in most ethnic groups (African Americans, American Indians and Alaska Natives), while in others it gives way to cancer (Asian Americans and Pacific Islanders and Hispanics).

According to the aforementioned information, variables were isolated from the dataset first, whose scientific confirmation attests to a high impact on heart disease. After these were extracted and converted, other variables that do not have a leading effect on heart disease but may indirectly lead to it were included in the final dataset.

**Dependent variable**:
*  **_MICHD** - Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI);

**Independent variables**:
*  **_BMI5** - Body Mass Index (BMI);
*  **SMOKE100** - Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes];
*  **_RFDRHV7** - Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week);
* **CVDSTRK3** - (Ever told) (you had) a stroke;
* **PHYSHLTH** - Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?;
* **MENTHLTH** - Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?;
* **DIFFWALK** - Do you have serious difficulty walking or climbing stairs?;
* **SEXVAR** - Are you male or female?;
* **_AGEG5YR** - Fourteen-level age category;
* **_IMPRACE** - Imputed race/ethnicity value (This value is the reported race/ethnicity or an imputed race/ethnicity, if the respondent refused to give a race/ethnicity. The value of the imputed race/ethnicity will be the most common race/ethnicity response for that region of the state) 7;
* **DIABETE4** - (Ever told) (you had) diabetes? (If ´Yes´ and respondent is female, ask ´Was this only when you were pregnant?´. If Respondent says pre-diabetes or borderline diabetes, use response code 4.);
* **_TOTINDA** - Adults who reported doing physical activity or exercise during the past 30 days other than their regular job;
* **GENHLTH** - Would you say that in general your health is;
* **SLEPTIM1** - On average, how many hours of sleep do you get in a 24-hour period?;
* **ASTHMA3** - (Ever told) (you had) asthma?;
* **CHCKDNY2** - Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?;
* **CHCSCNCR** - (Ever told) (you had) skin cancer?

In [35]:
used_vars = ["_MICHD", "_BMI5CAT", "SMOKE100", "_RFDRHV7", "CVDSTRK3", "PHYSHLTH",
             "MENTHLTH", "DIFFWALK", "SEXVAR", "_AGEG5YR", "_IMPRACE", "DIABETE4",
            "_TOTINDA", "GENHLTH", "SLEPTIM1", "ASTHMA3", "CHCKDNY2", "CHCSCNCR"]

heart_final = heart[used_vars].copy()

In [36]:
heart_final["_MICHD"] = heart_final["_MICHD"].replace({2: "No", 1: "Yes"})

heart_final["_BMI5CAT"] = heart_final["_BMI5CAT"].replace({
    1: "Underweight (BMI < 18.5)",
    2: "Normal weight (18.5 <= BMI < 25.0)",
    3: "Overweight (25.0 <= BMI < 30.0)",
    4: "Obese (30.0 <= BMI < +Inf)"
})

binary_vars = ["SMOKE100", "CVDSTRK3", "DIFFWALK", "_TOTINDA", "ASTHMA3", "CHCKDNY2", "CHCSCNCR"]
heart_final[binary_vars] = heart_final[binary_vars].replace({
    1: "Yes",
    2: "No",
    7: np.NaN,
    9: np.NaN
})

heart_final["_RFDRHV7"] = heart_final["_RFDRHV7"].replace({
    1: "No",
    2: "Yes",
    9: np.NaN
})

multi_vars = ["PHYSHLTH", "MENTHLTH"]
heart_final[multi_vars] = heart_final[multi_vars].replace({
    88: 0,
    77: np.NaN,
    99: np.NaN
})

heart_final["SEXVAR"] = heart_final["SEXVAR"].replace({1: "Male", 2: "Female"})

heart_final["_AGEG5YR"] = heart_final["_AGEG5YR"].replace({
    1: "18-24",
    2: "25-29",
    3: "30-34",
    4: "35-39",
    5: "40-44",
    6: "45-49",
    7: "50-54",
    8: "55-59",
    9: "60-64",
    10: "65-69",
    11: "70-74",
    12: "75-79",
    13: "80 or older",
    14: np.NaN
})

heart_final["_IMPRACE"] = heart_final["_IMPRACE"].replace({
    1: "White",
    2: "Black",
    3: "Asian",
    4: "American Indian/Alaskan Native",
    5: "Hispanic",
    6: "Other"
})

heart_final["DIABETE4"] = heart_final["DIABETE4"].replace({
    1: "Yes",
    2: "Yes (during pregnancy)",
    3: "No",
    4: "No, borderline diabetes",
    7: np.NaN,
    9: np.NaN
})

heart_final["GENHLTH"] = heart_final["GENHLTH"].replace({
    1: "Excellent",
    2: "Very good",
    3: "Good",
    4: "Fair",
    5: "Poor",
    7: np.NaN,
    9: np.NaN
})

heart_final["SLEPTIM1"] = heart_final["SLEPTIM1"].replace({
    77: np.NaN,
    99: np.NaN
})

In [37]:
heart_final = heart_final.dropna()

In [38]:
heart_final = heart_final.rename({
    "_MICHD": "HeartDisease",
    "_BMI5CAT": "BMICategory",
    "SMOKE100": "Smoking",
    "_RFDRHV7": "AlcoholDrinking",
    "CVDSTRK3": "Stroke",
    "PHYSHLTH": "PhysicalHealth",
    "MENTHLTH": "MentalHealth",
    "DIFFWALK": "DiffWalking",
    "SEXVAR": "Sex",
    "_AGEG5YR": "AgeCategory",
    "_IMPRACE": "Race",
    "DIABETE4": "Diabetic",
    "_TOTINDA": "PhysicalActivity",
    "GENHLTH": "GenHealth",
    "SLEPTIM1": "SleepTime",
    "ASTHMA3": "Asthma",
    "CHCKDNY2": "KidneyDisease",
    "CHCSCNCR": "SkinCancer"
    
}, axis=1)

In [42]:
heart_final.head()

Unnamed: 0,HeartDisease,BMICategory,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,Underweight (BMI < 18.5),Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
4,No,Normal weight (18.5 <= BMI < 25.0),No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
5,No,Overweight (25.0 <= BMI < 30.0),Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
6,No,Normal weight (18.5 <= BMI < 25.0),No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
8,No,Normal weight (18.5 <= BMI < 25.0),No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [41]:
heart_final.shape

(319795, 18)

In [40]:
heart_final.to_csv(WRITE_PATH, encoding="UTF-8", index=False)