<img src="https://github.com/akthammomani/akthammomani/assets/67468718/8d1f93b4-2270-477b-bd76-f9ec1c075307" width="1700"/>

# Data Wrangling: AI-Powered Heart Disease Risk Assessment

* **Name:** Aktham Almomani
* **Course:** Probability and Statistics for Artificial Intelligence (MS-AAI-500-02) / University Of San Diego
* **Semester:** Summer 2024
* **Group:** 8

<center>
    <img src="https://github.com/akthammomani/AI_powered_heart_disease_risk_assessment_app/assets/67468718/2cab2215-ce7f-4951-a43a-02b88a5b9fa9" alt="wrnagling">
</center>

## **Contents**<a is='Contents'></a>
* [Introduction](#Introduction)
* [Dataset](#Dataset)
* [Setup and Preliminaries](#Setup_and_preliminaries)
  * [Import Libraries](#Import_libraries)
  * [Necessary Functions](#Necessary_Functions)
* [Extracting descriptive column names for the dataset](#Extracting_descriptive_column_names_for_the_dataset)
* [Importing dataset](#Importing_dataset)
* [Validating the dataset](#Validating_the_dataset)
* [Correcting dataset column names](#Correcting_dataset_column_names)
* [Heart Disease related features](#Heart_Disease_related_features)
* [Selection Heart disease related features](#Selection_Heart_disease_related_features)
* [Imputing Missing Data, Transforming Columns and Features Engineering](#Imputing_missing_Data_and_transforming_columns)
  * [Distribution-Based Imputation](#Distribution_Based_Imputation)
  * [Column 1: Are_you_male_or_female](#Column_1_Are_you_male_or_female)
  * [Column 2: Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease](#Column_2_Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease)
  * [Column 3: Computed_race_groups_used_for_internet_prevalence_tables](#Column_3_Computed_race_groups_used_for_internet_prevalence_tables)
  * [Column 4: Imputed_Age_value_collapsed_above_80](#Column_4_Imputed_Age_value_collapsed_above_80)
  * [Column 5: General_Health](#Column_5_General_Health)
  * [Column 6: Have_Personal_Health_Care_Provider](#Column_6_Have_Personal_Health_Care_Provider)
  * [Column 7: Could_Not_Afford_To_See_Doctor](#Column_7_Could_Not_Afford_To_See_Doctor)
  * [Column 8: Length_of_time_since_last_routine_checkup](#Column_8_Length_of_time_since_last_routine_checkup)
  * [Column 9: Ever_Diagnosed_with_Heart_Attack](#Column_9_Ever_Diagnosed_with_Heart_Attack)
  * [Column 10: Ever_Diagnosed_with_a_Stroke](#Column_10_Ever_Diagnosed_with_a_Stroke)
  * [Column 11: Ever_told_you_had_a_depressive_disorder](#Column_11_Ever_told_you_had_a_depressive_disorder)
  * [Column 12: Ever_told_you_have_kidney_disease](#Column_12_Ever_told_you_have_kidney_disease)
  * [Column 13: Ever_told_you_had_diabetes](#Column_13_Ever_told_you_had_diabetes)
  * [Column 14: Computed_body_mass_index_categories](#Column_14_Computed_body_mass_index_categories)
  * [Column 15: Difficulty_Walking_or_Climbing_Stairs](#Column_15_Difficulty_Walking_or_Climbing_Stairs)
  * [Column 16: Computed_Physical_Health_Status](#Column_16_Computed_Physical_Health_Status)
  * [Column 17: Computed_Mental_Health_Status](#Column_17_Computed_Mental_Health_Status)
  * [Column 18: Computed_Asthma_Status](#Column_18_Computed_Asthma_Status)	
  * [Column 19: Exercise_in_Past_30_Days](#Column_19_Exercise_in_Past_30_Days)
  * [Column 20: Computed_Smoking_Status](#Column_20_Computed_Smoking_Status)
  * [Column 21: Binge_Drinking_Calculated_Variable](#Column_21_Binge_Drinking_Calculated_Variable)	
  * [Column 22: How_Much_Time_Do_You_Sleep](#Column_22_How_Much_Time_Do_You_Sleep)	
  * [Column 23: Computed_number_of_drinks_of_alcohol_beverages_per_week](#Column_23_Computed_number_of_drinks_of_alcohol_beverages_per_week)
* [Dropping unnecessary columns](#Dropping_unnecessary_columns)
* [Review final structure of the cleaned dataframe](#Review_final_structure_of_the_cleaned_dataframe)
* [Saving the cleaned dataframe](#Saving_the_cleaned_dataframe)

## **Introduction**<a id='Introduction'></a>
[Contents](#Contents)

In this notebook, I have undertaken a series of data wrangling steps to prepare our dataset for analysis. **Data wrangling** is a crucial step in the data science process, involving the transformation and mapping of raw data into a more usable format. Here's a summary of the key steps taken in this notebook:

* **Dealing with Missing Data:** Identified and imputed missing values in critical columns, such as the gender column, ensuring the dataset's completeness.
* **Data Mapping:** Transformed categorical variables into more meaningful representations, making the data easier to analyze and interpret.
* **Data Cleaning:** Removed or corrected inconsistent and erroneous entries to improve data quality.
* **Feature Engineering:** Created new features that may enhance the predictive power of our models.
These steps are essential for building a reliable and robust model for heart disease prediction.

## **Dataset**<a id='Dataset'></a>
[Contents](#Contents)

* The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. CDC BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
* The dataset was sourced from Kaggle [(Behavioral Risk Factor Surveillance System (BRFSS) 2022)](https://www.kaggle.com/datasets/ariaxiong/behavioral-risk-factor-surveillance-system-2022/data) and it was originally downloaded from the [CDC BRFSS 2022 website.](https://www.cdc.gov/brfss/annual_data/annual_2022.html)
* To get more understanding regarding the dataset, please go to the [data_directory](https://github.com/akthammomani/AI_powered_health_risk_assessment_app/tree/main/data_directory) folder in my [Github](https://github.com/akthammomani).

## **Setup and preliminaries**<a id='Setup_and_preliminaries'></a>
[Contents](#Contents)

### Import libraries<a id='Import_libraries'></a>
[Contents](#Contents)

In [1]:
#Let's import the necessary packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy.stats as stats
from scipy.stats import gamma, linregress
from bs4 import BeautifulSoup
import re
from fancyimpute import KNN
import dask.dataframe as dd

# let's run below to customize notebook display:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# format floating-point numbers to 2 decimal places: we'll adjust below requirement as needed for specific answers during this assignment:
pd.set_option('float_format', '{:.2f}'.format)

### **Necessary  functions**<a id='Necessary_Functions'></a>
[Contents](#Contents)

In [2]:
def summarize_df(df):
    """
    Generate a summary DataFrame for an input DataFrame.   
    Parameters:
    df (pd.DataFrame): The DataFrame to summarize.
    Returns:
    A datafram: containing the following columns:
              - 'unique_count': No. unique values in each column.
              - 'data_types': Data types of each column.
              - 'missing_counts': No. of missing (NaN) values in each column.
              - 'missing_percentage': Percentage of missing values in each column.
    """
    # No. of unique values for each column:
    unique_counts = df.nunique()    
    # Data types of each column:
    data_types = df.dtypes    
    # No. of missing (NaN) values in each column:
    missing_counts = df.isnull().sum()    
    # Percentage of missing values in each column:
    missing_percentage = 100 * df.isnull().mean()    
    # Concatenate the above metrics:
    summary_df = pd.concat([unique_counts, data_types, missing_counts, missing_percentage], axis=1)    
    # Rename the columns for better readibility
    summary_df.columns = ['unique_count', 'data_types', 'missing_counts', 'missing_percentage']   
    # Return summary df
    return summary_df
#-----------------------------------------------------------------------------------------------------------------#
# Function to clean and format the label
def clean_label(label):
    # Replace any non-alphabetic or non-numeric characters with nothing
    label = re.sub(r'[^a-zA-Z0-9\s]', '', label)
    # Replace spaces with underscores
    label = re.sub(r'\s+', '_', label)
    return label
#-----------------------------------------------------------------------------------------------------------------#

# Function to impute missing values based on distribution
def impute_missing(row):
    if pd.isna(row['Are_you_male_or_female_3']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Are_you_male_or_female_3']


#-----------------------------------------------------------------------------------------------------------------#
def value_counts_with_percentage(df, column_name):
    # Calculate value counts
    counts = df[column_name].value_counts(dropna=False)
    
    # Calculate percentages
    percentages = df[column_name].value_counts(dropna=False, normalize=True) * 100
    
    # Combine counts and percentages into a DataFrame
    result = pd.DataFrame({
        'Count': counts,
        'Percentage': percentages
    })
    
    return result

## **Extracting descriptive column Names for the dataset**<a id='Extracting_descriptive_column_names_for_the_dataset'></a>
[Contents](#Contents)

The Behavioral Risk Factor Surveillance System (BRFSS) dataset available on Kaggle, found here, contains a wealth of information collected through surveys. However, the column names in the dataset are represented by short labels or codes (e.g., _STATE, FMONTH, IDATE), which can be difficult to interpret without additional context.

To ensure we fully understand what each column in the dataset represents, it is crucial to replace these short codes with their corresponding descriptive names. These descriptive names provide clear insights into the type of data each column holds, making the dataset easier to understand and analyze.

**Process Overview:**
* **Identify the Source for Descriptive Names:** The descriptive names corresponding to these short labels are typically documented in the [codebook in HTML](https://github.com/akthammomani/AI_powered_health_risk_assessment_app/tree/main/data_directory) or metadata provided by the data collection authority. In this case, the descriptive names are found in an HTML document provided by the BRFSS.
* **Parse the HTML Document:** Using web scraping techniques, such as BeautifulSoup in Python, we can parse the HTML document to extract the relevant information. Specifically, we look for tables or sections in the HTML that list the short labels alongside their descriptive names.
* **Match and Replace:** We create a mapping of short labels to their descriptive names. This mapping is then applied to our dataset to replace the short labels with more meaningful descriptive names.
* **Save the Enhanced Dataset:** The dataset with descriptive column names is saved for subsequent analysis, ensuring that all users can easily interpret the columns.

In [3]:
# Path to the HTML file:
file_path = 'USCODE22_LLCP_102523.HTML'

# Read the HTML file:
with open(file_path, 'r', encoding='windows-1252') as file:
    html_content = file.read()

# Parse the HTML content using BeautifulSoup:
soup = BeautifulSoup(html_content, 'html.parser')

# Find all the tables that contain the required information:
tables = soup.find_all('table', class_='table')

# Initialize lists to store the extracted data:
labels = []
sas_variable_names = []

# Loop through each table to extract 'Label' and 'SAS Variable Name':
for table in tables:
    # Find all 'td' elements in the table:
    cells = table.find_all('td', class_='l m linecontent')
    
    # Loop through each cell to find 'Label' and 'SAS Variable Name':
    for cell in cells:
        text = cell.get_text(separator="\n")
        label = None
        sas_variable_name = None
        for line in text.split('\n'):
            if line.strip().startswith('Label:'):
                label = line.split('Label:')[1].strip()
            elif line.strip().startswith('SAS\xa0Variable\xa0Name:'):
                sas_variable_name = line.split('SAS\xa0Variable\xa0Name:')[1].strip()
        if label and sas_variable_name:
            labels.append(label)
            sas_variable_names.append(sas_variable_name)
        else:
            print("Label or SAS Variable Name not found in the text:")
            print(text)

# Create a DataFrame:
data = {'SAS Variable Name': sas_variable_names, 'Label': labels}
cols_df = pd.DataFrame(data)

# Save the DataFrame to a CSV file:
output_file_path = 'extracted_data.csv'
cols_df.to_csv(output_file_path, index=False)

print(f"Data has been successfully extracted and saved to {output_file_path}")

cols_df.head()


Data has been successfully extracted and saved to extracted_data.csv


Unnamed: 0,SAS Variable Name,Label
0,_STATE,State FIPS Code
1,FMONTH,File Month
2,IDATE,Interview Date
3,IMONTH,Interview Month
4,IDAY,Interview Day


In [4]:
#let's run below to examin each features again missing data count & percentage, unique count, data types:
summarize_df(cols_df)

Unnamed: 0,unique_count,data_types,missing_counts,missing_percentage
SAS Variable Name,324,object,0,0.0
Label,317,object,0,0.0


No Missing Data - looks like we have 324 columns 

## **Importing dataset**<a id='Importing_dataset'></a>
[Contents](#Contents)

In [5]:
#First, let's load the main dataset BRFSS 2022:
df = pd.read_csv('brfss2022.csv')

## **Validating the dataset**<a id='Validating_the_dataset'></a>
[Contents](#Contents)

In [6]:
# Now, let's look at the top 5 rows of the df:
df.head()

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,PVTRESD1,COLGHOUS,STATERE1,CELPHON1,LADULT1,COLGSEX1,NUMADULT,LANDSEX1,NUMMEN,NUMWOMEN,RESPSLCT,SAFETIME,CTELNUM1,CELLFON5,CADULT1,CELLSEX1,PVTRESD3,CCLGHOUS,CSTATE1,LANDLINE,HHADULT,SEXVAR,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,PRIMINSR,PERSDOC3,MEDCOST1,CHECKUP1,EXERANY2,SLEPTIM1,LASTDEN4,RMVTETH4,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNC1,CHCOCNC1,CHCCOPD3,ADDEPEV3,CHCKDNY2,HAVARTH4,DIABETE4,DIABAGE4,MARITAL,EDUCA,RENTHOM1,NUMHHOL4,NUMPHON4,CPDEMO1C,VETERAN3,EMPLOY1,CHILDREN,INCOME3,PREGNANT,WEIGHT2,HEIGHT3,DEAF,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,HADMAM,HOWLONG,CERVSCRN,CRVCLCNC,CRVCLPAP,CRVCLHPV,HADHYST2,HADSIGM4,COLNSIGM,COLNTES1,SIGMTES1,LASTSIG4,COLNCNCR,VIRCOLO1,VCLNTES2,SMALSTOL,STOLTEST,STOOLDN2,BLDSTFIT,SDNATES1,SMOKE100,SMOKDAY2,USENOW3,ECIGNOW2,LCSFIRST,LCSLAST,LCSNUMCG,LCSCTSC1,LCSSCNCR,LCSCTWHN,ALCDAY4,AVEDRNK3,DRNK3GE5,MAXDRNKS,FLUSHOT7,FLSHTMY3,PNEUVAC4,TETANUS1,HIVTST7,HIVTSTD3,HIVRISK5,COVIDPOS,COVIDSMP,COVIDPRM,PDIABTS1,PREDIAB2,DIABTYPE,INSULIN1,CHKHEMO3,EYEEXAM1,DIABEYE1,DIABEDU1,FEETSORE,TOLDCFS,HAVECFS,WORKCFS,IMFVPLA3,HPVADVC4,HPVADSHT,SHINGLE2,COVIDVA1,COVIDNU1,COVIDFS1,COVIDSE1,COPDCOGH,COPDFLEM,COPDBRTH,COPDBTST,COPDSMOK,CNCRDIFF,CNCRAGE,CNCRTYP2,CSRVTRT3,CSRVDOC1,CSRVSUM,CSRVRTRN,CSRVINST,CSRVINSR,CSRVDEIN,CSRVCLIN,CSRVPAIN,CSRVCTL2,PSATEST1,PSATIME1,PCPSARS2,PSASUGST,PCSTALK1,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,CAREGIV1,CRGVREL4,CRGVLNG1,CRGVHRS1,CRGVPRB3,CRGVALZD,CRGVPER1,CRGVHOU1,CRGVEXPT,ACEDEPRS,ACEDRINK,ACEDRUGS,ACEPRISN,ACEDIVRC,ACEPUNCH,ACEHURT1,ACESWEAR,ACETOUCH,ACETTHEM,ACEHVSEX,ACEADSAF,ACEADNED,LSATISFY,EMTSUPRT,SDHISOLT,SDHEMPLY,FOODSTMP,SDHFOOD1,SDHBILLS,SDHUTILS,SDHTRNSP,SDHSTRE1,MARIJAN1,MARJSMOK,MARJEAT,MARJVAPE,MARJDAB,MARJOTHR,USEMRJN4,LASTSMK2,STOPSMK2,MENTCIGS,MENTECIG,HEATTBCO,ASBIALCH,ASBIDRNK,ASBIBING,ASBIADVC,ASBIRDUC,FIREARM5,GUNLOAD,LOADULK2,RCSGEND1,RCSXBRTH,RCSRLTN2,CASTHDX2,CASTHNO2,BIRTHSEX,SOMALE,SOFEMALE,TRNSGNDR,HADSEX,PFPPRVN4,TYPCNTR9,BRTHCNT4,WHEREGET,NOBCUSE8,BCPREFER,RRCLASS3,RRCOGNT2,RRTREAT,RRATWRK2,RRHCARE4,RRPHYSM2,QSTVER,QSTLANG,_METSTAT,_URBSTAT,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_IMPRACE,_CHISPNC,_CRACE2,_CPRACE2,CAGEG,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT2,_LLCPWT,_RFHLTH,_PHYS14D,_MENT14D,_HLTHPLN,_HCVU652,_TOTINDA,_EXTETH3,_ALTETH3,_DENVST3,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR2,_PRACE2,_MRACE2,_HISPANC,_RACE1,_RACEG22,_RACEGR4,_RACEPR1,_SEX,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG1,_RFMAM22,_MAM5023,_HADCOLN,_CLNSCP1,_HADSIGM,_SGMSCP1,_SGMS101,_RFBLDS5,_STOLDN1,_VIRCOL1,_SBONTI1,_CRCREC2,_SMOKER3,_RFSMOK3,_CURECI2,_YRSSMOK,_PACKDAY,_PACKYRS,_YRSQUIT,_SMOKGRP,_LCSREC,DRNKANY6,DROCDY4_,_RFBING6,_DRNKWK2,_RFDRHV8,_FLSHOT7,_PNEUMO3,_AIDTST4
0,1.0,1.0,2032022,2,3,2022,1100.0,2022000001,2022000001.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,2.0,88.0,88.0,,99.0,1.0,2.0,1.0,2.0,8.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,1.0,80.0,1.0,6.0,1.0,1.0,1.0,2.0,2.0,7.0,88.0,99.0,,9999.0,9999.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,,,,2.0,1.0,3.0,2.0,3.0,,2.0,,,,,,,,2.0,,3.0,4.0,,,,2.0,,,888.0,,,,1.0,92021.0,2.0,3.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,2.0,11011.0,37.42,2.0,74.84,1.0,9.0,,,,,1.0,0.52,813.92,487.61,1.0,1.0,1.0,9.0,9.0,2.0,9.0,9.0,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,,,,,,9.0,1.0,4.0,9.0,1.0,,1.0,,1.0,,,,,,,,4.0,1.0,1.0,,,,,4.0,,2.0,0.0,1.0,0.0,1.0,1.0,2.0,2.0
1,1.0,1.0,2042022,2,4,2022,1100.0,2022000002,2022000002.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,1.0,88.0,88.0,,3.0,2.0,2.0,8.0,2.0,6.0,,,2.0,2.0,2.0,2.0,,1.0,1.0,2.0,2.0,2.0,2.0,3.0,,3.0,4.0,1.0,1.0,2.0,1.0,2.0,2.0,88.0,5.0,,150.0,503.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,4.0,2.0,,,,1.0,1.0,1.0,4.0,,,2.0,,,,,,,,2.0,,3.0,1.0,,,,2.0,,,888.0,,,,2.0,,2.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,2.0,1.0,5.0,11011.0,37.42,1.0,37.42,1.0,9.0,,,,,1.0,0.52,406.96,432.1,1.0,1.0,1.0,1.0,9.0,2.0,9.0,9.0,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,63.0,160.0,6804.0,2657.0,3.0,2.0,1.0,2.0,3.0,2.0,,1.0,,2.0,,,,,,,,4.0,1.0,1.0,,,,,4.0,,2.0,0.0,1.0,0.0,1.0,2.0,2.0,2.0
2,1.0,1.0,2022022,2,2,2022,1100.0,2022000003,2022000003.0,1.0,1.0,,1.0,2.0,1.0,,1.0,2.0,,,,,,,,,,,,,,2.0,2.0,2.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,5.0,,,2.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,6.0,1.0,2.0,,1.0,2.0,7.0,88.0,10.0,,140.0,502.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,,,,1.0,2.0,,,,,2.0,,,,,,,,2.0,,3.0,1.0,,,,2.0,,,888.0,,,,2.0,,2.0,7.0,2.0,,2.0,1.0,1.0,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,3.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,2.0,11011.0,37.42,1.0,37.42,1.0,9.0,,,,,1.0,0.52,406.96,366.74,1.0,2.0,2.0,1.0,1.0,1.0,9.0,,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,8.0,1.0,56.0,5.0,62.0,157.0,6350.0,2561.0,3.0,2.0,1.0,4.0,6.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,,,,2.0,2.0,4.0,1.0,1.0,,,,,4.0,,2.0,0.0,1.0,0.0,1.0,,,2.0
3,1.0,1.0,2032022,2,3,2022,1100.0,2022000004,2022000004.0,1.0,1.0,,1.0,2.0,1.0,,3.0,,2.0,1.0,2.0,,,,,,,,,,,2.0,1.0,88.0,88.0,,99.0,1.0,2.0,1.0,1.0,7.0,,,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,,1.0,4.0,1.0,2.0,,1.0,2.0,7.0,88.0,77.0,2.0,140.0,505.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,7.0,2.0,1.0,1.0,3.0,,,2.0,,,,,,,,1.0,2.0,3.0,1.0,17.0,999.0,2.0,1.0,2.0,,888.0,,,,1.0,102021.0,1.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,3.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,1.0,11011.0,37.42,3.0,112.26,1.0,9.0,,,,,1.0,0.52,1220.88,1681.79,1.0,1.0,1.0,9.0,9.0,1.0,9.0,9.0,9.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,14.0,3.0,73.0,6.0,65.0,165.0,6350.0,2330.0,2.0,1.0,1.0,2.0,9.0,,,1.0,,2.0,,,,,,,,2.0,2.0,1.0,56.0,0.1,6.0,,3.0,2.0,2.0,0.0,1.0,0.0,1.0,9.0,9.0,2.0
4,1.0,1.0,2022022,2,2,2022,1100.0,2022000005,2022000005.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,4.0,2.0,88.0,88.0,7.0,2.0,2.0,1.0,1.0,9.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,5.0,1.0,2.0,,2.0,2.0,5.0,88.0,5.0,2.0,119.0,502.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,,,,,,,,,,,,,,2.0,,3.0,1.0,,,,1.0,2.0,,203.0,2.0,88.0,2.0,2.0,,1.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,1.0,11011.0,37.42,2.0,74.84,1.0,9.0,,,,,1.0,0.52,813.92,2111.21,2.0,2.0,1.0,1.0,1.0,1.0,9.0,,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,5.0,1.0,43.0,3.0,62.0,157.0,5398.0,2177.0,2.0,1.0,1.0,3.0,3.0,1.0,,,,,,,,,,,,4.0,1.0,1.0,,,,,4.0,,1.0,10.0,1.0,140.0,1.0,,,2.0


In [7]:
#now, let's look at the shape of df:
shape = df.shape
print("Number of rows:", shape[0], "\nNumber of columns:", shape[1])

Number of rows: 445132 
Number of columns: 326


## **Correcting dataset column names**<a id='Correcting_dataset_column_names'></a>
[Contents](#Contents)

To replace the SAS Variable Names in your dataset with the corresponding labels (where spaces in the labels are replaced with underscores), you can follow these steps:

* Create a mapping from the SAS Variable Names to the modified labels.
* Use this mapping to rename the columns in your dataset.


In [8]:
# Function to clean and format the label
def clean_label(label):
    # Replace any non-alphabetic or non-numeric characters with nothing
    label = re.sub(r'[^a-zA-Z0-9\s]', '', label)
    # Replace spaces with underscores
    label = re.sub(r'\s+', '_', label)
    return label

# Create a dictionary for mapping SAS Variable Names to cleaned Labels
mapping = {row['SAS Variable Name']: clean_label(row['Label']) for _, row in cols_df.iterrows()}

# Print the mapping dictionary to verify the changes
#print("Column Renaming Mapping:")
#for k, v in mapping.items():
#    print(f"{k}: {v}")
# Rename the columns in the actual data DataFrame
df.rename(columns=mapping, inplace=True)
df.head()

Unnamed: 0,State_FIPS_Code,File_Month,Interview_Date,Interview_Month,Interview_Day,Interview_Year,Final_Disposition,Annual_Sequence_Number,Primary_Sampling_Unit,Correct_telephone_number,Private_Residence,Do_you_live_in_college_housing,Resident_of_State,Cellular_Telephone,Are_you_18_years_of_age_or_older,Are_you_male_or_female,Number_of_Adults_in_Household,Are_you_male_or_female.1,Number_of_Adult_men_in_Household,Number_of_Adult_women_in_Household,Respondent_selection,Safe_time_to_talk,Correct_Phone_Number,Is_this_a_cell_phone,Are_you_18_years_of_age_or_older.1,Are_you_male_or_female.2,Do_you_live_in_a_private_residence,Do_you_live_in_college_housing.1,Do_you_currently_live_in_state,Do_you_also_have_a_landline_telephone,Number_of_Adults_in_Household.1,Sex_of_Respondent,General_Health,Number_of_Days_Physical_Health_Not_Good,Number_of_Days_Mental_Health_Not_Good,Poor_Physical_or_Mental_Health,What_is_Primary_Source_of_Health_Insurance,Have_Personal_Health_Care_Provider,Could_Not_Afford_To_See_Doctor,Length_of_time_since_last_routine_checkup,Exercise_in_Past_30_Days,How_Much_Time_Do_You_Sleep,Last_Visited_Dentist_or_Dental_Clinic,Number_of_Permanent_Teeth_Removed,Ever_Diagnosed_with_Heart_Attack,Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease,Ever_Diagnosed_with_a_Stroke,Ever_Told_Had_Asthma,Still_Have_Asthma,Ever_told_you_had_skin_cancer_that_is_not_melanoma,Ever_told_you_had_melanoma_or_any_other_types_of_cancer,Ever_told_you_had_COPD_emphysema_or_chronic_bronchitis,Ever_told_you_had_a_depressive_disorder,Ever_told_you_have_kidney_disease,Told_Had_Arthritis,Ever_told_you_had_diabetes,DIABAGE4,Marital_Status,Education_Level,Own_or_Rent_Home,Household_Landline_Telephones,NUMPHON4,CPDEMO1C,Are_You_A_Veteran,Employment_Status,Number_of_Children_in_Household,Income_Level,Pregnancy_Status,Reported_Weight_in_Pounds,Reported_Height_in_Feet_and_Inches,Are_you_deaf_or_do_you_have_serious_difficulty_hearing,Blind_or_Difficulty_seeing,Difficulty_Concentrating_or_Remembering,Difficulty_Walking_or_Climbing_Stairs,Difficulty_Dressing_or_Bathing,Difficulty_Doing_Errands_Alone,Have_You_Ever_Had_a_Mammogram,How_Long_since_Last_Mammogram,Have_you_ever_had_a_cervical_cancer_screening_test,Time_since_last_cervical_cancer_screening_test,Have_a_PAP_test_and_recent_cervical_cancer_screening,Have_an_HPV_test_and_recent_cervical_cancer_screening,Had_Hysterectomy,Ever_Had_SigmoidoscopyColonoscopy,Ever_had_a_colonoscopy_sigmoidoscopy_or_both,How_long_since_you_had_colonoscopy,How_long_since_you_had_sigmoidoscopy,Time_Since_Last_SigmoidoscopyColonoscopy,Ever_had_any_other_kind_of_test_for_colorectal_cancer,Ever_had_a_virtual_colonoscopy,How_long_since_you_had_virtual_colonoscopy,Ever_had_stool_test,How_long_since_you_had_stool_test,Ever_had_stool_DNA_test,Was_test_part_of_Cologuard_test,How_long_since_you_had_stool_DNA,Smoked_at_Least_100_Cigarettes,Frequency_of_Days_Now_Smoking,Use_of_Smokeless_Tobacco_Products,Do_you_now_use_ecigarettes_or_vaping_products_every_day_some_days_or_not_at_all,How_old_when_you_first_started_smoking,How_old_when_you_last_smoked,On_average_how_many_cigarettes_do_you_smoke_each_day,Did_you_have_a_CT_or_CAT_scan,Were_any_CT_or_CAT_scans_done_to_check_for_lung_cancer,When_did_you_have_your_most_recent_CT_or_CAT_scan,Days_in_past_30_had_alcoholic_beverage,Avg_alcoholic_drinks_per_day_in_past_30,Binge_Drinking,Most_drinks_on_single_occasion_past_30_days,Adult_flu_shotspray_past_12_mos,When_did_you_receive_your_most_recent_seasonal_flu_shotspray,Pneumonia_shot_ever,Received_Tetanus_Shot_Since_2005,Ever_tested_HIV,Month_and_Year_of_Last_HIV_Test,Do_Any_High_Risk_Situations_Apply,Have_you_ever_been_told_you_tested_positive_for_COVID_19,Have_an_3_month_or_longer_covid_symptoms,Which_was_the_primary_symptom_that_you_experienced,When_was_your_last_blood_test_for_high_blood_sugar,Ever_been_told_by_a_doctor_or_other_health_professional_that_you_have_prediabetes_or_borderline_diabetes,What_type_of_diabetes_do_you_have,Now_Taking_Insulin,Times_Checked_for_Glycosylated_Hemoglobin,Last_Eye_Exam_Where_Pupils_Were_Dilated,When_was_the_last_time_a_they_took_a_photo_of_the_back_of_your_eye,When_was_the_last_time_you_took_a_course_or_class_in_how_to_manage_your_diabetes,Ever_Had_Feet_Sores_or_Irritations_Lasting_More_Than_Four_Weeks,Told_had_Chronic_Fatigue_Syndrome_CFS_or_Myalgic_Encephalomyelitis_ME,Still_have_Chronic_Fatigue_Syndrome_or_Myalgic_Encephalomyelitis,How_many_hours_a_week_are_you_been_able_to_work,Where_did_you_get_your_last_flu_shotvaccine,Have_you_ever_had_an_HPV_vaccination,How_many_HPV_shots_did_you_receive,Have_you_ever_had_the_shingles_or_zoster_vaccine,Received_at_least_one_COVID19_vaccination,Number_of_COVID19_vaccinations_received,MonthYear_of_first_COVID19_vaccination,MonthYear_of_second_COVID19_vaccination,Did_you_have_a_cough,Did_you_cough_up_phlegm,Did_you_have_shortness_of_breath,Have_you_ever_been_given_a_breathing_test,How_many_years_have_you_smoked_tobacco_products,How_Many_Types_of_Cancer,Age_Told_Had_Cancer,Type_of_Cancer,Currently_Receiving_Treatment_for_Cancer,What_Type_of_Doctor_Provides_Majority_of_Your_Care,Did_You_Receive_a_Summary_of_Cancer_Treatments_Received,Ever_Receive_Instructions_From_A_Doctor_For_FollowUp_CheckUps,Instructions_Written_or_Printed,Did_Health_Insurance_Pay_For_All_Of_Your_Cancer_Treatment,Ever_Denied_Insurance_Coverage_Because_Of_Your_Cancer,Participate_In_Clinical_Trial_As_Part_Of_Cancer_Treatment,Currently_Have_Physical_Pain_From_Cancer_Or_Treatment,Is_Pain_Under_Control,Ever_Had_PSA_Test,Time_Since_Most_Recent_PSA_Test,What_was_the_MAIN_reason_you_had_this_PSA_test,Who_first_suggested_this_PSA_test,Did_you_talk_about_the_advantages_or_disadvantages_of_PSA_test,Have_you_experienced_confusion_or_memory_loss_that_is_happening_more_often_or_is_getting_worse,Given_up_daytoday_chores_due_to_confusion_or_memory_loss,Need_assistance_with_daytoday_activities_due_to_confusion_or_memory_loss,When_you_need_help_with_daytoday_activities_are_you_able_to_get_it,Does_confusion_or_memory_loss_interfere_with_work_or_social_activities,Have_you_discussed_your_confusion_or_memory_loss_with_a_health_care_professional,Provided_regular_care_for_family_or_friend,Relationship_Of_Person_To_Whom_You_Are_Giving_Care,How_Long_Provided_Care_For_Person,How_Many_Hours_Do_You_Provide_Care_For_Person,What_Is_The_Major_Health_Problem_Illness_Disability_For_Care_For_Person,Does_Person_Being_Cared_For_Have_Alzheimers_Disease,Managed_personal_care,Managed_household_tasks,Do_you_expect_to_have_a_relative_you_will_need_to_provide_care_for,Live_With_Anyone_Depressed_Mentally_Ill_Or_Suicidal,Live_With_a_Problem_DrinkerAlcoholic,Live_With_Anyone_Who_Used_Illegal_Drugs_or_Abused_Prescriptions,Live_With_Anyone_Who_Served_TIme_in_Prison_or_Jail,Were_Your_Parents_DivorcedSeperated,How_Often_Did_Your_Parents_Beat_Each_Other_Up,How_Often_Did_A_Parent_Physically_Hurt_You_In_Any_Way,How_Often_Did_A_Parent_Swear_At_You,How_Often_Did_Anyone_Ever_Touch_You_Sexually,How_Often_Did_Anyone_Make_You_Touch_Them_Sexually,How_Often_Did_Anyone_Ever_Force_You_to_Have_Sex,Did_an_adult_make_you_feel_safe_and_protected,Did_an_adult_make_sure_basic_needs_were_met,Satisfaction_with_life,How_often_get_emotional_support_needed,How_often_do_you_feel_socially_isolated_from_others,Have_you_lost_employment_or_had_hours_reduced,During_the_past_12_months_have_you_received_food_stamps,How_often_did_the_food_that_you_bought_not_last_and_you_didnt_have_money_to_get_more,Were_you_not_able_to_pay_your_bills,Were_you_not_able_to_pay_utility_bills_or_threatened_to_lose_service,Has_a_lack_of_reliable_transportation_kept_you_from_appointments_meetings_work_or_getting_things_needed,How_often_have_you_felt_this_kind_of_stress,During_the_past_30_days_on_how_many_days_did_you_use_marijuana_or_hashish,Did_you_smoke_marijuana_or_cannabis,Did_you_eat_marijuana_or_cannabis,Did_you_vape_marijuana_or_cannabis,Did_you_dab_marijuana_or_cannabis,Did_you_use_marijuana_or_cannabis_some_other_way,USEMRJN4,Interval_Since_Last_Smoked,Stopped_Smoking_in_past_12_months,Do_you_usually_smoke_menthol_cigarettes,Do_you_usually_use_menthol_ecigarettes,Have_you_heard_of_heated_tobacco_products,Asked_during_checkup_if_you_drink_alchohol,Asked_in_person_or_by_form_how_much_you_drink,Asked_whether_you_drank_5_FOR_MEN_4_FOR_WOMEN_or_more_alcoholic_drinks_on_an_occasion,Offered_advice_about_what_level_of_drinking_is_harmful_or_risky,Were_you_advised_to_reduce_or_quit_your_drinking,Any_Firearms_in_Home,Any_Firearms_Loaded,Any_Loaded_Firearms_Also_Unlocked,Gender_of_child,Childs_sex_at_birth,Relationship_to_child,Hlth_pro_ever_said_child_has_asthma,Child_still_have_asthma,Are_you_male_or_female.3,Sexual_orientation,Sexual_orientation.1,Do_you_consider_yourself_to_be_transgender,Have_you_have_sexual_intercourse,Did_you_do_anything_to_keep_from_getting_pregnant,What_did_you_do_to_keep_you_from_getting_pregnant,Are_You_Doing_Anything_to_Keep_From_Getting_Pregnant,Where_did_you_get_what_you_used_to_prevent_pregnancy,What_was_main_reason_for_not_doing_anything_to_keep_you_from_getting_pregnant,What_is_your_preferred_birth_control_method,How_do_other_people_usually_classify_you_in_this_country,How_often_do_you_think_about_your_race,Were_you_treated_worse_than_the_same_or_better_than_people_of_other_races,How_do_you_feel_you_were_treated_at_work_compared_to_people_of_other_races_in_past_12_months,When_seeking_health_care_past_12_months_was_experience_worse_same_better_than_people_of_other_races,Times_past_30_days_felt_physical_symptoms_because_of_treatment_due_to_your_race,Questionnaire_Version_Identifier,Language_identifier,Metropolitan_Status,UrbanRural_Status,Metropolitan_Status_Code,Sample_Design_Stratification_Variable,Stratum_weight,Raw_weighting_factor_used_in_raking,Design_weight_use_in_raking,Imputed_raceethnicity_value,Child_Hispanic_Latinoa_or_Spanish_origin_calculated_variable,Child_NonHispanic_Race_including_Multiracial,Preferred_Child_Race_Categories,Four_level_child_age,Final_child_weight_Landline_and_CellPhone_data,Dual_Phone_Use_Categories,Dual_Phone_Use_Correction_Factor,Truncated_design_weight_used_in_adult_combined_land_line_and_cell_phone_raking,Final_weight_Landline_and_cellphone_data,Adults_with_good_or_better_health,Computed_Physical_Health_Status,Computed_Mental_Health_Status,Have_any_health_insurance,Respondents_aged_1864_with_health_insurance,Leisure_Time_Physical_Activity_Calculated_Variable,Adults_aged_18_that_have_had_permanent_teeth_extracted,Adults_aged_65_who_have_had_all_their_natural_teeth_extracted,Adults_that_have_visited_a_dentist_dental_hygenist_or_dental_clinic_within_the_past_year,Ever_had_CHD_or_MI,Lifetime_Asthma_Calculated_Variable,Current_Asthma_Calculated_Variable,Computed_Asthma_Status,Respondents_diagnosed_with_arthritis,Computed_Preferred_Race,Calculated_nonHispanic_Race_including_multiracial,Hispanic_Latinoa_or_Spanish_origin_calculated_variable,Computed_RaceEthnicity_grouping,Create_Computed_NonHispanic_WhitesAll_Others_Race_Categories_RaceEthnic_Group_Codes_Used_In_PostStratification_Variable,Computed_Five_level_raceethnicity_category,Computed_race_groups_used_for_internet_prevalence_tables,Calculated_sex_variable,Reported_age_in_fiveyear_age_categories_calculated_variable,Reported_age_in_two_age_groups_calculated_variable,Imputed_Age_value_collapsed_above_80,Imputed_age_in_six_groups,Computed_Height_in_Inches,Computed_Height_in_Meters,Computed_Weight_in_Kilograms,Computed_body_mass_index,Computed_body_mass_index_categories,Overweight_or_obese_calculated_variable,Computed_number_of_children_in_household,Computed_level_of_education_completed_categories,Computed_income_categories,Women_respondents_aged_40_who_have_had_a_mammogram_in_the_past_two_years,Women_respondents_aged_5074_that_have_had_a_mammogram_in_the_past_two_years,Had_colonoscopy_calculated_variable,Respondents_aged_4575_who_have_had_a_colonoscopy_within_the_past_ten_years,Had_sigmoidoscopy_calculated_variable,Respondents_aged_4575_who_have_had_a_sigmoidoscopy_within_the_past_five_years,Respondents_aged_4575_who_have_had_a_sigmoidoscopy_within_the_past_ten_years,Respondents_aged_4575_who_have_had_a_stool_test_within_the_past_year,Respondents_aged_4575_who_have_had_a_stool_DNA_test_within_the_past_three_years,Respondents_aged_4575_who_have_had_a_virtual_colonoscopy_within_the_past_five_years,Respondents_aged_4575_who_have_had_a_sigmoidoscopy_within_the_past_ten_years_and_a_blood_stool_test_in_the_past_year,Respondents_aged_4575_who_have_fully_met_the_USPSTF_recommendations,Computed_Smoking_Status,Current_Smoking_Calculated_Variable,Current_Ecigarette_User_Calculated_Variable,Number_of_years_smoked_cigarettes,Number_of_packs_of_cigarettes_smoked_per_day,Years_smoked_reported_packs_per_day,Number_of_years_since_quit_smoking_cigarettes,Smoking_Group,Lung_cancer_screening_recommendation_status,Drink_any_alcoholic_beverages_in_past_30_days,Computed_drinkoccasionsperday,Binge_Drinking_Calculated_Variable,Computed_number_of_drinks_of_alcohol_beverages_per_week,Heavy_Alcohol_Consumption_Calculated_Variable,Flu_Shot_Calculated_Variable,Pneumonia_Vaccination_Calculated_Variable,Ever_been_tested_for_HIV_calculated_variable
0,1.0,1.0,2032022,2,3,2022,1100.0,2022000001,2022000001.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,2.0,88.0,88.0,,99.0,1.0,2.0,1.0,2.0,8.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,1.0,80.0,1.0,6.0,1.0,1.0,1.0,2.0,2.0,7.0,88.0,99.0,,9999.0,9999.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,,,,2.0,1.0,3.0,2.0,3.0,,2.0,,,,,,,,2.0,,3.0,4.0,,,,2.0,,,888.0,,,,1.0,92021.0,2.0,3.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,2.0,11011.0,37.42,2.0,74.84,1.0,9.0,,,,,1.0,0.52,813.92,487.61,1.0,1.0,1.0,9.0,9.0,2.0,9.0,9.0,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,,,,,,9.0,1.0,4.0,9.0,1.0,,1.0,,1.0,,,,,,,,4.0,1.0,1.0,,,,,4.0,,2.0,0.0,1.0,0.0,1.0,1.0,2.0,2.0
1,1.0,1.0,2042022,2,4,2022,1100.0,2022000002,2022000002.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,1.0,88.0,88.0,,3.0,2.0,2.0,8.0,2.0,6.0,,,2.0,2.0,2.0,2.0,,1.0,1.0,2.0,2.0,2.0,2.0,3.0,,3.0,4.0,1.0,1.0,2.0,1.0,2.0,2.0,88.0,5.0,,150.0,503.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,4.0,2.0,,,,1.0,1.0,1.0,4.0,,,2.0,,,,,,,,2.0,,3.0,1.0,,,,2.0,,,888.0,,,,2.0,,2.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,2.0,1.0,5.0,11011.0,37.42,1.0,37.42,1.0,9.0,,,,,1.0,0.52,406.96,432.1,1.0,1.0,1.0,1.0,9.0,2.0,9.0,9.0,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,13.0,2.0,80.0,6.0,63.0,160.0,6804.0,2657.0,3.0,2.0,1.0,2.0,3.0,2.0,,1.0,,2.0,,,,,,,,4.0,1.0,1.0,,,,,4.0,,2.0,0.0,1.0,0.0,1.0,2.0,2.0,2.0
2,1.0,1.0,2022022,2,2,2022,1100.0,2022000003,2022000003.0,1.0,1.0,,1.0,2.0,1.0,,1.0,2.0,,,,,,,,,,,,,,2.0,2.0,2.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,5.0,,,2.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,6.0,1.0,2.0,,1.0,2.0,7.0,88.0,10.0,,140.0,502.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,,,,1.0,2.0,,,,,2.0,,,,,,,,2.0,,3.0,1.0,,,,2.0,,,888.0,,,,2.0,,2.0,7.0,2.0,,2.0,1.0,1.0,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,3.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,2.0,11011.0,37.42,1.0,37.42,1.0,9.0,,,,,1.0,0.52,406.96,366.74,1.0,2.0,2.0,1.0,1.0,1.0,9.0,,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,8.0,1.0,56.0,5.0,62.0,157.0,6350.0,2561.0,3.0,2.0,1.0,4.0,6.0,1.0,1.0,2.0,3.0,2.0,3.0,3.0,,,,2.0,2.0,4.0,1.0,1.0,,,,,4.0,,2.0,0.0,1.0,0.0,1.0,,,2.0
3,1.0,1.0,2032022,2,3,2022,1100.0,2022000004,2022000004.0,1.0,1.0,,1.0,2.0,1.0,,3.0,,2.0,1.0,2.0,,,,,,,,,,,2.0,1.0,88.0,88.0,,99.0,1.0,2.0,1.0,1.0,7.0,,,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,,1.0,4.0,1.0,2.0,,1.0,2.0,7.0,88.0,77.0,2.0,140.0,505.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,1.0,7.0,2.0,1.0,1.0,3.0,,,2.0,,,,,,,,1.0,2.0,3.0,1.0,17.0,999.0,2.0,1.0,2.0,,888.0,,,,1.0,102021.0,1.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,3.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,1.0,11011.0,37.42,3.0,112.26,1.0,9.0,,,,,1.0,0.52,1220.88,1681.79,1.0,1.0,1.0,9.0,9.0,1.0,9.0,9.0,9.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,14.0,3.0,73.0,6.0,65.0,165.0,6350.0,2330.0,2.0,1.0,1.0,2.0,9.0,,,1.0,,2.0,,,,,,,,2.0,2.0,1.0,56.0,0.1,6.0,,3.0,2.0,2.0,0.0,1.0,0.0,1.0,9.0,9.0,2.0
4,1.0,1.0,2022022,2,2,2022,1100.0,2022000005,2022000005.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,4.0,2.0,88.0,88.0,7.0,2.0,2.0,1.0,1.0,9.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,5.0,1.0,2.0,,2.0,2.0,5.0,88.0,5.0,2.0,119.0,502.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,,,,,,,,,,,,,,2.0,,3.0,1.0,,,,1.0,2.0,,203.0,2.0,88.0,2.0,2.0,,1.0,4.0,2.0,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,5.0,2.0,2.0,5.0,2.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,1.0,11011.0,37.42,2.0,74.84,1.0,9.0,,,,,1.0,0.52,813.92,2111.21,2.0,2.0,1.0,1.0,1.0,1.0,9.0,,9.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,5.0,1.0,43.0,3.0,62.0,157.0,5398.0,2177.0,2.0,1.0,1.0,3.0,3.0,1.0,,,,,,,,,,,,4.0,1.0,1.0,,,,,4.0,,1.0,10.0,1.0,140.0,1.0,,,2.0


## **Heart Disease related features**<a id='Heart_Disease_related_features'></a>
[Contents](#Contents)

After several days of research and analysis of the dataset's features, we have identified the following key features for heart disease assessment:

* **Target Variable (Dependent Variable):**
    * Heart_disease: "Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease"
* **Demographics:**
    * Gender: Are_you_male_or_female
    * Race: Computed_race_groups_used_for_internet_prevalence_tables
    * Age: Imputed_Age_value_collapsed_above_80
* **Medical History:**
    * General_Health
    * Have_Personal_Health_Care_Provider
    * Could_Not_Afford_To_See_Doctor
    * Length_of_time_since_last_routine_checkup
    * Ever_Diagnosed_with_Heart_Attack
    * Ever_Diagnosed_with_a_Stroke
    * Ever_told_you_had_a_depressive_disorder
    * Ever_told_you_have_kidney_disease
    * Ever_told_you_had_diabetes
    * Reported_Weight_in_Pounds
    * Reported_Height_in_Feet_and_Inches
    * Computed_body_mass_index_categories
    * Difficulty_Walking_or_Climbing_Stairs
    * Computed_Physical_Health_Status
    * Computed_Mental_Health_Status
    * Computed_Asthma_Status
* **Life Style:**
    * Leisure_Time_Physical_Activity_Calculated_Variable
    * Smoked_at_Least_100_Cigarettes
    * Computed_Smoking_Status
    * Binge_Drinking_Calculated_Variable
    * Computed_number_of_drinks_of_alcohol_beverages_per_week
    * Exercise_in_Past_30_Days
    * How_Much_Time_Do_You_Sleep


## **Selection Heart disease related features**<a id='Selection_Heart_disease_related_features'></a>
[Contents](#Contents)

In [9]:
#Here, let's seelect the main features directly related to heart disease:
df = df[["Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease", # Target Variable
         "Are_you_male_or_female", #Demographics
         "Computed_race_groups_used_for_internet_prevalence_tables",#Demographics
         "Imputed_Age_value_collapsed_above_80",#Demographics
         "General_Health", #Medical History
         "Have_Personal_Health_Care_Provider",#Medical History
         "Could_Not_Afford_To_See_Doctor",#Medical History
         "Length_of_time_since_last_routine_checkup",#Medical History
         "Ever_Diagnosed_with_Heart_Attack",#Medical History
         "Ever_Diagnosed_with_a_Stroke",#Medical History
         "Ever_told_you_had_a_depressive_disorder",#Medical History
         "Ever_told_you_have_kidney_disease",#Medical History
         "Ever_told_you_had_diabetes",#Medical History
         "Reported_Weight_in_Pounds",#Medical History
         "Reported_Height_in_Feet_and_Inches",#Medical History
         "Computed_body_mass_index_categories",#Medical History
         "Difficulty_Walking_or_Climbing_Stairs",#Medical History
         "Computed_Physical_Health_Status",#Medical History
         "Computed_Mental_Health_Status",#Medical History
         "Computed_Asthma_Status",#Medical History
         "Leisure_Time_Physical_Activity_Calculated_Variable",#Life Style
         "Smoked_at_Least_100_Cigarettes",#Life Style
         "Computed_Smoking_Status",#Life Style
         "Binge_Drinking_Calculated_Variable",#Life Style
         "Computed_number_of_drinks_of_alcohol_beverages_per_week",#Life Style
         "Exercise_in_Past_30_Days",#Life Style
         "How_Much_Time_Do_You_Sleep"#Life Style
        ]]
df.head()

Unnamed: 0,Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease,Are_you_male_or_female,Are_you_male_or_female.1,Are_you_male_or_female.2,Are_you_male_or_female.3,Computed_race_groups_used_for_internet_prevalence_tables,Imputed_Age_value_collapsed_above_80,General_Health,Have_Personal_Health_Care_Provider,Could_Not_Afford_To_See_Doctor,Length_of_time_since_last_routine_checkup,Ever_Diagnosed_with_Heart_Attack,Ever_Diagnosed_with_a_Stroke,Ever_told_you_had_a_depressive_disorder,Ever_told_you_have_kidney_disease,Ever_told_you_had_diabetes,Reported_Weight_in_Pounds,Reported_Height_in_Feet_and_Inches,Computed_body_mass_index_categories,Difficulty_Walking_or_Climbing_Stairs,Computed_Physical_Health_Status,Computed_Mental_Health_Status,Computed_Asthma_Status,Leisure_Time_Physical_Activity_Calculated_Variable,Smoked_at_Least_100_Cigarettes,Computed_Smoking_Status,Binge_Drinking_Calculated_Variable,Computed_number_of_drinks_of_alcohol_beverages_per_week,Exercise_in_Past_30_Days,How_Much_Time_Do_You_Sleep
0,2.0,,,,,1.0,80.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,9999.0,9999.0,,2.0,1.0,1.0,3.0,2.0,2.0,4.0,1.0,0.0,2.0,8.0
1,2.0,,,,,1.0,80.0,1.0,2.0,2.0,8.0,2.0,2.0,2.0,2.0,3.0,150.0,503.0,3.0,2.0,1.0,1.0,3.0,2.0,2.0,4.0,1.0,0.0,2.0,6.0
2,2.0,,2.0,,,1.0,56.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,3.0,140.0,502.0,3.0,2.0,2.0,2.0,3.0,1.0,2.0,4.0,1.0,0.0,1.0,5.0
3,2.0,,,,,1.0,73.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,3.0,140.0,505.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,0.0,1.0,7.0
4,2.0,,,,,1.0,43.0,4.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,3.0,119.0,502.0,2.0,2.0,2.0,1.0,3.0,1.0,2.0,4.0,1.0,140.0,1.0,9.0


In [10]:
#now, let's look at the shape of df after features selection:
shape = df.shape
print("Number of rows:", shape[0], "\nNumber of columns:", shape[1])

Number of rows: 445132 
Number of columns: 30


## **Imputing Missing Data, Transforming Columns and Features Engineering**<a id='Imputing_missing_Data_and_transforming_columns'></a>
[Contents](#Contents)

In this step, we address missing data, map categorical values, and rename columns for improved data quality and clarity. The key actions taken are as follows:

* Replace Specific Values with NaN: Identify and replace erroneous or placeholder values with NaN to standardize missing data representation.
* Calculate Value Distribution: Determine the distribution of existing values to understand the data's baseline state.
* Impute Missing Values: Use a function to impute missing values based on the calculated distribution, ensuring the data remains representative of its original characteristics.
* Map Categorical Values: Apply a mapping dictionary to convert numeric codes into meaningful categorical labels.
* Rename Columns: Update column names to reflect their contents accurately and improve dataset readability.
* Feature Engineering: Created new features that may enhance the predictive power of our models. These steps are essential for building a reliable and robust model for heart disease prediction.


### **Distribution-Based Imputation**<a id='Distribution_Based_Imputation'></a>
[Contents](#Contents)
To deal with missing data in this project, we'll be using **Distribution-Based Imputation**:

* **Introduction**
    * Distribution-Based: The imputation process relies on the existing distribution of the categories in the dataset.
    * Imputation: The act of filling in missing values.
* **Why This Method Works**
    * Preserves Original Distribution: By using the observed proportions to guide the imputation, the method maintains the original distribution of gender categories.
    * Random Imputation: Randomly selecting values based on the existing distribution prevents systematic biases that could arise from deterministic imputation methods.
    * Scalability: This approach can be easily scaled to larger datasets and applied to other categorical variables with missing values.
* **Advantages**
    * Bias Minimization: Ensures that the imputed values do not skew the dataset in favor of any particular category.
    * Simplicity: The method is straightforward to implement and understand.
    * Flexibility: Can be adapted to any categorical variable with missing values.

This method is particularly useful in scenarios where preserving the natural distribution of data is crucial for subsequent analysis or modeling tasks. 

In [11]:
#let's run below to examin each features again missing data count & percentage, unique count, data types:
summarize_df(df)

Unnamed: 0,unique_count,data_types,missing_counts,missing_percentage
Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease,4,float64,2,0.0
Are_you_male_or_female,2,float64,445111,100.0
Are_you_male_or_female,6,float64,401436,90.18
Are_you_male_or_female,5,float64,96053,21.58
Are_you_male_or_female,4,float64,365705,82.16
Computed_race_groups_used_for_internet_prevalence_tables,7,float64,0,0.0
Imputed_Age_value_collapsed_above_80,63,float64,0,0.0
General_Health,7,float64,3,0.0
Have_Personal_Health_Care_Provider,5,float64,2,0.0
Could_Not_Afford_To_See_Doctor,4,float64,4,0.0


### **Column 1: Are_you_male_or_female**<a id='Column_1_Are_you_male_or_female'></a>
[Contents](#Contents)

We have 4 versions of the same column, so now let's keep the least columns with missing data `21.58`

In [12]:
# let's get the column names in a list:
print(df.columns)

Index(['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease',
       'Are_you_male_or_female', 'Are_you_male_or_female',
       'Are_you_male_or_female', 'Are_you_male_or_female',
       'Computed_race_groups_used_for_internet_prevalence_tables',
       'Imputed_Age_value_collapsed_above_80', 'General_Health',
       'Have_Personal_Health_Care_Provider', 'Could_Not_Afford_To_See_Doctor',
       'Length_of_time_since_last_routine_checkup',
       'Ever_Diagnosed_with_Heart_Attack', 'Ever_Diagnosed_with_a_Stroke',
       'Ever_told_you_had_a_depressive_disorder',
       'Ever_told_you_have_kidney_disease', 'Ever_told_you_had_diabetes',
       'Reported_Weight_in_Pounds', 'Reported_Height_in_Feet_and_Inches',
       'Computed_body_mass_index_categories',
       'Difficulty_Walking_or_Climbing_Stairs',
       'Computed_Physical_Health_Status', 'Computed_Mental_Health_Status',
       'Computed_Asthma_Status',
       'Leisure_Time_Physical_Activity_Calculated_Variable',
       'Smoked_at_Le

In [13]:
#let's select the main features related to heart disease:
df.columns = ['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease', # This is my target variable!!!
       'Are_you_male_or_female_1', 'Are_you_male_or_female_2',
       'Are_you_male_or_female_3', 'Are_you_male_or_female_4',
       'Computed_race_groups_used_for_internet_prevalence_tables',
       'Imputed_Age_value_collapsed_above_80', 'General_Health',
       'Have_Personal_Health_Care_Provider', 'Could_Not_Afford_To_See_Doctor',
       'Length_of_time_since_last_routine_checkup',
       'Ever_Diagnosed_with_Heart_Attack', 'Ever_Diagnosed_with_a_Stroke',
       'Ever_told_you_had_a_depressive_disorder',
       'Ever_told_you_have_kidney_disease', 'Ever_told_you_had_diabetes',
       'Reported_Weight_in_Pounds', 'Reported_Height_in_Feet_and_Inches',
       'Computed_body_mass_index_categories',
       'Difficulty_Walking_or_Climbing_Stairs',
       'Computed_Physical_Health_Status', 'Computed_Mental_Health_Status',
       'Computed_Asthma_Status',
       'Leisure_Time_Physical_Activity_Calculated_Variable',
       'Smoked_at_Least_100_Cigarettes', 'Computed_Smoking_Status',
       'Binge_Drinking_Calculated_Variable',
       'Computed_number_of_drinks_of_alcohol_beverages_per_week',
       'Exercise_in_Past_30_Days', 'How_Much_Time_Do_You_Sleep']
#let's run below to examin each features again missing data count & percentage, unique count, data types:
summarize_df(df)


Unnamed: 0,unique_count,data_types,missing_counts,missing_percentage
Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease,4,float64,2,0.0
Are_you_male_or_female_1,2,float64,445111,100.0
Are_you_male_or_female_2,6,float64,401436,90.18
Are_you_male_or_female_3,5,float64,96053,21.58
Are_you_male_or_female_4,4,float64,365705,82.16
Computed_race_groups_used_for_internet_prevalence_tables,7,float64,0,0.0
Imputed_Age_value_collapsed_above_80,63,float64,0,0.0
General_Health,7,float64,3,0.0
Have_Personal_Health_Care_Provider,5,float64,2,0.0
Could_Not_Afford_To_See_Doctor,4,float64,4,0.0


Alright, as we can see above, now, let's drop 'Are_you_male_or_female_1', 'Are_you_male_or_female_2' and 'Are_you_male_or_female_4'

In [14]:
#Let's drop the unnecessary columns:
columns_to_drop = ['Are_you_male_or_female_1', 'Are_you_male_or_female_2', 'Are_you_male_or_female_4']
df = df.drop(columns=columns_to_drop)

#let's run below to examin each features again missing data count & percentage, unique count, data types:
summarize_df(df)

Unnamed: 0,unique_count,data_types,missing_counts,missing_percentage
Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease,4,float64,2,0.0
Are_you_male_or_female_3,5,float64,96053,21.58
Computed_race_groups_used_for_internet_prevalence_tables,7,float64,0,0.0
Imputed_Age_value_collapsed_above_80,63,float64,0,0.0
General_Health,7,float64,3,0.0
Have_Personal_Health_Care_Provider,5,float64,2,0.0
Could_Not_Afford_To_See_Doctor,4,float64,4,0.0
Length_of_time_since_last_routine_checkup,7,float64,3,0.0
Ever_Diagnosed_with_Heart_Attack,4,float64,4,0.0
Ever_Diagnosed_with_a_Stroke,4,float64,2,0.0


In [15]:
# view columns count:
df.Are_you_male_or_female_3.value_counts(dropna=False)

2.00    174948
1.00    173639
NaN      96053
3.00       328
9.00       113
7.00        51
Name: Are_you_male_or_female_3, dtype: int64

**Are_you_male_or_female_3:**
* 2: Femal
* 1: Male
* 3: Nonbinary
* 7: Don’t know/Not Sure
* 9: Refused

So based on above, let's change 7 and 9 to nan

In [16]:
# Replace 7 and 9 with NaN
df['Are_you_male_or_female_3'].replace([7, 9], np.nan, inplace=True)
df.Are_you_male_or_female_3.value_counts(dropna=False)

2.00    174948
1.00    173639
NaN      96217
3.00       328
Name: Are_you_male_or_female_3, dtype: int64

In [17]:
# Calculate the distribution of existing values
value_counts = df['Are_you_male_or_female_3'].value_counts(normalize=True, dropna=True)
print("Original distribution:\n", value_counts)

Original distribution:
 2.00   0.50
1.00   0.50
3.00   0.00
Name: Are_you_male_or_female_3, dtype: float64


In [18]:
# Function to impute missing values based on distribution
def impute_missing_gender(row):
    if pd.isna(row['Are_you_male_or_female_3']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Are_you_male_or_female_3']

# Apply the imputation function
df['Are_you_male_or_female_3'] = df.apply(impute_missing_gender, axis=1)

In [19]:
# Verify the imputation
imputed_value_counts = df['Are_you_male_or_female_3'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)


Distribution after imputation:
 2.00    223217
1.00    221503
3.00       412
Name: Are_you_male_or_female_3, dtype: int64


Alright, as we can see above, no missing data on this column and the proportions reserved (Random imputation worked as expected).

In [20]:
# Create a mapping dictionary:
gender_mapping = {2: 'female', 1: 'male', 3: 'nonbinary'}

# Apply the mapping to the gender column:
df['Are_you_male_or_female_3'] = df['Are_you_male_or_female_3'].map(gender_mapping)

# Rename the column:
df.rename(columns={'Are_you_male_or_female_3': 'gender'}, inplace=True)

df.head()

Unnamed: 0,Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease,gender,Computed_race_groups_used_for_internet_prevalence_tables,Imputed_Age_value_collapsed_above_80,General_Health,Have_Personal_Health_Care_Provider,Could_Not_Afford_To_See_Doctor,Length_of_time_since_last_routine_checkup,Ever_Diagnosed_with_Heart_Attack,Ever_Diagnosed_with_a_Stroke,Ever_told_you_had_a_depressive_disorder,Ever_told_you_have_kidney_disease,Ever_told_you_had_diabetes,Reported_Weight_in_Pounds,Reported_Height_in_Feet_and_Inches,Computed_body_mass_index_categories,Difficulty_Walking_or_Climbing_Stairs,Computed_Physical_Health_Status,Computed_Mental_Health_Status,Computed_Asthma_Status,Leisure_Time_Physical_Activity_Calculated_Variable,Smoked_at_Least_100_Cigarettes,Computed_Smoking_Status,Binge_Drinking_Calculated_Variable,Computed_number_of_drinks_of_alcohol_beverages_per_week,Exercise_in_Past_30_Days,How_Much_Time_Do_You_Sleep
0,2.0,female,1.0,80.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,9999.0,9999.0,,2.0,1.0,1.0,3.0,2.0,2.0,4.0,1.0,0.0,2.0,8.0
1,2.0,male,1.0,80.0,1.0,2.0,2.0,8.0,2.0,2.0,2.0,2.0,3.0,150.0,503.0,3.0,2.0,1.0,1.0,3.0,2.0,2.0,4.0,1.0,0.0,2.0,6.0
2,2.0,male,1.0,56.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,3.0,140.0,502.0,3.0,2.0,2.0,2.0,3.0,1.0,2.0,4.0,1.0,0.0,1.0,5.0
3,2.0,female,1.0,73.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,3.0,140.0,505.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,0.0,1.0,7.0
4,2.0,male,1.0,43.0,4.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,3.0,119.0,502.0,2.0,2.0,2.0,1.0,3.0,1.0,2.0,4.0,1.0,140.0,1.0,9.0


In [21]:
#let's run below to examin each features again missing data count & percentage, unique count, data types:
summarize_df(df)

Unnamed: 0,unique_count,data_types,missing_counts,missing_percentage
Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease,4,float64,2,0.0
gender,3,object,0,0.0
Computed_race_groups_used_for_internet_prevalence_tables,7,float64,0,0.0
Imputed_Age_value_collapsed_above_80,63,float64,0,0.0
General_Health,7,float64,3,0.0
Have_Personal_Health_Care_Provider,5,float64,2,0.0
Could_Not_Afford_To_See_Doctor,4,float64,4,0.0
Length_of_time_since_last_routine_checkup,7,float64,3,0.0
Ever_Diagnosed_with_Heart_Attack,4,float64,4,0.0
Ever_Diagnosed_with_a_Stroke,4,float64,2,0.0


### **Column 2: Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease**<a id='Column_2_Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'></a>
[Contents](#Contents)

In [22]:
#view column counts:
df.Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease.value_counts(dropna=False)

2.00    414176
1.00     26551
7.00      4044
9.00       359
NaN          2
Name: Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease, dtype: int64

**Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease:**
* 2: No
* 1: Yes
* 7: Don’t know/Not Sure
* 9: Refused

Alright, so next let's change 7 and 9 to nan:

In [23]:
# Replace 7 and 9 with NaN
df['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'].replace([7, 9], np.nan, inplace=True)
df.Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease.value_counts(dropna=False)

2.00    414176
1.00     26551
NaN       4405
Name: Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease, dtype: int64

Alright, again, let's use  **Distribution-Based Imputation** for the above missing data:

In [24]:
# Calculate the distribution of existing values
value_counts = df['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'].value_counts(normalize=True, dropna=True)
print("Original distribution:\n", value_counts)

Original distribution:
 2.00   0.94
1.00   0.06
Name: Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease, dtype: float64


In [25]:
# Function to impute missing values based on distribution
def impute_missing(row):
    if pd.isna(row['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease']

In [26]:
# Apply the imputation function
df['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'] = df.apply(impute_missing, axis=1)

In [27]:
# Verify the imputation
imputed_value_counts = df['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    418331
1.00     26801
Name: Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease, dtype: int64


In [28]:
# Verify the imputation
imputed_value_counts = df['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'].value_counts(dropna=False, normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00   0.94
1.00   0.06
Name: Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease, dtype: float64


In [29]:
# Create a mapping dictionary:
heart_disease_mapping = {2: 'no', 1: 'yes'}

# Apply the mapping to the "Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease" column:
df['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'] = df['Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease'].map(heart_disease_mapping)

# Rename the column:
df.rename(columns={'Ever_Diagnosed_with_Angina_or_Coronary_Heart_Disease': 'heart_disease'}, inplace=True)

In [30]:
#let's run below to examin each features again missing data count & percentage, unique count, data types:
summarize_df(df)

Unnamed: 0,unique_count,data_types,missing_counts,missing_percentage
heart_disease,2,object,0,0.0
gender,3,object,0,0.0
Computed_race_groups_used_for_internet_prevalence_tables,7,float64,0,0.0
Imputed_Age_value_collapsed_above_80,63,float64,0,0.0
General_Health,7,float64,3,0.0
Have_Personal_Health_Care_Provider,5,float64,2,0.0
Could_Not_Afford_To_See_Doctor,4,float64,4,0.0
Length_of_time_since_last_routine_checkup,7,float64,3,0.0
Ever_Diagnosed_with_Heart_Attack,4,float64,4,0.0
Ever_Diagnosed_with_a_Stroke,4,float64,2,0.0


### **Column 3: Computed_race_groups_used_for_internet_prevalence_tables**<a id='Column_3_Computed_race_groups_used_for_internet_prevalence_tables'></a>
[Contents](#Contents)

In [31]:
#view column counts:
df.Computed_race_groups_used_for_internet_prevalence_tables.value_counts(dropna=False)

1.00    333514
7.00     42977
2.00     35876
4.00     13487
6.00      9744
3.00      7120
5.00      2414
Name: Computed_race_groups_used_for_internet_prevalence_tables, dtype: int64

Alright, so good news is there's no missing data in this column

**Computed_race_groups_used_for_internet_prevalence_tables:**
* 1: white_only_non_hispanic
* 2: black_only_non_hispanic
* 3: american_indian_or_alaskan_native_only_non_hispanic
* 4: asian_only_non_hispanic
* 5: native_hawaiian_or_other_pacific_islander_only_non_hispanic
* 6: multiracial_non_hispanic
* 7: hispanic


In [32]:
# Create a mapping dictionary:
race_mapping = {1: 'white_only_non_hispanic',
2: 'black_only_non_hispanic',
3: 'american_indian_or_alaskan_native_only_non_hispanic',
4: 'asian_only_non_hispanic',
5: 'native_hawaiian_or_other_pacific_islander_only_non_hispanic',
6: 'multiracial_non_hispanic',
7: 'hispanic'}

# Apply the mapping to the race column:
df['Computed_race_groups_used_for_internet_prevalence_tables'] = df['Computed_race_groups_used_for_internet_prevalence_tables'].map(race_mapping)

# Rename the column:
df.rename(columns={'Computed_race_groups_used_for_internet_prevalence_tables': 'race'}, inplace=True)

In [33]:
#view column counts:
df.race.value_counts(dropna=False)

white_only_non_hispanic                                        333514
hispanic                                                        42977
black_only_non_hispanic                                         35876
asian_only_non_hispanic                                         13487
multiracial_non_hispanic                                         9744
american_indian_or_alaskan_native_only_non_hispanic              7120
native_hawaiian_or_other_pacific_islander_only_non_hispanic      2414
Name: race, dtype: int64

### **column 4: Imputed_Age_value_collapsed_above_80**<a id='Column_4_Imputed_Age_value_collapsed_above_80'></a>
[Contents](#Contents)

In [34]:
#view column counts:
df.Imputed_Age_value_collapsed_above_80.value_counts(dropna=False)

80.00    36253
65.00    10421
70.00    10371
67.00     9652
68.00     9351
62.00     9262
72.00     9212
66.00     9191
64.00     9174
60.00     9155
69.00     9027
75.00     8920
63.00     8874
71.00     8686
73.00     8404
74.00     8267
52.00     8256
61.00     8216
55.00     7976
58.00     7922
59.00     7843
53.00     7711
50.00     7456
54.00     7222
57.00     7187
56.00     7131
40.00     6917
51.00     6759
42.00     6604
76.00     6513
45.00     6332
38.00     6015
49.00     5927
47.00     5905
77.00     5893
35.00     5819
43.00     5817
48.00     5805
78.00     5763
39.00     5710
36.00     5614
37.00     5613
46.00     5611
44.00     5573
79.00     5527
41.00     5492
30.00     5426
32.00     5426
34.00     5312
33.00     5008
31.00     4668
28.00     4586
29.00     4449
25.00     4423
27.00     4302
24.00     4256
26.00     4240
23.00     4086
22.00     3997
20.00     3847
21.00     3837
19.00     3558
18.00     3362
Name: Imputed_Age_value_collapsed_above_80, dtype: int6

In [35]:
# Define bins and labels:
bins = [17, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 79, 99]
labels = [
    'Age_18_to_24', 'Age_25_to_29', 'Age_30_to_34', 'Age_35_to_39',
    'Age_40_to_44', 'Age_45_to_49', 'Age_50_to_54', 'Age_55_to_59',
    'Age_60_to_64', 'Age_65_to_69', 'Age_70_to_74', 'Age_75_to_79',
    'Age_80_or_older'
         ]

In [36]:
# Categorize the age values into bins:
df['age_category'] = pd.cut(df['Imputed_Age_value_collapsed_above_80'], bins=bins, labels=labels, right=True)
df.age_category.value_counts(dropna=False)

Age_65_to_69       47642
Age_70_to_74       44940
Age_60_to_64       44681
Age_55_to_59       38059
Age_50_to_54       37404
Age_80_or_older    36253
Age_75_to_79       32616
Age_40_to_44       30403
Age_45_to_49       29580
Age_35_to_39       28771
Age_18_to_24       26943
Age_30_to_34       25840
Age_25_to_29       22000
Name: age_category, dtype: int64

### **Column 5: General_Health**<a id='Column_5_General_Health'></a>
[Contents](#Contents)

In [37]:
#view column counts:
df.General_Health.value_counts(dropna=False)

2.00    148444
3.00    143598
1.00     71878
4.00     60273
5.00     19741
7.00       810
9.00       385
NaN          3
Name: General_Health, dtype: int64

**General_Health:**
* 1: excellent
* 2: very_good
* 3: good
* 4: fair
* 5: poor
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:


In [38]:
# Replace 7 and 9 with NaN
df['General_Health'].replace([7, 9], np.nan, inplace=True)
df.General_Health.value_counts(dropna=False)

2.00    148444
3.00    143598
1.00     71878
4.00     60273
5.00     19741
NaN       1198
Name: General_Health, dtype: int64

In [39]:
# Calculate the distribution of existing values
value_counts = df['General_Health'].value_counts(normalize=True, dropna=True)
print("Original General_Health:\n", value_counts)

Original General_Health:
 2.00   0.33
3.00   0.32
1.00   0.16
4.00   0.14
5.00   0.04
Name: General_Health, dtype: float64


In [40]:
# Function to impute missing values based on distribution
def impute_missing(row):
    if pd.isna(row['General_Health']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['General_Health']

In [41]:
# Apply the imputation function
df['General_Health'] = df.apply(impute_missing, axis=1)

In [42]:
# Verify the imputation
imputed_value_counts = df['General_Health'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    148843
3.00    143983
1.00     72064
4.00     60440
5.00     19802
Name: General_Health, dtype: int64


In [43]:
# Create a mapping dictionary:
health_mapping = {1: 'excellent',
                  2: 'very_good',
                  3: 'good',
                  4: 'fair',
                  5: 'poor'
                 }


# Apply the mapping to the health column:
df['General_Health'] = df['General_Health'].map(health_mapping)

# Rename the column:
df.rename(columns={'General_Health': 'general_health'}, inplace=True)

In [44]:
#view column counts:
df.general_health.value_counts(dropna=False)

very_good    148843
good         143983
excellent     72064
fair          60440
poor          19802
Name: general_health, dtype: int64

### **Column 6: Have_Personal_Health_Care_Provider**<a id='Column_6_Have_Personal_Health_Care_Provider'></a>	
[Contents](#Contents)

In [45]:
#view column counts:
df.Have_Personal_Health_Care_Provider.value_counts(dropna=False)

1.00    246967
2.00    136685
3.00     57105
7.00      3270
9.00      1103
NaN          2
Name: Have_Personal_Health_Care_Provider, dtype: int64

**Have_Personal_Health_Care_Provider:**
* 1: yes_only_one
* 2: more_than_one
* 3: no
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:

In [46]:
# Replace 7 and 9 with NaN
df['Have_Personal_Health_Care_Provider'].replace([7, 9], np.nan, inplace=True)
df.Have_Personal_Health_Care_Provider.value_counts(dropna=False)

1.00    246967
2.00    136685
3.00     57105
NaN       4375
Name: Have_Personal_Health_Care_Provider, dtype: int64

In [47]:
# Calculate the distribution of existing values
value_counts = df['Have_Personal_Health_Care_Provider'].value_counts(normalize=True, dropna=True)
print("Original Have_Personal_Health_Care_Provider:\n", value_counts)

Original Have_Personal_Health_Care_Provider:
 1.00   0.56
2.00   0.31
3.00   0.13
Name: Have_Personal_Health_Care_Provider, dtype: float64


In [48]:
# Function to impute missing values based on distribution
def impute_missing(row):
    if pd.isna(row['Have_Personal_Health_Care_Provider']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Have_Personal_Health_Care_Provider']

In [49]:
# Apply the imputation function
df['Have_Personal_Health_Care_Provider'] = df.apply(impute_missing, axis=1)

In [50]:
# Verify the imputation
imputed_value_counts = df['Have_Personal_Health_Care_Provider'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00    249393
2.00    138061
3.00     57678
Name: Have_Personal_Health_Care_Provider, dtype: int64


In [51]:
# Verify the imputation
imputed_value_counts = df['Have_Personal_Health_Care_Provider'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00   0.56
2.00   0.31
3.00   0.13
Name: Have_Personal_Health_Care_Provider, dtype: float64


In [52]:
# Create a mapping dictionary:
porvider_mapping = {1: 'yes_only_one',
                  2: 'more_than_one',
                  3: 'no'
                 }

# Apply the mapping to the provider column:
df['Have_Personal_Health_Care_Provider'] = df['Have_Personal_Health_Care_Provider'].map(porvider_mapping)

# Rename the column:
df.rename(columns={'Have_Personal_Health_Care_Provider': 'health_care_provider'}, inplace=True)

### **Column 7: Could_Not_Afford_To_See_Doctor**<a id='Column_7_Could_Not_Afford_To_See_Doctor'></a>
[Contents](#Contents)

In [53]:
#view column counts:
df.Could_Not_Afford_To_See_Doctor.value_counts(dropna=False)

2.00    406296
1.00     37227
7.00      1157
9.00       448
NaN          4
Name: Could_Not_Afford_To_See_Doctor, dtype: int64

**Could_Not_Afford_To_See_Doctor:**
* 1: yes
* 2: no
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:

In [54]:
# Replace 7 and 9 with NaN
df['Could_Not_Afford_To_See_Doctor'].replace([7, 9], np.nan, inplace=True)
df.Could_Not_Afford_To_See_Doctor.value_counts(dropna=False)

2.00    406296
1.00     37227
NaN       1609
Name: Could_Not_Afford_To_See_Doctor, dtype: int64

In [55]:
# Calculate the distribution of existing values
value_counts = df['Could_Not_Afford_To_See_Doctor'].value_counts(normalize=True, dropna=True)
print("Original Could_Not_Afford_To_See_Doctor:\n", value_counts)

Original Could_Not_Afford_To_See_Doctor:
 2.00   0.92
1.00   0.08
Name: Could_Not_Afford_To_See_Doctor, dtype: float64


In [56]:
# Function to impute missing values based on distribution
def impute_missing(row):
    if pd.isna(row['Could_Not_Afford_To_See_Doctor']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Could_Not_Afford_To_See_Doctor']

In [57]:
# Apply the imputation function
df['Could_Not_Afford_To_See_Doctor'] = df.apply(impute_missing, axis=1)

In [58]:
# Verify the imputation
imputed_value_counts = df['Could_Not_Afford_To_See_Doctor'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    407773
1.00     37359
Name: Could_Not_Afford_To_See_Doctor, dtype: int64


In [59]:
# Verify the imputation
imputed_value_counts = df['Could_Not_Afford_To_See_Doctor'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00   0.92
1.00   0.08
Name: Could_Not_Afford_To_See_Doctor, dtype: float64


In [60]:
# Create a mapping dictionary:
doctor_mapping = {1: 'yes',
                  2: 'no'
                 }

# Apply the mapping to the doctor column:
df['Could_Not_Afford_To_See_Doctor'] = df['Could_Not_Afford_To_See_Doctor'].map(doctor_mapping)

# Rename the column:
df.rename(columns={'Could_Not_Afford_To_See_Doctor': 'could_not_afford_to_see_doctor'}, inplace=True)

### **Column 8: Length_of_time_since_last_routine_checkup**<a id='Column_8_Length_of_time_since_last_routine_checkup'></a>
[Contents](#Contents)

In [61]:
#view column counts:
df.Length_of_time_since_last_routine_checkup.value_counts(dropna=False)

1.00    350944
2.00     41919
3.00     24882
4.00     19079
7.00      5063
8.00      2509
9.00       733
NaN          3
Name: Length_of_time_since_last_routine_checkup, dtype: int64

**Could_Not_Afford_To_See_Doctor:**
* 1: 'past_year',
* 2: 'past_2_years',
* 3: 'past_5_years',
* 4: '5+_years_ago',
* 7: 'dont_know',
* 8: 'never',
* 9: 'refused',
so for 7, 9 let's convert to nan:

In [62]:
#Replace 7 and 9 with NaN:
df['Length_of_time_since_last_routine_checkup'].replace([7, 9], np.nan, inplace=True)
df.Length_of_time_since_last_routine_checkup.value_counts(dropna=False)

1.00    350944
2.00     41919
3.00     24882
4.00     19079
NaN       5799
8.00      2509
Name: Length_of_time_since_last_routine_checkup, dtype: int64

In [63]:
# Calculate the distribution of existing values:
value_counts = df['Length_of_time_since_last_routine_checkup'].value_counts(normalize=True, dropna=True)
print("Original Length_of_time_since_last_routine_checkup:\n", value_counts)

Original Length_of_time_since_last_routine_checkup:
 1.00   0.80
2.00   0.10
3.00   0.06
4.00   0.04
8.00   0.01
Name: Length_of_time_since_last_routine_checkup, dtype: float64


In [64]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Length_of_time_since_last_routine_checkup']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Length_of_time_since_last_routine_checkup']

In [65]:
# Apply the imputation function:
df['Length_of_time_since_last_routine_checkup'] = df.apply(impute_missing, axis=1)

In [66]:
# Verify the imputation:
imputed_value_counts = df['Length_of_time_since_last_routine_checkup'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00    355585
2.00     42502
3.00     25212
4.00     19296
8.00      2537
Name: Length_of_time_since_last_routine_checkup, dtype: int64


In [67]:
# Verify the imputation:
imputed_value_counts = df['Length_of_time_since_last_routine_checkup'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00   0.80
2.00   0.10
3.00   0.06
4.00   0.04
8.00   0.01
Name: Length_of_time_since_last_routine_checkup, dtype: float64


In [68]:
# Create a mapping dictionary:
checkup_mapping = {1: 'past_year',
                   2: 'past_2_years',
                   3: 'past_5_years',
                   4: '5+_years_ago',
                   8: 'never',
                 }

# Apply the mapping to the checkup_mapping column:
df['Length_of_time_since_last_routine_checkup'] = df['Length_of_time_since_last_routine_checkup'].map(checkup_mapping)

# Rename the column:
df.rename(columns={'Length_of_time_since_last_routine_checkup': 'length_of_time_since_last_routine_checkup'}, inplace=True)

In [69]:
#view column counts:
df['length_of_time_since_last_routine_checkup'].value_counts(dropna=False,normalize=True)

past_year      0.80
past_2_years   0.10
past_5_years   0.06
5+_years_ago   0.04
never          0.01
Name: length_of_time_since_last_routine_checkup, dtype: float64

### **Column 9: Ever_Diagnosed_with_Heart_Attack**<a id='Column_9_Ever_Diagnosed_with_Heart_Attack'></a>
[Contents](#Contents)

In [70]:
#view column counts:
df['Ever_Diagnosed_with_Heart_Attack'].value_counts(dropna=False)

2.00    416959
1.00     25108
7.00      2731
9.00       330
NaN          4
Name: Ever_Diagnosed_with_Heart_Attack, dtype: int64

**Ever_Diagnosed_with_Heart_Attack:**
* 1: yes
* 2: no
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:

In [71]:
#Replace 7 and 9 with NaN:
df['Ever_Diagnosed_with_Heart_Attack'].replace([7, 9], np.nan, inplace=True)
df.Ever_Diagnosed_with_Heart_Attack.value_counts(dropna=False)

2.00    416959
1.00     25108
NaN       3065
Name: Ever_Diagnosed_with_Heart_Attack, dtype: int64

In [72]:
# Calculate the distribution of existing values:
value_counts = df['Ever_Diagnosed_with_Heart_Attack'].value_counts(normalize=True, dropna=True)
print("Original Length_of_time_since_last_routine_checkup:\n", value_counts)

Original Length_of_time_since_last_routine_checkup:
 2.00   0.94
1.00   0.06
Name: Ever_Diagnosed_with_Heart_Attack, dtype: float64


In [73]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Ever_Diagnosed_with_Heart_Attack']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Ever_Diagnosed_with_Heart_Attack']

In [74]:
# Apply the imputation function:
df['Ever_Diagnosed_with_Heart_Attack'] = df.apply(impute_missing, axis=1)

In [75]:
# Verify the imputation:
imputed_value_counts = df['Ever_Diagnosed_with_Heart_Attack'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    419856
1.00     25276
Name: Ever_Diagnosed_with_Heart_Attack, dtype: int64


In [76]:
# Verify the imputation:
imputed_value_counts = df['Ever_Diagnosed_with_Heart_Attack'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00   0.94
1.00   0.06
Name: Ever_Diagnosed_with_Heart_Attack, dtype: float64


In [77]:
# Create a mapping dictionary:
heart_attack_mapping = {1: 'yes',
                   2: 'no',

                 }

# Apply the mapping to the heart_attack_mapping column:
df['Ever_Diagnosed_with_Heart_Attack'] = df['Ever_Diagnosed_with_Heart_Attack'].map(heart_attack_mapping)

# Rename the column:
df.rename(columns={'Ever_Diagnosed_with_Heart_Attack': 'ever_diagnosed_with_heart_attack'}, inplace=True)

In [78]:
#view column counts:
df['ever_diagnosed_with_heart_attack'].value_counts(dropna=False,normalize=True) # 

no    0.94
yes   0.06
Name: ever_diagnosed_with_heart_attack, dtype: float64

### **Column 10: Ever_Diagnosed_with_a_Stroke**<a id='Column_10_Ever_Diagnosed_with_a_Stroke'></a>
[Contents](#Contents)

In [79]:
#view column counts:
df['Ever_Diagnosed_with_a_Stroke'].value_counts(dropna=False)

2.00    424336
1.00     19239
7.00      1274
9.00       281
NaN          2
Name: Ever_Diagnosed_with_a_Stroke, dtype: int64

**Ever_Diagnosed_with_Heart_Attack:**
* 1: yes
* 2: no
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:

In [80]:
#Replace 7 and 9 with NaN:
df['Ever_Diagnosed_with_a_Stroke'].replace([7, 9], np.nan, inplace=True)
df.Ever_Diagnosed_with_a_Stroke.value_counts(dropna=False)

2.00    424336
1.00     19239
NaN       1557
Name: Ever_Diagnosed_with_a_Stroke, dtype: int64

In [81]:
# Calculate the distribution of existing values:
value_counts = df['Ever_Diagnosed_with_a_Stroke'].value_counts(normalize=True, dropna=True)
print("Original Ever_Diagnosed_with_a_Stroke:\n", value_counts)

Original Ever_Diagnosed_with_a_Stroke:
 2.00   0.96
1.00   0.04
Name: Ever_Diagnosed_with_a_Stroke, dtype: float64


In [82]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Ever_Diagnosed_with_a_Stroke']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Ever_Diagnosed_with_a_Stroke']

In [83]:
# Apply the imputation function:
df['Ever_Diagnosed_with_a_Stroke'] = df.apply(impute_missing, axis=1)

In [84]:
# Verify the imputation:
imputed_value_counts = df['Ever_Diagnosed_with_a_Stroke'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    425821
1.00     19311
Name: Ever_Diagnosed_with_a_Stroke, dtype: int64


In [85]:
# Verify the imputation:
imputed_value_counts = df['Ever_Diagnosed_with_a_Stroke'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00   0.96
1.00   0.04
Name: Ever_Diagnosed_with_a_Stroke, dtype: float64


In [86]:
# Create a mapping dictionary:
stroke_mapping = {1: 'yes',
                   2: 'no',

                 }

# Apply the mapping to the stroke column:
df['Ever_Diagnosed_with_a_Stroke'] = df['Ever_Diagnosed_with_a_Stroke'].map(stroke_mapping)

# Rename the column:
df.rename(columns={'Ever_Diagnosed_with_a_Stroke': 'ever_diagnosed_with_a_stroke'}, inplace=True)

In [87]:
#view column counts:
df['ever_diagnosed_with_a_stroke'].value_counts(dropna=False,normalize=True) # 

no    0.96
yes   0.04
Name: ever_diagnosed_with_a_stroke, dtype: float64

### **Column 11: Ever_told_you_had_a_depressive_disorder**<a id='Column_11_Ever_told_you_had_a_depressive_disorder'></a>
[Contents](#Contents)

In [88]:
#view column counts:
value_counts_with_percentage(df, 'Ever_told_you_had_a_depressive_disorder')

Unnamed: 0,Count,Percentage
2.0,350910,78.83
1.0,91410,20.54
7.0,2140,0.48
9.0,665,0.15
,7,0.0


**Ever_told_you_had_a_depressive_disorder:**
* 1: yes
* 2: no
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:

In [89]:
#Replace 7 and 9 with NaN:
df['Ever_told_you_had_a_depressive_disorder'].replace([7, 9], np.nan, inplace=True)
df.Ever_told_you_had_a_depressive_disorder.value_counts(dropna=False)

2.00    350910
1.00     91410
NaN       2812
Name: Ever_told_you_had_a_depressive_disorder, dtype: int64

In [90]:
# Calculate the distribution of existing values:
value_counts = df['Ever_told_you_had_a_depressive_disorder'].value_counts(normalize=True, dropna=True)
print("Original Ever_told_you_had_a_depressive_disorder:\n", value_counts)

Original Ever_told_you_had_a_depressive_disorder:
 2.00   0.79
1.00   0.21
Name: Ever_told_you_had_a_depressive_disorder, dtype: float64


In [91]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Ever_told_you_had_a_depressive_disorder']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Ever_told_you_had_a_depressive_disorder']

In [92]:
# Apply the imputation function:
df['Ever_told_you_had_a_depressive_disorder'] = df.apply(impute_missing, axis=1)

In [93]:
# Verify the imputation:
imputed_value_counts = df['Ever_told_you_had_a_depressive_disorder'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    353187
1.00     91945
Name: Ever_told_you_had_a_depressive_disorder, dtype: int64


In [94]:
# Verify the imputation:
imputed_value_counts = df['Ever_told_you_had_a_depressive_disorder'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00   0.79
1.00   0.21
Name: Ever_told_you_had_a_depressive_disorder, dtype: float64


In [95]:
# Create a mapping dictionary:
depressive_disorder_mapping = {1: 'yes',
                   2: 'no',

                 }

# Apply the mapping to the depressive_disorder column:
df['Ever_told_you_had_a_depressive_disorder'] = df['Ever_told_you_had_a_depressive_disorder'].map(depressive_disorder_mapping)

# Rename the column:
df.rename(columns={'Ever_told_you_had_a_depressive_disorder': 'ever_told_you_had_a_depressive_disorder'}, inplace=True)

In [96]:
#view column counts & percentage:
value_counts_with_percentage(df, 'ever_told_you_had_a_depressive_disorder')

Unnamed: 0,Count,Percentage
no,353187,79.34
yes,91945,20.66


### **Column 12: Ever_told_you_have_kidney_disease**<a id='Column_12_Ever_told_you_have_kidney_disease'></a>
[Contents](#Contents)

In [97]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Ever_told_you_have_kidney_disease')

Unnamed: 0,Count,Percentage
2.0,422891,95.0
1.0,20315,4.56
7.0,1581,0.36
9.0,343,0.08
,2,0.0


**Ever_told_you_had_a_depressive_disorder:**
* 1: yes
* 2: no
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:

In [98]:
#Replace 7 and 9 with NaN:
df['Ever_told_you_have_kidney_disease'].replace([7, 9], np.nan, inplace=True)
df.Ever_told_you_have_kidney_disease.value_counts(dropna=False)

2.00    422891
1.00     20315
NaN       1926
Name: Ever_told_you_have_kidney_disease, dtype: int64

In [99]:
# Calculate the distribution of existing values:
value_counts = df['Ever_told_you_have_kidney_disease'].value_counts(normalize=True, dropna=True)
print("Original Ever_told_you_have_kidney_disease:\n", value_counts)

Original Ever_told_you_have_kidney_disease:
 2.00   0.95
1.00   0.05
Name: Ever_told_you_have_kidney_disease, dtype: float64


In [100]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Ever_told_you_have_kidney_disease']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Ever_told_you_have_kidney_disease']

In [101]:
# Apply the imputation function:
df['Ever_told_you_have_kidney_disease'] = df.apply(impute_missing, axis=1)

In [102]:
# Verify the imputation:
imputed_value_counts = df['Ever_told_you_have_kidney_disease'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    424726
1.00     20406
Name: Ever_told_you_have_kidney_disease, dtype: int64


In [103]:
# Verify the imputation:
imputed_value_counts = df['Ever_told_you_have_kidney_disease'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00   0.95
1.00   0.05
Name: Ever_told_you_have_kidney_disease, dtype: float64


In [104]:
# Create a mapping dictionary:
kidney_mapping = {1: 'yes',
                   2: 'no',

                 }

# Apply the mapping to the kidney column:
df['Ever_told_you_have_kidney_disease'] = df['Ever_told_you_have_kidney_disease'].map(kidney_mapping)

# Rename the column:
df.rename(columns={'Ever_told_you_have_kidney_disease': 'ever_told_you_have_kidney_disease'}, inplace=True)

In [105]:
#view column counts & percentage:
value_counts_with_percentage(df, 'ever_told_you_have_kidney_disease')

Unnamed: 0,Count,Percentage
no,424726,95.42
yes,20406,4.58


### **Column 13: Ever_told_you_had_diabetes**<a id='Column_13_Ever_told_you_had_diabetes'></a>
[Contents](#Contents)

In [106]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Ever_told_you_had_diabetes')

Unnamed: 0,Count,Percentage
3.0,368722,82.83
1.0,61158,13.74
4.0,10329,2.32
2.0,3836,0.86
7.0,763,0.17
9.0,321,0.07
,3,0.0


**Ever_told_you_had_diabetes:**
* 1: 'yes',
* 2: 'yes_during_pregnancy',
* 3: 'no',
* 4: 'no_prediabetes',
* 7: 'dont_know',
* 9: 'refused',

so for 7, 9 let's convert to nan:

In [107]:
#Replace 7 and 9 with NaN:
df['Ever_told_you_had_diabetes'].replace([7, 9], np.nan, inplace=True)
df.Ever_told_you_had_diabetes.value_counts(dropna=False)

3.00    368722
1.00     61158
4.00     10329
2.00      3836
NaN       1087
Name: Ever_told_you_had_diabetes, dtype: int64

In [108]:
# Calculate the distribution of existing values:
value_counts = df['Ever_told_you_had_diabetes'].value_counts(normalize=True, dropna=True)
print("Original Ever_told_you_have_kidney_disease:\n", value_counts)

Original Ever_told_you_have_kidney_disease:
 3.00   0.83
1.00   0.14
4.00   0.02
2.00   0.01
Name: Ever_told_you_had_diabetes, dtype: float64


In [109]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Ever_told_you_had_diabetes']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Ever_told_you_had_diabetes']

In [110]:
# Apply the imputation function:
df['Ever_told_you_had_diabetes'] = df.apply(impute_missing, axis=1)

In [111]:
# Verify the imputation:
imputed_value_counts = df['Ever_told_you_had_diabetes'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 3.00    369615
1.00     61307
4.00     10362
2.00      3848
Name: Ever_told_you_had_diabetes, dtype: int64


In [112]:
# Verify the imputation:
imputed_value_counts = df['Ever_told_you_had_diabetes'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 3.00   0.83
1.00   0.14
4.00   0.02
2.00   0.01
Name: Ever_told_you_had_diabetes, dtype: float64


In [113]:
# Create a mapping dictionary:
diabetes_mapping = {1: 'yes',
                  2: 'yes_during_pregnancy',
                  3: 'no',
                  4: 'no_prediabetes',

                 }

# Apply the mapping to the diabetes column:
df['Ever_told_you_had_diabetes'] = df['Ever_told_you_had_diabetes'].map(diabetes_mapping)

# Rename the column:
df.rename(columns={'Ever_told_you_had_diabetes': 'ever_told_you_had_diabetes'}, inplace=True)

In [114]:
#view column counts & percentage:
value_counts_with_percentage(df, 'ever_told_you_had_diabetes')

Unnamed: 0,Count,Percentage
no,369615,83.03
yes,61307,13.77
no_prediabetes,10362,2.33
yes_during_pregnancy,3848,0.86


### **Column 14: Computed_body_mass_index_categories**<a id='Column_14_Computed_body_mass_index_categories'></a>
[Contents](#Contents)

In [115]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Computed_body_mass_index_categories')

Unnamed: 0,Count,Percentage
3.0,139995,31.45
4.0,132577,29.78
2.0,116976,26.28
,48806,10.96
1.0,6778,1.52


**Computed_body_mass_index_categories:**
* 1: 'underweight_bmi_less_than_18_5',
* 2: 'normal_weight_bmi_18_5_to_24_9',
* 3: 'overweight_bmi_25_to_29_9',
* 4: 'obese_bmi_30_or_more',


In [116]:
# Calculate the distribution of existing values:
value_counts = df['Computed_body_mass_index_categories'].value_counts(normalize=True, dropna=True)
print("Original Computed_body_mass_index_categories:\n", value_counts)

Original Computed_body_mass_index_categories:
 3.00   0.35
4.00   0.33
2.00   0.30
1.00   0.02
Name: Computed_body_mass_index_categories, dtype: float64


In [117]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Computed_body_mass_index_categories']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Computed_body_mass_index_categories']

In [118]:
# Apply the imputation function:
df['Computed_body_mass_index_categories'] = df.apply(impute_missing, axis=1)

In [119]:
# Verify the imputation:
imputed_value_counts = df['Computed_body_mass_index_categories'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 3.00    157226
4.00    148919
2.00    131365
1.00      7622
Name: Computed_body_mass_index_categories, dtype: int64


In [120]:
# Verify the imputation:
imputed_value_counts = df['Computed_body_mass_index_categories'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 3.00   0.35
4.00   0.33
2.00   0.30
1.00   0.02
Name: Computed_body_mass_index_categories, dtype: float64


In [121]:
# Create a mapping dictionary:
bmi_mapping = {1: 'underweight_bmi_less_than_18_5',
                    2: 'normal_weight_bmi_18_5_to_24_9',
                    3: 'overweight_bmi_25_to_29_9',
                    4: 'obese_bmi_30_or_more',

                 }

# Apply the mapping to the bmi column:
df['Computed_body_mass_index_categories'] = df['Computed_body_mass_index_categories'].map(bmi_mapping)

# Rename the column:
df.rename(columns={'Computed_body_mass_index_categories': 'BMI'}, inplace=True)

In [122]:
#view column counts & percentage:
value_counts_with_percentage(df, 'BMI')

Unnamed: 0,Count,Percentage
overweight_bmi_25_to_29_9,157226,35.32
obese_bmi_30_or_more,148919,33.46
normal_weight_bmi_18_5_to_24_9,131365,29.51
underweight_bmi_less_than_18_5,7622,1.71


### **Column 15: Difficulty_Walking_or_Climbing_Stairs**<a id='Column_15_Difficulty_Walking_or_Climbing_Stairs'></a>
[Contents](#Contents)

In [123]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Difficulty_Walking_or_Climbing_Stairs')

Unnamed: 0,Count,Percentage
2.0,353039,79.31
1.0,68081,15.29
,22155,4.98
7.0,1221,0.27
9.0,636,0.14


**Difficulty_Walking_or_Climbing_Stairs:**
* 1: yes
* 2: no
* 7: dont_know
* 9: refused

so for 7, 9 let's convert to nan:

In [124]:
#Replace 7 and 9 with NaN:
df['Difficulty_Walking_or_Climbing_Stairs'].replace([7, 9], np.nan, inplace=True)
df.Difficulty_Walking_or_Climbing_Stairs.value_counts(dropna=False)

2.00    353039
1.00     68081
NaN      24012
Name: Difficulty_Walking_or_Climbing_Stairs, dtype: int64

In [125]:
# Calculate the distribution of existing values:
value_counts = df['Difficulty_Walking_or_Climbing_Stairs'].value_counts(normalize=True, dropna=True)
print("Original Difficulty_Walking_or_Climbing_Stairs:\n", value_counts)

Original Difficulty_Walking_or_Climbing_Stairs:
 2.00   0.84
1.00   0.16
Name: Difficulty_Walking_or_Climbing_Stairs, dtype: float64


In [126]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Difficulty_Walking_or_Climbing_Stairs']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Difficulty_Walking_or_Climbing_Stairs']

In [127]:
# Apply the imputation function:
df['Difficulty_Walking_or_Climbing_Stairs'] = df.apply(impute_missing, axis=1)

In [128]:
# Verify the imputation:
imputed_value_counts = df['Difficulty_Walking_or_Climbing_Stairs'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00    373082
1.00     72050
Name: Difficulty_Walking_or_Climbing_Stairs, dtype: int64


In [129]:
# Verify the imputation:
imputed_value_counts = df['Difficulty_Walking_or_Climbing_Stairs'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 2.00   0.84
1.00   0.16
Name: Difficulty_Walking_or_Climbing_Stairs, dtype: float64


In [130]:
# Create a mapping dictionary:
climbing_mapping = {1: 'yes',
                   2: 'no',

                 }

# Apply the mapping to the climbing_mapping column:
df['Difficulty_Walking_or_Climbing_Stairs'] = df['Difficulty_Walking_or_Climbing_Stairs'].map(climbing_mapping)

# Rename the column:
df.rename(columns={'Difficulty_Walking_or_Climbing_Stairs': 'difficulty_walking_or_climbing_stairs'}, inplace=True)

In [131]:
#view column counts & percentage:
value_counts_with_percentage(df, 'difficulty_walking_or_climbing_stairs')

Unnamed: 0,Count,Percentage
no,373082,83.81
yes,72050,16.19


### **Column 16: Computed_Physical_Health_Status**<a id='Column_16_Computed_Physical_Health_Status'></a>
[Contents](#Contents)

In [132]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Computed_Physical_Health_Status')

Unnamed: 0,Count,Percentage
1.0,267819,60.17
2.0,108312,24.33
3.0,58074,13.05
9.0,10927,2.45


**Computed_Physical_Health_Status:**
* 1: 'zero_days_not_good',
* 2: '1_to_13_days_not_good',
* 3: '14_plus_days_not_good',
* 9: 'dont_know'

so for 9 let's convert to nan:

In [133]:
#Replace 7 and 9 with NaN:
df['Computed_Physical_Health_Status'].replace([9], np.nan, inplace=True)
df.Computed_Physical_Health_Status.value_counts(dropna=False)

1.00    267819
2.00    108312
3.00     58074
NaN      10927
Name: Computed_Physical_Health_Status, dtype: int64

In [134]:
# Calculate the distribution of existing values:
value_counts = df['Computed_Physical_Health_Status'].value_counts(normalize=True, dropna=True)
print("Original Computed_Physical_Health_Status:\n", value_counts)

Original Computed_Physical_Health_Status:
 1.00   0.62
2.00   0.25
3.00   0.13
Name: Computed_Physical_Health_Status, dtype: float64


In [135]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Computed_Physical_Health_Status']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Computed_Physical_Health_Status']

In [136]:
# Apply the imputation function:
df['Computed_Physical_Health_Status'] = df.apply(impute_missing, axis=1)

In [137]:
# Verify the imputation:
imputed_value_counts = df['Computed_Physical_Health_Status'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00    274525
2.00    111056
3.00     59551
Name: Computed_Physical_Health_Status, dtype: int64


In [138]:
# Verify the imputation:
imputed_value_counts = df['Computed_Physical_Health_Status'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00   0.62
2.00   0.25
3.00   0.13
Name: Computed_Physical_Health_Status, dtype: float64


In [139]:
# Create a mapping dictionary:
health_status_mapping = {1: 'zero_days_not_good',
                    2: '1_to_13_days_not_good',
                    3: '14_plus_days_not_good',

                 }

# Apply the mapping to the health_status_mapping column:
df['Computed_Physical_Health_Status'] = df['Computed_Physical_Health_Status'].map(health_status_mapping)

# Rename the column:
df.rename(columns={'Computed_Physical_Health_Status': 'physical_health_status'}, inplace=True)

In [140]:
#view column counts & percentage:
value_counts_with_percentage(df, 'physical_health_status')

Unnamed: 0,Count,Percentage
zero_days_not_good,274525,61.67
1_to_13_days_not_good,111056,24.95
14_plus_days_not_good,59551,13.38


### **Column 17: Computed_Mental_Health_Status**<a id='Column_17_Computed_Mental_Health_Status'></a>	
[Contents](#Contents)

In [141]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Computed_Mental_Health_Status')

Unnamed: 0,Count,Percentage
1.0,265229,59.58
2.0,110616,24.85
3.0,60220,13.53
9.0,9067,2.04


**Computed_Physical_Health_Status:**
* 1: 'zero_days_not_good',
* 2: '1_to_13_days_not_good',
* 3: '14_plus_days_not_good',
* 9: 'dont_know'

so for 9 let's convert to nan:

In [142]:
#Replace 7 and 9 with NaN:
df['Computed_Mental_Health_Status'].replace([9], np.nan, inplace=True)
df.Computed_Mental_Health_Status.value_counts(dropna=False)

1.00    265229
2.00    110616
3.00     60220
NaN       9067
Name: Computed_Mental_Health_Status, dtype: int64

In [143]:
# Calculate the distribution of existing values:
value_counts = df['Computed_Mental_Health_Status'].value_counts(normalize=True, dropna=True)
print("Original Computed_Mental_Health_Status:\n", value_counts)

Original Computed_Mental_Health_Status:
 1.00   0.61
2.00   0.25
3.00   0.14
Name: Computed_Mental_Health_Status, dtype: float64


In [144]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Computed_Mental_Health_Status']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Computed_Mental_Health_Status']

In [145]:
# Apply the imputation function:
df['Computed_Mental_Health_Status'] = df.apply(impute_missing, axis=1)

In [146]:
# Verify the imputation:
imputed_value_counts = df['Computed_Mental_Health_Status'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00    270792
2.00    112907
3.00     61433
Name: Computed_Mental_Health_Status, dtype: int64


In [147]:
# Verify the imputation:
imputed_value_counts = df['Computed_Mental_Health_Status'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00   0.61
2.00   0.25
3.00   0.14
Name: Computed_Mental_Health_Status, dtype: float64


In [148]:
# Create a mapping dictionary:
m_health_status_mapping = {1: 'zero_days_not_good',
                    2: '1_to_13_days_not_good',
                    3: '14_plus_days_not_good',

                 }

# Apply the mapping to the m_health_status_mapping column:
df['Computed_Mental_Health_Status'] = df['Computed_Mental_Health_Status'].map(m_health_status_mapping)

# Rename the column:
df.rename(columns={'Computed_Mental_Health_Status': 'mental_health_status'}, inplace=True)

In [149]:
#view column counts & percentage:
value_counts_with_percentage(df, 'mental_health_status')

Unnamed: 0,Count,Percentage
zero_days_not_good,270792,60.83
1_to_13_days_not_good,112907,25.36
14_plus_days_not_good,61433,13.8


### **Column 18: Computed_Asthma_Status**<a id='Column_18_Computed_Asthma_Status'></a>
[Contents](#Contents)

In [150]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Computed_Asthma_Status')

Unnamed: 0,Count,Percentage
3.0,376665,84.62
1.0,45659,10.26
2.0,18948,4.26
9.0,3860,0.87


**Computed_Asthma_Status:**
* 1: 'current_asthma',
* 2: 'former_asthma',
* 3: 'never_asthma',
* 9: 'dont_know_refused_missing'

so for 9 let's convert to nan:

In [151]:
#Replace 7 and 9 with NaN:
df['Computed_Asthma_Status'].replace([9], np.nan, inplace=True)
df.Computed_Asthma_Status.value_counts(dropna=False)

3.00    376665
1.00     45659
2.00     18948
NaN       3860
Name: Computed_Asthma_Status, dtype: int64

In [152]:
# Calculate the distribution of existing values:
value_counts = df['Computed_Asthma_Status'].value_counts(normalize=True, dropna=True)
print("Original Computed_Asthma_Status:\n", value_counts)

Original Computed_Asthma_Status:
 3.00   0.85
1.00   0.10
2.00   0.04
Name: Computed_Asthma_Status, dtype: float64


In [153]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Computed_Asthma_Status']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Computed_Asthma_Status']

In [154]:
# Apply the imputation function:
df['Computed_Asthma_Status'] = df.apply(impute_missing, axis=1)

In [155]:
# Verify the imputation:
imputed_value_counts = df['Computed_Asthma_Status'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 3.00    379980
1.00     46035
2.00     19117
Name: Computed_Asthma_Status, dtype: int64


In [156]:
# Verify the imputation:
imputed_value_counts = df['Computed_Asthma_Status'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 3.00   0.85
1.00   0.10
2.00   0.04
Name: Computed_Asthma_Status, dtype: float64


In [157]:
# Create a mapping dictionary:
Asthma_Status_mapping = {1: 'current_asthma',
                           2: 'former_asthma',
                           3: 'never_asthma',

                 }

# Apply the mapping to the Asthma_Status_mapping column:
df['Computed_Asthma_Status'] = df['Computed_Asthma_Status'].map(Asthma_Status_mapping)

# Rename the column:
df.rename(columns={'Computed_Asthma_Status': 'asthma_Status'}, inplace=True)

In [158]:
#view column counts & percentage:
value_counts_with_percentage(df, 'asthma_Status')

Unnamed: 0,Count,Percentage
never_asthma,379980,85.36
current_asthma,46035,10.34
former_asthma,19117,4.29


### **Column 19: Exercise_in_Past_30_Days**<a id='Column_19_Exercise_in_Past_30_Days'></a>
[Contents](#Contents)

In [159]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Exercise_in_Past_30_Days')

Unnamed: 0,Count,Percentage
1.0,337559,75.83
2.0,106480,23.92
7.0,724,0.16
9.0,367,0.08
,2,0.0


**Exercise_in_Past_30_Days:**
* 1: 'yes',
* 2: 'no',
* 7: 'dont_know'
* 9: 'refused_missing'

so for 7, 9 let's convert to nan:

In [160]:
#Replace 7 and 9 with NaN:
df['Exercise_in_Past_30_Days'].replace([7, 9], np.nan, inplace=True)
df.Exercise_in_Past_30_Days.value_counts(dropna=False)

1.00    337559
2.00    106480
NaN       1093
Name: Exercise_in_Past_30_Days, dtype: int64

In [161]:
# Calculate the distribution of existing values:
value_counts = df['Exercise_in_Past_30_Days'].value_counts(normalize=True, dropna=True)
print("Original Exercise_in_Past_30_Days:\n", value_counts)

Original Exercise_in_Past_30_Days:
 1.00   0.76
2.00   0.24
Name: Exercise_in_Past_30_Days, dtype: float64


In [162]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Exercise_in_Past_30_Days']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Exercise_in_Past_30_Days']

In [163]:
# Apply the imputation function:
df['Exercise_in_Past_30_Days'] = df.apply(impute_missing, axis=1)

In [164]:
# Verify the imputation:
imputed_value_counts = df['Exercise_in_Past_30_Days'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00    338376
2.00    106756
Name: Exercise_in_Past_30_Days, dtype: int64


In [165]:
# Verify the imputation:
imputed_value_counts = df['Exercise_in_Past_30_Days'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00   0.76
2.00   0.24
Name: Exercise_in_Past_30_Days, dtype: float64


In [166]:
# Create a mapping dictionary:
exercise_Status_mapping = {1: 'yes',
                           2: 'no',

                 }

# Apply the mapping to the exercise_Status_mapping column:
df['Exercise_in_Past_30_Days'] = df['Exercise_in_Past_30_Days'].map(exercise_Status_mapping)

# Rename the column:
df.rename(columns={'Exercise_in_Past_30_Days': 'exercise_status_in_past_30_Days'}, inplace=True)

In [167]:
#view column counts & percentage:
value_counts_with_percentage(df, 'exercise_status_in_past_30_Days')

Unnamed: 0,Count,Percentage
yes,338376,76.02
no,106756,23.98


### **Column 20: Computed_Smoking_Status**<a id='Column_20_Computed_Smoking_Status'></a>
[Contents](#Contents)

In [168]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Computed_Smoking_Status')

Unnamed: 0,Count,Percentage
4.0,245955,55.25
3.0,113774,25.56
1.0,36003,8.09
9.0,35462,7.97
2.0,13938,3.13


**Computed_Smoking_Status:**
* 1: 'current_smoker_every_day',
* 2: 'current_smoker_some_days',
* 3: 'former_smoker',
* 4: 'never_smoked',
* 9: 'dont_know_refused_missing'

so for 9 let's convert to nan:

In [169]:
#Replace 7 and 9 with NaN:
df['Computed_Smoking_Status'].replace([9], np.nan, inplace=True)
df.Computed_Smoking_Status.value_counts(dropna=False)

4.00    245955
3.00    113774
1.00     36003
NaN      35462
2.00     13938
Name: Computed_Smoking_Status, dtype: int64

In [170]:
# Calculate the distribution of existing values:
value_counts = df['Computed_Smoking_Status'].value_counts(normalize=True, dropna=True)
print("Original Computed_Smoking_Status:\n", value_counts)

Original Computed_Smoking_Status:
 4.00   0.60
3.00   0.28
1.00   0.09
2.00   0.03
Name: Computed_Smoking_Status, dtype: float64


In [171]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Computed_Smoking_Status']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Computed_Smoking_Status']

In [172]:
# Apply the imputation function:
df['Computed_Smoking_Status'] = df.apply(impute_missing, axis=1)

In [173]:
# Verify the imputation:
imputed_value_counts = df['Computed_Smoking_Status'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 4.00    267178
3.00    123743
1.00     39107
2.00     15104
Name: Computed_Smoking_Status, dtype: int64


In [174]:
# Verify the imputation:
imputed_value_counts = df['Computed_Smoking_Status'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 4.00   0.60
3.00   0.28
1.00   0.09
2.00   0.03
Name: Computed_Smoking_Status, dtype: float64


In [175]:
# Create a mapping dictionary:
smoking_Status_mapping = {1: 'current_smoker_every_day',
                           2: 'current_smoker_some_days',
                           3: 'former_smoker',
                           4: 'never_smoked'
                          }

# Apply the mapping to the smoking_Status_mapping column:
df['Computed_Smoking_Status'] = df['Computed_Smoking_Status'].map(smoking_Status_mapping)

# Rename the column:
df.rename(columns={'Computed_Smoking_Status': 'smoking_status'}, inplace=True)

In [176]:
#view column counts & percentage:
value_counts_with_percentage(df, 'smoking_status')

Unnamed: 0,Count,Percentage
never_smoked,267178,60.02
former_smoker,123743,27.8
current_smoker_every_day,39107,8.79
current_smoker_some_days,15104,3.39


### **Column 21: Binge_Drinking_Calculated_Variable**<a id='Column_21_Binge_Drinking_Calculated_Variable'></a>	
[Contents](#Contents)

In [177]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Binge_Drinking_Calculated_Variable')

Unnamed: 0,Count,Percentage
1.0,337114,75.73
2.0,56916,12.79
9.0,51102,11.48


**Binge_Drinking_Calculated_Variable:**
* 1: 'no',
* 2: 'yes',
* 9: 'dont_know_refused_missing'

so for 9 let's convert to nan:

In [178]:
#Replace 7 and 9 with NaN:
df['Binge_Drinking_Calculated_Variable'].replace([9], np.nan, inplace=True)
df.Binge_Drinking_Calculated_Variable.value_counts(dropna=False)

1.00    337114
2.00     56916
NaN      51102
Name: Binge_Drinking_Calculated_Variable, dtype: int64

In [179]:
# Calculate the distribution of existing values:
value_counts = df['Binge_Drinking_Calculated_Variable'].value_counts(normalize=True, dropna=True)
print("Original Binge_Drinking_Calculated_Variable:\n", value_counts)

Original Binge_Drinking_Calculated_Variable:
 1.00   0.86
2.00   0.14
Name: Binge_Drinking_Calculated_Variable, dtype: float64


In [180]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['Binge_Drinking_Calculated_Variable']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['Binge_Drinking_Calculated_Variable']

In [181]:
# Apply the imputation function:
df['Binge_Drinking_Calculated_Variable'] = df.apply(impute_missing, axis=1)

In [182]:
# Verify the imputation:
imputed_value_counts = df['Binge_Drinking_Calculated_Variable'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00    380851
2.00     64281
Name: Binge_Drinking_Calculated_Variable, dtype: int64


In [183]:
# Verify the imputation:
imputed_value_counts = df['Binge_Drinking_Calculated_Variable'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 1.00   0.86
2.00   0.14
Name: Binge_Drinking_Calculated_Variable, dtype: float64


In [184]:
# Create a mapping dictionary:
binge_drinking_status = {1: 'no',
                           2: 'yes'
                          }

# Apply the mapping to the binge_drinking_status column:
df['Binge_Drinking_Calculated_Variable'] = df['Binge_Drinking_Calculated_Variable'].map(binge_drinking_status)

# Rename the column:
df.rename(columns={'Binge_Drinking_Calculated_Variable': 'binge_drinking_status'}, inplace=True)

In [185]:
#view column counts & percentage:
value_counts_with_percentage(df, 'binge_drinking_status')

Unnamed: 0,Count,Percentage
no,380851,85.56
yes,64281,14.44


### **Column 22: How_Much_Time_Do_You_Sleep**<a id='Column_22_How_Much_Time_Do_You_Sleep'></a>
[Contents](#Contents)

In [186]:
#view column counts & percentage:
value_counts_with_percentage(df, 'How_Much_Time_Do_You_Sleep')

Unnamed: 0,Count,Percentage
7.0,132927,29.86
8.0,125442,28.18
6.0,95880,21.54
5.0,30122,6.77
9.0,21210,4.76
4.0,12433,2.79
10.0,10459,2.35
77.0,4792,1.08
3.0,3260,0.73
12.0,3004,0.67


In [187]:
def categorize_sleep_hours(df, column_name):
    # Define the mapping dictionary for known values
    sleep_mapping = {
        77: 'dont_know',
        99: 'refused_to_answer',
        np.nan: 'missing'
    }
    
    # Categorize hours of sleep
    for hour in range(0, 4):
        sleep_mapping[hour] = 'very_short_sleep_0_to_3_hours'
    for hour in range(4, 6):
        sleep_mapping[hour] = 'short_sleep_4_to_5_hours'
    for hour in range(6, 9):
        sleep_mapping[hour] = 'normal_sleep_6_to_8_hours'
    for hour in range(9, 11):
        sleep_mapping[hour] = 'long_sleep_9_to_10_hours'
    for hour in range(11, 25):
        sleep_mapping[hour] = 'very_long_sleep_11_or_more_hours'

    # Map the values to their categories
    df['sleep_category'] = df[column_name].map(sleep_mapping)

    return df

In [188]:
# Apply the function to categorize sleep hours
df = categorize_sleep_hours(df, 'How_Much_Time_Do_You_Sleep')

In [189]:
#view column counts & percentage:
value_counts_with_percentage(df, 'sleep_category')

Unnamed: 0,Count,Percentage
normal_sleep_6_to_8_hours,354249,79.58
short_sleep_4_to_5_hours,42555,9.56
long_sleep_9_to_10_hours,31669,7.11
very_short_sleep_0_to_3_hours,5963,1.34
very_long_sleep_11_or_more_hours,5243,1.18
dont_know,4792,1.08
refused_to_answer,658,0.15
missing,3,0.0


In [190]:
#Replace 7 and 9 with NaN:
#df['sleep_category'].replace(['dont_know', 'refused_to_answer'], np.nan, inplace=True)
df['sleep_category'].replace(['missing', 'dont_know','refused_to_answer'], np.nan, inplace=True)
df.sleep_category.value_counts(dropna=False)

normal_sleep_6_to_8_hours           354249
short_sleep_4_to_5_hours             42555
long_sleep_9_to_10_hours             31669
very_short_sleep_0_to_3_hours         5963
NaN                                   5453
very_long_sleep_11_or_more_hours      5243
Name: sleep_category, dtype: int64

In [191]:
# Calculate the distribution of existing values:
value_counts = df['sleep_category'].value_counts(normalize=True, dropna=True)
print("Original sleep_category:\n", value_counts)

Original sleep_category:
 normal_sleep_6_to_8_hours          0.81
short_sleep_4_to_5_hours           0.10
long_sleep_9_to_10_hours           0.07
very_short_sleep_0_to_3_hours      0.01
very_long_sleep_11_or_more_hours   0.01
Name: sleep_category, dtype: float64


In [192]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['sleep_category']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['sleep_category']

In [193]:
# Apply the imputation function:
df['sleep_category'] = df.apply(impute_missing, axis=1)

In [194]:
# Verify the imputation:
imputed_value_counts = df['sleep_category'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 normal_sleep_6_to_8_hours           358636
short_sleep_4_to_5_hours             43092
long_sleep_9_to_10_hours             32046
very_short_sleep_0_to_3_hours         6049
very_long_sleep_11_or_more_hours      5309
Name: sleep_category, dtype: int64


In [195]:
# Verify the imputation:
imputed_value_counts = df['sleep_category'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 normal_sleep_6_to_8_hours          0.81
short_sleep_4_to_5_hours           0.10
long_sleep_9_to_10_hours           0.07
very_short_sleep_0_to_3_hours      0.01
very_long_sleep_11_or_more_hours   0.01
Name: sleep_category, dtype: float64


In [196]:
#view column counts & percentage:
value_counts_with_percentage(df, 'sleep_category')

Unnamed: 0,Count,Percentage
normal_sleep_6_to_8_hours,358636,80.57
short_sleep_4_to_5_hours,43092,9.68
long_sleep_9_to_10_hours,32046,7.2
very_short_sleep_0_to_3_hours,6049,1.36
very_long_sleep_11_or_more_hours,5309,1.19


### **Column 23: Computed_number_of_drinks_of_alcohol_beverages_per_week**<a id='Column_23_Computed_number_of_drinks_of_alcohol_beverages_per_week'></a>
[Contents](#Contents)

In [197]:
#view column counts & percentage:
value_counts_with_percentage(df, 'Computed_number_of_drinks_of_alcohol_beverages_per_week')

Unnamed: 0,Count,Percentage
0.0,188832,42.42
99900.0,49705,11.17
23.0,20646,4.64
47.0,18325,4.12
93.0,12104,2.72
200.0,10153,2.28
700.0,10143,2.28
100.0,9507,2.14
400.0,8845,1.99
70.0,8312,1.87


In [198]:
# Divide by 100 to get the number of drinks per week
df['drinks_per_week'] = df['Computed_number_of_drinks_of_alcohol_beverages_per_week'] / 100

In [199]:
# Define the function to categorize the drink consumption
def categorize_drinks(drinks_per_week):
    #if drinks_per_week == 0:
        #return 'did_not_drink'
    if drinks_per_week == 99900 / 100:
        return 'do_not_know'
    elif 0.01 <= drinks_per_week <= 1:
        return 'very_low_consumption_0.01_to_1_drinks'
    elif 1.01 <= drinks_per_week <= 5:
        return 'low_consumption_1.01_to_5_drinks'
    elif 5.01 <= drinks_per_week <= 10:
        return 'moderate_consumption_5.01_to_10_drinks'
    elif 10.01 <= drinks_per_week <= 20:
        return 'high_consumption_10.01_to_20_drinks'
    elif drinks_per_week > 20:
        return 'very_high_consumption_more_than_20_drinks'
    else:
        return 'did_not_drink'

In [200]:
# Apply the categorization function
df['drinks_category'] = df['drinks_per_week'].apply(categorize_drinks)


In [201]:
#view column counts & percentage:
value_counts_with_percentage(df, 'drinks_category')

Unnamed: 0,Count,Percentage
did_not_drink,188832,42.42
low_consumption_1.01_to_5_drinks,72539,16.3
very_low_consumption_0.01_to_1_drinks,68894,15.48
do_not_know,49705,11.17
moderate_consumption_5.01_to_10_drinks,33916,7.62
high_consumption_10.01_to_20_drinks,19734,4.43
very_high_consumption_more_than_20_drinks,11512,2.59


In [202]:
#Replace 7 and 9 with NaN:
df['drinks_category'].replace(['do_not_know'], np.nan, inplace=True)
df.drinks_category.value_counts(dropna=False)

did_not_drink                                188832
low_consumption_1.01_to_5_drinks              72539
very_low_consumption_0.01_to_1_drinks         68894
NaN                                           49705
moderate_consumption_5.01_to_10_drinks        33916
high_consumption_10.01_to_20_drinks           19734
very_high_consumption_more_than_20_drinks     11512
Name: drinks_category, dtype: int64

In [203]:
# Calculate the distribution of existing values:
value_counts = df['drinks_category'].value_counts(normalize=True, dropna=True)
print("Original drinks_category:\n", value_counts)

Original drinks_category:
 did_not_drink                               0.48
low_consumption_1.01_to_5_drinks            0.18
very_low_consumption_0.01_to_1_drinks       0.17
moderate_consumption_5.01_to_10_drinks      0.09
high_consumption_10.01_to_20_drinks         0.05
very_high_consumption_more_than_20_drinks   0.03
Name: drinks_category, dtype: float64


In [204]:
# Function to impute missing values based on distribution:
def impute_missing(row):
    if pd.isna(row['drinks_category']):
        return np.random.choice(value_counts.index, p=value_counts.values)
    else:
        return row['drinks_category']

In [205]:
# Apply the imputation function:
df['drinks_category'] = df.apply(impute_missing, axis=1)

In [206]:
# Verify the imputation:
imputed_value_counts = df['drinks_category'].value_counts(dropna=False) # normalize=True
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 did_not_drink                                212603
low_consumption_1.01_to_5_drinks              81487
very_low_consumption_0.01_to_1_drinks         77630
moderate_consumption_5.01_to_10_drinks        38269
high_consumption_10.01_to_20_drinks           22197
very_high_consumption_more_than_20_drinks     12946
Name: drinks_category, dtype: int64


In [207]:
# Verify the imputation:
imputed_value_counts = df['drinks_category'].value_counts(dropna=False,normalize=True) # 
print("Distribution after imputation:\n", imputed_value_counts)

Distribution after imputation:
 did_not_drink                               0.48
low_consumption_1.01_to_5_drinks            0.18
very_low_consumption_0.01_to_1_drinks       0.17
moderate_consumption_5.01_to_10_drinks      0.09
high_consumption_10.01_to_20_drinks         0.05
very_high_consumption_more_than_20_drinks   0.03
Name: drinks_category, dtype: float64


In [208]:
#Final check after imputation:
value_counts_with_percentage(df, 'drinks_category')

Unnamed: 0,Count,Percentage
did_not_drink,212603,47.76
low_consumption_1.01_to_5_drinks,81487,18.31
very_low_consumption_0.01_to_1_drinks,77630,17.44
moderate_consumption_5.01_to_10_drinks,38269,8.6
high_consumption_10.01_to_20_drinks,22197,4.99
very_high_consumption_more_than_20_drinks,12946,2.91


## **Dropping unnecessary columns**<a id='Dropping_unnecessary_columns'></a>
[Contents](#Contents)

In [209]:
#Here, let's drop the unnecessary colums:
columns_to_drop = ['Imputed_Age_value_collapsed_above_80', 'Reported_Weight_in_Pounds', 
                   'Reported_Height_in_Feet_and_Inches', 'Leisure_Time_Physical_Activity_Calculated_Variable',
                  'Smoked_at_Least_100_Cigarettes', 'Computed_number_of_drinks_of_alcohol_beverages_per_week',
                  'How_Much_Time_Do_You_Sleep', 'drinks_per_week']
df = df.drop(columns=columns_to_drop)

## **Review the final structre of the cleaned dataframe**<a id='Review_final_structure_of_the_cleaned_dataframe'></a>
[Contents](#Contents)

In [210]:
#now, let's look at the shape of df:
shape = df.shape
print("Number of rows:", shape[0], "\nNumber of columns:", shape[1])

Number of rows: 445132 
Number of columns: 23


In [211]:
summarize_df(df)

Unnamed: 0,unique_count,data_types,missing_counts,missing_percentage
heart_disease,2,object,0,0.0
gender,3,object,0,0.0
race,7,object,0,0.0
general_health,5,object,0,0.0
health_care_provider,3,object,0,0.0
could_not_afford_to_see_doctor,2,object,0,0.0
length_of_time_since_last_routine_checkup,5,object,0,0.0
ever_diagnosed_with_heart_attack,2,object,0,0.0
ever_diagnosed_with_a_stroke,2,object,0,0.0
ever_told_you_had_a_depressive_disorder,2,object,0,0.0


Awesome, there's no missing data. **So, as we can see above. we cleaned the data, removed missing data and still maintained the size of the dataset "rows"**

## **Saving the clean dataframe**<a id='Saving_the_cleaned_dataframe'></a>
[Contents](#Contents)

In [212]:
output_file_path = "./brfss2022_data_wrangling_output.csv"

df.to_csv(output_file_path, index=False)