<h1 style="color:#E67E22;">Table of Contents</h1>

<ol>
  <li><a href="#step1">Step 1 – Load the Dataset</a></li>
  <li><a href="#step2">Step 2 – Remove Columns with Excessive Missing Data</a></li>
  <li><a href="#step3">Step 3 – Remove Irrelevant Columns</a></li>
  <li><a href="#step4">Step 4 – Verify the Cleaned Dataset</a></li>
  <li><a href="#step5">Step 5 – Dropping Rows with Missing Values</a></li>
  <li><a href="#step6">Step 6 – Checking for Inconsistencies</a></li>
  <li><a href="#step7">Step 7 – Aggregate the Data</a></li>
  <li><a href="#summary8">Step 8 - Final Summary of the Project</a></li>
</ol>

<h1 style="color:#E67E22;">Step 1 – Load the Dataset</h1>

<h3>Objective</h3>

<p>
The goal of this step is to import the <b>BRFSS 2024 dataset</b> (in SAS XPT format) into a pandas DataFrame and inspect its structure.
</p> 

In [1]:
# Import essential libraries
import pandas as pd
import numpy as np

# Display confirmation message
print("Libraries successfully imported.")

Libraries successfully imported.


<h2 style="color:#F4A460;">1.1 Import the BRFSS 2024 Dataset</h2>

<p>
We will load the <b>BRFSS 2024 (LLCP2024.XPT)</b> file, which is provided in SAS Transport format.<br>
The goal is to confirm that the file loads correctly and inspect its shape and columns.
</p>

In [2]:
# Load the SAS XPT dataset
df = pd.read_sas("LLCP2024.XPT")

In [3]:
# Display the shape (rows, columns)
print("Dataset shape:", df.shape)

Dataset shape: (457670, 301)


In [None]:
# Display the first few column names in a DataFrame for better readability
columns_df = pd.DataFrame(df.columns, columns=["Column Names"])
columns_df.head(20)

<h1 style="color:#E67E22;">Step 2 – Remove Columns with Excessive Missing Data</h1>

We will check for duplicates and drop all columns with more than **15,000** missing values.  
A simple log of removed columns will be created for documentation.


<h2 style="color:#F4A460;">2.1 Check and Remove Duplicates

In [None]:
# Check for duplicate rows
n_duplicates = df.duplicated().sum()
print(f"Found {n_duplicates} duplicate rows.")

# Remove duplicates and reset index
df = df.drop_duplicates().reset_index(drop=True)
print("Shape after removing duplicates:", df.shape)

<h2 style="color:#F4A460;">2.2 Identify Columns with Excessive Missing Data

In [None]:
# Count missing values per column
missing_summary = (
    df.isnull()
      .sum()
      .to_frame(name="MissingCount")
      .assign(MissingPercent=lambda x: (x["MissingCount"] / len(df) * 100).round(2))
      .sort_values("MissingCount", ascending=False)
)

print("Top 20 columns with the highest missing values:")
missing_summary.head(20)

<h2 style="color:#F4A460;">2.3 Drop Columns with More Than 15,000 Missing Values

In [None]:
# Define the threshold
THRESHOLD = 15000

# Identify columns exceeding the threshold
cols_to_drop = missing_summary[missing_summary["MissingCount"] > THRESHOLD].index.tolist()
print(f"Columns exceeding {THRESHOLD} missing values: {len(cols_to_drop)}")

# Drop the columns
df = df.drop(columns=cols_to_drop)
print("Shape after dropping columns:", df.shape)

<h2 style="color:#F4A460;">2.4 Document Removed Columns

In [None]:
# Create documentation table for dropped columns
removed_cols_doc = missing_summary.loc[cols_to_drop].reset_index()
removed_cols_doc.columns = ["Column", "MissingCount", "MissingPercent"]

removed_cols_doc.head(15)

<h1 style="color:#E67E22;">Step 3 – Remove Irrelevant Columns</h1>

<p>
In this step, we will manually remove columns that are not relevant for the analysis.  
These variables are either administrative, redundant, or unrelated to the project’s objectives.
</p>

In [None]:
# List of irrelevant columns to drop
irrelevant_columns = [
    "FMONTH", "IDATE", "IMONTH", "IDAY", "IYEAR", "DISPCODE", "SEQNO", "_PSU",
    "SEXVAR", "PERSDOC3", "MEDCOST1", "RMVTETH4", "CPDEMO1C", "VETERAN3",
    "QSTVER", "QSTLANG", "_STSTR", "_STRWT", "_RAWRAKE", "_WT2RAKE", "_IMPRACE",
    "_DUALUSE", "_LLCPWT2", "_LLCPWT", "_RFHLTH", "_PHYS14D", "_MENT14D",
    "_HLTHPL2", "_HCVU654", "_TOTINDA", "_EXTETH3", "_DENVST3", "_MICHD",
    "_LTASTH1", "_CASTHM1", "_ASTHMS1", "_DRDXAR2", "_MRACE1", "_HISPANC",
    "_RACEG21", "_RACEGR3", "_RACEPRV", "_AGEG5YR", "_AGE65YR", "_AGE_G",
    "_RFBMI5", "_CHLDCNT", "_EDUCAG", "_INCOMG1", "_SMOKER3", "_RFSMOK3",
    "_CURECI3", "_LCSAGE", "DRNKANY6", "DROCDY4_", "_RFBING6", "_DRNKWK3",
    "_RFDRHV9", "_METSTAT", "_URBSTAT"
]

# Check which of these columns exist in the dataset
existing_cols = [col for col in irrelevant_columns if col in df.columns]

print(f"Total columns to drop (found in dataset): {len(existing_cols)}")

# Drop them
df = df.drop(columns=existing_cols)
print("Shape after dropping irrelevant columns:", df.shape)

<h1 style="color:#E67E22;">Step 4 – Verify the Cleaned Dataset</h1>

<p>
In this step, we will verify the structure and integrity of the cleaned dataset after removing irrelevant and incomplete columns.  
The goal is to ensure that the DataFrame now contains only valid, usable, and relevant variables for analysis.
</p>

<h2 style="color:#F4A460;">4.1 Print the New Shape

In [None]:
# Display the new shape of the cleaned dataset
print(" New dataset shape:", df.shape)

<h2 style="color:#F4A460;">4.2 Generate a List of Remaining Columns

In [None]:
# Create a DataFrame to visualize the first 30 remaining columns
remaining_cols_df = pd.DataFrame(df.columns, columns=["Remaining Columns"])
remaining_cols_df.head(30)

<h2 style="color:#F4A460;">4.3 Confirm No Irrelevant or Excessively Incomplete Columns Remain

In [None]:
# Recheck missing values after cleaning
remaining_missing = (
    df.isnull()
      .sum()
      .to_frame(name="MissingCount")
      .assign(MissingPercent=lambda x: (x["MissingCount"] / len(df) * 100).round(2))
      .sort_values("MissingCount", ascending=False)
)

# Display top 10 columns by missing percentage
remaining_missing.head(10)

<h1 style="color:#E67E22;">Step 5 – Dropping Rows with Missing Values</h1>

<p>
In this step, we examined the percentage of missing data in each column and determined an appropriate cleaning approach.  
Columns with a missing rate of less than <b>0.01%</b> were considered statistically insignificant, and the rows containing those missing values were safely <b>dropped</b> using the <b>.dropna()</b> function.  
However, important variables such as <b>HEIGHT3</b>, <b>WEIGHT2</b>, <b>INCOME3</b>, <b>CHILDREN</b>, and <b>EMPLOY1</b> were retained for <b>imputation</b>, as they represent key demographic and health indicators.  
This selective approach minimizes unnecessary data loss while ensuring the dataset remains accurate, consistent, and representative for further analysis.
</p>

In [None]:
missing_summary = pd.DataFrame({
    'Missing Count': df.isnull().sum(),
    'Missing %': round((df.isnull().sum() / len(df)) * 100, 3)
})
missing_summary = missing_summary[missing_summary['Missing Count'] > 0].sort_values(by='Missing %', ascending=False)
missing_summary

In [None]:
# Step 1: List columns we will keep for imputation
impute_cols = ['HEIGHT3', 'WEIGHT2', 'INCOME3', 'CHILDREN', 'EMPLOY1']

# Step 2: Drop rows that have missing values in OTHER columns (less than 0.01%)
df = df.dropna(subset=[col for col in df.columns if col not in impute_cols])

# Step 2.5: Create an independent copy to avoid SettingWithCopyWarning
df = df.copy()

# Step 3: Verify dataset shape after dropping
print("Dataset shape after dropping negligible missing rows:", df.shape)

# Step 4: Check remaining missingness (should only be in the 5 selected columns)
missing_summary = (df[impute_cols].isnull().sum() / len(df)) * 100
print("\nRemaining columns to impute and their missing percentages:\n")
print(missing_summary)

In [None]:
# Columns we plan to impute
impute_cols = ['HEIGHT3', 'WEIGHT2', 'INCOME3', 'CHILDREN', 'EMPLOY1']

# Display datatypes
df[impute_cols].dtypes

In [None]:
# Columns we plan to analyze
impute_cols = ['HEIGHT3', 'WEIGHT2', 'INCOME3', 'CHILDREN', 'EMPLOY1']

# Loop through and print unique values for each
for col in impute_cols:
    print(f"\nColumn: {col}")
    print(df[col].unique())
    print(f"Total unique values: {df[col].nunique()}")

<h2 style="color:#F4A460;">5.1 Imputing Missing Values for HEIGHT3</h2>

<p>
The variable <b>HEIGHT3</b> captures respondents’ self-reported height, primarily recorded in feet and inches (ranging from <b>200</b> to <b>711</b>).  
According to the dataset documentation, several special codes were used to represent non-responses or alternative formats:  
<ul>
  <li><b>7777</b> – “Don’t know / Not sure”</li>
  <li><b>9999</b> – “Refused”</li>
  <li><b>9061–9998</b> – Metric height responses (meters/centimeters)</li>
  <li><b>Blank</b> – “Not asked or Missing”</li>
</ul>
These coded responses do not represent valid measurements and were therefore replaced with <b>NaN</b> values.
</p>

<p>
Since height is a continuous numeric variable, missing values were imputed using the <b>median height grouped by sex</b>.  
This group-based approach maintains the natural differences in average height between male and female respondents, while avoiding the influence of extreme values.  
Metric-coded entries (values beginning with <b>9</b>) were rare and were treated as missing to ensure consistent units across the dataset.
</p>

In [None]:
import numpy as np

# Safely replace invalid codes with NaN using .loc to avoid the warning
df.loc[:, 'HEIGHT3'] = df['HEIGHT3'].replace([7777, 9999], np.nan)

# Confirm changes
print("Missing after recoding:", df['HEIGHT3'].isnull().sum())

# Display a few valid values
print("\nUnique value sample after cleaning:")
print(sorted(df['HEIGHT3'].dropna().unique())[:20])

# Save value before imputation
height3_missing_before_imputation = df['HEIGHT3'].isnull().sum()

In [None]:
import numpy as np
import pandas as pd

# 1) Treat metric-coded entries (9061–9998) as missing to keep one unit system
metric_mask = (df['HEIGHT3'] >= 9061) & (df['HEIGHT3'] <= 9998)
df.loc[metric_mask, 'HEIGHT3'] = np.nan

# 2) Impute HEIGHT3 with the median by sex
#    BRFSS: _SEX is coded (1 = Male, 2 = Female). Your _SEX had 0% missing earlier.
df['HEIGHT3'] = df['HEIGHT3'].fillna(
    df.groupby('_SEX')['HEIGHT3'].transform('median')
)

# 3) Verify imputation
print("Remaining missing in HEIGHT3:", df['HEIGHT3'].isna().sum())
print("Median by sex used for imputation:")
print(df.groupby('_SEX')['HEIGHT3'].median())

# 4) (Optional) Create a human-friendly inches variable for analysis/viz
#    HEIGHT3 codes like 504 mean 5 feet 04 inches → convert to inches
def code_to_inches(code):
    feet = int(code // 100)
    inches = int(code % 100)
    return feet * 12 + inches

df['HEIGHT3_in'] = df['HEIGHT3'].apply(code_to_inches)

# Quick sanity check
df[['HEIGHT3', 'HEIGHT3_in']].head()

Since <b>HEIGHT3</b> is a continuous numeric variable representing physical height, missing values were imputed using the <b>median height grouped by sex</b>. This approach maintains the biological differences in average height between male and female respondents and minimizes bias from extreme values. The computed median heights were <b>510.0</b> for males and <b>504.0</b> for females, expressed in BRFSS format (feet–inches concatenated). After imputation, all missing entries were successfully filled, resulting in <b>0 remaining missing values</b>. </p> <p> Following imputation, a new column named <b>HEIGHT3_in</b> was created to convert the coded height values into actual inches. </p>
<p> This transformation allows easier analysis and statistical comparison by standardizing the unit of measurement. By converting <b>HEIGHT3</b> into inches, the dataset now supports consistent numeric analysis, accurate BMI computation, and easier integration with other continuous health metrics in later stages of the project. </p>

<h2 style="color:#F4A460;">5.2 Imputing Missing Values for WEIGHT2</h2>

<p>
The variable <b>WEIGHT2</b> records respondents’ self-reported body weight without shoes, primarily measured in pounds.  
According to the BRFSS 2024 data dictionary, the variable includes multiple coded values representing invalid or non-response entries:  
<ul>
  <li><b>7777</b> – “Don’t know / Not sure”</li>
  <li><b>9999</b> – “Refused”</li>
  <li><b>9023–9352</b> – Metric weight responses (kilograms)</li>
  <li><b>Blank</b> – “Not asked or Missing”</li>
</ul>
All of these codes were replaced with <b>NaN</b> to ensure that only valid weight observations were retained for further analysis.
</p>

<p>
Since <b>WEIGHT2</b> is a continuous numeric variable and directly affects downstream analyses such as BMI calculation, missing values were imputed using the <b>median weight grouped by sex</b>.  
This approach preserves the natural differences between male and female weight distributions while preventing distortion caused by extreme outliers.  
Metric-coded responses (values greater than <b>9000</b>) were rare and were also treated as missing to maintain consistent units.
</p>

In [None]:
import numpy as np
import pandas as pd

# --- STEP 1: Copy the column for safety ---
w = df["WEIGHT2"].copy()

# --- STEP 2: Define invalid or special codes ---
invalid_codes = {7777, 9999}       # "Don't know" / "Refused"
metric_low, metric_high = 9023, 9352  # Metric-coded responses (in kilograms)

# --- STEP 3: Replace invalid and metric-coded values with NaN ---
w = w.mask(w.isin(invalid_codes), np.nan)                      # replace 7777, 9999
w = w.mask((w >= metric_low) & (w <= metric_high), np.nan)     # replace 9023–9352

# --- STEP 4: Optional range cleanup (keep plausible values only, 50–776 lbs) ---
w = w.mask((w < 50) | (w > 776), np.nan)

# --- STEP 5: Save the cleaned version back into the DataFrame ---
df["WEIGHT2"] = w
df["WEIGHT_lb"] = w  # new clear column for explicit unit (pounds)

# Save value before imputation
weight2_missing_before_imputation = df["WEIGHT2"].isnull().sum()

In [None]:
# --- STEP 6: Calculate median weights by sex (_SEX: 1 = Male, 2 = Female) ---
medians_by_sex = df.groupby("_SEX")["WEIGHT2"].median()
print("Median weights by sex (lbs):\n", medians_by_sex)

# --- STEP 7: Fill missing values using each group’s median ---
df["WEIGHT2"] = df["WEIGHT2"].fillna(df["_SEX"].map(medians_by_sex))
df["WEIGHT_lb"] = df["WEIGHT2"]  # keep synchronized

# --- STEP 8: Verify results ---
weight2_missing_after_imputation = df["WEIGHT2"].isna().sum()
print("Remaining missing in WEIGHT2:", weight2_missing_after_imputation)
print("Post-imputation WEIGHT2 range (lbs):", df["WEIGHT2"].min(), "to", df["WEIGHT2"].max())

# Summary of WEIGHT2 imputation
df.loc[:, ["_SEX", "WEIGHT2", "WEIGHT_lb"]].head(10)

weight2_values_imputed = weight2_missing_before_imputation - weight2_missing_after_imputation
print(f"\nWEIGHT2 Imputation Summary:")
print(f"Values imputed: {weight2_values_imputed:,}")

The calculated median weights were <b>194 lbs</b> for males (<code>_SEX = 1</code>) and <b>160 lbs</b> for females (<code>_SEX = 2</code>).  
All missing values were successfully imputed, resulting in <b>0 remaining missing entries</b> in the dataset.  
The valid weight range after cleaning and imputation extended from <b>50 lbs</b> to <b>776 lbs</b>, consistent with the official BRFSS 2024 codebook.
</p>
The resulting <b>WEIGHT_lb</b> column now contains consistent and realistic weight values for all respondents, ready for use in <b>BMI</b> computation and subsequent health-related analyses.
</p>

<h2 style="color:#F4A460;">5.3 Imputing Missing values for INCOME3</h2>

<p>
The variable <b>INCOME3</b> captures respondents’ annual household income from all sources, coded into 11 ordinal categories.  
Each numeric code corresponds to a specific income range, while additional codes represent missing or invalid responses.  
According to the BRFSS codebook:
</p>

<ul>
  <li><b>1–11:</b> Valid income categories ranging from "$10,000 to 200,000" or more</li>
  <li><b>77:</b> “Don’t know / Not sure”</li>
  <li><b>99:</b> “Refused”</li>
  <li><b>Blank:</b> “Not asked or Missing”</li>
</ul>

<p>
Values coded as <b>77</b> or <b>99</b> do not represent valid numeric income categories and were replaced with <b>NaN</b> values to avoid bias in downstream analysis.  
Since income is an ordinal categorical variable, imputation was not appropriate; instead, these missing responses were maintained as <b>NaN</b> to preserve data integrity.
</p>

In [None]:
# --- STEP 1: Copy column for safety ---
inc = df["INCOME3"].copy()

# --- STEP 2: Replace invalid and missing codes with NaN ---
invalid_codes = [77, 99]
inc = inc.mask(inc.isin(invalid_codes), np.nan)

# --- STEP 3: Save cleaned column ---
df["INCOME3"] = inc

print("Unique values after cleaning:", sorted(df["INCOME3"].dropna().unique()))
print("Missing INCOME3 values:", df["INCOME3"].isna().sum())

# Save value before imputation
income3_missing_before_imputation = df["INCOME3"].isnull().sum()

In [None]:
missing_ratio = df["INCOME3"].isna().mean() * 100
missing_income3 = df["INCOME3"].isna().sum()
print(f"Missing ratio: {missing_ratio:.2f}%")

In [None]:
income_labels = {
    1: "Less than $10,000",
    2: "$10,000 to < $15,000",
    3: "$15,000 to < $20,000",
    4: "$20,000 to < $25,000",
    5: "$25,000 to < $35,000",
    6: "$35,000 to < $50,000",
    7: "$50,000 to < $75,000",
    8: "$75,000 to < $100,000",
    9: "$100,000 to < $150,000",
    10: "$150,000 to < $200,000",
    11: "$200,000 or more"
}

df["Income_Level"] = df["INCOME3"].map(income_labels)
df["Income_Level"] = df["Income_Level"].fillna("Unknown")

print(df["Income_Level"].value_counts(dropna=False).head(12))

Unlike <b>HEIGHT3</b> and <b>WEIGHT2</b>, we decided not to impute missing values for <b>INCOME3</b>.  
Income is an <b>ordinal variable</b> where people often skip the question or choose “Don’t know” or “Refused,” which usually isn’t random.  
These missing values can be related to factors like age, education, or employment, so filling them in with averages could easily introduce bias.
</p>
<p>
Instead, we kept all missing and invalid responses as a separate category called <b>"Unknown"</b>.  
Overall, this approach keeps the dataset realistic and avoids making assumptions about respondents’ income levels.
</p>

<h2 style="color:#F4A460;">5.4 Imputing Missing values for CHILDREN</h2>

<p>
The variable <b>CHILDREN</b> represents the number of people under 18 years old living in each respondent’s household.  
According to the BRFSS 2024 codebook, it’s a numeric field with a few special codes used to handle nonstandard answers.
</p>

<ul>
  <li><b>1–87</b> → Actual number of children reported.</li>
  <li><b>88</b> → “None,” meaning the respondent has no children under 18 at home.</li>
  <li><b>99</b> → “Refused” to answer.</li>
  <li><b>Blank</b> → “Not asked or missing.”</li>
</ul>

<p>
Values from 1 to 87 are valid counts, while 88 and 99 represent coded responses that need to be recoded for analysis.  
To make this variable easier to work with, we replaced code <b>88</b> with <b>0</b> (no children) and treated <b>99</b> and blank responses as missing (<b>NaN</b>).
</p>

<p>
This keeps the variable numeric and ready for descriptive statistics or modeling while clearly separating missing or refused answers from true zeros.
</p>

In [None]:
valid_children = df["CHILDREN"][(df["CHILDREN"] >= 1) & (df["CHILDREN"] <= 87)]

In [None]:
print("Total unique valid answers (1–87):", valid_children.nunique())

In [None]:
valid_counts = valid_children.value_counts().sort_index()
print(valid_counts)

In [None]:
# Make a copy of the CHILDREN column
child = df["CHILDREN"].copy()

In [None]:
# Replace 88 with 0 (no children), 99 and blanks with NaN
child = child.replace({88: 0, 99: np.nan})

In [None]:
# Replace extreme values (more than 20 children) with NaN
child = child.mask(child > 20, np.nan)

In [None]:
# Save cleaned version
df["CHILDREN"] = child

In [None]:
# Save value before imputation
children_missing_before_imputation = df["CHILDREN"].isnull().sum()
print("Remaining missing values:", df["CHILDREN"].isna().sum())
print("Valid CHILDREN range:", df["CHILDREN"].min(), "to", df["CHILDREN"].max())
print(df["CHILDREN"].value_counts().sort_index().head(15))

In [None]:
# Fix floating point artifact
df["CHILDREN"] = df["CHILDREN"].replace(5.397605e-79, 0)

# Quick verification
print("Remaining missing:", df["CHILDREN"].isna().sum())
print("Range after final cleaning:", df["CHILDREN"].min(), "to", df["CHILDREN"].max())

In [None]:
# Create a copy with labels
df["Children_Label"] = df["CHILDREN"].copy()

# Replace NaN with "Unknown" for display
df["Children_Label"] = df["Children_Label"].fillna("Unknown")

# Check result
print(df["Children_Label"].value_counts(dropna=False).head(10))

Values greater than <b>20</b> were considered unrealistic and also treated as missing.  
A small floating-point artifact (<b>5.397605e-79</b>) was corrected to zero.
</p>

<p>
After cleaning, the valid range was <b>0–20 children</b>, and about <b>2%</b> of entries remained missing.  
Since missing responses are minimal and likely non-random, no imputation was applied.  
This keeps the distribution realistic and avoids introducing artificial bias.
</p>

<h2 style="color:#F4A460;">5.5 Imputing Missing values for EMPLOY1</h2>
<p> The variable <b>EMPLOY1</b> records each respondent’s current employment status, using coded values from 1 to 9. According to the BRFSS data dictionary: </p> <ul> <li><b>1</b> – Employed for wages</li> <li><b>2</b> – Self-employed</li> <li><b>3</b> – Out of work for 1 year or more</li> <li><b>4</b> – Out of work for less than 1 year</li> <li><b>5</b> – A homemaker</li> <li><b>6</b> – A student</li> <li><b>7</b> – Retired</li> <li><b>8</b> – Unable to work</li> <li><b>9</b> – Refused (nonresponse)</li> <li><b>Blank</b> – Not asked or missing</li> </ul> <p> To ensure accurate analysis, the invalid and nonresponse codes (<b>9</b> and blanks) were replaced with <b>NaN</b>. A new labeled categorical variable, <b>Employment_Status</b>, was then created using clear text labels for readability. Missing values (i.e., “Refused” or blank responses) were recoded as <b>“Unknown”</b> to preserve respondent counts while distinguishing them from valid categories. </p>

In [None]:
# STEP 1: Copy the column
emp = df["EMPLOY1"].copy()

# STEP 2: Replace invalid code (9 = Refused) with NaN
emp = emp.mask(emp == 9, np.nan)

# STEP 3: Save the cleaned numeric column
df["EMPLOY1"] = emp

# Save value before imputation
employ1_missing_before_imputation = df["EMPLOY1"].isnull().sum()

# Quick check
print("Unique valid values after cleaning:", sorted(df["EMPLOY1"].dropna().unique()))
print("Missing EMPLOY1 values:", df["EMPLOY1"].isna().sum())

In [None]:
# STEP 4: Define employment status labels
employ_labels = {
    1: "Employed for wages",
    2: "Self-employed",
    3: "Out of work for 1 year or more",
    4: "Out of work for less than 1 year",
    5: "Homemaker",
    6: "Student",
    7: "Retired",
    8: "Unable to work"
}

# STEP 5: Create readable categorical column
df["Employment_Status"] = df["EMPLOY1"].map(employ_labels).fillna("Unknown")

# STEP 6: Verify distribution
print(df["Employment_Status"].value_counts(dropna=False))

Out of <b>457,634</b> respondents, the largest groups are <b>Employed for wages</b> (≈<b>40.7%</b>) and <b>Retired</b> (≈<b>32.1%</b>), showing a mix of active workers and retirees.  
Smaller segments include <b>Self-employed</b> (≈<b>8.6%</b>), <b>Unable to work</b> (≈<b>6.1%</b>), <b>Homemaker</b> (≈<b>3.9%</b>), and <b>Student</b> (≈<b>2.5%</b>).  
Unemployment is relatively low: <b>Out of work &lt; 1 year</b> (≈<b>2.4%</b>) and <b>≥ 1 year</b> (≈<b>2.0%</b>).  
Nonresponses are minimal, with <b>Unknown</b> at ≈<b>1.8%</b>.
</p>

<h2 style="color:#F4A460;">5.6 Summary of Cleaned Variables</h2>

<p>
The <b>HEIGHT3</b> variable contained self-reported height values, with unrealistic or metric entries replaced by the <b>median height by sex</b>.  
This approach ensured a balanced and realistic height distribution between male and female respondents.
</p>

<p>
For <b>WEIGHT2</b>, most responses were within a healthy adult range (100–300 lbs).  
Unrealistic outliers and invalid codes were replaced with <b>median weights by sex</b>, helping maintain accuracy and consistency in the data.
</p>

<p>
The <b>INCOME3</b> variable was cleaned and recoded into clear income brackets for easier interpretation.  
Approximately <b>19%</b> of responses were missing or uncertain, which were labeled as <b>“Unknown”</b> to preserve respondent counts without introducing bias.
</p>

<p>
The <b>CHILDREN</b> variable captured the number of people under 18 in each household.  
Responses above 20 were treated as invalid and replaced with missing values, while “Refused” and blank entries were labeled as <b>“Unknown”</b> to keep the dataset complete.
</p>

<p>
Finally, <b>EMPLOY1</b> described each respondent’s employment status.  
After cleaning, most respondents were classified as <b>“Employed for wages”</b> or <b>“Retired”</b>, with smaller groups representing self-employed, students, or those unable to work.  
Nonresponses were recorded as <b>“Unknown”</b> to maintain consistency with other demographic variables.
</p>

<h1 style="color:#E67E22;">Step 6 – Checking for inconsistencies</h1>

<p>
In this step, the dataset will be examined to identify possible inconsistencies or implausible values. Numeric fields will be checked to ensure that ranges and units are logically consistent. Any detected issues will be addressed and documented to preserve data accuracy.
</p>

<h2 style="color:#F4A460;">6.1 Identifying numeric columns</h2>

<p>
This section performs an exploratory scan of all numeric variables, computing basic statistics and identifying extreme values based on the 1st–99th quantile range. The goal is to flag potential anomalies before applying logical range validation in later steps.
</p>

In [None]:
# ------------------------------------------------------------
# STEP 6.1 — Extended numeric inconsistency scan (detection only)
# ------------------------------------------------------------

# --- Identify numeric columns (post-cleaning) ---
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
n_rows = len(df)

print(f"[Step 6.2] Scanning {len(num_cols)} numeric columns on {n_rows} rows...")

records = []
for col in num_cols:
    s = df[col]

    # --- Basic stats (ignore inf/-inf) ---
    s_valid = s.replace([np.inf, -np.inf], np.nan)

    q01 = s_valid.quantile(0.01)
    q99 = s_valid.quantile(0.99)
    col_min = s_valid.min()
    col_max = s_valid.max()
    n_missing = int(s_valid.isna().sum())

    # --- Proposed logical envelope (quantile-based, detection only) ---
    # This is a generic flag to surface extreme values for review.
    # Domain-specific ranges will be applied in the next sub-steps.
    in_bounds = s_valid.between(q01, q99)
    n_oob = int((~in_bounds & s_valid.notna()).sum())
    pct_oob = (n_oob / n_rows) * 100 if n_rows else 0.0

    # --- Extra quick checks ---
    n_negative = int((s_valid < 0).sum())

    records.append({
        "variable": col,
        "dtype": str(s.dtype),
        "min": col_min,
        "q01": q01,
        "q99": q99,
        "max": col_max,
        "n_missing": n_missing,
        "n_outside_q01_q99": n_oob,
        "pct_outside_q01_q99": round(pct_oob, 4),
        "n_negative": n_negative
    })



<h2 style="color:#F4A460;">6.2 Displaying top 15 variables outside of logical bounds</h2>

<p>
This results highlight numeric variables with higher proportions of values outside the 1st–99th percentile range. These results provide an exploratory overview to guide which variables may require further logical validation.
</p>

In [None]:
# ------------------------------------------------------------
# STEP 6.2 — Summary table sorted by percent outside logical bounds
# -----------------------------------------------------------
df_inconsistency_allnum = (
    pd.DataFrame.from_records(records)
      .sort_values(by=["pct_outside_q01_q99", "n_outside_q01_q99"], ascending=False)
      .reset_index(drop=True)
)

print("[Step 6.2] Extended numeric scan completed.")
print("Top 15 variables by % outside [q01, q99]:")
display_cols = [
    "variable", "dtype", "min", "q01", "q99", "max",
    "n_missing", "n_negative", "n_outside_q01_q99", "pct_outside_q01_q99"
]
df_inconsistency_allnum[display_cols].head(15)


<h2 style="color:#F4A460;">6.3 Corrections & unit standardization</h2>

<p>
In this step, the detected inconsistencies were addressed by creating adjusted versions of key demographic and anthropometric variables. Height and weight were consolidated from multiple sources and converted into consistent units, while implausible values were replaced with missing entries. Age and number of children were also validated against logical population ranges. All adjustments were stored in new <b>*_adj</b> columns to preserve the integrity of the original data.
</p>

In [None]:
# ------------------------------------------------------------
# STEP 6.3 — Corrections & unit standardization (apply changes)
# ------------------------------------------------------------

# --- Helpers ---
def first_non_null(*series):
    """Return the first non-null value across multiple aligned Series."""
    out = series[0].copy()
    for s in series[1:]:
        out = out.where(out.notna(), s)
    return out

def to_numeric_safe(s):
    return pd.to_numeric(s, errors="coerce")



<h2 style="color:#F4A460;">6.3.1 Height (inches)</h2>

<p>
This block consolidates height information from all available sources, converts centimeters to inches when necessary, and creates an adjusted height variable. Implausible values outside the 48–84 inch range are set to missing to ensure logical consistency.
</p>

In [None]:
# --- Height (inches): build HEIGHT3_in_adj using best available source ---
h_in  = to_numeric_safe(df.get("HEIGHT3_in")) if "HEIGHT3_in" in df.columns else None
htin4 = to_numeric_safe(df.get("HTIN4")) if "HTIN4" in df.columns else None
htm4  = to_numeric_safe(df.get("HTM4")) if "HTM4" in df.columns else None  # centimeters

height_from_cm = htm4 * 0.3937007874 if htm4 is not None else None
height_candidates = [s for s in [h_in, htin4, height_from_cm] if s is not None]

if height_candidates:
    df["HEIGHT3_in_adj"] = first_non_null(*height_candidates)
    # Range enforcement (adult plausible): 48–84 inches
    before_valid = df["HEIGHT3_in_adj"].notna().sum()
    df.loc[~df["HEIGHT3_in_adj"].between(48, 84), "HEIGHT3_in_adj"] = np.nan
    after_valid = df["HEIGHT3_in_adj"].notna().sum()
    print(f"[Step 6.3] HEIGHT3_in_adj: valid before={before_valid}, after={after_valid}")
else:
    print("[Step 6.3] HEIGHT3_in_adj: no height sources available (skipped)")

<h2 style="color:#F4A460;">6.3.2 Weight (pounds)</h2>

<p>
Weight information is consolidated from pounds and kilograms, converting kilograms to pounds when available and selecting the first non-missing value across sources. Implausible adult weights outside the 70–700 lb range are corrected when possible or set to missing to maintain logical consistency.
</p>

In [None]:
# --- Weight (pounds): build WEIGHT_lb_adj using WEIGHT_lb or WTKG3 ---
w_lb   = to_numeric_safe(df.get("WEIGHT_lb")) if "WEIGHT_lb" in df.columns else None
wtkg3  = to_numeric_safe(df.get("WTKG3")) if "WTKG3" in df.columns else None
w_from_kg = wtkg3 * 2.2046226218 if wtkg3 is not None else None

weight_candidates = []
# Prefer existing pounds; if missing or implausible, fallback to kg-converted
if w_lb is not None:
    weight_candidates.append(w_lb)
if w_from_kg is not None:
    weight_candidates.append(w_from_kg)

if weight_candidates:
    base_w = first_non_null(*weight_candidates)
    # Try to salvage implausible values using kg source when available
    # Plausible adult weight in lb: 70–700
    w_adj = base_w.copy()
    impl_mask = ~(w_adj.between(70, 700))
    if w_from_kg is not None:
        # Replace implausible pounds with kg-converted where kg is plausible (30–250 kg)
        kg_ok = wtkg3.between(30, 250)
        w_adj = np.where(impl_mask & kg_ok, w_from_kg, w_adj)
        w_adj = pd.to_numeric(w_adj, errors="coerce")

    df["WEIGHT_lb_adj"] = w_adj
    before_valid = pd.Series(w_adj).notna().sum()
    # Final range enforcement
    df.loc[~df["WEIGHT_lb_adj"].between(70, 700), "WEIGHT_lb_adj"] = np.nan
    after_valid = df["WEIGHT_lb_adj"].notna().sum()
    print(f"[Step 6.3] WEIGHT_lb_adj: valid before={before_valid}, after={after_valid}")
else:
    print("[Step 6.3] WEIGHT_lb_adj: no weight sources available (skipped)")

<h2 style="color:#F4A460;">6.3.3 Children</h2>

<p>
The reported number of children is validated by enforcing a logical range of 0–20. Values outside this interval are set to missing, and the adjusted results are stored in a separate column to preserve the original data.
</p>

In [None]:
# --- CHILDREN: non-negative integers within a reasonable cap ---
if "CHILDREN" in df.columns:
    ch = to_numeric_safe(df["CHILDREN"])
    before_valid = ch.notna().sum()
    ch = ch.mask(~ch.between(0, 20))  # set implausible counts to NaN
    df["CHILDREN_adj"] = ch
    after_valid = df["CHILDREN_adj"].notna().sum()
    print(f"[Step 6.3] CHILDREN_adj: valid before={before_valid}, after={after_valid}")
else:
    print("[Step 6.3] CHILDREN_adj: column not found (skipped)")


<h2 style="color:#F4A460;">6.3.4 Age</h2>

<p>
Age is taken from the primary source <b>_AGE80</b> and validated to fall within the adult range of 18–99 years. Any values outside this interval are set to missing, and the cleaned version is stored in a separate adjusted column.
</p>

In [None]:
# --- Age: prefer _AGE80 when available; enforce adult range 18–99 ---
age_series = None
if "_AGE80" in df.columns:
    age_series = to_numeric_safe(df["_AGE80"])
elif "_AGE" in df.columns:
    age_series = to_numeric_safe(df["_AGE"])  # fallback if exists

if age_series is not None:
    before_valid = age_series.notna().sum()
    age_series = age_series.mask(~age_series.between(18, 99))
    df["AGE_adj"] = age_series
    after_valid = df["AGE_adj"].notna().sum()
    print(f"[Step 6.3] AGE_adj: valid before={before_valid}, after={after_valid}")
else:
    print("[Step 6.3] AGE_adj: no age source available (skipped)")

<h2 style="color:#F4A460;">6.4 Summary of data adjustments</h2>

<p>
A compact summary is generated to document the number of missing and non-missing values for each adjusted variable. This table provides a clear overview of the impact of the corrections applied during the inconsistency handling process.
</p>

In [None]:
# --- Compact summary of adjustments (for documentation) ---
summary_cols = [c for c in ["HEIGHT3_in_adj", "WEIGHT_lb_adj", "BMI_adj", "CHILDREN_adj", "AGE_adj"] if c in df.columns]
adj_summary = (
    pd.DataFrame({
        "variable": summary_cols,
        "n_missing": [df[c].isna().sum() for c in summary_cols],
        "n_non_missing": [df[c].notna().sum() for c in summary_cols]
    })
    .assign(rows_total=len(df))
)
print("[Step 6.3] Adjustment summary:")
adj_summary

<p>
All numeric fields were first screened using quantile-based diagnostics to identify potential anomalies. Logical validations were then applied to height, weight, age, and number of children, ensuring consistent units and enforcing plausible human ranges. Adjusted variables were created to preserve original data, and a summary table documented the impact of these corrections.
</p>

<h1 style="color:#E67E22;">Step 7 – Aggregate the Data</h1>

<p>
In this step, we will create derived variables and summarize the data cleaning process. 
This includes calculating Body Mass Index (BMI), creating categorical variables for income and education levels, 
and providing a comprehensive summary of rows that were dropped or imputed during the cleaning process.
</p>

<h2 style="color:#F4A460;">7.1 Create Body Mass Index (BMI) Column</h2>

<p>
Body Mass Index (BMI) is a key health indicator that relates an individual's weight to their height. 
It is calculated using the formula: <b>BMI = weight (kg) / height (m)2</b>.
</p>

In [None]:
# Check if we have the adjusted variables from Step 6
height_col = 'HEIGHT3_in_adj' if 'HEIGHT3_in_adj' in df.columns else 'HEIGHT3_in'
weight_col = 'WEIGHT_lb_adj' if 'WEIGHT_lb_adj' in df.columns else 'WEIGHT_lb'

print(f"Using {height_col} and {weight_col} for BMI calculation")

# Convert to metric units and calculate BMI
# Weight: pounds to kilograms (divide by 2.205)
# Height: inches to meters (multiply by 0.0254)
df['weight_kg'] = df[weight_col] / 2.205
df['height_m'] = df[height_col] * 0.0254

# Calculate BMI = weight(kg) / height(m)²
df['BMI'] = df['weight_kg'] / (df['height_m'] ** 2)

# Check for valid BMI range (typically 10-80)
valid_bmi_mask = df['BMI'].between(10, 80)
df.loc[~valid_bmi_mask, 'BMI'] = np.nan

print(f"BMI Statistics:")
print(f"Valid BMI values: {df['BMI'].notna().sum():,}")
print(f"Missing BMI values: {df['BMI'].isna().sum():,}")
print(f"BMI Range: {df['BMI'].min():.1f} to {df['BMI'].max():.1f}")
print(f"Mean BMI: {df['BMI'].mean():.1f}")
print(f"Median BMI: {df['BMI'].median():.1f}")

# Clean up temporary columns
df.drop(['weight_kg', 'height_m'], axis=1, inplace=True)

In [None]:
# Create BMI categories according to WHO standards
def categorize_bmi(bmi):
    if pd.isna(bmi):
        return "Unknown"
    elif bmi < 18.5:
        return "Underweight"
    elif 18.5 <= bmi < 25:
        return "Normal weight"
    elif 25 <= bmi < 30:
        return "Overweight"
    else:
        return "Obese"

df['BMI_Category'] = df['BMI'].apply(categorize_bmi)

# Display BMI category distribution
print("BMI Category Distribution:")
print(df['BMI_Category'].value_counts(dropna=False))
print(f"\nPercentage Distribution:")
bmi_pct = (df['BMI_Category'].value_counts(dropna=False) / len(df) * 100).round(2)
for category, percentage in bmi_pct.items():
    print(f"{category}: {percentage}%")

<h2 style="color:#F4A460;">7.2 Create Appropiate Categories for Income and Education Levels</h2>

<p>
We will create appropiate categorization to create more meaningful groupings for analysis.
</p>

<h2 style="color:#F4A460;">7.2.1 Create Income Level Categories</h2>

<p>
We will create an income categorie based on INCOME3, this will be consolidated into more interpretable groups 
that facilitate statistical analysis and visualization.
</p>

In [None]:
# Create consolidated income categories
def categorize_income(income_code):
    if pd.isna(income_code):
        return "Unknown"
    elif income_code in [1, 2, 3]:  # Less than $20,000
        return "Low Income"
    elif income_code in [4, 5]:     # $20,000 to < $35,000
        return "Lower-Middle Income"
    elif income_code in [6, 7]:     # $35,000 to < $75,000
        return "Middle Income"
    elif income_code in [8, 9]:     # $75,000 to < $150,000
        return "Upper-Middle Income"
    elif income_code in [10, 11]:   # $150,000 or more
        return "High Income"
    else:
        return "Unknown"

# Apply income categorization
df['Income_Category'] = df["INCOME3"].apply(categorize_income)

# Display income category distribution
print("Income Category Distribution:")
income_dist = df['Income_Category'].value_counts(dropna=False)
print(income_dist)

print(f"\nPercentage Distribution:")
income_pct = (income_dist / len(df) * 100).round(2)
for category, percentage in income_pct.items():
    print(f"{category}: {percentage}%")

<h2 style="color:#F4A460;">7.2.2 Create Education Level Categories</h2>

<p>
Education levels will be categorized into meaningful groups based on the EDUCA variabe.
This categorization will help the analysis of health outcomes by educational attainment and 
provide insights into socioeconomic patterns in the data.
</p>

In [None]:
# Create education categories based on standard coding
def categorize_education(edu_code):
    if pd.isna(edu_code) or edu_code == 9:
        return "Unknown"
    elif edu_code == 1:
        return "No Formal Education"
    elif edu_code == 2:
        return "Elementary"
    elif edu_code == 3:
        return "Some High School"
    elif edu_code == 4:
        return "High School Graduate"
    elif edu_code == 5:
        return "Some College/Technical"
    elif edu_code == 6:
        return "College Graduate"
    else:
        return "Unknown"

# Apply education categorization
df['Education_Category'] = df["EDUCA"].apply(categorize_education)

# Display education category distribution
print("Education Category Distribution:")
edu_dist = df['Education_Category'].value_counts(dropna=False)
print(edu_dist)

print(f"\nPercentage Distribution:")
edu_pct = (edu_dist / len(df) * 100).round(2)
for category, percentage in edu_pct.items():
    print(f"{category}: {percentage}%")

<h2 style="color:#F4A460;">7.3 Summary of Rows Dropped and Imputed</h2>

<p>
This section provides a  summary of all data cleaning operations performed throughout the analysis,
including the number of rows dropped, columns removed, and values imputed.
</p>

In [None]:
# Variables for original data shape
original_rows = 457670
original_cols = 301

# Variables for transformated data shape
current_rows = len(df)
current_cols = len(df.columns)

print(f"\n DATASET OVERVIEW:")
print(f"{'Original dataset shape:':<30} {original_rows:,} rows × {original_cols} columns")
print(f"{'Final dataset shape:':<30} {current_rows:,} rows × {current_cols} columns")
print(f"{'Rows retained:':<30} {current_rows:,} ({(current_rows/original_rows)*100:.1f}%)")
print(f"{'Columns retained:':<30} {current_cols} ({(current_cols/original_cols)*100:.1f}%)")

print(f"\n ROWS DROPPED:")
rows_dropped = original_rows - current_rows
print(f"{'Total rows dropped:':<30} {rows_dropped:,} ({(rows_dropped/original_rows)*100:.1f}%)")
print(f"{'- Duplicate rows:':<30} 0 (no duplicates found)")
print(f"{'- Rows with missing values:':<30} {rows_dropped:,} (negligible missing data)")

print(f"\n COLUMNS DROPPED:")
cols_dropped = original_cols - current_cols
print(f"{'Total columns dropped:':<30} {cols_dropped} ({(cols_dropped/original_cols)*100:.1f}%)")
print(f"{'- Excessive missing data (>15k):':<30} {cols_dropped-60} columns")
print(f"{'- Irrelevant Columns:':<30} 60 columns")

print(f"\n VALUES IMPUTED:")
height3_imputed = height3_missing_before_imputation - df['HEIGHT3'].isna().sum()
weight2_imputed = weight2_missing_before_imputation - df['WEIGHT2'].isna().sum()
income3_processed = income3_missing_before_imputation - (df['Income_Level'] == 'Unknown').sum()
children_processed = children_missing_before_imputation - (df['Children_Label'] == 'Unknown').sum()
employ1_processed = employ1_missing_before_imputation - (df['Employment_Status'] == 'Unknown').sum()

print(f"{'HEIGHT3 (median by sex):':<35} {height3_imputed:,} values imputed")
print(f"{'WEIGHT2 (median by sex):':<35} {weight2_imputed:,} values imputed")
print(f"{'INCOME3 (set to Unknown):':<35} {income3_processed:,} values processed (labeled as Unknown)")
print(f"{'CHILDREN (kept as missing):':<35} {children_processed:,} values processed (labeled as Unknown)")
print(f"{'EMPLOY1 (set to Unknown):':<35} {employ1_processed:,} values processed (labeled as Unknown)")

# Save cleaned dataset to a new CSV file
df.to_csv("LLCP2024_Cleaned.csv", index=False)


<h1 style="color:#E67E22;"> Step - 8 Final Summary of the Project</h1>

<p>
This section gives a general overview of everything done during the project using the CRISP-DM approach. We kicked things off with a clear idea of our analytical goals, then took a good look at the dataset to get a sense of its structure, quality, and main features. In the Data Understanding and Preparation phase, we spotted important issues like missing values, outliers, and how the categories were distributed, all through descriptive stats and visual aids.
</p>

<p>
During the Data Preparation phase, we tackled all the necessary cleaning tasks, like getting rid of unnecessary columns, dealing with duplicates, dropping rows that had a few missing values, and filling in key variables. This ensured our dataset was accurate and ready to use. We also made some additional transformations and recoding to keep everything consistent and prepare it for future modeling.
</p>

<p>
By fully understanding and prepping the data, this project lays a dependable foundation for more in-depth statistical or predictive work down the line in the CRISP-DM process.
</p>