# Diabetes Risk Prediction
This project uses the Behavioral Risk Factor Surveillance System (BRFSS) survey data from [this link](https://www.cdc.gov/brfss/annual_data/annual_2024.html) to predict the probability of developing different types of Diabetes. Features about U.S. residents include demographic data (e.g. income level, education, race) as well as data regarding health-related risk behaviors, chronic health conditions, and use of preventive services.

This is the first notebook for the project, which parses the raw ASCII data file, available in the link above, to extract the relevant target and feature variables for subsequent EDA and modeling in another notebook in this folder.

The dataset contains 2 identifier columns, 3 Target variable candidates and a total of 24 potential features, as described below:
- Each row is defined by a uniquely defined by (i.e. Table's Grain)
  1. "State FIPS Code"
  2. "Annual Sequence Number" 
- Target variable candidates related to Diabetes:
  1. "(Ever told) you had diabetes"
  2. "Ever been told by a doctor or other health professional that you have pre-diabetes or borderline diabetes?",
  3. "What type of diabetes do you have?"
- Demographic features:
  1. "Urban/Rural Status"
  2. "Reported age in five-year age categories calculated variable"
  3. "Sex of Respondent"
  4. "Computed Race-Ethnicity grouping"
  5. "Education Level"
  6. "Income Level"
- Personal health features:
  1. "Have Personal Health Care Provider?"
  2. "Could Not Afford To See Doctor"
  3. "Computed Weight in Kilograms"
  4. "Computed Height in Meters"
  5. "Computed body mass index"
  6. "Exercise in Past 30 Days"
  7. "How often did you drink regular soda or pop that contains sugar?"
  8. "How often did you drink sugar-sweetened drinks?"
  9. "Computed Smoking Status"
  10. "Computed number of drinks of alcohol beverages per week"
  11. "Drink any alcoholic beverages in past 30 days"
  12. "Heavy Alcohol Consumption  Calculated Variable"
  13. "General Health"
- Other disease indicator features:
  1. "Ever Diagnosed with Heart Attack"
  2. "Ever Diagnosed with Angina or Coronary Heart Disease"
  3. "Ever Diagnosed with a Stroke"
  4. "Ever told you have kidney disease?"
  5. "Ever Told Had Asthma"
  6. "(Ever told) you had a depressive disorder"
  7. "Told Had Arthritis"

## Setup
### Define parameters

In [1]:
# URLs with input data
raw_data_url = "https://www.cdc.gov/brfss/annual_data/2024/files/LLCP2024ASC.zip"
data_dict_url = "https://www.cdc.gov/brfss/annual_data/2024/zip/codebook24_llcp-v2-508.zip"

# Define columns to extract using labels from HTML file
columns_to_extract = [
    "State FIPS Code",
    "Annual Sequence Number",
    "(Ever told) you had diabetes",
    "Ever been told by a doctor or other health professional that you have pre-diabetes or borderline diabetes?",
    "What type of diabetes do you have?",
    "Urban/Rural Status",
    "Reported age in five-year age categories calculated variable",
    "Sex of Respondent",
    "Computed Race-Ethnicity grouping",
    "Education Level",
    "Income Level",
    "Have Personal Health Care Provider?",
    "Could Not Afford To See Doctor",
    "Computed Weight in Kilograms",
    "Computed Height in Meters",
    "Computed body mass index",
    "Exercise in Past 30 Days",
    "How often did you drink regular soda or pop that contains sugar?",
    "How often did you drink sugar-sweetened drinks?",
    "Computed Smoking Status",
    "Computed number of drinks of alcohol beverages per week",
    "Drink any alcoholic beverages in past 30 days",
    "Heavy Alcohol Consumption  Calculated Variable",
    "General Health",
    "Ever Diagnosed with Heart Attack",
    "Ever Diagnosed with Angina or Coronary Heart Disease",
    "Ever Diagnosed with a Stroke",
    "Ever told you have kidney disease?",
    "Ever Told Had Asthma",
    "(Ever told) you had a depressive disorder",
    "Told Had Arthritis"
]

# Output file for writing final dataframe
output_file = "diabetes_data.pickle"

### Import packages

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pickle
import re

pd.set_option('display.max_columns', None)

### Define Functions

In [None]:
def parse_brfss_dictionary(html_file):
    """
    Parse HTML data dictionary to extract both column definitions and value-to-label mappings
    in a single pass through the file.
    
    Parameters:
    -----------
    html_file : str
        Path to the HTML data dictionary file
    
    Returns:
    --------
    tuple : (column_lookup, codebook)
        - column_lookup: dict mapping variable labels to metadata
          Format: {label: {'column_range': str, 'type': str, 'sas_name': str}}
        - codebook: dict mapping SAS variable names to value-label mappings
          Format: {sas_variable_name: {value: label}}
    """
    with open(html_file, 'r', encoding='windows-1252') as f:
        soup = BeautifulSoup(f, 'html.parser')
    
    column_lookup = {}
    codebook = {}
    
    # Find all variable tables (one pass through HTML)
    tables = soup.find_all('table', {'class': 'table'})
    
    for table in tables:
        # Extract metadata from header cell
        metadata_cell = table.find('td', {'class': 'l m linecontent'})
        if not metadata_cell:
            continue
        
        metadata_text = metadata_cell.get_text()
        
        # Only process cells that contain variable definitions
        if 'Label:' not in metadata_text or 'Column:' not in metadata_text:
            continue
        
        # Extract label (between "Label:" and "Section Name:")
        # Note: HTML uses \xa0 (non-breaking spaces)
        label_match = re.search(r'Label:[\s\xa0]+(.+?)Section[\s\xa0]+Name:', metadata_text)
        
        # Extract column range (format: "N" or "N-M")
        column_match = re.search(r'Column:[\s\xa0]+(\d+(?:-\d+)?)', metadata_text)
        
        # Extract variable type (Num or Char)
        type_match = re.search(r'Type[\s\xa0]+of[\s\xa0]+Variable:[\s\xa0]+(Num|Char)', metadata_text)
        
        # Extract SAS variable name (stops before "Question")
        varname_match = re.search(r'SAS[\s\xa0]+Variable[\s\xa0]+Name:[\s\xa0]+(\w+?)(?=Question)', metadata_text)
        
        if not (label_match and column_match and varname_match):
            continue
        
        # Store column metadata
        label = label_match.group(1).strip().replace('\xa0', ' ')
        column_range = column_match.group(1)
        var_type = type_match.group(1) if type_match else None
        var_name = varname_match.group(1)
        
        column_lookup[label] = {
            'column_range': column_range,
            'type': var_type,
            'sas_name': var_name
        }
        
        # Calculate column width for this variable
        # This determines the zero-padding needed for values
        if '-' in column_range:
            start, end = map(int, column_range.split('-'))
            column_width = end - start + 1
        else:
            column_width = 1
        
        # Extract value-label mappings from table body
        tbody = table.find('tbody')
        if not tbody:
            continue
        
        value_labels = {}
        has_categorical_values = False
        has_range_values = False  # Track if this variable has any range values
        
        for row in tbody.find_all('tr'):
            cells = row.find_all('td')
            if len(cells) < 2:
                continue
            
            # Extract value (first column)
            value = cells[0].get_text(strip=True)
            
            # Check if this is a range value (e.g., "1 - 97", "100 - 999")
            # Range values indicate continuous/numeric variables
            if ' - ' in value:
                has_range_values = True
                continue  # Skip adding ranges to the codebook
            
            # Extract label (second column)
            label_html = cells[1]
            
            # Get text and split by line breaks to separate notes
            label_text = label_html.get_text(separator='|')
            label_parts = label_text.split('|')
            
            # Take first part (before notes)
            value_label = label_parts[0].strip()
            
            # Clean skip logic (remove "→Go to..." instructions)
            value_label = re.sub(r'→Go to.*$', '', value_label).strip()
            
            # Clean encoding issues: Replace "Donï¿½t know" and similar patterns with "Unknown"
            # This handles character encoding issues from the HTML
            if re.match(r'^Don.{1,3}t know', value_label, re.IGNORECASE):
                value_label = 'Unknown'
            # Also catch "Refused" variations and other missing data indicators
            elif value_label.lower().startswith('refused'):
                value_label = 'Refused'
            
            # Remove encoding error characters and everything after them (e.g., "Noï¿½Go" → "No")
            # The pattern ï¿½ is a common Unicode replacement character for malformed text
            value_label = re.sub(r'ï¿½.*$', '', value_label).strip()
            
            # Check if this is a categorical value (not just special codes)
            # Special codes: HIDDEN, BLANK, and codes like 7/9/77/99/777/999
            is_special = value in ['HIDDEN', 'BLANK'] or re.match(r'^[79]+$', value)
            if not is_special:
                has_categorical_values = True
            
            # Pad numeric values to match the column width in the ASCII file
            # This ensures codebook values match the fixed-width format
            if value.isdigit() and len(value) < column_width:
                value_padded = value.zfill(column_width)
            else:
                value_padded = value
            
            # Store mapping with properly padded value
            value_labels[value_padded] = value_label
        
        # Only add to codebook if:
        # 1. Has meaningful categorical values (not just HIDDEN/BLANK/special codes)
        # 2. Does NOT have any range values (which indicate continuous variables)
        # Variables with ranges (like _DRNKWK3, SSBSUGR2) should remain numeric
        if value_labels and has_categorical_values and not has_range_values:
            codebook[var_name] = value_labels
    
    print(f"Successfully parsed {len(column_lookup)} variable definitions and {len(codebook)} value label mappings from HTML dictionary")
    return column_lookup, codebook


def apply_value_labels(df, codebook, columns_to_label=None):
    """
    Apply value-to-label mappings to DataFrame columns.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with numeric/coded values
    codebook : dict
        Value label mappings from parse_brfss_dictionary()
    columns_to_label : list, optional
        Specific columns to label. If None, attempts to label all columns.
    
    Returns:
    --------
    pd.DataFrame : DataFrame with values replaced by labels
    """
    df_labeled = df.copy()
    
    if columns_to_label is None:
        columns_to_label = df.columns
    
    labeled_count = 0
    skipped_vars = []
    
    for col in columns_to_label:
        # Check if column has value labels in codebook
        if col not in codebook:
            continue
        
        value_map = codebook[col]
        
        # Test mapping on a sample to see if it's appropriate
        # If less than 50% of non-null values can be mapped, skip this variable
        # (it's likely a continuous variable or identifier)
        sample = df[col].dropna().head(1000)
        if len(sample) > 0:
            test_mapped = sample.astype(str).str.strip().map(value_map)
            mapping_success_rate = test_mapped.notna().sum() / len(sample)
            
            if mapping_success_rate < 0.5:
                skipped_vars.append(f"{col} ({mapping_success_rate:.1%} mappable)")
                continue
        
        # Apply mapping
        df_labeled[col] = df[col].astype(str).str.strip().map(value_map)
        
        # Count how many values were successfully mapped
        mapped = df_labeled[col].notna().sum()
        if mapped > 0:
            labeled_count += 1
            print(f"  Labeled {col}: {mapped:,} / {df[col].notna().sum():,} values ({mapped/df[col].notna().sum()*100:.1f}%)")
    
    if skipped_vars:
        print(f"\n  Skipped (continuous/identifier): {', '.join(skipped_vars)}")
    
    print(f"\nSuccessfully labeled {labeled_count} columns")
    return df_labeled


def parse_data_file(html_file, asc_file, columns_to_extract):
    """
    Parse BRFSS ASCII data file using HTML data dictionary.
    
    Parameters:
    -----------
    html_file : str
        Path to HTML data dictionary file
    asc_file : str
        Path to ASCII data file
    columns_to_extract : list
        List of variable labels to extract from the data
    
    Returns:
    --------
    pd.DataFrame : Parsed data with value labels applied
    """
    # Step 1: Parse HTML data dictionary (single pass for both metadata and value labels)
    print("Parsing HTML data dictionary...")
    column_lookup, codebook = parse_brfss_dictionary(html_file)
    
    # Step 2: Convert labels to colspecs for pd.read_fwf()
    colspecs = []
    column_names = []
    dtypes = {}
    
    print("\nMapping columns:")
    for label in columns_to_extract:
        if label in column_lookup:
            col_info = column_lookup[label]
            col_range = col_info['column_range']
            
            # Parse "1-2" or "149" format
            if '-' in col_range:
                start, end = map(int, col_range.split('-'))
            else:
                start = end = int(col_range)
            
            # Convert to 0-based indexing for Python
            colspecs.append((start - 1, end))
            
            # Use SAS variable name for column name (to match codebook keys)
            col_name = col_info['sas_name']
            column_names.append(col_name)
            
            # Set dtype (start with string for safety, can convert later)
            dtypes[col_name] = str
            
            print(f"  {label} -> {col_name} (columns {col_range})")
        else:
            print(f"  WARNING: '{label}' not found in data dictionary")
    
    print(f"\nPrepared to extract {len(colspecs)} columns from ASCII file")
    
    # Step 3: Read the ASCII file using pd.read_fwf()
    print(f"\nReading ASCII file: {asc_file}")
    df = pd.read_fwf(
        asc_file, 
        colspecs=colspecs,
        names=column_names,
        dtype=dtypes,
        encoding='ascii'
    )
    
    print(f"Successfully loaded {len(df):,} rows and {len(df.columns)} columns")
    
    # Step 4: Apply value labels to DataFrame
    print("\nApplying value labels to DataFrame...")
    df = apply_value_labels(df, codebook)
    return df


def apply_decimal_transformations(df):
    """
    Apply decimal transformations to numeric columns with implied decimal places.
    
    Transforms columns by dividing by 10^decimal_places to convert from
    fixed-width integer representation to proper decimal values.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with columns requiring decimal transformations
    
    Returns:
    --------
    pd.DataFrame : Transformed DataFrame with proper decimal values
    """
    # Create a copy for safe rerunning
    df_transformed = df.copy()
    
    # Define columns with implied decimal places (all have 2 decimal places per data dictionary)
    decimal_transforms = {
        'WTKG3': {
            'name': 'Weight (kg)',
            'decimal_places': 2,
            'description': 'Computed Weight in Kilograms'
        },
        'HTM4': {
            'name': 'Height (m)',
            'decimal_places': 2,
            'description': 'Computed Height in Meters'
        },
        '_BMI5': {
            'name': 'BMI',
            'decimal_places': 2,
            'description': 'Computed Body Mass Index'
        },
        '_DRNKWK3': {
            'name': 'Drinks/week',
            'decimal_places': 2,
            'description': 'Computed number of drinks per week'
        }
    }
    
    print("Applying decimal place transformations:\n")
    
    for col, config in decimal_transforms.items():
        if col in df_transformed.columns:
            # Convert to numeric, coercing non-numeric values to NaN
            numeric_vals = pd.to_numeric(df_transformed[col], errors='coerce')
            
            # Apply decimal transformation (divide by 10^decimal_places)
            divisor = 10 ** config['decimal_places']
            df_transformed[col] = numeric_vals / divisor
            
            # Report transformation
            non_null = df_transformed[col].notna().sum()
            print(f"{config['description']} ({col}):")
            print(f"  Transformed {non_null:,} values (÷{divisor})")
            print()
    
    print("✓ Decimal transformations complete")
    return df_transformed


def clean_special_codes(df):
    """
    Clean special codes and outliers in numeric variables.
    
    Handles special missing codes and unrealistic outliers in the _DRNKWK3
    (drinks per week) column.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with columns requiring special code cleaning
    
    Returns:
    --------
    pd.DataFrame : Cleaned DataFrame with special codes handled
    """
    df_cleaned = df.copy()
    
    # Clean special codes in _DRNKWK3 (Drinks per week)
    # Note: This runs AFTER decimal transformation, so 99900 is now 999.00
    if '_DRNKWK3' in df_cleaned.columns:
        print("Cleaning _DRNKWK3 special codes:\n")
        
        # Get numeric values (already divided by 100 from decimal transformation)
        drnkwk3_numeric = df_cleaned['_DRNKWK3'].copy()
        
        # Count values before cleaning
        total_count = drnkwk3_numeric.notna().sum()
        
        # Identify special codes and value ranges
        # After ÷100: 99900 → 999.00 (Don't know/Refused/Missing)
        missing_code = (drnkwk3_numeric == 999.00).sum()
        zero_drinks = (drnkwk3_numeric == 0.0).sum()  # Did not drink
        over_100 = ((drnkwk3_numeric > 100) & (drnkwk3_numeric != 999.00)).sum()  # High values to cap
        valid_range = ((drnkwk3_numeric > 0) & (drnkwk3_numeric <= 100)).sum()  # 0-100 range
        
        print(f"  Values before cleaning:")
        print(f"    Total non-null: {total_count:,}")
        print(f"    Zero (did not drink): {zero_drinks:,}")
        print(f"    Valid range (0.01-100.00): {valid_range:,}")
        print(f"    Over 100 (will cap at 100): {over_100:,}")
        print(f"    Special code 999.00 (missing): {missing_code:,}")
        
        if over_100 > 0:
            # Show distribution of high values before capping
            high_vals = drnkwk3_numeric[(drnkwk3_numeric > 100) & (drnkwk3_numeric != 999.00)]
            print(f"\n  Distribution of values over 100 (before capping):")
            print(f"    Min: {high_vals.min():.2f}")
            print(f"    Max: {high_vals.max():.2f}")
            print(f"    Mean: {high_vals.mean():.2f}")
            print(f"    Median: {high_vals.median():.2f}")
        
        # Apply cleaning transformations:
        # 1. Map 999.00 (special missing code) to NaN
        df_cleaned.loc[drnkwk3_numeric == 999.00, '_DRNKWK3'] = np.nan
        
        # 2. Cap values over 100 at 100 (data quality issue: >14 drinks/day is unreasonable)
        df_cleaned.loc[(drnkwk3_numeric > 100) & (drnkwk3_numeric != 999.00), '_DRNKWK3'] = 100.0
        
        # Report results
        remaining = df_cleaned['_DRNKWK3'].notna().sum()
        removed = total_count - remaining
        
        print(f"\n  Cleaning applied:")
        print(f"    Mapped 999.00 → NaN: {missing_code:,} values")
        print(f"    Capped >100 → 100.0: {over_100:,} values")
        print(f"    Total removed (NaN): {removed:,} ({removed/total_count*100:.2f}%)")
        print(f"    Remaining valid values: {remaining:,}")
        
        if remaining > 0:
            print(f"    Range: {df_cleaned['_DRNKWK3'].min():.2f} - {df_cleaned['_DRNKWK3'].max():.2f}")
            print(f"    Mean: {df_cleaned['_DRNKWK3'].mean():.2f} drinks/week")
        print()
    
    print("✓ Special value cleaning complete")
    return df_cleaned


def validate_grain(df, key_cols):
    """
    Validate that the specified key columns uniquely identify each row.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to validate
    key_cols : list
        List of column names that should form a unique key
    
    Returns:
    --------
    bool : True if key is unique, False otherwise
    """
    is_unique = len(df) == len(df.drop_duplicates(subset=key_cols))
    key_str = ' and '.join([f"`{col}`" for col in key_cols])
    print(f"Every row is uniquely defined by the {key_str} columns: {is_unique}")
    return is_unique


def analyze_missing_values(df, threshold=5.0):
    """
    Analyze missing values across all columns in the DataFrame.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to analyze
    threshold : float, optional
        Percentage threshold for highlighting columns (default: 5.0)
    
    Returns:
    --------
    pd.DataFrame : Summary of missing values by column
    """
    # Calculate missing values and percentages
    missing_df = pd.DataFrame({
        'Column': df.columns,
        'Missing_Count': df.isnull().sum(),
        'Missing_Percent': (df.isnull().sum() / len(df) * 100).round(2)
    })
    
    # Sort by missing count descending
    missing_df = missing_df.sort_values('Missing_Count', ascending=False)
    
    print("Missing Value Summary:")
    print(f"Total rows: {len(df):,}\n")
    print(missing_df.to_string(index=False))
    
    # Highlight columns with significant missing data
    high_missing = missing_df[missing_df['Missing_Percent'] > threshold]
    if len(high_missing) > 0:
        print(f"\n⚠️  Columns with >{threshold}% missing data:")
        print(high_missing.to_string(index=False))
    else:
        print(f"\n✓ No columns have >{threshold}% missing data")
    
    return missing_df


def verify_data_types(df, categorical_cols, numeric_cols):
    """
    Verify that columns have the expected data types.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to verify
    categorical_cols : list
        List of columns expected to be categorical/string
    numeric_cols : list
        List of columns expected to be numeric
    
    Returns:
    --------
    pd.DataFrame : Summary of data types by column
    """
    print("Data Type Summary:\n")
    print(f"{'Column':<15} {'Current Type':<15} {'Expected Category':<20}")
    print("=" * 50)
    
    type_summary = []
    
    for col in df.columns:
        current_type = str(df[col].dtype)
        if col in categorical_cols:
            expected = "Categorical/String"
        elif col in numeric_cols:
            expected = "Numeric"
        else:
            expected = "Unknown"
        
        print(f"{col:<15} {current_type:<15} {expected:<20}")
        type_summary.append({
            'Column': col,
            'Current_Type': current_type,
            'Expected_Category': expected
        })
    
    print(f"\n✓ Data type summary complete")
    print(f"  Categorical columns: {len(categorical_cols)} (human-readable labels)")
    print(f"  Numeric columns: {len(numeric_cols)} (transformed with proper decimals)")
    
    return pd.DataFrame(type_summary)


def check_numeric_ranges(df, numeric_check_cols):
    """
    Check that numeric columns have values within expected ranges.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to check
    numeric_check_cols : dict
        Dictionary mapping column names to range specifications
        Format: {col_name: {'name': str, 'min': float, 'max': float}}
    
    Returns:
    --------
    pd.DataFrame : Summary of numeric ranges with out-of-range counts
    """
    print("Numeric Variable Range Summary:\n")
    print(f"{'Variable':<15} {'Description':<20} {'Count':<10} {'Min':<10} {'Max':<10} {'Mean':<10} {'Valid Range':<20}")
    print("=" * 105)
    
    range_summary = []
    
    for col, info in numeric_check_cols.items():
        if col in df.columns:
            # Convert to numeric, coercing errors to NaN
            numeric_vals = pd.to_numeric(df[col], errors='coerce')
            
            count = numeric_vals.notna().sum()
            min_val = numeric_vals.min() if count > 0 else None
            max_val = numeric_vals.max() if count > 0 else None
            mean_val = numeric_vals.mean() if count > 0 else None
            
            # Format values
            min_str = f"{min_val:.2f}" if min_val is not None else "N/A"
            max_str = f"{max_val:.2f}" if max_val is not None else "N/A"
            mean_str = f"{mean_val:.2f}" if mean_val is not None else "N/A"
            range_str = f"{info['min']}-{info['max']}"
            
            print(f"{col:<15} {info['name']:<20} {count:<10,} {min_str:<10} {max_str:<10} {mean_str:<10} {range_str:<20}")
            
            # Check for out-of-range values
            out_of_range = 0
            if min_val is not None and max_val is not None:
                out_of_range = ((numeric_vals < info['min']) | (numeric_vals > info['max'])).sum()
                if out_of_range > 0:
                    print(f"  ⚠️  {out_of_range:,} values outside expected range")
            
            range_summary.append({
                'Variable': col,
                'Count': count,
                'Min': min_val,
                'Max': max_val,
                'Mean': mean_val,
                'Out_of_Range': out_of_range
            })
    
    print("\n✓ Range check complete")
    return pd.DataFrame(range_summary)


def remove_out_of_range_records(df, numeric_check_cols):
    """
    Remove records where any numeric variable is outside its expected range.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to clean
    numeric_check_cols : dict
        Dictionary mapping column names to range specifications
        Format: {col_name: {'name': str, 'min': float, 'max': float}}
    
    Returns:
    --------
    pd.DataFrame : Cleaned DataFrame with out-of-range records removed
    """
    # Create a copy for safe rerunning
    df_clean = df.copy()
    
    # Count rows before removal
    rows_before = len(df_clean)
    
    # Track which records to remove (combine all out-of-range conditions)
    records_to_remove = pd.Series([False] * len(df_clean), index=df_clean.index)
    
    print("Checking numeric variables for out-of-range values:\n")
    
    for col, info in numeric_check_cols.items():
        if col in df_clean.columns:
            # Convert to numeric, coercing errors to NaN
            numeric_vals = pd.to_numeric(df_clean[col], errors='coerce')
            
            # Identify out-of-range values (excluding NaN)
            out_of_range = ((numeric_vals < info['min']) | (numeric_vals > info['max'])) & numeric_vals.notna()
            out_of_range_count = out_of_range.sum()
            
            if out_of_range_count > 0:
                print(f"  {col} ({info['name']}): {out_of_range_count:,} out-of-range values")
                print(f"    Range: {info['min']}-{info['max']}")
                
                # Add to removal mask
                records_to_remove = records_to_remove | out_of_range
            else:
                print(f"  {col} ({info['name']}): ✓ All values in range")
    
    # Count total unique records to remove
    total_to_remove = records_to_remove.sum()
    
    print(f"\nTotal unique records to remove: {total_to_remove:,}")
    
    if total_to_remove > 0:
        # Show sample of records being removed
        print(f"\nSample of records being removed (first 10):")
        sample_cols = ['_STATE', 'SEQNO'] + [col for col in numeric_check_cols.keys() if col in df_clean.columns]
        print(df_clean[records_to_remove][sample_cols].head(10).to_string(index=False))
        
        # Remove the records
        df_clean = df_clean[~records_to_remove].copy()
        
        rows_after = len(df_clean)
        rows_removed = rows_before - rows_after
        
        print(f"\n✓ Removed {rows_removed:,} records ({rows_removed/rows_before*100:.3f}%)")
        print(f"  Rows before: {rows_before:,}")
        print(f"  Rows after: {rows_after:,}")
    else:
        print(f"\n✓ No out-of-range values found in any numeric column")
    
    return df_clean


def write_data(df, output_file):
    """
    Write DataFrame to file in specified format.
    
    Supports CSV, Parquet, and Pickle formats based on file extension.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to write
    output_file : str
        Output file path (extension determines format)
    
    Returns:
    --------
    None
    """
    if not output_file:
        print("⚠️  No output file specified")
        print(f"\nDataset available in memory:")
        print(f"  Rows: {len(df):,}")
        print(f"  Columns: {len(df.columns)}")
        return
    
    # Determine file format from extension
    if output_file.endswith('.csv'):
        df.to_csv(output_file, index=False)
        print(f"✓ Saved cleaned dataset to CSV: {output_file}")
    elif output_file.endswith('.parquet'):
        df.to_parquet(output_file, index=False)
        print(f"✓ Saved cleaned dataset to Parquet: {output_file}")
    elif output_file.endswith('.pkl') or output_file.endswith('.pickle'):
        df.to_pickle(output_file)
        print(f"✓ Saved cleaned dataset to Pickle: {output_file}")
    else:
        print(f"⚠️  Unsupported file format: {output_file}")
        print(f"   Supported formats: .csv, .parquet, .pkl, .pickle")
        return
    
    # Print summary
    print(f"\nDataset summary:")
    print(f"  Rows: {len(df):,}")
    print(f"  Columns: {len(df.columns)}")
    print(f"  Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## Download data
### Raw data file (ASCII)

In [4]:
asc_zip = !ls LLCP2024ASC*.zip
# If ascii zip file not present, download it and get zip file name
if not asc_zip:
    !wget {raw_data_url}
    asc_zip = !ls LLCP2024ASC*.zip

asc_file = !ls LLCP2024*.ASC*
# If raw ascii file not present, unzip ascii zip file and get raw ascii file name
if not asc_file:
    !unzip {asc_zip[0]}
    asc_file = !ls LLCP2024*.ASC*
asc_file = asc_file[0]

### Data dictionary file (HTML)

In [5]:
dict_zip = !ls codebook24_llcp*.zip
# If html zip file not present, download it and get html zip file name
if not dict_zip:
    !wget {data_dict_url}
    dict_zip = !ls codebook24_llcp*.zip

html_file = !ls USCODE24_LLCP*.HTML
# If raw html file not present, unzip html zip file and get raw html file name
if not html_file:
    !unzip {dict_zip[0]}
    html_file = !ls USCODE24_LLCP*.HTML
html_file = html_file[0]

## Parse data

In [None]:
data_df = parse_data_file(html_file, asc_file, columns_to_extract)

## Transform
Apply transformations to convert raw values to their proper formats with correct decimal places and units.

### Decimal Transformations
Apply decimal place transformations to numeric variables with implied decimal places.

In [None]:
transform_df = apply_decimal_transformations(data_df)

### Special Code Cleaning
Clean special codes and outliers in numeric variables.

In [None]:
transform_df = clean_special_codes(transform_df)

## Validation
### Grain Check
Confirm that the `_STATE` and `SEQNO` columns uniquely describe each row of the dataset.

In [None]:
validate_grain(transform_df, ['_STATE', 'SEQNO'])

### Missing Value Analysis
Check for missing values across all columns to identify data completeness issues.

In [None]:
analyze_missing_values(transform_df)

### Data Type Verification
Verify that columns have the expected data types after parsing and value label mapping.

In [None]:
# Define expected types for different column categories
categorical_cols = ['_STATE', 'DIABETE4', 'PREDIAB2', 'DIABTYPE', '_URBSTAT', '_AGEG5YR', 
                    'SEXVAR', '_RACE', 'EDUCA', 'INCOME3', 'PERSDOC3', 'MEDCOST1', 
                    'EXERANY2', '_SMOKER3', 'DRNKANY6', '_RFDRHV9', 'GENHLTH', 'CVDINFR4',
                    'CVDCRHD4', 'CVDSTRK3', 'CHCKDNY2', 'ASTHMA3', 'ADDEPEV3', 'HAVARTH4']

numeric_cols = ['SEQNO', 'WTKG3', 'HTM4', '_BMI5', 'SSBSUGR2', 'SSBFRUT3', '_DRNKWK3']

verify_data_types(transform_df, categorical_cols, numeric_cols)

### Range Checks for Numeric Variables
Verify that numeric columns have reasonable values within expected ranges.

In [None]:
# Define numeric columns to check
numeric_check_cols = {
    'WTKG3': {'name': 'Weight (kg)', 'min': 23.00, 'max': 295.00},
    'HTM4': {'name': 'Height (m)', 'min': 0.91, 'max': 2.44},
    '_BMI5': {'name': 'BMI', 'min': 0.01, 'max': 99.99},
    'SSBSUGR2': {'name': 'Sugar soda freq', 'min': 0, 'max': 999},
    'SSBFRUT3': {'name': 'Sugar drink freq', 'min': 0, 'max': 999},
    '_DRNKWK3': {'name': 'Drinks/week', 'min': 0.0, 'max': 100.0}  # After ÷100 and capping at 100
}

check_numeric_ranges(transform_df, numeric_check_cols)

## Data Removal
Remove records with data quality issues identified during validation.

### Remove Out-of-Range Numeric Records
Remove records where any numeric variable is outside its expected range as specified in the data dictionary.

In [None]:
# Use the same range definitions from the validation section
numeric_check_cols = {
    'WTKG3': {'name': 'Weight (kg)', 'min': 23.00, 'max': 295.00},
    'HTM4': {'name': 'Height (m)', 'min': 0.91, 'max': 2.44},
    '_BMI5': {'name': 'BMI', 'min': 0.01, 'max': 99.99},
    'SSBSUGR2': {'name': 'Sugar soda freq', 'min': 0, 'max': 999},
    'SSBFRUT3': {'name': 'Sugar drink freq', 'min': 0, 'max': 999},
    '_DRNKWK3': {'name': 'Drinks/week', 'min': 0.0, 'max': 100.0}  # After ÷100 and capping at 100
}

clean_df = remove_out_of_range_records(transform_df, numeric_check_cols)

## Write Data
Save the cleaned and transformed dataset to file for use in EDA and modeling.

In [None]:
write_data(clean_df, output_file)