# Preparing Data for Supervised Learning

This notebook creates a version of `merged_clean.csv` specifically prepared for supervised modeling. The output file is saved as `merged_clean_sl.csv` in the `data/clean/` directory.

Steps performed:

- Recoded yes/no fields into binary values
- Converted insurance coverage variables to binary flags
- Dropped columns that were completely missing
- Recoded gender for modeling use
- Converted relevant columns from float to integer (e.g., binary, ordinal, and count-based fields like age and household size)

This version avoids one-hot encoding and keeps variables in a format that supports model interpretability while maintaining a clean structure for training classifiers or regressors.


In [2]:
import pandas as pd
import os

# Load cleaned dataset
df = pd.read_csv("../data/clean/merged_clean.csv")

# Recode binary yes/no variables: 1 = yes, 2 or blank = no
binary_vars = [
    'Covered by health insurance',
    'Time when no insurance in past year?',
    'Routine place to go for healthcare',
    'Past 12 months had video conf w/Dr?',
    'Seen mental health professional/past yr'
]

for col in binary_vars:
    df[col] = df[col].map({1: 1, 2: 0})

# Recode insurance type variables: filled = 1, blank = 0
insurance_cols = [
    'Covered by private insurance',
    'Covered by Medicare',
    'Covered by Medi-Gap',
    'Covered by Medicaid',
    'Covered by CHIP',
    'Covered by military health care',
    'Covered by state-sponsored health plan',
    'Covered by other government insurance'
]

for col in insurance_cols:
    df[col] = df[col].notna().astype(int)

# Recode gender: 1 = male becomes 0, 2 = female becomes 1
df['Gender'] = df['Gender'].map({1: 0, 2: 1})

# Drop fully missing or unused columns
df = df.drop(columns=['Covered by CHIP', 'Covered by other government insurance'], errors='ignore')

# Convert selected columns from float to int
columns_to_convert = binary_vars + insurance_cols + [
    'Gender',
    'Education level - Adults 20+',
    'Difficulty these problems have caused',
    'Difficulty with self-care',
    'How often feel worried/nervous/anxious',
    'Level of feeling worried/nervous/anxious',
    'Type place most often go for healthcare',
    'Age in years at screening',
    'Total number of people in the Household'
]

for col in columns_to_convert:
    if col in df.columns:
        df[col] = df[col].fillna(0).astype(int)

# Save output
output_path = "../data/clean/merged_clean_sl.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df.to_csv(output_path, index=False)

print(f"{output_path} saved successfully.")


../data/clean/merged_clean_sl.csv saved successfully.
