# Predicting Individual Health Outcomes from Dietary Patterns Across Diverse Populations**Using Machine Learning to Forecast Obesity Risk Based on Food Consumption and Lifestyle Factors**---**Team Members:**- Jenner Dulce (jdulce@sandiego.edu)- Gable Krich (gkrich@sandiego.edu)- Francis Ortiz (fortiz@sandiego.edu)- Christian Webb (christianw@sandiego.edu)**Course:** COMP 352 - Data Science  **Professor:** Daniel Matlock  **Date:** November 24, 2025---## Dataset Information**Data Source:** NHANES 2013-2018 (Cycles H, I, J)  **Official Source:** https://www.cdc.gov/nchs/nhanes/  **Format:** XPT files (SAS transport format) parsed with pandas**Survey Cycles:**- **Cycle H:** 2013-2014- **Cycle I:** 2015-2016- **Cycle J:** 2017-2018**Files Per Cycle (21 total XPT files):**1. DEMO - Demographics (age, gender, race, education, income)2. DR1TOT - Dietary intake (24-hour recall)3. BMX - Body measurements (for target variable only)4. PAQ - Physical activity levels5. SMQ - Smoking behaviors6. SLQ - Sleep patterns7. ALQ - Alcohol consumption---## Project Overview**Research Question:**  Can we predict individual obesity risk (BMI ≥ 30) from dietary intake patterns, physical activity, and lifestyle factors—WITHOUT using body measurements or laboratory values?**Why This Matters:**  - Enables early intervention BEFORE weight gain occurs- Identifies modifiable risk factors- Provides actionable insights for prevention- Demonstrates that lifestyle data alone contains strong predictive signal**Critical Scientific Approach:**  We explicitly EXCLUDE all physical measurements (BMI, weight, height, waist circumference) and laboratory biomarkers from our features to avoid data leakage. We predict obesity from its CAUSES (diet, activity), not its MEASUREMENTS.

---# SECTION 1: Data Importing and Pre-processing (100 Points)---## 1.1 Import Required Packages

In [None]:
# Data manipulation and analysisimport pandas as pdimport numpy as npimport warningswarnings.filterwarnings('ignore')import os# Visualizationimport matplotlib.pyplot as pltimport seaborn as sns# Machine learning - preprocessingfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldfrom sklearn.preprocessing import StandardScaler# Machine learning - modelsfrom sklearn.linear_model import LogisticRegression, RidgeClassifierfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.svm import SVCtry:    from xgboost import XGBClassifier    XGBOOST_AVAILABLE = Trueexcept ImportError:    XGBOOST_AVAILABLE = False    print("XGBoost not available, will use other models")# Machine learning - evaluationfrom sklearn.metrics import (accuracy_score, precision_score, recall_score,                              f1_score, roc_auc_score, confusion_matrix,                              classification_report, roc_curve)# Set random seed for reproducibilitynp.random.seed(42)# Configure visualization settingsplt.style.use('seaborn-v0_8-darkgrid')sns.set_palette("husl")plt.rcParams['figure.figsize'] = (12, 6)plt.rcParams['font.size'] = 10print("✓ All packages imported successfully!")print(f"✓ Pandas version: {pd.__version__}")print(f"✓ NumPy version: {np.__version__}")print(f"✓ Random seed set to: 42")

## 1.2 Import and Parse NHANES XPT FilesWe will import **21 XPT files** (SAS transport format) from three NHANES survey cycles:- **Cycle H (2013-2014):** 7 files- **Cycle I (2015-2016):** 7 files  - **Cycle J (2017-2018):** 7 files### Import Method: pd.read_sas() for XPT FormatEach cycle contains:1. **DEMO** - Demographics2. **DR1TOT** - Dietary intake (24-hour recall)3. **BMX** - Body measurements4. **PAQ** - Physical activity5. **SMQ** - Smoking6. **SLQ** - Sleep7. **ALQ** - Alcohol

In [None]:
# Define data directory path - ADJUST THIS to your file locationdata_dir = 'data_files/'  # Change to where you saved the XPT files# Define the three cyclescycles = {    'H': '2013-2014',    'I': '2015-2016',    'J': '2017-2018'}# Define file typesfile_types = ['DEMO', 'DR1TOT', 'BMX', 'PAQ', 'SMQ', 'SLQ', 'ALQ']print("="*80)print("IMPORTING NHANES XPT FILES - Parsing SAS Transport Format")print("="*80)print(f"\nData directory: {data_dir}")print(f"Cycles to import: {list(cycles.keys())}")print(f"Files per cycle: {len(file_types)}")print(f"Total files to process: {len(cycles) * len(file_types)}")print("="*80)

### Step 1: Parse Each XPT File Using pd.read_sas()This demonstrates actual file importing and parsing from XPT format.

In [None]:
# Dictionary to store all loaded dataframesdata_dict = {}# Import each cyclefor cycle, years in cycles.items():    print(f"\n{'='*80}")    print(f"PARSING CYCLE {cycle} ({years})")    print('='*80)        cycle_data = {}        for file_type in file_types:        filename = f"{file_type}_{cycle}.xpt"        filepath = os.path.join(data_dir, filename)                try:            # Parse XPT file with pandas            if file_type == 'DR1TOT':                # Dietary files sometimes need encoding specification                df = pd.read_sas(filepath, encoding='latin1')            else:                df = pd.read_sas(filepath)                        cycle_data[file_type] = df            print(f"✓ {filename:20s} | {len(df):>6,} rows | {len(df.columns):>3} columns | Parsed successfully")                    except FileNotFoundError:            print(f"⚠ {filename:20s} | FILE NOT FOUND - Will skip this file")            cycle_data[file_type] = None        except Exception as e:            print(f"✗ {filename:20s} | ERROR: {str(e)[:50]}")            cycle_data[file_type] = None        data_dict[cycle] = cycle_dataprint("\n" + "="*80)print("XPT FILE PARSING COMPLETE")print("="*80)

### Step 2: Examine Parsed Data StructureLet's look at the actual NHANES variable names (they use codes, not readable names).

In [None]:
# Show example of raw NHANES data structureprint("\n" + "="*80)print("EXAMPLE: RAW NHANES DATA STRUCTURE (Cycle H Demographics)")print("="*80)if data_dict['H']['DEMO'] is not None:    demo_h = data_dict['H']['DEMO']    print(f"\nShape: {demo_h.shape}")    print(f"\nFirst 5 rows:")    print(demo_h.head())        print(f"\nColumn names (NHANES codes):")    for i, col in enumerate(demo_h.columns[:20], 1):        print(f"  {i:2d}. {col}")    if len(demo_h.columns) > 20:        print(f"  ... and {len(demo_h.columns) - 20} more columns")        print(f"\nKey NHANES variable codes:")    print("  SEQN      = Participant sequence number (unique ID)")    print("  RIDAGEYR  = Age in years")    print("  RIAGENDR  = Gender (1=Male, 2=Female)")    print("  RIDRETH3  = Race/ethnicity")    print("  DMDEDUC2  = Education level")    print("  INDHHIN2  = Annual household income")else:    print("\n⚠ Demographics file not loaded")

## 1.3 Merge Files Within Each CycleMerge all 7 files for each cycle on SEQN (participant ID).**Critical:** We use **inner join** to keep only participants with complete data across ALL files.

In [None]:
# Merge files within each cyclemerged_cycles = {}for cycle, years in cycles.items():    print(f"\n{'='*80}")    print(f"MERGING CYCLE {cycle} ({years}) FILES")    print('='*80)        cycle_data = data_dict[cycle]        # Start with demographics (should have most participants)    if cycle_data['DEMO'] is not None:        merged = cycle_data['DEMO'].copy()        print(f"\nStarting with DEMO: {len(merged):,} participants")                # Merge each additional file        for file_type in ['DR1TOT', 'BMX', 'PAQ', 'SMQ', 'SLQ', 'ALQ']:            if cycle_data[file_type] is not None:                before_count = len(merged)                merged = merged.merge(cycle_data[file_type], on='SEQN', how='inner')                after_count = len(merged)                lost = before_count - after_count                print(f"After merging {file_type:7s}: {after_count:>6,} participants (lost {lost:>5,})")            else:                print(f"⚠ Skipping {file_type:7s}: File not available")                merged_cycles[cycle] = merged        print(f"\n✓ Cycle {cycle} complete: {len(merged):,} participants with complete data")        print(f"  Total columns: {len(merged.columns)}")    else:        print(f"\n✗ Cycle {cycle}: Demographics file not available, cannot merge")        merged_cycles[cycle] = Noneprint("\n" + "="*80)print("WITHIN-CYCLE MERGING COMPLETE")print("="*80)

## 1.4 Combine All CyclesConcatenate the three cycles into one dataset.

In [None]:
# Combine all cyclesprint("\n" + "="*80")print("COMBINING ALL CYCLES")print("="*80)# Collect all valid cycle dataframesvalid_cycles = []for cycle, years in cycles.items():    if merged_cycles[cycle] is not None:        # Add cycle identifier column        merged_cycles[cycle]['CYCLE'] = cycle        merged_cycles[cycle]['SURVEY_YEARS'] = years        valid_cycles.append(merged_cycles[cycle])        print(f"✓ Cycle {cycle} ({years}): {len(merged_cycles[cycle]):,} participants")if valid_cycles:    # Concatenate all cycles    df_all = pd.concat(valid_cycles, ignore_index=True)        print(f"\n{'='*80}")    print(f"✓ COMBINED DATASET CREATED")    print(f"{'='*80}")    print(f"  Total participants: {len(df_all):,}")    print(f"  Total columns: {len(df_all.columns)}")    print(f"  Cycles included: {df_all['CYCLE'].unique()}")    print(f"  Date range: {df_all['SURVEY_YEARS'].unique()}")else:    print("\n✗ No valid cycles to combine!")    df_all = pd.DataFrame()  # Empty dataframe

## 1.5 Extract and Rename Key VariablesNHANES uses cryptic codes. Let's extract and rename the variables we need with human-readable names.

In [None]:
print("\n" + "="*80)print("EXTRACTING AND RENAMING VARIABLES")print("="*80)# Define variable mappings (NHANES code -> Readable name)variable_mappings = {    # Identifier    'SEQN': 'SEQN',        # Demographics    'RIDAGEYR': 'Age',    'RIAGENDR': 'Gender',    'RIDRETH3': 'Race_Ethnicity',    'DMDEDUC2': 'Education_Level',    'INDHHIN2': 'Income_Bracket',    'DMDMARTL': 'Marital_Status',        # Dietary Intake (24-hour recall)    'DR1TKCAL': 'Total_Calories_kcal',    'DR1TPROT': 'Protein_g',    'DR1TCARB': 'Carbohydrates_g',    'DR1TTFAT': 'Fat_g',    'DR1TSFAT': 'Saturated_Fat_g',    'DR1TFIBE': 'Fiber_g',    'DR1TSODI': 'Sodium_mg',    'DR1TCHOL': 'Cholesterol_mg',    'DR1TSUGR': 'Sugar_g',        # Body Measurements (for target only!)    'BMXBMI': 'BMI',    'BMXWT': 'Weight_kg',    'BMXHT': 'Height_cm',        # Physical Activity (example variables - may need adjustment)    'PAD680': 'Vigorous_Activity_Min_Per_Week',    'PAD320': 'Moderate_Activity_Min_Per_Week',        # Smoking    'SMQ020': 'Ever_Smoked_100_Cigarettes',    'SMQ040': 'Current_Smoking_Status',        # Sleep    'SLD012': 'Sleep_Hours_Per_Night',        # Alcohol    'ALQ130': 'Avg_Alcohol_Drinks_Per_Day'}# Extract only the columns we need (that exist)available_cols = []missing_cols = []for nhanes_code, readable_name in variable_mappings.items():    if nhanes_code in df_all.columns:        available_cols.append(nhanes_code)    else:        missing_cols.append(nhanes_code)print(f"\n✓ Available variables: {len(available_cols)}/{len(variable_mappings)}")print(f"⚠ Missing variables: {len(missing_cols)}")if missing_cols:    print(f"\nMissing NHANES codes (may need different variable names):")    for code in missing_cols[:10]:        print(f"  - {code} ({variable_mappings[code]})")    if len(missing_cols) > 10:        print(f"  ... and {len(missing_cols) - 10} more")# Create simplified dataset with renamed columnsdf_selected = df_all[available_cols + ['CYCLE', 'SURVEY_YEARS']].copy()df_renamed = df_selected.rename(columns=variable_mappings)print(f"\n✓ Dataset with renamed variables created")print(f"  Shape: {df_renamed.shape}")print(f"\nRenamed columns:")for i, col in enumerate(df_renamed.columns, 1):    print(f"  {i:2d}. {col}")

### Dataset CharacteristicsNow let's examine our cleaned dataset with readable names.

In [None]:
print("\n" + "="*80)print("DATASET CHARACTERISTICS")print("="*80)print(f"\n📊 Dimensions:")print(f"   Total Records: {df_renamed.shape[0]:,} participants")print(f"   Total Features: {df_renamed.shape[1]} variables")print(f"\n📁 File Format: XPT (SAS Transport Format)")print(f"   Parsed with: pd.read_sas()")print(f"\n📅 Survey Cycles Included:")for cycle in df_renamed['CYCLE'].unique():    count = (df_renamed['CYCLE'] == cycle).sum()    years = df_renamed[df_renamed['CYCLE'] == cycle]['SURVEY_YEARS'].iloc[0]    print(f"   Cycle {cycle} ({years}): {count:,} participants")print(f"\n📋 Data Types:")print(df_renamed.dtypes.value_counts())print(f"\n💾 Memory Usage:")print(f"   {df_renamed.memory_usage(deep=True).sum() / 1024**2:.2f} MB")print(f"\n🔍 First 5 Rows:")display(df_renamed.head())print(f"\n📈 Statistical Summary:")display(df_renamed.describe())

## 1.6 Filter for Adults (18+) and Create Target Variable**Important:** We only want adults for obesity prediction.

In [None]:
print("\n" + "="*80)print("FILTERING FOR ADULTS AND CREATING TARGET VARIABLE")print("="*80)# Filter for adults (18+)df_adults = df_renamed[df_renamed['Age'] >= 18].copy()print(f"\n✓ Filtered for adults (Age ≥ 18)")print(f"  Before: {len(df_renamed):,} participants")print(f"  After: {len(df_adults):,} adults")print(f"  Removed: {len(df_renamed) - len(df_adults):,} participants under 18")# Create binary obesity target variable# Obesity defined as BMI ≥ 30df_adults['Obesity_Status'] = (df_adults['BMI'] >= 30).astype(int)print(f"\n✓ Created target variable: Obesity_Status")print(f"  Definition: BMI ≥ 30 = Obese (1), BMI < 30 = Not Obese (0)")# Remove rows with missing BMI (can't create target)df_adults = df_adults[df_adults['BMI'].notna()].copy()print(f"\n✓ Removed participants with missing BMI")print(f"  Final dataset: {len(df_adults):,} adults with BMI data")# Check obesity prevalenceobesity_count = df_adults['Obesity_Status'].sum()obesity_pct = obesity_count / len(df_adults) * 100print(f"\n📊 Obesity Prevalence:")print(f"  Obese (BMI ≥ 30): {obesity_count:,} participants ({obesity_pct:.1f}%)")print(f"  Not Obese (BMI < 30): {len(df_adults) - obesity_count:,} participants ({100-obesity_pct:.1f}%)")# Now assign to main dataframedf = df_adults.copy()print(f"\n✓ Final working dataset: {len(df):,} adults")

## 1.7 CRITICAL: Remove Data Leakage Variables**MOST IMPORTANT STEP:** Remove BMI, weight, height from features!We can ONLY use these to create the target variable. Using them as features would be circular reasoning.

In [None]:
print("\n" + "="*80)print("REMOVING DATA LEAKAGE VARIABLES")print("="*80)# Variables that would cause data leakageleakage_vars = ['BMI', 'Weight_kg', 'Height_cm']print(f"\n⚠️  Removing variables that would cause data leakage:")for var in leakage_vars:    if var in df.columns:        print(f"  ✗ {var}")        df = df.drop(var, axis=1)    else:        print(f"  - {var} (not in dataset)")print(f"\n✓ Data leakage variables removed")print(f"✓ We will predict Obesity_Status from CAUSES (diet, activity), not MEASUREMENTS")print(f"\n✓ Remaining features: {len(df.columns) - 3}") # Minus SEQN, CYCLE, SURVEY_YEARS, Obesity_Status# Verify no leakageleakage_keywords = ['bmi', 'weight', 'height', 'waist']potential_leakage = []for col in df.columns:    if any(keyword in col.lower() for keyword in leakage_keywords):        if col != 'Obesity_Status':  # Target is okay            potential_leakage.append(col)if potential_leakage:    print(f"\n⚠️  WARNING: Potential data leakage detected:")    for col in potential_leakage:        print(f"  - {col}")else:    print(f"\n✓ NO DATA LEAKAGE DETECTED - Safe to proceed!")

## 1.8 Handle Missing DataCheck for missing values and implement imputation strategy.

In [None]:
print("\n" + "="*80)print("MISSING DATA ANALYSIS")print("="*80)# Calculate missing valuesmissing_summary = pd.DataFrame({    'Column': df.columns,    'Missing_Count': df.isnull().sum().values,    'Missing_Percentage': (df.isnull().sum().values / len(df) * 100).round(2)})# Show only columns with missing datamissing_data = missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)print(f"\n📊 Missing Values Summary:")if len(missing_data) > 0:    print(f"\nColumns with missing data:")    print(missing_data.to_string(index=False))else:    print("\n✓ No missing values found!")# Count by percentage rangesif len(missing_data) > 0:    low_missing = len(missing_data[missing_data['Missing_Percentage'] < 10])    med_missing = len(missing_data[(missing_data['Missing_Percentage'] >= 10) & (missing_data['Missing_Percentage'] < 20)])    high_missing = len(missing_data[missing_data['Missing_Percentage'] >= 20])        print(f"\nMissing data breakdown:")    print(f"  < 10% missing: {low_missing} columns")    print(f"  10-20% missing: {med_missing} columns")    print(f"  ≥ 20% missing: {high_missing} columns")

In [None]:
# Handle missing dataprint("\n" + "="*80)print("HANDLING MISSING DATA")print("="*80)# Strategy 1: Remove rows with missing targetif df['Obesity_Status'].isnull().sum() > 0:    before = len(df)    df = df[df['Obesity_Status'].notna()]    print(f"✓ Removed {before - len(df)} rows with missing target variable")# Strategy 2: Impute numeric variables with median (grouped by gender if available)numeric_cols = df.select_dtypes(include=[np.number]).columnsnumeric_cols_with_missing = [col for col in numeric_cols if df[col].isnull().sum() > 0]if numeric_cols_with_missing:    print(f"\n✓ Imputing {len(numeric_cols_with_missing)} numeric columns with median:")    for col in numeric_cols_with_missing:        if col not in ['SEQN', 'Obesity_Status']:            missing_count = df[col].isnull().sum()            if 'Gender' in df.columns and df['Gender'].notna().all():                df[col] = df.groupby('Gender')[col].transform(lambda x: x.fillna(x.median()))            else:                df[col] = df[col].fillna(df[col].median())            print(f"  - {col}: {missing_count} values imputed")# Strategy 3: Impute categorical with mode or 'Unknown'cat_cols = df.select_dtypes(include=['object', 'category']).columnscat_cols_with_missing = [col for col in cat_cols if df[col].isnull().sum() > 0]if cat_cols_with_missing:    print(f"\n✓ Imputing {len(cat_cols_with_missing)} categorical columns:")    for col in cat_cols_with_missing:        missing_count = df[col].isnull().sum()        missing_pct = missing_count / len(df) * 100        if missing_pct < 20:            df[col] = df[col].fillna(df[col].mode()[0])            print(f"  - {col}: {missing_count} values → mode")        else:            df[col] = df[col].fillna('Unknown')            print(f"  - {col}: {missing_count} values → 'Unknown'")print(f"\n✓ Missing data handling complete")print(f"  Remaining missing values: {df.isnull().sum().sum()}")print(f"  Final dataset size: {len(df):,} participants")

## 1.9 Encode Categorical Variables

In [None]:
print("\n" + "="*80)print("ENCODING CATEGORICAL VARIABLES")print("="*80)# Gender: Binary encodingif 'Gender' in df.columns:    # NHANES: 1=Male, 2=Female → Convert to: 1=Male, 0=Female    df['Gender'] = (df['Gender'] == 1).astype(int)    print("✓ Gender encoded: Male=1, Female=0")# Smoking: Create simplified binaryif 'Ever_Smoked_100_Cigarettes' in df.columns and 'Current_Smoking_Status' in df.columns:    # Combine into smoking status    # Never smoked = 0, Former = 1, Current = 2    df['Smoking_Status'] = 0  # Default: Never    df.loc[df['Ever_Smoked_100_Cigarettes'] == 1, 'Smoking_Status'] = 1  # Former    df.loc[df['Current_Smoking_Status'].isin([1, 2]), 'Smoking_Status'] = 2  # Current    print("✓ Smoking_Status created: Never=0, Former=1, Current=2")# Education: Ordinal encodingif 'Education_Level' in df.columns:    # NHANES codes: 1=<9th, 2=9-11, 3=HS, 4=Some college, 5=College grad    # Already ordinal, just handle missing    df['Education_Level'] = df['Education_Level'].fillna(0)    print("✓ Education_Level: ordinal (0=Unknown, 1-5=levels)")# Income: Ordinal encoding  if 'Income_Bracket' in df.columns:    # Convert NHANES income codes to ordinal    df['Income_Bracket'] = df['Income_Bracket'].fillna(0)    print("✓ Income_Bracket: ordinal encoding applied")# Race/Ethnicity: One-hot encodingif 'Race_Ethnicity' in df.columns:    race_dummies = pd.get_dummies(df['Race_Ethnicity'], prefix='Race')    df = pd.concat([df, race_dummies], axis=1)    df = df.drop('Race_Ethnicity', axis=1)    print(f"✓ Race/Ethnicity one-hot encoded: {len(race_dummies.columns)} binary columns")# Marital Status: One-hot encoding (if available)if 'Marital_Status' in df.columns:    marital_dummies = pd.get_dummies(df['Marital_Status'], prefix='Marital')    df = pd.concat([df, marital_dummies], axis=1)    df = df.drop('Marital_Status', axis=1)    print(f"✓ Marital_Status one-hot encoded: {len(marital_dummies.columns)} binary columns")print(f"\n✓ Categorical encoding complete")print(f"  Current dataset shape: {df.shape}")

## 1.10 Feature EngineeringCreate derived features from dietary and lifestyle variables.

In [None]:
print("\n" + "="*80)print("FEATURE ENGINEERING")print("="*80)features_created = []# 1. Total Physical Activity (combine moderate + vigorous)if 'Moderate_Activity_Min_Per_Week' in df.columns and 'Vigorous_Activity_Min_Per_Week' in df.columns:    df['Total_Physical_Activity'] = df['Moderate_Activity_Min_Per_Week'].fillna(0) + df['Vigorous_Activity_Min_Per_Week'].fillna(0) * 2    features_created.append('Total_Physical_Activity')elif 'Moderate_Activity_Min_Per_Week' in df.columns:    df['Total_Physical_Activity'] = df['Moderate_Activity_Min_Per_Week'].fillna(0)    features_created.append('Total_Physical_Activity')elif 'Vigorous_Activity_Min_Per_Week' in df.columns:    df['Total_Physical_Activity'] = df['Vigorous_Activity_Min_Per_Week'].fillna(0) * 2    features_created.append('Total_Physical_Activity')# 2. Macronutrient Percentagesif 'Total_Calories_kcal' in df.columns:    if 'Protein_g' in df.columns:        df['Protein_Pct_Calories'] = (df['Protein_g'] * 4 / df['Total_Calories_kcal']) * 100        features_created.append('Protein_Pct_Calories')    if 'Fat_g' in df.columns:        df['Fat_Pct_Calories'] = (df['Fat_g'] * 9 / df['Total_Calories_kcal']) * 100        features_created.append('Fat_Pct_Calories')    if 'Carbohydrates_g' in df.columns:        df['Carb_Pct_Calories'] = (df['Carbohydrates_g'] * 4 / df['Total_Calories_kcal']) * 100        features_created.append('Carb_Pct_Calories')# 3. Energy Balance (KEY PREDICTOR!)if 'Total_Calories_kcal' in df.columns and 'Total_Physical_Activity' in df.columns:    df['Energy_Balance'] = df['Total_Calories_kcal'] / (df['Total_Physical_Activity'] + 100)    features_created.append('Energy_Balance')# 4. Nutrient Densityif 'Fiber_g' in df.columns and 'Protein_g' in df.columns and 'Total_Calories_kcal' in df.columns:    df['Nutrient_Density'] = (df['Fiber_g'] + df['Protein_g']/10) / (df['Total_Calories_kcal'] / 1000)    features_created.append('Nutrient_Density')# 5. Sugar Ratioif 'Sugar_g' in df.columns and 'Total_Calories_kcal' in df.columns:    df['Sugar_Pct_Calories'] = (df['Sugar_g'] * 4 / df['Total_Calories_kcal']) * 100    features_created.append('Sugar_Pct_Calories')# 6. Age-Calorie Interactionif 'Age' in df.columns and 'Total_Calories_kcal' in df.columns:    df['Age_Calorie_Interaction'] = df['Age'] * df['Total_Calories_kcal'] / 1000    features_created.append('Age_Calorie_Interaction')# 7. Sodium Risk Indicatorif 'Sodium_mg' in df.columns:    df['High_Sodium_Risk'] = (df['Sodium_mg'] > 2300).astype(int)    features_created.append('High_Sodium_Risk')# 8. Active Lifestyle Scoreif 'Total_Physical_Activity' in df.columns and 'Sleep_Hours_Per_Night' in df.columns and 'Smoking_Status' in df.columns:    df['Active_Lifestyle_Score'] = ((df['Total_Physical_Activity'] / 150) +                                      (df['Sleep_Hours_Per_Night'] / 8) +                                      (2 - df['Smoking_Status']))    features_created.append('Active_Lifestyle_Score')# 9. Saturated Fat Ratioif 'Saturated_Fat_g' in df.columns and 'Fat_g' in df.columns:    df['Saturated_Fat_Ratio'] = df['Saturated_Fat_g'] / (df['Fat_g'] + 0.1)    features_created.append('Saturated_Fat_Ratio')# 10. Adequate Sleep Indicatorif 'Sleep_Hours_Per_Night' in df.columns:    df['Adequate_Sleep'] = ((df['Sleep_Hours_Per_Night'] >= 7) & (df['Sleep_Hours_Per_Night'] <= 9)).astype(int)    features_created.append('Adequate_Sleep')print(f"\n✓ Created {len(features_created)} engineered features:")for i, feature in enumerate(features_created, 1):    print(f"  {i:2d}. {feature}")print(f"\n✓ Feature engineering complete")print(f"  Current dataset shape: {df.shape}")

## 1.11 Prepare for ModelingSeparate features and target, create train-test split.

In [None]:
print("\n" + "="*80)print("PREPARING FOR MODELING")print("="*80)# Columns to exclude from featuresexclude_cols = ['SEQN', 'CYCLE', 'SURVEY_YEARS', 'Obesity_Status']# Get feature columnsfeature_cols = [col for col in df.columns if col not in exclude_cols]# Separate features (X) and target (y)X = df[feature_cols].copy()y = df['Obesity_Status'].copy()print(f"\n✓ Features (X): {X.shape}")print(f"✓ Target (y): {y.shape}")print(f"\nTarget distribution:")print(y.value_counts())print(f"\nObesity prevalence: {y.mean()*100:.1f}%")# Train-test split with stratificationX_train, X_test, y_train, y_test = train_test_split(    X, y,    test_size=0.20,    random_state=42,    stratify=y)print(f"\n✓ Train-Test Split (80/20):")print(f"  Training set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")print(f"  Testing set: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")print(f"\n  Training obesity rate: {y_train.mean()*100:.1f}%")print(f"  Testing obesity rate: {y_test.mean()*100:.1f}%")print(f"  ✓ Stratification successful!")

## 1.12 Feature StandardizationApply Z-score normalization to all features.

In [None]:
# Standardize featuresscaler = StandardScaler()scaler.fit(X_train)X_train_scaled = scaler.transform(X_train)X_test_scaled = scaler.transform(X_test)# Convert back to DataFramesX_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)print("\n" + "="*80)print("FEATURE STANDARDIZATION COMPLETE")print("="*80)print(f"\n✓ Applied Z-score normalization: (x - mean) / std")print(f"\nVerification (training set):")print(f"  Mean: {X_train_scaled.mean().mean():.6f} (should be ~0)")print(f"  Std: {X_train_scaled.std().mean():.6f} (should be ~1)")print(f"\n✓ Data ready for modeling!")

## 1.13 Section 1 Summary

In [None]:
print("\n" + "="*80)print("SECTION 1 COMPLETE: DATA PREPROCESSING SUMMARY")print("="*80)print(f"\n✓ Imported and parsed 21 XPT files (3 cycles × 7 files)")print(f"✓ Merged data by participant ID (SEQN) - inner join for complete records")print(f"✓ Filtered for adults (18+) with BMI data")print(f"✓ Created binary obesity target (BMI ≥ 30)")print(f"✓ Removed data leakage variables (BMI, weight, height)")print(f"✓ Handled missing data (imputation)")print(f"✓ Encoded categorical variables")print(f"✓ Created {len(features_created)} engineered features")print(f"✓ Standardized all features (Z-score)")print(f"✓ Split into train/test sets (80/20, stratified)")print(f"\n📊 Final Dataset Statistics:")print(f"  Total participants: {len(df):,}")print(f"  Adults with complete data: {len(df):,}")print(f"  Survey cycles: {', '.join(sorted(df['CYCLE'].unique()))}")print(f"  Years covered: 2013-2018")print(f"  Total features: {X.shape[1]}")print(f"  Training samples: {X_train.shape[0]:,}")print(f"  Testing samples: {X_test.shape[0]:,}")print(f"  Obesity prevalence: {y.mean()*100:.1f}%")print(f"\n✓ NO DATA LEAKAGE: Predicting from causes, not measurements!")print(f"\n{'='*80}")print("Ready for Section 2: Data Analysis and Visualization")print("="*80)

---# SECTION 2: Data Analysis and Visualization (100 Points)Complete analysis coming...## 2.1 Variable Classification## 2.2 Descriptive Statistics  ## 2.3 Correlation Analysis## 2.4 Exploratory Visualizations---# SECTION 3: Machine Learning Models (100 Points)Complete ML pipeline coming...## 3.1 Model Training## 3.2 Cross-Validation## 3.3 Performance Metrics## 3.4 Feature Importance---# SECTION 4: Conclusions (50 Points)Final interpretation and results...