# FINAL PROJECT

## 1. Data Collection
The dataset used in this project is **CDC Diabetes Health Indicators**, which contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes.

The dataset source can be found on **UC Irvine Machine Learning Repository** ([UCI Dataset URL](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators)), or **Kaggle** ([Kaggle Dataset URL](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset)). The dataset is a cleaned and consolidated by Alex Teboul from the original **CDC's BRFSS 2015**.

Regarding license, the dataset is marked as **CC0: Public Domain**, which means it can be used freely without asking permission.

The original data is collected from a random sample of adults (one per household) through a telephone survey across over 400,000 U.S. citizens regarding their health-related risk behaviors, chronic health conditions, and use of preventive services in the year of 2015. The cleaned data is processed by selecting features related to diabetes disease and other chronic health conditions only.

Our group has an interest in medical and healthcare-related datasets, therefore we chose this dataset to process. The dataset has the potential to provide insights about what and how factors contribute to the form of chronic health conditions.

## 2. Data Exploration

### 2.1 Load the Dataset

In [None]:
%pip install ucimlrepo
%pip install pandas
%pip install numpy
%pip install matplotlib
%pip install seaborn
%pip install scikit-learn

In [None]:
from ucimlrepo import fetch_ucirepo 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# fetch dataset 
cdc_diabetes_health_indicators = fetch_ucirepo(id=891).data.original

In [None]:
# Display the first few rows of the original dataset
cdc_diabetes_health_indicators.head()

### 2.2 Dataset Overview - Basic Information

In [None]:
cdc_diabetes_health_indicators.shape

Each row represents one survey respondent/individual

In [None]:
cdc_diabetes_health_indicators.memory_usage(deep=True).sum()

In [None]:
cdc_diabetes_health_indicators.dtypes

In [None]:
cdc_diabetes_health_indicators.info()

### 2.3 Data Integrity

In [None]:
cdc_diabetes_health_indicators.duplicated().sum()

In [None]:
cdc_diabetes_health_indicators.isnull().sum()

### 2.4 Column Inventory

| Column Name            | Data Type | Meaning/Definition                                                                 | Relevant for Analysis?          | Should it be Dropped?      |
|------------------------|-----------|------------------------------------------------------------------------------------|---------------------------------|---------------------------|
| ID                     | int64     | Unique identifier for each respondent                                              | No                              | Yes (identifier only)     |
| Diabetes_binary        | int64     | Target variable: 0 = no diabetes, 1 = prediabetes or diabetes                      | Yes                             | No (target variable)      |
| HighBP                 | int64     | High blood pressure: 0 = no, 1 = yes                                               | Yes                             | No (predictor)            |
| HighChol               | int64     | High cholesterol: 0 = no, 1 = yes                                                  | Yes                             | No (predictor)            |
| CholCheck              | int64     | Cholesterol check in past 5 years: 0 = no, 1 = yes                                 | Yes                             | No (predictor)            |
| BMI                    | int64     | Body Mass Index (weight in kg / height in m²)                                       | Yes                             | No (predictor)            |
| Smoker                 | int64     | Have smoked at least 100 cigarettes in lifetime: 0 = no, 1 = yes                   | Yes                             | No (predictor)            |
| Stroke                 | int64     | Ever told had a stroke: 0 = no, 1 = yes                                            | Yes                             | No (predictor)            |
| HeartDiseaseorAttack   | int64     | Coronary heart disease or myocardial infarction: 0 = no, 1 = yes                   | Yes                             | No (predictor)            |
| PhysActivity           | int64     | Physical activity in past 30 days (not job-related): 0 = no, 1 = yes                | Yes                             | No (predictor)            |
| Fruits                 | int64     | Consume fruit 1+ times per day: 0 = no, 1 = yes                                    | Yes                             | No (predictor)            |
| Veggies                | int64     | Consume vegetables 1+ times per day: 0 = no, 1 = yes                                | Yes                             | No (predictor)            |
| HvyAlcoholConsump      | int64     | Heavy alcohol consumption (adult men >14 drinks/week, women >7 drinks/week): 0 = no, 1 = yes | Yes                             | No (predictor)            |
| AnyHealthcare          | int64     | Have any health care coverage: 0 = no, 1 = yes                                     | Yes                             | No (predictor)            |
| NoDocbcCost            | int64     | Could not see doctor due to cost in past 12 months: 0 = no, 1 = yes                | Yes                             | No (predictor)            |
| GenHlth                | int64     | General health rating: scale 1-5 (1 = excellent, 5 = poor)                          | Yes                             | No (predictor)            |
| MentHlth               | int64     | Number of days mental health was not good in past 30 days (0-30)                   | Yes                             | No (predictor)            |
| PhysHlth               | int64     | Number of days physical health was not good in past 30 days (0-30)                 | Yes                             | No (predictor)            |
| DiffWalk               | int64     | Serious difficulty walking or climbing stairs: 0 = no, 1 = yes                      | Yes                             | No (predictor)            |
| Sex                    | int64     | Sex: 0 = female, 1 = male                                                          | Yes                             | No (predictor)            |
| Age                    | int64     | Age category: 13-level age category (1 = 18-24, 13 = 80+)                          | Yes                             | No (predictor)            |
| Education              | int64     | Education level: scale 1-6 (1 = never attended school, 6 = college graduate)       | Yes                             | No (predictor)            |
| Income                 | int64     | Income level: scale 1-8 (1 = <$10,000, 8 = $75,000+)                                | Yes                             | No (predictor)            |

In [None]:
cdc_diabetes_health_indicators.drop(columns=['ID'], inplace=True)

### 2.5 Numerical Columns Analysis

In [None]:
plain_numeric_columns = ['BMI', 'PhysHlth', 'MentHlth']
categorial_like_numerics = [ 'GenHlth',  'Age', 'Education', 'Income']

#### 1. Distribution & Central Tendency

In [None]:
cdc_diabetes_health_indicators[plain_numeric_columns].describe()

In [None]:
for col in categorial_like_numerics:
    sns.countplot(data=cdc_diabetes_health_indicators, x=col)
    plt.title(f'Distribution of {col}')
    plt.show()

#### 2. Range & Outliers

In [None]:
# check min/max values for plain numeric columns
for col in plain_numeric_columns:
    print(f"{col} - min: {cdc_diabetes_health_indicators[col].min()}, max: {cdc_diabetes_health_indicators[col].max()}")   
    

In [None]:
# Create box plots for plain numeric columns to identify outliers
for col in plain_numeric_columns:
    sns.boxplot(data=cdc_diabetes_health_indicators, x=col)
    plt.title(f'Box plot of {col}')
    plt.show()

All exhibit strongly positively skewed distributions.
BMI:
- Interquartile range (IQR) is narrow, centered around typical adults value (~20-30).
- Median ≈ 26–27.
- Long right tail with numerous points flagged as outliers extending to ~100.
- Reflects the common pattern in population data: most individuals have moderate BMI, while a smaller group has severe obesity.

PhysHlth and MentHlth:
- Both are nearly identical in shape.
- Majority of values cluster at or near 0 (most people report few or no bad health days).
- Very small IQR (typically 0 to ~2–3).
- Median close to 0.
- Long upper whisker and many outliers extending to the maximum of 30.
- Indicates that while most respondents enjoy good physical/mental health, a subset experiences frequent poor health days.

The numerous points marked as outliers are valid and expected. These represent genuine cases (severe obesity, chronic conditions, mental health challenges) rather than data errors.


#### 3. Data Quality

In [None]:
# check for impossible values in plain numeric columns
for col in plain_numeric_columns:
    if col == 'BMI':
        invalid_values = cdc_diabetes_health_indicators[(cdc_diabetes_health_indicators[col] < 10) | (cdc_diabetes_health_indicators[col] > 100)]
    else:
        invalid_values = cdc_diabetes_health_indicators[(cdc_diabetes_health_indicators[col] < 0) | (cdc_diabetes_health_indicators[col] > 30)]
    print(f"Invalid values in {col}:")
    print(invalid_values)

### 2.6 Categorical Columns Analysis

In [None]:
# all columns except the above numeric ones 
categorial_columns = [ col for col in cdc_diabetes_health_indicators.columns if col not in plain_numeric_columns and col not in categorial_like_numerics ]

#### 1. Value Distribution

In [None]:
for col in categorial_columns:
    sns.countplot(data=cdc_diabetes_health_indicators, x=col)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# calculate percentages of each category in categorical columns
for col in categorial_columns:
    print(f"Value counts for {col}:")
    print(cdc_diabetes_health_indicators[col].value_counts(normalize=True) * 100)
    print()

#### 2. Data Quality

In [None]:
# Verify all values are 0 and 1 only
for col in categorial_columns:
    unique_values = cdc_diabetes_health_indicators[col].unique()
    print(f"Unique values in {col}: {unique_values}")

In [None]:
# Check for class imbalance in target variable 'Diabetes_012'
target_counts = cdc_diabetes_health_indicators['Diabetes_binary'].value_counts(normalize=True) * 100
print("Class distribution in 'Diabetes_binary':")
print(target_counts)

### 2.7 Missing Data Analysis

In [None]:
# A heatmap to visualize missing data
plt.figure(figsize=(10, 6))
sns.heatmap(cdc_diabetes_health_indicators.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()

In [None]:
# Bar chart to show missing data counts
missing_counts = cdc_diabetes_health_indicators.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]

No missing data found - dataset is clean 

### 2.8 Relationships & Correlations

#### 1. Correlation Matrix

In [None]:
# Create a heatmap to visualize correlations
plt.figure(figsize=(12, 8))
correlation_matrix = cdc_diabetes_health_indicators.corr()
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Features correlated with Diabetes_binary:
- HighBP (0.26)
- HighChol (0.20)
- BMI (0.22)
- Age (0.18)
- DiffWalk (0.22)
- PhysHlth (0.17)
- HeartDiseaseorAttack (0.18)

#### 2. Cross-tabulations

In [None]:
# Create crosstabs for:
# Diabetes vs HighBP
# Diabetes vs HighChol
# Diabetes vs Smoker
# Diabetes vs PhysActivity
crosstab_highbp = pd.crosstab(cdc_diabetes_health_indicators['Diabetes_binary'], cdc_diabetes_health_indicators['HighBP'])
crosstab_highchol = pd.crosstab(cdc_diabetes_health_indicators['Diabetes_binary'], cdc_diabetes_health_indicators['HighChol'])
crosstab_smoker = pd.crosstab(cdc_diabetes_health_indicators['Diabetes_binary'], cdc_diabetes_health_indicators['Smoker'])
crosstab_physactivity = pd.crosstab(cdc_diabetes_health_indicators['Diabetes_binary'], cdc_diabetes_health_indicators['PhysActivity'])

# show crosstabs
print("Crosstab: Diabetes vs HighBP")
print(crosstab_highbp)
print("\nCrosstab: Diabetes vs HighChol")
print(crosstab_highchol)
print("\nCrosstab: Diabetes vs Smoker")
print(crosstab_smoker)
print("\nCrosstab: Diabetes vs PhysActivity")
print(crosstab_physactivity)

### 2.9 Initial Observations and Insights

**Key Observations:**

1. **Severe class imbalance:** Dataset contains ~86% non-diabetic vs ~14% diabetic/prediabetic individuals, which will require careful handling in machine learning models (stratified sampling, appropriate metrics).

2. **Strong health condition correlations:** High blood pressure (0.26), high cholesterol (0.20), and BMI (0.22) show the strongest correlations with diabetes, suggesting these are primary physiological risk factors.

3. **Lifestyle factor patterns:** Physical activity shows inverse relationship with diabetes - 75.7% of overall dataset is physically active, but among diabetics, only 63.0% are active (from crosstab: 22,287/(13,059+22,287)), indicating protective effect.

4. **Smoking paradox:** Smoking prevalence is similar between diabetic (51.8%) and non-diabetic (43.1%) groups, suggesting weaker individual effect than health conditions. However, interactions with other factors need investigation.

5. **BMI distribution differences:** Mean BMI in dataset is 28.4 (overweight category), with high variance (SD=6.6) and extreme outliers up to 98, reflecting U.S. obesity epidemic patterns.

6. **Age correlation:** Age shows moderate correlation (0.18) with diabetes, confirming diabetes as age-related chronic condition.

**Patterns Leading to Research Questions:**

- The combination of health conditions (HighBP, HighChol) suggests potential synergistic effects worth investigating
- Physical activity's protective pattern raises questions about its interaction with smoking and health conditions
- Different correlation strengths across features motivate feature importance analysis through ML

**Red Flags:**

- **Critical: Severe class imbalance (86:14 ratio)** requires:
  - Stratified train/test splitting (already implemented ✓)
  - ROC-AUC as primary metric instead of accuracy
  - Potential use of class weights or SMOTE in models

- No missing values is surprisingly clean for health survey data - suggests heavy preprocessing by data authors

- Extreme BMI outliers (up to 98) are valid but rare cases that might disproportionately influence models

## 3. Question Formulation

Question 1: Which features are most important for predicting diabetes risk?
1. The Question: "Which features (among lifestyle factors, health conditions, and demographic variables) are most important for predicting diabetes risk?" 
2. Motivation & Benefits:
- Why worth investigating? Understanding which features drive diabetes risk can help healthcare providers prioritize screening and interventions
- Benefits:
    - Identify high-risk individuals more efficiently
    - Allocate resources to most impactful preventive measures
    - Inform public health campaigns
- Who cares? Healthcare providers, public health officials, insurance companies, at-risk individuals
- Real-world impact: Can improve early detection and prevention strategies, reducing healthcare costs
Question 2: How do lifestyle factors and health conditions interact to affect diabetes prevalence?
1. The Question: "How do lifestyle factors (smoking, physical activity) and health conditions (high BP, high cholesterol) interact to affect diabetes prevalence? Are combined effects more significant than individual factors?" 
2. Motivation & Benefits:
- Why worth investigating? Understanding interaction effects reveals whether combinations of risk factors multiply risk beyond individual contributions
- Benefits:
    - Identify highest-risk combinations (e.g., smoker + high BP + no physical activity)
    - Develop targeted intervention programs
    - Understand whether changing one factor (e.g., increasing physical activity) can offset other risks
- Who cares? Public health researchers, policy makers, primary care physicians, individuals managing multiple risk factors
- Real-world impact: Can inform personalized risk assessment and lifestyle modification recommendations

## 4. Data Analysis

### Question 1:

#### A. Preprocessing

Create train/test split (80/20 or 70/30)
Note: Features are already numerical, no encoding needed

In [None]:
# Split data
from sklearn.model_selection import train_test_split
X = cdc_diabetes_health_indicators.drop(columns=['Diabetes_binary'])
y = cdc_diabetes_health_indicators['Diabetes_binary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

#### B. Analysis

Train multiple classification models (Logistic Regression, Decision Tree, Random Forest, etc.) to predict Diabetes_binary

Evaluate models using accuracy, precision, recall, F1-score, ROC-AUC, Confusion Matrix

Identify top features using feature importance from tree-based models or coefficients from logistic regression

Compare importance rankings across models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

In [None]:
lr_model = LogisticRegression(max_iter=1000)
dt_model = DecisionTreeClassifier(random_state=42)
rf_model = RandomForestClassifier(random_state=42)
lr_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

models = {
    'Logistic Regression': lr_model,
    'Decision Tree': dt_model,
    'Random Forest': rf_model
}
results = {}
for model_name, model in models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    results[model_name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_proba),
        'Confusion Matrix': confusion_matrix(y_test, y_pred)
    }


In [None]:
# Display resultsfor model_name, metrics in results.items():
print(f"Results for {model_name}:")
for metric_name, value in results.items():
    print(f"  {metric_name}: {value}")
print()

In [None]:
# Extract feature importances
feature_importances = {}
for model_name, model in models.items():
    if model_name == 'Logistic Regression':
        importances = model.coef_[0]
    else:
        importances = model.feature_importances_
    feature_importances[model_name] = pd.Series(importances, index=X.columns).sort_values(ascending=False)

# Display feature importances
for model_name, importances in feature_importances.items():
    print(f"Feature importances for {model_name}:")
    print(importances)
    print()

In [None]:
# Create visualizations of feature importances
for model_name, importances in feature_importances.items():
    plt.figure(figsize=(10, 6))
    sns.barplot(x=importances.values, y=importances.index)
    plt.title(f'Feature Importances from {model_name}')
    plt.xlabel('Importance Score')
    plt.ylabel('Features')
    plt.show()

#### C. Results and Interpretation

In [None]:
# Feature importance bar charts (top 15 features)
for model_name, importances in feature_importances.items():
    plt.figure(figsize=(10, 6))
    sns.barplot(x=importances.values[:15], y=importances.index[:15])
    plt.title(f'Feature Importances from {model_name}')
    plt.xlabel('Importance Score')
    plt.ylabel('Features')
    plt.show()


In [None]:
# ROC Curves
from sklearn.metrics import roc_curve, auc
plt.figure(figsize=(10, 8))
for model_name, model in models.items():
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.2f})')


In [None]:
# Confusion Matrices
for model_name, metrics in results.items():
    cm = metrics['Confusion Matrix']
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0, 1], yticklabels=[0, 1])
    plt.title(f'Confusion Matrix for {model_name}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()


In [None]:
# Comparison plot of feature rankings across models
feature_rankings = pd.DataFrame()
for model_name, importances in feature_importances.items():
    feature_rankings[model_name] = importances.rank(ascending=False)    
feature_rankings.plot(kind='bar', figsize=(12, 8))
plt.title('Feature Rankings Across Models')
plt.xlabel('Features')
plt.ylabel('Rank')
plt.legend(title='Models')
plt.show()

### Written Analysis - Question 1 Results

#### **Model Performance Comparison:**

| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|-------|----------|-----------|--------|----------|---------|
| Logistic Regression | 86.2% | 51.7% | 15.8% | 24.2% | **0.819** |
| Decision Tree | 79.8% | 29.7% | 32.8% | 31.2% | 0.599 |
| Random Forest | 86.0% | 48.9% | 17.9% | 26.2% | 0.796 |

**Key Findings:**

1. **Logistic Regression performs best** with ROC-AUC of 0.819, indicating strong discriminative ability between diabetic and non-diabetic individuals.

2. **Class imbalance heavily affects metrics:**
   - High accuracy (86%) is misleading - similar to baseline (86% non-diabetic)
   - Low recall (15-33%) means models miss many diabetic cases
   - Precision varies widely (30-52%)
   - ROC-AUC is the most reliable metric here

3. **Decision Tree overfits** with lowest AUC (0.599), barely better than random guessing (0.5).

#### **Feature Importance Analysis:**

**Top 5 Features by Model:**

**Logistic Regression** (coefficients):
1. CholCheck (1.22) - Having cholesterol checked strongly predicts diabetes awareness/diagnosis
2. HighBP (0.76) - High blood pressure is major indicator
3. HighChol (0.58) - High cholesterol contributes significantly  
4. GenHlth (0.54) - Self-rated general health
5. Sex (0.25) - Gender differences in diabetes risk

**Random Forest** (feature importance):
1. **BMI (0.183)** - Body Mass Index is strongest predictor
2. **Age (0.123)** - Age is second most important
3. Income (0.098) - Socioeconomic factor
4. PhysHlth (0.084) - Physical health days
5. Education (0.070) - Education level

**Decision Tree** (for comparison):
1. BMI (0.141)
2. Income (0.105)
3. Age (0.097)
4. PhysHlth (0.090)
5. Education (0.080)

#### **Critical Insights:**

1. **BMI emerges as #1 predictor** in tree-based models (Random Forest & Decision Tree), with importance scores of 0.183 and 0.141 respectively. This aligns with medical understanding that obesity is a primary diabetes risk factor.

2. **Model disagreement on top features:** Logistic Regression emphasizes clinical measures (CholCheck, HighBP, HighChol) while tree-based models emphasize demographics and lifestyle (BMI, Age, Income). This suggests:
   - Linear relationships favor health conditions
   - Non-linear models capture obesity and age effects better
   - Both perspectives are valid

3. **Lifestyle factors rank medium importance:**
   - PhysActivity: Rank 14-15 across models (importance ~0.026-0.033)
   - Smoker: Rank 9-10 (importance ~0.033-0.039)
   - This is lower than expected, suggesting these have weaker *individual* effects
   - Interaction effects (Question 2) may reveal more

4. **Demographic factors matter:** Age, Income, and Education all rank in top 7, indicating socioeconomic determinants of diabetes risk.

5. **Heavy alcohol consumption** shows *negative* coefficient in Logistic Regression (-0.78), suggesting inverse relationship - possibly confounded by age or other factors.

#### **Practical Implications:**

**For Healthcare Providers:**
- **Priority screening criteria:** Focus on individuals with high BMI (>30), older age (50+), and high BP/cholesterol
- **Multi-factor assessment needed:** No single feature dominates - use combinations
- **Don't ignore lifestyle:** While PhysActivity shows medium importance, it's modifiable unlike age/genetics

**For Public Health Campaigns:**
- Target interventions at **high-BMI populations** (most important modifiable risk factor)
- Age-specific programs needed (age is #2 predictor)
- Address socioeconomic disparities (income/education in top 5)

**For Individuals:**
- BMI reduction should be primary focus for prevention
- Regular cholesterol/BP checks crucial (especially if overweight)
- Physical activity, while not top predictor, still protective (see Question 2)

#### **Model Limitations:**

1. **Poor recall (15-33%)** means models miss most diabetic cases - not suitable for clinical diagnosis without tuning
2. **Class imbalance not fully addressed** - could improve with SMOTE or class weights
3. **Feature importance shows correlation, not causation** - cannot prove BMI *causes* diabetes from this analysis alone
4. **No hyperparameter tuning performed** - models use default settings, performance could improve

### Question 2:

#### A. Preprocessing

- Filter dataset to focus on key variables: Diabetes_binary, lifestyle factors (Smoker, PhysActivity), health conditions (HighBP, HighChol),
- Create interaction groups/categories:
    - Combine Smoker + PhysActivity to create 4 groups:
        - Non-smoker + Active
        - Non-smoker + Inactive
        - Smoker + Active
        - Smoker + Inactive
    - Combine HighBP + HighChol to create 4 groups:
        - Neither condition
        - HighBP only
        - HighChol only
        - Both conditions
- No missing data to handle

In [None]:
questions_analysis = cdc_diabetes_health_indicators[['Diabetes_binary', 'Smoker', 'PhysActivity', 'HighBP', 'HighChol']].copy()

questions_analysis['Smoker_PhysActivity'] = questions_analysis.apply(
    lambda row: f"{'Smoker' if row['Smoker'] == 1 else 'Non-smoker'} + {'Active' if row['PhysActivity'] == 1 else 'Inactive'}", axis=1) 
questions_analysis['HighBP_HighChol'] = questions_analysis.apply(
    lambda row: f"{'Both' if row['HighBP'] == 1 and row['HighChol'] == 1 else 'HighBP only' if row['HighBP'] == 1 else 'HighChol only' if row['HighChol'] == 1 else 'Neither'}", axis=1)

# Calculate diabetes prevalence for each combination
prevalence_smoker_physactivity = questions_analysis.groupby('Smoker_PhysActivity')['Diabetes_binary'].mean().reset_index()
prevalence_highbp_highchol = questions_analysis.groupby('HighBP_HighChol')['Diabetes_binary'].mean().reset_index()

# Display prevalence results
print("Diabetes Prevalence by Smoker and Physical Activity Status:")
print(prevalence_smoker_physactivity)
print("\nDiabetes Prevalence by High Blood Pressure and High Cholesterol Status:")
print(prevalence_highbp_highchol)

### B. Analysis
- Calculate diabtetes prevalence rates within each interaction group
- Use chi-square tests to assess statistical significance of differences in prevalence between groupsetes prevalence rates within each interaction group
- Use chi-square tests to assess statistical significance of differences in prevalence between groups
- Look for synergistic effects (combined risk > sum of individual risks)

In [None]:
# Group by interaction categories
grouped_interactions = questions_analysis.groupby(['Smoker_PhysActivity', 'HighBP_HighChol'])['Diabetes_binary'].mean().reset_index()
print("\nDiabetes Prevalence by Combined Interaction Groups:")
print(grouped_interactions)

# Calculate prevelance percentages
grouped_interactions['Diabetes_Prevalence (%)'] = grouped_interactions['Diabetes_binary'] * 100
print("\nDiabetes Prevalence Percentages by Combined Interaction Groups:")
print(grouped_interactions[['Smoker_PhysActivity', 'HighBP_HighChol', 'Diabetes_Prevalence (%)']])

# Perform chi-square tests
from scipy.stats import chi2_contingency
contingency_table_smoker_physactivity = pd.crosstab(questions_analysis['Smoker_PhysActivity'], questions_analysis['Diabetes_binary'])
contingency_table_highbp_highchol = pd.crosstab(questions_analysis['HighBP_HighChol'], questions_analysis['Diabetes_binary'])
chi2_smoker_physactivity, p_smoker_physactivity, _, _ = chi2_contingency(contingency_table_smoker_physactivity)
chi2_highbp_highchol, p_highbp_highchol, _, _ = chi2_contingency(contingency_table_highbp_highchol)
print(f"\nChi-square test for Smoker and Physical Activity interaction: chi2 = {chi2_smoker_physactivity:.2f}, p-value = {p_smoker_physactivity:.4f}")
print(f"Chi-square test for High Blood Pressure and High Cholesterol interaction: chi2 = {chi2_highbp_highchol:.2f}, p-value = {p_highbp_highchol:.4f}")

# Create interaction effect visualizations
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped_interactions, x='Smoker_PhysActivity', y='Diabetes_Prevalence (%)', hue='HighBP_HighChol')
plt.title('Diabetes Prevalence by Interaction Groups')
plt.xlabel('Smoker and Physical Activity Status')
plt.ylabel('Diabetes Prevalence (%)')
plt.legend(title='High BP and High Cholesterol Status')
plt.show()

#### C. Results and Interpretation

In [None]:
grouped_bar_charts = questions_analysis.groupby(['Smoker', 'PhysActivity'])['Diabetes_binary'].mean().reset_index()
grouped_bar_charts['Diabetes_Prevalence (%)'] = grouped_bar_charts['Diabetes_binary'] * 100
plt.figure(figsize=(8, 6))
sns.barplot(data=grouped_bar_charts, x='Smoker', y='Diabetes_Prevalence (%)', hue='PhysActivity')
plt.title('Diabetes Prevalence by Smoker and Physical Activity Status')
plt.xlabel('Smoker Status (0=Non-smoker, 1=Smoker)')
plt.ylabel('Diabetes Prevalence (%)')
plt.legend(title='Physical Activity (0=Inactive, 1=Active)')
plt.show()

heatmap_data = questions_analysis.pivot_table(index='Smoker', columns='PhysActivity', values='Diabetes_binary', aggfunc='mean') * 100
plt.figure(figsize=(6, 5))
sns.heatmap(heatmap_data, annot=True, fmt=".2f", cmap='YlGnBu')
plt.title('Diabetes Prevalence Heatmap (Smoker x PhysActivity)')
plt.xlabel('Physical Activity (0=Inactive, 1=Active)')
plt.ylabel('Smoker Status (0=Non-smoker, 1=Smoker)')
plt.show()

stacked_bar_data = questions_analysis.groupby(['HighBP', 'HighChol'])['Diabetes_binary'].mean().reset_index()
stacked_bar_data['Diabetes_Prevalence (%)'] = stacked_bar_data['Diabetes_binary'] * 100
stacked_bar_data_pivot = stacked_bar_data.pivot(index='HighBP', columns='HighChol', values='Diabetes_Prevalence (%)')
stacked_bar_data_pivot.plot(kind='bar', stacked=True, figsize=(8, 6))
plt.title('Stacked Bar Chart of Diabetes Prevalence (HighBP x HighChol)')
plt.xlabel('High Blood Pressure (0=No, 1=Yes)')
plt.ylabel('Diabetes Prevalence (%)')
plt.legend(title='High Cholesterol (0=No, 1=Yes)')
plt.show()  

# Interaction plot using matplotlib
interaction_data = questions_analysis.groupby(['Smoker', 'PhysActivity'])['Diabetes_binary'].mean().unstack()
plt.figure(figsize=(8, 6))
for col in interaction_data.columns:
    plt.plot(interaction_data.index, interaction_data[col], marker='o', markersize=10, linewidth=2, 
             label=f'PhysActivity={col}')
plt.title('Interaction Plot: Smoker x Physical Activity on Diabetes Prevalence')
plt.xlabel('Smoker Status (0=Non-smoker, 1=Smoker)')
plt.ylabel('Mean Diabetes Prevalence')
plt.xticks([0, 1])
plt.legend(title='Physical Activity')
plt.grid(True, alpha=0.3)
plt.show()


### Written Analysis - Question 2 Results

#### **Individual Factor Effects (Baseline):**

**Lifestyle Factors:**
- **Smoking alone** increases diabetes prevalence from 12.1% (non-smokers) to 16.3% (smokers)  
  - Absolute increase: **+4.2 percentage points**
  
- **Physical inactivity alone** increases diabetes prevalence from 11.6% (active) to 21.1% (inactive)
  - Absolute increase: **+9.5 percentage points**
  - Physical inactivity has **2.3x larger individual effect** than smoking

**Health Conditions:**
- **High BP alone** increases prevalence from 6.0% (normal BP) to 24.2% (high BP)
  - Absolute increase: **+18.2 percentage points**
  
- **High cholesterol alone** increases prevalence from 8.0% (normal) to 22.0% (high)
  - Absolute increase: **+14.0 percentage points**

#### **Interaction Effects Analysis:**

**Smoking × Physical Activity:**

From our results:
| Group | Diabetes Prevalence |
|-------|---------------------|
| Non-smoker + Active | 10.0% |
| Non-smoker + Inactive | 19.9% |
| Smoker + Active | 13.9% |
| Smoker + Inactive | **22.3%** |

**Analysis:**
- **Expected if additive:** If smoking (+4.2%) and inactivity (+9.5%) simply added: 10.0% (baseline) + 4.2% + 9.5% = 23.7%
- **Observed:** 22.3%
- **Interpretation:** Effects are roughly additive, **no strong synergy** (observed < expected)

**Key insight:** Physical activity appears **protective even among smokers** - smokers who are active have only 13.9% prevalence vs 22.3% for inactive smokers, a **37% relative reduction**.

**High BP × High Cholesterol:**

From our results:
| Group | Diabetes Prevalence |
|-------|---------------------|
| Neither | 4.2% |
| HighBP only | 16.7% |
| HighChol only | 10.4% |
| Both | **29.7%** |

**Analysis:**
- **Expected if additive:** 4.2% (baseline) + (16.7%-4.2%) + (10.4%-4.2%) = 22.9%
- **Observed:** 29.7%
- **Difference:** +6.8 percentage points **beyond additive expectation**
- **Interpretation:** **Strong synergistic effect** - having both conditions multiplies risk beyond simple addition

**This is a critical finding:** Having both high BP and high cholesterol creates compounded cardiovascular stress that accelerates diabetes risk.

#### **Most Dangerous Combinations:**

From the combined interaction data, the most dangerous combination is:
- **Smoker + Inactive + Both HighBP & HighChol:** Shows very high diabetes prevalence
- This represents individuals with all four risk factors combined
- **Risk multiplies** when combining lifestyle and health condition factors

#### **Protective Effects of Physical Activity:**

Comparing across smoking status:
- Among non-smokers: Active 10.0% vs Inactive 19.9% → **49.7% relative reduction**
- Among smokers: Active 13.9% vs Inactive 22.3% → **37.7% relative reduction**

**Physical activity substantially offsets smoking risk**, though doesn't eliminate it (active smokers still have 39% higher prevalence than active non-smokers).

#### **Statistical Significance:**

**Chi-square test results:**
- **Smoker × PhysActivity interaction:** χ² = 3309.47, p < 0.0001 (highly significant)
- **HighBP × HighChol interaction:** χ² = 15891.62, p < 0.0001 (highly significant)

Both interactions show **extremely strong statistical significance** (p-values essentially zero), confirming these are not chance findings.

#### **Practical Implications:**

**For High-Risk Individuals:**
1. **If you have both HighBP and HighChol:** Extremely high priority for diabetes screening (30% risk) - aggressive management of both conditions needed

2. **If you smoke and are inactive:** Changing either factor significantly reduces risk:
   - Start exercising: 22.3% → 13.9% (38% reduction)
   - Quit smoking: 22.3% → 19.9% (11% reduction)
   - **Exercise has bigger impact** than smoking cessation for diabetes risk specifically

3. **If you have one health condition:** Preventing the second condition is critical to avoid synergistic risk multiplication

**For Clinicians:**
- Screen for **multiple risk factors simultaneously** - combinations are more dangerous than individual factors
- When counseling patients with multiple risks, emphasize **compounding effects**
- **Physical activity intervention** should be prioritized even for smokers - provides substantial benefit

**For Public Health:**
- Programs targeting **multiple lifestyle changes** (diet + exercise + smoking cessation) likely more effective than single-factor interventions
- **Cardiovascular health** (managing BP & cholesterol) appears even more critical than previously appreciated for diabetes prevention
- Target highest-risk subgroups (smokers + inactive + CVD conditions) for intensive intervention

#### **Limitations:**

**Critical Caveat:** This is **cross-sectional observational data** - we cannot establish causation:
- Cannot prove smoking *causes* diabetes (could be reverse causation or confounding)
- Cannot prove physical activity *prevents* diabetes (healthier people may exercise more)
- Association ≠ causation

**Other limitations:**
- Binary variables lose nuance (how much exercise? how many cigarettes?)
- Self-reported data subject to recall bias
- Analysis limited to 2-way interactions (3-way+ could reveal more)
- Doesn't account for duration of risk factors (smoking for 1 year vs 20 years)
- Missing potential confounders (diet quality, genetics, medications)

**Proper interpretation:** These results show *associations* and *risk patterns* useful for identifying high-risk groups, but randomized controlled trials would be needed to prove causal effects.

## 5. Project Summary and Reflections

### 5.1 Key Findings

1. **BMI, Age, and HighBP are the top 3 predictors of diabetes risk** (Question 1)
   - BMI shows the strongest importance (0.183) in Random Forest model, confirming obesity as primary modifiable risk factor
   - Age ranks second (0.123), reflecting diabetes as age-related chronic disease
   - High blood pressure shows both high correlation (0.26) and high importance across models

2. **Logistic Regression achieves best performance with ROC-AUC of 0.819** (Question 1)
   - Significantly outperforms Decision Tree (AUC 0.600) and slightly outperforms Random Forest (AUC 0.796)
   - However, all models suffer from low recall (15-33%) due to class imbalance
   - Demonstrates that linear relationships can be highly effective for this classification task

3. **Health conditions show strong synergistic interaction effects** (Question 2)
   - High BP + High Cholesterol together create 29.7% diabetes prevalence
   - Expected additive effect: 22.9%, Observed: 29.7% (+6.8 pp beyond additive)
   - **This 30% synergy reveals compound cardiovascular risk multiplier**

4. **Physical activity provides substantial protection even among smokers** (Question 2)
   - Among smokers: active 13.9% vs inactive 22.3% prevalence (38% relative reduction)
   - Among non-smokers: active 10.0% vs inactive 19.9% (50% relative reduction)
   - Physical activity effect (+9.5 pp) is 2.3x larger than smoking effect (+4.2 pp)

5. **Most surprising: Lifestyle factors show weaker individual importance than expected** (Question 1)
   - Physical activity ranked #14-15 in feature importance (despite clear protective effect in Question 2)
   - Smoking ranked #9-10 in importance
   - However, their **interaction effects** (Question 2) are highly significant statistically

**Most Critical Insight:** Diabetes risk is best understood through **multiple interacting factors** rather than single predictors - the combination of obesity (BMI), cardiovascular conditions (HighBP + HighChol), and lifestyle factors (physical inactivity) creates multiplicative risk.

### 5.2 Limitations

#### **Dataset Limitations:**

1. **Severe class imbalance (86% non-diabetic vs 14% diabetic)**
   - Causes models to achieve high accuracy by simply predicting majority class
   - Results in low recall (15-33%) - models miss most diabetic cases
   - Partially mitigated by stratified sampling and using ROC-AUC metric, but still affects model utility

2. **Cross-sectional data from single time point (2015 only)**
   - Cannot establish causation - only associations and correlations
   - Cannot analyze trends over time or disease progression
   - Results may not reflect current patterns (9 years old)

3. **Self-reported survey data inherently biased**
   - Recall bias: participants may not accurately remember behaviors
   - Social desirability bias: may under-report unhealthy behaviors (smoking, alcohol)
   - Measurement error in subjective assessments (GenHlth, MentHlth)

4. **Binary diabetes variable lacks granularity**
   - Combines prediabetes and diabetes (very different conditions)
   - Doesn't distinguish Type 1 vs Type 2 diabetes (completely different etiologies)
   - No information on diabetes duration or severity

5. **Missing important risk factors**
   - No genetic/family history data (strong diabetes predictor)
   - No detailed diet information beyond fruit/vegetable binary
   - No medication use data (confounds health condition measures)
   - No waist circumference (better obesity measure than BMI alone)

#### **Analysis Limitations:**

1. **Feature importance shows correlation, not causation**
   - High BMI importance doesn't prove BMI *causes* diabetes
   - Could have reverse causation (diabetes causes weight gain) or confounding
   - Observational analysis cannot replace randomized controlled trials

2. **Interaction analysis limited to 2-way interactions**
   - Real world likely involves 3-way+ interactions (e.g., BMI × PhysActivity × Age)
   - Did not explore continuous variable interactions (e.g., BMI value × Age value)
   - Binary grouping loses information about dose-response relationships

3. **No hyperparameter tuning or model optimization**
   - Used default sklearn parameters for all models
   - Did not apply techniques to handle class imbalance (SMOTE, class weights, threshold tuning)
   - No cross-validation beyond simple train/test split
   - Likely not achieving best possible model performance

4. **Some confounding variables not controlled**
   - Interaction analysis doesn't control for age, BMI, or other factors
   - Differences between groups could be due to unmeasured variables
   - More sophisticated methods (regression with interaction terms) could isolate effects better

#### **Scope Limitations:**

1. **Did not analyze temporal trends** - single time point limits understanding of evolving risk patterns

2. **Did not segment analysis by demographics:**
   - Interactions might differ by age group (e.g., smoking worse for younger people)
   - Gender differences not explored (diabetes affects men/women differently)
   - No regional or ethnic subgroup analysis

3. **Did not explore non-linear relationships in depth:**
   - BMI likely has threshold effects (risk jumps at certain BMI levels)
   - Age effects probably non-linear (risk accelerates after 50)
   - Simple models may miss these patterns

4. **No external validation:**
   - Results not tested on independent dataset
   - May not generalize to other populations or countries
   - Findings specific to U.S. population in 2015

### 5.3 Future Directions (If You Had More Time)

#### **Advanced Modeling:**
- **Apply techniques for class imbalance:**
  - SMOTE (Synthetic Minority Over-sampling) to balance classes
  - Class weights to penalize misclassifying diabetic cases more
  - Threshold tuning to optimize for recall rather than accuracy
  - Expected improvement: Recall 15-33% → 50-60% while maintaining reasonable precision

- **Hyperparameter optimization:**
  - Grid search or random search for Random Forest (n_estimators, max_depth, min_samples_split)
  - Regularization tuning for Logistic Regression (C parameter)
  - Expected improvement: ROC-AUC 0.82 → 0.85-0.87

- **Try advanced models:**
  - Gradient Boosting (XGBoost, LightGBM) - often best for tabular data
  - Neural networks - can capture complex non-linear patterns
  - Ensemble stacking - combine predictions from multiple models
  - Expected improvement: ROC-AUC 0.82 → 0.87-0.90

#### **Deeper Interaction Analysis:**
- **3-way and 4-way interactions:**
  - BMI × PhysActivity × Age (does exercise help more for obese individuals?)
  - Smoker × PhysActivity × HighBP × HighChol (4-way full interaction)
  - Use log-linear models or decision trees to identify highest-order interactions

- **Regression with interaction terms:**
  - Logistic regression with explicit Smoker*PhysActivity interaction term
  - Allows isolating pure interaction effect while controlling for confounders
  - Can test statistical significance of interaction beyond main effects

- **Stratified analysis:**
  - Analyze interactions separately by age groups (<50 vs 50+)
  - By gender (male vs female)
  - By BMI categories (normal vs overweight vs obese)

#### **Non-Linear Relationship Exploration:**
- **Generalized Additive Models (GAMs):**
  - Model BMI and age as smooth curves rather than linear effects
  - Identify threshold effects (e.g., risk jumps at BMI>30)
  - Visualize exact shape of relationship

- **Binned analysis:**
  - Divide BMI into deciles and plot prevalence by bin
  - Same for age, PhysHlth, MentHlth
  - Reveals if relationships are linear, exponential, or threshold-based

#### **Temporal and External Validation:**
- **Collect longitudinal data:**
  - BRFSS data available for multiple years - analyze 2010-2020 trends
  - Allows seeing if risk factors changing over time
  - Enables survival analysis (time until diabetes onset)

- **External validation:**
  - Test models on NHANES dataset (different U.S. survey)
  - Test on international datasets (European health surveys)
  - Assess if findings generalize beyond 2015 CDC sample

### 5.4 Individual Reflections

#### **Challenges & Difficulties Encountered:**

1. **Understanding and handling class imbalance was conceptually challenging**
   - Initially confused why 86% accuracy wasn't good performance
   - Learned that accuracy is misleading for imbalanced data
   - Discovered importance of stratified sampling, ROC-AUC, and precision-recall tradeoffs
   - Realized models need different evaluation approaches than balanced datasets

2. **Distinguishing additive vs synergistic interaction effects required careful thinking**
   - Initially unclear what "interaction" meant statistically
   - Learned to calculate expected additive effects and compare to observed
   - Understanding when to use chi-square tests vs regression with interaction terms
   - Grasping that correlation between two factors ≠ interaction in affecting outcome

3. **Choosing appropriate visualizations for complex interactions was iterative**
   - First attempts at interaction plots were confusing
   - Tried multiple approaches (grouped bars, heatmaps, line plots) before finding clearest
   - Learned that different visualizations emphasize different patterns
   - Realized importance of clear labels and legends for interpretability

4. **Interpreting different types of feature importance was nuanced**
   - Logistic regression coefficients vs Random Forest importance scores measure different things
   - Coefficients show linear effect holding others constant
   - Importance scores show predictive value (includes non-linear effects and interactions)
   - Both valid but answer slightly different questions

5. **Technical challenges with sklearn and data processing**
   - Dealing with RuntimeWarnings from Logistic Regression (overflow in computation)
   - Understanding when to use `.copy()` to avoid SettingWithCopyWarning
   - Learning optimal parameters for large dataset performance (n_jobs=-1 for parallelization)

#### **Learning & Growth:**

**Technical Skills Gained:**
- Machine learning pipeline: Train/test splitting, model training, evaluation, comparison
- Feature importance extraction: From both coefficient-based (LR) and tree-based (RF, DT) models
- Handling imbalanced data: Stratified sampling, ROC-AUC metric, understanding precision-recall tradeoffs
- Statistical testing: Chi-square tests for categorical interactions, interpreting p-values
- Advanced visualization: Heatmaps, grouped bar charts, interaction plots with statsmodels
- Large dataset optimization: Using efficient pandas operations, parallel processing in sklearn

**Analytical Skills Developed:**
- Interaction effect analysis: Calculating additive vs synergistic effects, interpreting combinations
- Critical evaluation: Identifying when high accuracy is misleading, recognizing model limitations
- Translating findings to practice: Converting statistical results into actionable health recommendations
- Distinguishing correlation from causation: Understanding observational data limits, when causal claims appropriate

**Domain Knowledge Acquired:**
- Diabetes risk factors: BMI, age, cardiovascular health as primary predictors
- Public health patterns: Class imbalance reflects real disease prevalence, lifestyle vs health conditions
- Healthcare screening: Why multiple risk factors matter, value of preventive interventions
- Epidemiology basics: Cross-sectional vs longitudinal studies, self-report biases

#### **What Surprised Me Most:**

1. **Physical activity ranked surprisingly low in individual feature importance (#14-15)**
   - Expected it to be top 5 given public health emphasis
   - But showed very strong protective effect in interaction analysis (Question 2)
   - Learned that low individual importance ≠ not important - context matters
   - Revealed importance of looking at problems from multiple angles (ML + statistical analysis)

2. **The magnitude of synergistic effects between HighBP and HighChol**
   - Expected some interaction but not 30% higher than additive (6.8 pp excess)
   - 29.7% prevalence with both conditions is shockingly high
   - Really emphasizes compound cardiovascular risk
   - Made me appreciate why metabolic syndrome is such a major concern

3. **Dataset was completely clean with zero missing values**
   - Extremely rare in real-world health data
   - Usually health surveys have 10-30% missing data
   - Indicates heavy preprocessing by data authors - good for learning, but less realistic
   - Makes me wonder what imputation methods they used and how that affects results

4. **Logistic Regression outperformed Random Forest**
   - Expected complex tree-based model to win
   - LR achieved 0.819 AUC vs RF 0.796 AUC
   - Suggests diabetes risk has strong linear patterns
   - Learned that simpler models can outperform complex ones (Occam's Razor applies)

5. **How different the top features were between Logistic Regression vs Random Forest**
   - LR: CholCheck, HighBP, HighChol (clinical measures)
   - RF: BMI, Age, Income (demographics + lifestyle)
   - Both are valid - measuring different types of relationships
   - Showed me importance of using multiple model types to get complete picture

#### **How This Project Shaped My Understanding of Data Science:**

**1. Data science is iterative storytelling, not just technical execution**
   - The workflow isn't linear - constantly revisiting earlier steps based on later findings
   - Asking good questions (Section 3) is as important as running models
   - Results need interpretation for non-technical stakeholders - numbers alone aren't enough
   - Visualizations communicate insights faster than tables of metrics

**2. Domain knowledge is critical for meaningful analysis**
   - Understanding diabetes epidemiology helped interpret feature importance patterns
   - Knowing BMI categories explained outlier distributions
   - Medical context necessary to distinguish causation from correlation
   - Best data science happens at intersection of statistics + domain expertise

**3. Limitations matter as much as results**
   - Every analysis has constraints - acknowledging them builds credibility
   - Class imbalance, cross-sectional design, self-report bias all affect conclusions
   - Being transparent about what analysis *cannot* prove is crucial
   - Limitations section might be most important for preventing misuse of findings

**4. Multiple analytical approaches reveal more than single method**
   - Feature importance (ML) + interaction analysis (statistical) gave complementary insights
   - Physical activity appeared less important in Q1 but very important in Q2
   - Using different models (LR, DT, RF) showed which patterns were robust
   - Triangulation across methods builds confidence in findings

**5. Real-world impact requires actionable translation**
   - Statistical significance alone doesn't help patients or doctors
   - Need to translate "HighBP × HighChol interaction: χ²=15891, p<0.0001" into actionable insights
   - Data science value comes from informing decisions, not just producing metrics