# üìã Notebook 02: Data Preparation

**Objective:** Prepare two versions of the dataset for modeling

**What we'll do:**
1. Load the clean data from Notebook 01
2. Create Dataset B (Full) - all 21 features
3. Create Dataset A (Clean) - remove potentially leaky features
4. Prepare scaling strategy
5. Save both datasets

**Why two datasets?**
- Dataset B (Full): Shows maximum predictive power (but might include target leakage)
- Dataset A (Clean): More realistic for preventive screening (removes consequences of diabetes)
- Comparing them demonstrates critical thinking about feature selection!

---

## üì¶ Step 1: Imports and Setup

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For scaling (we'll prepare the strategy, not fit yet)
from sklearn.preprocessing import StandardScaler

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Imports complete")

## üìä Step 2: Load Data from Notebook 01

In [None]:
# Load the dataset
df = pd.read_csv('CDC_Diabetes_Dataset.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

In [None]:
# Comprehensive data quality check
print("=" * 60)
print("DATA QUALITY VERIFICATION")
print("=" * 60)

# 1. Basic checks
print("\n1Ô∏è‚É£ BASIC CHECKS")
print("-" * 60)
print(f"Total rows: {len(df):,}")
print(f"Total columns: {len(df.columns)}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")

# 2. Data types check
print("\n2Ô∏è‚É£ DATA TYPES")
print("-" * 60)
print(df.dtypes.value_counts())
print(f"\n‚ö†Ô∏è All columns should be numeric (float64 or int64)")

# 3. Check for any non-numeric values
print("\n3Ô∏è‚É£ NON-NUMERIC VALUES CHECK")
print("-" * 60)
non_numeric_cols = df.select_dtypes(exclude=['number']).columns.tolist()
if non_numeric_cols:
    print(f"‚ö†Ô∏è Non-numeric columns found: {non_numeric_cols}")
else:
    print("‚úÖ All columns are numeric")

In [None]:
# 4. Check value ranges for each feature
print("\n4Ô∏è‚É£ VALUE RANGE VERIFICATION")
print("-" * 60)
print("Checking if values are within expected ranges...\n")

# Expected ranges based on dataset description
expected_ranges = {
    'Diabetes_012': (0, 2),
    'HighBP': (0, 1),
    'HighChol': (0, 1),
    'CholCheck': (0, 1),
    'BMI': (12, 98),  # Reasonable human BMI range
    'Smoker': (0, 1),
    'Stroke': (0, 1),
    'HeartDiseaseorAttack': (0, 1),
    'PhysActivity': (0, 1),
    'Fruits': (0, 1),
    'Veggies': (0, 1),
    'HvyAlcoholConsump': (0, 1),
    'AnyHealthcare': (0, 1),
    'NoDocbcCost': (0, 1),
    'GenHlth': (1, 5),
    'MentHlth': (0, 30),
    'PhysHlth': (0, 30),
    'DiffWalk': (0, 1),
    'Sex': (0, 1),
    'Age': (1, 13),
    'Education': (1, 6),
    'Income': (1, 8)
}

range_issues = []

for col, (min_val, max_val) in expected_ranges.items():
    actual_min = df[col].min()
    actual_max = df[col].max()
    
    if actual_min < min_val or actual_max > max_val:
        range_issues.append(col)
        print(f"‚ö†Ô∏è {col}: Expected [{min_val}-{max_val}], Got [{actual_min}-{actual_max}]")

if not range_issues:
    print("‚úÖ All features are within expected ranges")
else:
    print(f"\n‚ö†Ô∏è Found {len(range_issues)} features with unexpected ranges")

In [None]:
# 5. Check for outliers in continuous features
print("\n5Ô∏è‚É£ OUTLIER DETECTION (Continuous Features)")
print("-" * 60)

continuous_cols = ['BMI', 'MentHlth', 'PhysHlth']

for col in continuous_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_pct = (len(outliers) / len(df)) * 100
    
    print(f"\n{col}:")
    print(f"  Range: [{df[col].min():.1f} - {df[col].max():.1f}]")
    print(f"  Mean: {df[col].mean():.1f}, Median: {df[col].median():.1f}")
    print(f"  IQR bounds: [{lower_bound:.1f} - {upper_bound:.1f}]")
    print(f"  Outliers: {len(outliers):,} ({outlier_pct:.2f}%)")
    
    if outlier_pct > 5:
        print(f"  ‚ö†Ô∏è High percentage of outliers (>5%)")
    else:
        print(f"  ‚úÖ Outlier percentage acceptable")

In [None]:
# 6. Check for unexpected value distributions in binary features
print("\n6Ô∏è‚É£ BINARY FEATURE DISTRIBUTION CHECK")
print("-" * 60)

binary_cols = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke', 
               'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
               'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 
               'DiffWalk', 'Sex']

print("\nChecking if binary features only contain 0 and 1...\n")

binary_issues = []
for col in binary_cols:
    unique_vals = df[col].unique()
    if not set(unique_vals).issubset({0.0, 1.0}):
        binary_issues.append(col)
        print(f"‚ö†Ô∏è {col}: Contains values other than 0/1: {unique_vals}")

if not binary_issues:
    print("‚úÖ All binary features contain only 0 and 1")
else:
    print(f"\n‚ö†Ô∏è Found {len(binary_issues)} binary features with unexpected values")

In [None]:
# 7. Target variable distribution check
print("\n7Ô∏è‚É£ TARGET VARIABLE CHECK")
print("-" * 60)

target_col = 'Diabetes_012'
target_counts = df[target_col].value_counts().sort_index()

print("\nClass distribution:")
for cls, count in target_counts.items():
    pct = (count / len(df)) * 100
    print(f"  Class {int(cls)}: {count:6,} ({pct:5.2f}%)")

# Calculate imbalance ratio
majority_class = target_counts.max()
minority_class = target_counts.min()
imbalance_ratio = majority_class / minority_class

print(f"\nImbalance ratio: {imbalance_ratio:.1f}:1")
if imbalance_ratio > 10:
    print("‚ö†Ô∏è SEVERE class imbalance detected (>10:1)")
    print("   ‚Üí We'll need to handle this in modeling phase")
elif imbalance_ratio > 3:
    print("‚ö†Ô∏è Moderate class imbalance detected (>3:1)")
else:
    print("‚úÖ Classes are relatively balanced")

In [None]:
# 8. Final summary
print("\n" + "=" * 60)
print("FINAL DATA QUALITY SUMMARY")
print("=" * 60)

quality_checks = {
    'No missing values': df.isnull().sum().sum() == 0,
    'No duplicates': df.duplicated().sum() == 0,
    'All numeric types': len(non_numeric_cols) == 0,
    'Values in expected ranges': len(range_issues) == 0,
    'Binary features valid': len(binary_issues) == 0,
}

print()
for check, passed in quality_checks.items():
    status = "‚úÖ" if passed else "‚ö†Ô∏è"
    print(f"{status} {check}")

all_passed = all(quality_checks.values())

print("\n" + "=" * 60)
if all_passed:
    print("‚úÖ ALL DATA QUALITY CHECKS PASSED!")
    print("‚úÖ Dataset is ready for preprocessing and modeling")
else:
    print("‚ö†Ô∏è SOME ISSUES DETECTED - Review above for details")
    print("   (Note: Some issues like outliers may be expected)")
print("=" * 60)

## üßπ Step 3: Remove Duplicates FIRST

**Strategy:** Remove duplicates from the original dataset BEFORE creating Dataset A and Dataset B.

**Why this order?**
- Ensures both datasets have the same sample size
- Makes model comparison fair (only difference is features, not samples)
- Cleaner methodology

**Important:** After this step, we work with deduplicated data for ALL subsequent analysis.

In [None]:
print("=" * 60)
print("STEP 3: REMOVE DUPLICATES FROM ORIGINAL DATASET")
print("=" * 60)

# Count duplicates in original data
duplicates_original = df.duplicated().sum()
duplicate_pct = (duplicates_original / len(df)) * 100

print(f"\nüìä BEFORE deduplication:")
print(f"Total rows: {len(df):,}")
print(f"Duplicate rows: {duplicates_original:,} ({duplicate_pct:.2f}%)")
print(f"Unique rows: {len(df) - duplicates_original:,}")

In [None]:
# Remove duplicates and store in NEW variable
df_deduplicated = df.drop_duplicates().copy()

print(f"\n‚úÖ AFTER deduplication:")
print(f"Before: {len(df):,} rows")
print(f"After:  {len(df_deduplicated):,} rows")
print(f"Removed: {len(df) - len(df_deduplicated):,} duplicate rows")

# Verify no duplicates remain
remaining_duplicates = df_deduplicated.duplicated().sum()
print(f"\n‚úÖ Verification: {remaining_duplicates} duplicates remaining (should be 0)")

In [None]:
# Check class balance is preserved
print(f"\nüìä Class distribution AFTER deduplication:")
print("-" * 60)

for cls, count in df_deduplicated['Diabetes_012'].value_counts().sort_index().items():
    pct = (count / len(df_deduplicated)) * 100
    original_pct = (df['Diabetes_012'].value_counts().sort_index()[cls] / len(df)) * 100
    print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}%) - Original: {original_pct:5.2f}%")

print("\n‚úÖ Class proportions preserved after deduplication!")

In [None]:
# FROM NOW ON, use df_deduplicated for ALL subsequent work
print("\n" + "=" * 60)
print("‚úÖ DEDUPLICATED DATASET READY")
print("=" * 60)
print(f"\nWorking dataset: {len(df_deduplicated):,} rows, {len(df_deduplicated.columns)} columns")
print(f"\n‚ö†Ô∏è IMPORTANT: All subsequent steps will use df_deduplicated")

## üîç Step 4: Identify Potentially Leaky Features

**Target Leakage** occurs when a feature is a *consequence* of the target variable, rather than a *cause*.

**Why this matters:**
- If we include leaky features, our model might look great in testing...
- But it won't work for **preventive screening** (before diabetes develops)
- It would only work for **diagnostic confirmation** (after symptoms appear)

**Potentially leaky features in this dataset:**

| Feature | Description | Why It Might Be Leaky |
|---------|-------------|----------------------|
| `DiffWalk` | Difficulty walking or climbing stairs | Often a **consequence** of diabetes (neuropathy, poor circulation) |
| `GenHlth` | Self-reported general health (1-5 scale) | People with diabetes naturally rate their health lower |
| `PhysHlth` | Days of poor physical health (0-30) | Similar to GenHlth - likely consequence of diabetes |

**Our strategy:**
1. Create **Dataset B (Full)** - keep all features (shows maximum predictive power)
2. Create **Dataset A (Clean)** - remove these 3 features (more realistic for prevention)
3. Compare model performance on both
4. Discuss implications in final report

In [None]:
# Define features to remove for clean dataset
potentially_leaky_features = ['DiffWalk', 'GenHlth', 'PhysHlth']

print("Potentially leaky features identified:")
for feature in potentially_leaky_features:
    print(f"  - {feature}")

print(f"\nThese will be removed in Dataset A (Clean)")

## üîç Step 4: Identify Potentially Leaky Features

**Target Leakage** occurs when a feature is a *consequence* of the target variable, rather than a *cause*.

**Why this matters:**
- If we include leaky features, our model might look great in testing...
- But it won't work for **preventive screening** (before diabetes develops)
- It would only work for **diagnostic confirmation** (after symptoms appear)

**Potentially leaky features in this dataset:**

| Feature | Description | Why It Might Be Leaky |
|---------|-------------|----------------------|
| `DiffWalk` | Difficulty walking or climbing stairs | Often a **consequence** of diabetes (neuropathy, poor circulation) |
| `GenHlth` | Self-reported general health (1-5 scale) | People with diabetes naturally rate their health lower |
| `PhysHlth` | Days of poor physical health (0-30) | Similar to GenHlth - likely consequence of diabetes |

**Our strategy:**
1. Create **Dataset B (Full)** - keep all features (shows maximum predictive power)
2. Create **Dataset A (Clean)** - remove these 3 features (more realistic for prevention)
3. Compare model performance on both
4. Discuss implications in final report

In [None]:
# Define features to remove for clean dataset
potentially_leaky_features = ['DiffWalk', 'GenHlth', 'PhysHlth']

print("Potentially leaky features identified:")
for feature in potentially_leaky_features:
    print(f"  - {feature}")

print(f"\nThese will be removed in Dataset A (Clean)")

## üìä Step 5: Create Dataset B (Full) - All Features

**Dataset B includes all 21 features from the DEDUPLICATED data.**

This represents the "best case scenario" where we have access to all available information.

In [None]:
print("=" * 60)
print("STEP 5: CREATING DATASET B (FULL) - ALL FEATURES")
print("=" * 60)

# CRITICAL: Use df_deduplicated (not df!)
df_full = df_deduplicated.copy()

print(f"\n‚úÖ Created Dataset B from deduplicated data")
print(f"Source: df_deduplicated ({len(df_deduplicated):,} rows)")
print(f"Result: df_full ({len(df_full):,} rows)")

# Verify
assert len(df_full) == len(df_deduplicated), "‚ùå ERROR: df_full has different size!"
print(f"\n‚úÖ Verification passed: Same size as deduplicated data")

In [None]:
# Separate features and target for Dataset B
X_full = df_full.drop('Diabetes_012', axis=1)
y_full = df_full['Diabetes_012']

print("=== Dataset B (Full) ===")
print(f"Samples: {len(df_full):,}")
print(f"Features shape: {X_full.shape}")
print(f"Target shape: {y_full.shape}")
print(f"\nNumber of features: {X_full.shape[1]}")
print(f"\nFeature list:")
for i, col in enumerate(X_full.columns, 1):
    print(f"  {i:2d}. {col}")

## üßπ Step 6: Create Dataset A (Clean) - Remove Leaky Features

**Dataset A removes potentially leaky features from the DEDUPLICATED data.**

This represents a more realistic scenario for preventive screening.

In [None]:
print("=" * 60)
print("STEP 6: CREATING DATASET A (CLEAN) - REMOVE LEAKY FEATURES")
print("=" * 60)

print(f"\nüö´ Removing features: {potentially_leaky_features}")

# CRITICAL: Use df_deduplicated (not df!)
df_clean = df_deduplicated.drop(columns=potentially_leaky_features).copy()

print(f"\n‚úÖ Created Dataset A from deduplicated data")
print(f"Source: df_deduplicated ({len(df_deduplicated):,} rows)")
print(f"Result: df_clean ({len(df_clean):,} rows)")

# Verify
assert len(df_clean) == len(df_deduplicated), "‚ùå ERROR: df_clean has different size!"
print(f"\n‚úÖ Verification passed: Same size as deduplicated data")

In [None]:
# Separate features and target for Dataset A
X_clean = df_clean.drop('Diabetes_012', axis=1)
y_clean = df_clean['Diabetes_012']

print("=== Dataset A (Clean) ===")
print(f"Samples: {len(df_clean):,}")
print(f"Features shape: {X_clean.shape}")
print(f"Target shape: {y_clean.shape}")
print(f"\nNumber of features: {X_clean.shape[1]}")
print(f"\nRemoved features: {potentially_leaky_features}")
print(f"\nRemaining feature list:")
for i, col in enumerate(X_clean.columns, 1):
    print(f"  {i:2d}. {col}")

## ‚úÖ Step 7: Verify Both Datasets Are Identical in Size

In [None]:
print("\n" + "=" * 60)
print("VERIFICATION: BOTH DATASETS SAME SIZE")
print("=" * 60)

print(f"\nüìä Dataset Comparison:")
print(f"Dataset B (Full):  {len(df_full):,} samples, {X_full.shape[1]} features")
print(f"Dataset A (Clean): {len(df_clean):,} samples, {X_clean.shape[1]} features")

# Critical checks
same_size = len(df_full) == len(df_clean)
same_target = y_full.equals(y_clean)

print(f"\n‚úÖ Same sample size? {same_size}")
if not same_size:
    print(f"   ‚ùå ERROR: Dataset sizes don't match!")
    print(f"   Dataset B: {len(df_full):,}")
    print(f"   Dataset A: {len(df_clean):,}")
    print(f"   Difference: {abs(len(df_full) - len(df_clean)):,}")
    raise AssertionError("Dataset sizes must be identical!")

print(f"‚úÖ Same target values? {same_target}")
if not same_target:
    print(f"   ‚ùå ERROR: Targets don't match!")
    raise AssertionError("Targets must be identical!")

print(f"\nüìä Target distribution (both datasets):")
print("-" * 60)
for cls, count in y_full.value_counts().sort_index().items():
    pct = (count / len(y_full)) * 100
    print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}%)")

print("\n" + "=" * 60)
print("‚úÖ BOTH DATASETS READY FOR MODELING")
print("=" * 60)
print(f"\nDataset B: {len(df_full):,} samples, {X_full.shape[1]} features (all features)")
print(f"Dataset A: {len(df_clean):,} samples, {X_clean.shape[1]} features (removed 3 leaky features)")
print(f"\nüí° Same sample size ensures fair model comparison!")

## ‚öñÔ∏è Step 8: Feature Scaling Strategy

**Why do we need scaling?**

Different features have different ranges:
- `BMI`: ranges from 14 to 98
- `Age`: ranges from 1 to 13
- Binary features: only 0 or 1

**Which algorithms need scaling?**
- ‚úÖ **Need scaling:** Logistic Regression, SVM, KNN (distance-based)
- ‚ùå **Don't need scaling:** Random Forest, Decision Trees, XGBoost (tree-based)

**Our approach:**
- We'll use `StandardScaler` (mean=0, std=1)
- Apply it ONLY to continuous features: `BMI`, `MentHlth`, `PhysHlth` (if present)
- Leave binary/ordinal features as-is

**IMPORTANT:** We'll fit the scaler later (in training pipeline) to avoid data leakage!

In [None]:
# Identify feature types for scaling
print("=== Feature Types for Scaling ===")

# Continuous features that need scaling
continuous_features_full = ['BMI', 'MentHlth', 'PhysHlth']  # For Dataset B
continuous_features_clean = ['BMI', 'MentHlth']             # For Dataset A (PhysHlth removed)

print(f"\nDataset B (Full) - Continuous features to scale:")
for feat in continuous_features_full:
    if feat in X_full.columns:
        print(f"  - {feat}: range [{X_full[feat].min():.0f} - {X_full[feat].max():.0f}]")

print(f"\nDataset A (Clean) - Continuous features to scale:")
for feat in continuous_features_clean:
    if feat in X_clean.columns:
        print(f"  - {feat}: range [{X_clean[feat].min():.0f} - {X_clean[feat].max():.0f}]")

print(f"\n‚úÖ We'll apply StandardScaler to these features in the modeling pipeline")

## ‚öñÔ∏è Step 8: Feature Scaling Strategy

**Why do we need scaling?**

Different features have different ranges:
- `BMI`: ranges from 14 to 98
- `Age`: ranges from 1 to 13
- Binary features: only 0 or 1

**Which algorithms need scaling?**
- ‚úÖ **Need scaling:** Logistic Regression, SVM, KNN (distance-based)
- ‚ùå **Don't need scaling:** Random Forest, Decision Trees, XGBoost (tree-based)

**Our approach:**
- We'll use `StandardScaler` (mean=0, std=1)
- Apply it ONLY to continuous features: `BMI`, `MentHlth`, `PhysHlth` (if present)
- Leave binary/ordinal features as-is

**IMPORTANT:** We'll fit the scaler later (in training pipeline) to avoid data leakage!

In [None]:
# Identify feature types for scaling
print("=== Feature Types for Scaling ===")

# Continuous features that need scaling
continuous_features_full = ['BMI', 'MentHlth', 'PhysHlth']  # For Dataset B
continuous_features_clean = ['BMI', 'MentHlth']             # For Dataset A (PhysHlth removed)

print(f"\nDataset B (Full) - Continuous features to scale:")
for feat in continuous_features_full:
    if feat in X_full.columns:
        print(f"  - {feat}: range [{X_full[feat].min():.0f} - {X_full[feat].max():.0f}]")

print(f"\nDataset A (Clean) - Continuous features to scale:")
for feat in continuous_features_clean:
    if feat in X_clean.columns:
        print(f"  - {feat}: range [{X_clean[feat].min():.0f} - {X_clean[feat].max():.0f}]")

print(f"\n‚úÖ We'll apply StandardScaler to these features in the modeling pipeline")

## üíæ Step 9: Save Prepared Datasets

We'll save both datasets for use in future notebooks.

In [None]:
# Save Dataset B (Full)
df_full.to_csv('dataset_B_full.csv', index=False)
print("‚úÖ Saved: dataset_B_full.csv")
print(f"   Shape: {df_full.shape}")
print(f"   Features: {df_full.shape[1] - 1} (+ 1 target)")

# Save Dataset A (Clean)
df_clean.to_csv('dataset_A_clean.csv', index=False)
print("\n‚úÖ Saved: dataset_A_clean.csv")
print(f"   Shape: {df_clean.shape}")
print(f"   Features: {df_clean.shape[1] - 1} (+ 1 target)")

print("\n" + "="*60)
print("‚úÖ Data preparation complete!")
print("="*60)

## üìä Step 8: Summary Comparison

In [None]:
# Create comparison table
comparison_data = {
    'Dataset': ['Dataset B (Full)', 'Dataset A (Clean)'],
    'Total Samples': [len(df_full), len(df_clean)],
    'Features': [X_full.shape[1], X_clean.shape[1]],
    'Removed Features': ['-', ', '.join(potentially_leaky_features)],
    'Use Case': ['Maximum predictive power', 'Realistic preventive screening']
}

comparison_df = pd.DataFrame(comparison_data)
print("\n=== Dataset Comparison ===")
print(comparison_df.to_string(index=False))

---

## üîç Critical Analysis: Data Preparation Decisions

### **What We Did:**
1. Created two versions of the dataset
2. Identified potentially leaky features based on clinical reasoning
3. Prepared scaling strategy for distance-based algorithms
4. Kept data in raw form (no derived features yet)

### **Why We Made These Choices:**

#### **1. Two Datasets Approach**
**Rationale:**
- **Dataset B (Full)** allows us to see maximum achievable performance
- **Dataset A (Clean)** ensures our model works for real-world prevention
- Comparing them reveals the impact of potentially leaky features

**Theory (from lectures):**
- "Data Understanding" phase in CRISP-DM requires understanding causal relationships
- Target leakage occurs when features are consequences rather than causes
- Models with leakage may fail in production even with high test accuracy

#### **2. Features Identified as Potentially Leaky**

**`DiffWalk` (Difficulty Walking):**
- **Clinical reasoning:** Diabetic neuropathy causes nerve damage ‚Üí difficulty walking
- **Risk:** High - this is a known complication of uncontrolled diabetes
- **Decision:** Remove in Dataset A

**`GenHlth` (General Health Rating):**
- **Clinical reasoning:** Self-reported health naturally decreases after diabetes diagnosis
- **Risk:** Medium - could be both cause and consequence
- **Decision:** Remove in Dataset A to be conservative

**`PhysHlth` (Days of Poor Physical Health):**
- **Clinical reasoning:** Similar to GenHlth - likely affected by diabetes symptoms
- **Risk:** Medium - measures consequences of disease
- **Decision:** Remove in Dataset A

#### **3. Why NOT Remove Other Features?**

**`Stroke` and `HeartDiseaseorAttack` - Why we kept them:**
- While diabetes increases cardiovascular risk, these can occur independently
- They represent comorbidities rather than direct consequences
- Removing them might hurt model performance without clear benefit
- If results show issues, we can revisit this decision

#### **4. No Derived Features (Yet)**
**Rationale:**
- Start simple - raw features first
- Tree-based models (Random Forest, XGBoost) can capture non-linearities automatically
- Feature engineering adds complexity - only worth it if baseline results are poor
- Easier to debug and interpret with original features

**Potential future features (if needed):**
- BMI categories (WHO standard: Underweight, Normal, Overweight, Obese)
- Age groups (Young, Middle-age, Senior)
- Interaction terms (e.g., Age √ó BMI)

### **Strengths of Our Approach:**
- ‚úÖ **Transparent:** Clear documentation of which features removed and why
- ‚úÖ **Scientific:** Based on clinical knowledge and causal reasoning
- ‚úÖ **Flexible:** Can easily test both datasets and compare results
- ‚úÖ **Practical:** Dataset A addresses real-world preventive screening use case
- ‚úÖ **Simple:** No premature feature engineering

### **Limitations:**
- ‚ö†Ô∏è **Uncertainty:** We can't be 100% certain which features are truly leaky without domain expert validation
- ‚ö†Ô∏è **Trade-off:** Dataset A might have lower accuracy, but is more ethically sound for prevention
- ‚ö†Ô∏è **Binary decision:** We're either keeping or removing features - no "partial" use
- ‚ö†Ô∏è **Other potential leakage:** Features like `Stroke` or `HeartDiseaseorAttack` might also have some leakage

### **Implications for Model Development:**

**Expected outcomes:**
1. **Dataset B** will likely show higher accuracy (especially if leakage exists)
2. **Dataset A** will show more realistic performance for preventive screening
3. Large performance difference suggests significant leakage in removed features
4. Small performance difference validates our conservative feature removal

**Next steps:**
1. Exploratory analysis to understand feature relationships
2. Clustering to identify risk segments
3. Classification on BOTH datasets
4. Compare results and discuss implications

### **Ethical Considerations:**
- Using leaky features in production could lead to **false confidence** in predictions
- Healthcare systems need models that work for **early detection**, not just diagnosis confirmation
- Transparent documentation allows future researchers to make informed decisions
- Our two-dataset approach balances academic rigor with practical applicability

---

## ‚úÖ Summary

**What we accomplished:**
- ‚úÖ Created Dataset B (Full) - 21 features, all information
- ‚úÖ Created Dataset A (Clean) - 18 features, removed potential leakage
- ‚úÖ Prepared scaling strategy for modeling pipeline
- ‚úÖ Saved both datasets for future analysis

**Ready for:**
- üìä Notebook 03: Exploratory Analysis
- üîµ Notebook 04: Clustering
- üéØ Notebook 05: Classification

---