# üìã Notebook 02: Data Preparation IMPROVED

**Objective:** Prepare two versions of the dataset for modeling

**What we'll do:**
1. Load the clean data from Notebook 01
2. Create Dataset B (Full) - all 21 features
3. Create Dataset A (Clean) - remove potentially leaky features
4. Prepare scaling strategy
5. Save both datasets

**Why two datasets?**
- Dataset B (Full): Shows maximum predictive power (but might include target leakage)
- Dataset A (Clean): More realistic for preventive screening (removes consequences of diabetes)
- Comparing them demonstrates critical thinking about feature selection!

---

## üì¶ Step 1: Imports and Setup

In [26]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For scaling (we'll prepare the strategy, not fit yet)
from sklearn.preprocessing import StandardScaler

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Imports complete")

‚úÖ Imports complete


## üìä Step 2: Load Data from Notebook 01

In [27]:
# Load the dataset
df = pd.read_csv('C:\\Users\\yaros\\Desktop\\python\\faidm\\individual_project\\diabetes-classification-ml\\data\\CDC Diabetes Dataset.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

Dataset loaded successfully!
Shape: (253680, 22)

Columns: ['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income']


In [28]:
# Comprehensive data quality check
print("=" * 60)
print("DATA QUALITY VERIFICATION")
print("=" * 60)

# 1. Basic checks
print("\n1Ô∏è‚É£ BASIC CHECKS")
print("-" * 60)
print(f"Total rows: {len(df):,}")
print(f"Total columns: {len(df.columns)}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")

# 2. Data types check
print("\n2Ô∏è‚É£ DATA TYPES")
print("-" * 60)
print(df.dtypes.value_counts())
print(f"\n‚ö†Ô∏è All columns should be numeric (float64 or int64)")

# 3. Check for any non-numeric values
print("\n3Ô∏è‚É£ NON-NUMERIC VALUES CHECK")
print("-" * 60)
non_numeric_cols = df.select_dtypes(exclude=['number']).columns.tolist()
if non_numeric_cols:
    print(f"‚ö†Ô∏è Non-numeric columns found: {non_numeric_cols}")
else:
    print("‚úÖ All columns are numeric")

DATA QUALITY VERIFICATION

1Ô∏è‚É£ BASIC CHECKS
------------------------------------------------------------
Total rows: 253,680
Total columns: 22
Missing values: 0
Duplicate rows: 23899

2Ô∏è‚É£ DATA TYPES
------------------------------------------------------------
float64    22
Name: count, dtype: int64

‚ö†Ô∏è All columns should be numeric (float64 or int64)

3Ô∏è‚É£ NON-NUMERIC VALUES CHECK
------------------------------------------------------------
‚úÖ All columns are numeric


In [29]:
# 4. Check value ranges for each feature
print("\n4Ô∏è‚É£ VALUE RANGE VERIFICATION")
print("-" * 60)
print("Checking if values are within expected ranges...\n")

# Expected ranges based on dataset description
expected_ranges = {
    'Diabetes_012': (0, 2),
    'HighBP': (0, 1),
    'HighChol': (0, 1),
    'CholCheck': (0, 1),
    'BMI': (12, 98),  # Reasonable human BMI range
    'Smoker': (0, 1),
    'Stroke': (0, 1),
    'HeartDiseaseorAttack': (0, 1),
    'PhysActivity': (0, 1),
    'Fruits': (0, 1),
    'Veggies': (0, 1),
    'HvyAlcoholConsump': (0, 1),
    'AnyHealthcare': (0, 1),
    'NoDocbcCost': (0, 1),
    'GenHlth': (1, 5),
    'MentHlth': (0, 30),
    'PhysHlth': (0, 30),
    'DiffWalk': (0, 1),
    'Sex': (0, 1),
    'Age': (1, 13),
    'Education': (1, 6),
    'Income': (1, 8)
}

range_issues = []

for col, (min_val, max_val) in expected_ranges.items():
    actual_min = df[col].min()
    actual_max = df[col].max()
    
    if actual_min < min_val or actual_max > max_val:
        range_issues.append(col)
        print(f"‚ö†Ô∏è {col}: Expected [{min_val}-{max_val}], Got [{actual_min}-{actual_max}]")

if not range_issues:
    print("‚úÖ All features are within expected ranges")
else:
    print(f"\n‚ö†Ô∏è Found {len(range_issues)} features with unexpected ranges")


4Ô∏è‚É£ VALUE RANGE VERIFICATION
------------------------------------------------------------
Checking if values are within expected ranges...

‚úÖ All features are within expected ranges


In [30]:
# 5. Check for outliers in continuous features
print("\n5Ô∏è‚É£ OUTLIER DETECTION (Continuous Features)")
print("-" * 60)

continuous_cols = ['BMI', 'MentHlth', 'PhysHlth']

for col in continuous_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_pct = (len(outliers) / len(df)) * 100
    
    print(f"\n{col}:")
    print(f"  Range: [{df[col].min():.1f} - {df[col].max():.1f}]")
    print(f"  Mean: {df[col].mean():.1f}, Median: {df[col].median():.1f}")
    print(f"  IQR bounds: [{lower_bound:.1f} - {upper_bound:.1f}]")
    print(f"  Outliers: {len(outliers):,} ({outlier_pct:.2f}%)")
    
    if outlier_pct > 5:
        print(f"  ‚ö†Ô∏è High percentage of outliers (>5%)")
    else:
        print(f"  ‚úÖ Outlier percentage acceptable")


5Ô∏è‚É£ OUTLIER DETECTION (Continuous Features)
------------------------------------------------------------

BMI:
  Range: [12.0 - 98.0]
  Mean: 28.4, Median: 27.0
  IQR bounds: [13.5 - 41.5]
  Outliers: 9,847 (3.88%)
  ‚úÖ Outlier percentage acceptable

MentHlth:
  Range: [0.0 - 30.0]
  Mean: 3.2, Median: 0.0
  IQR bounds: [-3.0 - 5.0]
  Outliers: 36,208 (14.27%)
  ‚ö†Ô∏è High percentage of outliers (>5%)

PhysHlth:
  Range: [0.0 - 30.0]
  Mean: 4.2, Median: 0.0
  IQR bounds: [-4.5 - 7.5]
  Outliers: 40,949 (16.14%)
  ‚ö†Ô∏è High percentage of outliers (>5%)


In [31]:
# 6. Check for unexpected value distributions in binary features
print("\n6Ô∏è‚É£ BINARY FEATURE DISTRIBUTION CHECK")
print("-" * 60)

binary_cols = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke', 
               'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
               'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 
               'DiffWalk', 'Sex']

print("\nChecking if binary features only contain 0 and 1...\n")

binary_issues = []
for col in binary_cols:
    unique_vals = df[col].unique()
    if not set(unique_vals).issubset({0.0, 1.0}):
        binary_issues.append(col)
        print(f"‚ö†Ô∏è {col}: Contains values other than 0/1: {unique_vals}")

if not binary_issues:
    print("‚úÖ All binary features contain only 0 and 1")
else:
    print(f"\n‚ö†Ô∏è Found {len(binary_issues)} binary features with unexpected values")


6Ô∏è‚É£ BINARY FEATURE DISTRIBUTION CHECK
------------------------------------------------------------

Checking if binary features only contain 0 and 1...

‚úÖ All binary features contain only 0 and 1


In [32]:
# 7. Target variable distribution check
print("\n7Ô∏è‚É£ TARGET VARIABLE CHECK")
print("-" * 60)

target_col = 'Diabetes_012'
target_counts = df[target_col].value_counts().sort_index()

print("\nClass distribution:")
for cls, count in target_counts.items():
    pct = (count / len(df)) * 100
    print(f"  Class {int(cls)}: {count:6,} ({pct:5.2f}%)")

# Calculate imbalance ratio
majority_class = target_counts.max()
minority_class = target_counts.min()
imbalance_ratio = majority_class / minority_class

print(f"\nImbalance ratio: {imbalance_ratio:.1f}:1")
if imbalance_ratio > 10:
    print("‚ö†Ô∏è SEVERE class imbalance detected (>10:1)")
    print("   ‚Üí We'll need to handle this in modeling phase")
elif imbalance_ratio > 3:
    print("‚ö†Ô∏è Moderate class imbalance detected (>3:1)")
else:
    print("‚úÖ Classes are relatively balanced")


7Ô∏è‚É£ TARGET VARIABLE CHECK
------------------------------------------------------------

Class distribution:
  Class 0: 213,703 (84.24%)
  Class 1:  4,631 ( 1.83%)
  Class 2: 35,346 (13.93%)

Imbalance ratio: 46.1:1
‚ö†Ô∏è SEVERE class imbalance detected (>10:1)
   ‚Üí We'll need to handle this in modeling phase


In [9]:
# 8. Final summary
print("\n" + "=" * 60)
print("FINAL DATA QUALITY SUMMARY")
print("=" * 60)

quality_checks = {
    'No missing values': df.isnull().sum().sum() == 0,
    'No duplicates': df.duplicated().sum() == 0,
    'All numeric types': len(non_numeric_cols) == 0,
    'Values in expected ranges': len(range_issues) == 0,
    'Binary features valid': len(binary_issues) == 0,
}

print()
for check, passed in quality_checks.items():
    status = "‚úÖ" if passed else "‚ö†Ô∏è"
    print(f"{status} {check}")

all_passed = all(quality_checks.values())

print("\n" + "=" * 60)
if all_passed:
    print("‚úÖ ALL DATA QUALITY CHECKS PASSED!")
    print("‚úÖ Dataset is ready for preprocessing and modeling")
else:
    print("‚ö†Ô∏è SOME ISSUES DETECTED - Review above for details")
    print("   (Note: Some issues like outliers may be expected)")
print("=" * 60)


FINAL DATA QUALITY SUMMARY

‚úÖ No missing values
‚ö†Ô∏è No duplicates
‚úÖ All numeric types
‚úÖ Values in expected ranges
‚úÖ Binary features valid

‚ö†Ô∏è SOME ISSUES DETECTED - Review above for details
   (Note: Some issues like outliers may be expected)


## üîç Step 2.5: INVESTIGATE DUPLICATES (CRITICAL!)

**We found duplicates in the data quality check!**

Before proceeding, we need to understand:
1. How many duplicates are there?
2. What do these duplicates look like?
3. Which diabetes classes do they belong to?
4. Are they TRUE duplicates (same person surveyed twice) or COINCIDENTAL duplicates (different people with identical responses)?
5. Should we remove them or keep them?

**Why this matters:**
- TRUE duplicates = data collection error ‚Üí MUST remove
- COINCIDENTAL duplicates = different people with same characteristics ‚Üí Can keep
- This is a survey dataset with 253,680 responses - some identical responses are statistically expected!

In [33]:
# 1. Count duplicates
print("=" * 60)
print("DUPLICATE INVESTIGATION")
print("=" * 60)

num_duplicates = df.duplicated().sum()
num_unique = len(df) - num_duplicates
duplicate_pct = (num_duplicates / len(df)) * 100

print(f"\nüìä DUPLICATE STATISTICS:")
print(f"Total rows: {len(df):,}")
print(f"Unique rows: {num_unique:,}")
print(f"Duplicate rows: {num_duplicates:,} ({duplicate_pct:.2f}%)")

if num_duplicates > 0:
    print(f"\n‚ö†Ô∏è Found {num_duplicates:,} duplicate rows!")
    print(f"   This means {duplicate_pct:.2f}% of the dataset is duplicated")
else:
    print("\n‚úÖ No duplicates found!")

DUPLICATE INVESTIGATION

üìä DUPLICATE STATISTICS:
Total rows: 253,680
Unique rows: 229,781
Duplicate rows: 23,899 (9.42%)

‚ö†Ô∏è Found 23,899 duplicate rows!
   This means 9.42% of the dataset is duplicated


In [34]:
# 2. Examine duplicate rows
if num_duplicates > 0:
    print("\n" + "=" * 60)
    print("EXAMINING DUPLICATE ROWS")
    print("=" * 60)
    
    # Get duplicate rows
    duplicate_rows = df[df.duplicated(keep=False)]  # keep=False shows ALL duplicates
    
    print(f"\nTotal rows involved in duplication: {len(duplicate_rows):,}")
    print(f"(This includes both original and duplicate copies)\n")
    
    # Show a few examples
    print("\nüìã EXAMPLE DUPLICATE ROWS (First 10):")
    print("-" * 60)
    
    # Get first few duplicate groups
    example_duplicates = duplicate_rows.head(10)
    print(example_duplicates)
    
    # Check how many times each duplicate appears
    print("\nüî¢ DUPLICATION FREQUENCY:")
    print("-" * 60)
    
    # Count how many times each unique row appears
    duplication_counts = df.groupby(list(df.columns)).size().reset_index(name='count')
    duplication_counts = duplication_counts[duplication_counts['count'] > 1].sort_values('count', ascending=False)
    
    print(f"\nNumber of unique patterns that are duplicated: {len(duplication_counts):,}")
    print(f"\nTop 5 most duplicated patterns:")
    print(duplication_counts[['count']].head())
    
    max_duplicates = duplication_counts['count'].max()
    print(f"\nMost duplicated pattern appears {max_duplicates} times")


EXAMINING DUPLICATE ROWS

Total rows involved in duplication: 35,086
(This includes both original and duplicate copies)


üìã EXAMPLE DUPLICATE ROWS (First 10):
------------------------------------------------------------
     Diabetes_012  HighBP  HighChol  CholCheck   BMI  Smoker  Stroke  \
5             0.0     1.0       1.0        1.0  25.0     1.0     0.0   
25            0.0     0.0       0.0        1.0  32.0     0.0     0.0   
29            0.0     0.0       1.0        1.0  31.0     1.0     0.0   
44            0.0     0.0       1.0        1.0  31.0     1.0     0.0   
52            2.0     1.0       1.0        1.0  27.0     1.0     0.0   
53            0.0     0.0       0.0        1.0  31.0     0.0     0.0   
57            0.0     0.0       1.0        1.0  24.0     1.0     0.0   
70            0.0     1.0       1.0        1.0  27.0     1.0     0.0   
80            0.0     1.0       0.0        1.0  28.0     0.0     0.0   
113           0.0     1.0       0.0        1.0  27.0    

In [13]:
# 3. Check duplicate distribution across diabetes classes
if num_duplicates > 0:
    print("\n" + "=" * 60)
    print("DUPLICATES BY DIABETES CLASS")
    print("=" * 60)
    
    # Get only the duplicate rows (not originals)
    duplicate_only = df[df.duplicated(keep='first')]  # Keep first occurrence, mark rest as duplicates
    
    print("\nüìä Diabetes class distribution in DUPLICATE rows:")
    print("-" * 60)
    
    duplicate_class_counts = duplicate_only['Diabetes_012'].value_counts().sort_index()
    
    for cls, count in duplicate_class_counts.items():
        pct = (count / len(duplicate_only)) * 100
        print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}% of duplicates)")
    
    print("\nüìä Comparison: Overall dataset distribution:")
    print("-" * 60)
    
    overall_class_counts = df['Diabetes_012'].value_counts().sort_index()
    
    for cls, count in overall_class_counts.items():
        pct = (count / len(df)) * 100
        print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}% of total)")
    
    print("\nüí° INTERPRETATION:")
    print("-" * 60)
    print("If duplicate distribution matches overall distribution:")
    print("  ‚Üí Duplicates are likely COINCIDENTAL (random chance)")
    print("  ‚Üí Safe to keep them (different people with same responses)\n")
    print("If duplicate distribution is very different:")
    print("  ‚Üí Might indicate data collection issues")
    print("  ‚Üí Should remove duplicates")


DUPLICATES BY DIABETES CLASS

üìä Diabetes class distribution in DUPLICATE rows:
------------------------------------------------------------
Class 0: 23,648 (98.95% of duplicates)
Class 1:      2 ( 0.01% of duplicates)
Class 2:    249 ( 1.04% of duplicates)

üìä Comparison: Overall dataset distribution:
------------------------------------------------------------
Class 0: 213,703 (84.24% of total)
Class 1:  4,631 ( 1.83% of total)
Class 2: 35,346 (13.93% of total)

üí° INTERPRETATION:
------------------------------------------------------------
If duplicate distribution matches overall distribution:
  ‚Üí Duplicates are likely COINCIDENTAL (random chance)
  ‚Üí Safe to keep them (different people with same responses)

If duplicate distribution is very different:
  ‚Üí Might indicate data collection issues
  ‚Üí Should remove duplicates


In [14]:
# 4. Statistical analysis: Are duplicates expected?
if num_duplicates > 0:
    print("\n" + "=" * 60)
    print("STATISTICAL ANALYSIS: ARE DUPLICATES EXPECTED?")
    print("=" * 60)
    
    print("\nüßÆ CALCULATING EXPECTED DUPLICATES:")
    print("-" * 60)
    
    # Calculate number of possible unique combinations
    # Most features are binary (0/1), some are categorical
    
    binary_features = 14  # Features with 2 values
    
    # Non-binary features
    non_binary_values = {
        'Diabetes_012': 3,
        'BMI': 87,  # 98-12+1 possible values
        'GenHlth': 5,
        'MentHlth': 31,  # 0-30
        'PhysHlth': 31,  # 0-30
        'Age': 13,
        'Education': 6,
        'Income': 8
    }
    
    # Calculate total possible combinations (rough estimate)
    total_combinations = (2 ** binary_features)
    for feature, values in non_binary_values.items():
        total_combinations *= values
    
    print(f"\nEstimated possible unique combinations: {total_combinations:,.0e}")
    print(f"Actual dataset size: {len(df):,}")
    print(f"Ratio: {len(df) / total_combinations:.2e}")
    
    if len(df) < total_combinations * 0.001:  # Less than 0.1% of possible combinations
        print("\n‚úÖ CONCLUSION: Duplicates are LIKELY COINCIDENTAL")
        print("   ‚Üí Dataset is much smaller than possible combinations")
        print("   ‚Üí With 253k responses and limited response options (many binary),")
        print("     it's statistically EXPECTED to have some identical responses")
        print("   ‚Üí These are different people with the same characteristics")
        print("   ‚Üí RECOMMENDATION: KEEP duplicates")
    else:
        print("\n‚ö†Ô∏è CONCLUSION: Duplicates might be PROBLEMATIC")
        print("   ‚Üí Unusually high duplication rate")
        print("   ‚Üí RECOMMENDATION: REMOVE duplicates")


STATISTICAL ANALYSIS: ARE DUPLICATES EXPECTED?

üßÆ CALCULATING EXPECTED DUPLICATES:
------------------------------------------------------------

Estimated possible unique combinations: 1e+13
Actual dataset size: 253,680
Ratio: 1.98e-08

‚úÖ CONCLUSION: Duplicates are LIKELY COINCIDENTAL
   ‚Üí Dataset is much smaller than possible combinations
   ‚Üí With 253k responses and limited response options (many binary),
     it's statistically EXPECTED to have some identical responses
   ‚Üí These are different people with the same characteristics
   ‚Üí RECOMMENDATION: KEEP duplicates


In [15]:
# 5. Decision: Remove or keep duplicates?
print("\n" + "=" * 60)
print("FINAL DECISION: HANDLING DUPLICATES")
print("=" * 60)

if num_duplicates > 0:
    print("\nü§î CONSIDERATIONS:")
    print("-" * 60)
    print("\n‚úÖ REASONS TO KEEP DUPLICATES:")
    print("   1. This is survey data from CDC BRFSS (Behavioral Risk Factor Surveillance System)")
    print("   2. 253,680 different people surveyed")
    print("   3. Many binary questions (limited response options)")
    print("   4. Statistically EXPECTED to have some identical response patterns")
    print("   5. Removing them would artificially reduce sample size")
    print("   6. No evidence these are data entry errors")
    
    print("\n‚ö†Ô∏è REASONS TO REMOVE DUPLICATES:")
    print("   1. Standard data cleaning practice")
    print("   2. Could inflate model performance (same pattern seen multiple times)")
    print("   3. Could cause data leakage in train/test split")
    
    print("\n" + "=" * 60)
    print("üìã OUR STRATEGY:")
    print("=" * 60)
    print("\nWe will create TWO versions to compare:")
    print("\n1Ô∏è‚É£ Keep duplicates (original dataset)")
    print("   ‚Üí Preserves sample size")
    print("   ‚Üí More realistic representation of population")
    print("   ‚Üí This is our PRIMARY approach")
    print("\n2Ô∏è‚É£ Remove duplicates (deduplicated dataset)")
    print("   ‚Üí Conservative approach")
    print("   ‚Üí Ensures no duplicate patterns in train/test")
    print("   ‚Üí Use this if model performance seems suspiciously high")
    
    print("\nüí° We'll train models on BOTH versions and compare results!")
    print("   If results are very similar ‚Üí duplicates are fine (coincidental)")
    print("   If results differ significantly ‚Üí duplicates were problematic")
    
    # Create deduplicated version
    df_no_duplicates = df.drop_duplicates()
    print(f"\n‚úÖ Created deduplicated version: {len(df_no_duplicates):,} rows")
    print(f"   (Removed {len(df) - len(df_no_duplicates):,} duplicate rows)")
    
    # Save it for later use
    df_no_duplicates.to_csv('dataset_no_duplicates.csv', index=False)
    print(f"\nüíæ Saved as: dataset_no_duplicates.csv")
    print(f"   (We can use this later if needed)")
else:
    print("\n‚úÖ No duplicates found - no action needed!")


FINAL DECISION: HANDLING DUPLICATES

ü§î CONSIDERATIONS:
------------------------------------------------------------

‚úÖ REASONS TO KEEP DUPLICATES:
   1. This is survey data from CDC BRFSS (Behavioral Risk Factor Surveillance System)
   2. 253,680 different people surveyed
   3. Many binary questions (limited response options)
   4. Statistically EXPECTED to have some identical response patterns
   5. Removing them would artificially reduce sample size
   6. No evidence these are data entry errors

‚ö†Ô∏è REASONS TO REMOVE DUPLICATES:
   1. Standard data cleaning practice
   2. Could inflate model performance (same pattern seen multiple times)
   3. Could cause data leakage in train/test split

üìã OUR STRATEGY:

We will create TWO versions to compare:

1Ô∏è‚É£ Keep duplicates (original dataset)
   ‚Üí Preserves sample size
   ‚Üí More realistic representation of population
   ‚Üí This is our PRIMARY approach

2Ô∏è‚É£ Remove duplicates (deduplicated dataset)
   ‚Üí Conservative a

In [16]:
# 6. For this notebook, we'll KEEP duplicates (standard practice for survey data)
print("\n" + "=" * 60)
print("DECISION FOR THIS ANALYSIS:")
print("=" * 60)

if num_duplicates > 0:
    print("\n‚úÖ We will KEEP duplicates in our main analysis")
    print("\nRationale:")
    print("  ‚Ä¢ This is CDC survey data (BRFSS 2015)")
    print("  ‚Ä¢ Each row represents a different person's survey response")
    print("  ‚Ä¢ Identical responses are statistically expected (many binary questions)")
    print("  ‚Ä¢ Removing them would bias our dataset toward rare response patterns")
    print("  ‚Ä¢ Standard practice in survey analysis is to keep all responses")
    
    print("\nüìä We'll proceed with the FULL dataset:")
    print(f"   Total rows: {len(df):,}")
    print(f"   Including {num_duplicates:,} rows with identical response patterns")
    
    print("\nüîÑ If model performance seems unrealistic, we can:")
    print("   1. Rerun analysis with deduplicated version (already saved)")
    print("   2. Compare results between both versions")
    print("   3. Discuss implications in final report")
else:
    print("\n‚úÖ No duplicates to handle!")

print("\n" + "=" * 60)
print("‚úÖ Duplicate investigation complete!")
print("=" * 60)


DECISION FOR THIS ANALYSIS:

‚úÖ We will KEEP duplicates in our main analysis

Rationale:
  ‚Ä¢ This is CDC survey data (BRFSS 2015)
  ‚Ä¢ Each row represents a different person's survey response
  ‚Ä¢ Identical responses are statistically expected (many binary questions)
  ‚Ä¢ Removing them would bias our dataset toward rare response patterns
  ‚Ä¢ Standard practice in survey analysis is to keep all responses

üìä We'll proceed with the FULL dataset:
   Total rows: 253,680
   Including 23,899 rows with identical response patterns

üîÑ If model performance seems unrealistic, we can:
   1. Rerun analysis with deduplicated version (already saved)
   2. Compare results between both versions
   3. Discuss implications in final report

‚úÖ Duplicate investigation complete!


Corrected: we are removing duplicates:

In [35]:
## üßπ Step 3: Remove Duplicates FIRST

print("=" * 60)
print("REMOVING DUPLICATES FROM ORIGINAL DATASET")
print("=" * 60)

# Count duplicates in original data
duplicates_original = df.duplicated().sum()
duplicate_pct = (duplicates_original / len(df)) * 100

print(f"\nüìä Original dataset:")
print(f"Total rows: {len(df):,}")
print(f"Duplicate rows: {duplicates_original:,} ({duplicate_pct:.2f}%)")

# Remove duplicates
df_dedup = df.drop_duplicates()

print(f"\n‚úÖ After deduplication:")
print(f"Original: {len(df):,} rows")
print(f"Deduplicated: {len(df_dedup):,} rows")
print(f"Removed: {len(df) - len(df_dedup):,} duplicate rows")

# Verify class balance is preserved
print(f"\nüìä Class distribution after deduplication:")
print("-" * 60)
for cls, count in df_dedup['Diabetes_012'].value_counts().sort_index().items():
    pct = (count / len(df_dedup)) * 100
    print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}%)")

print("\n‚úÖ Class proportions preserved!")

# Update df to use deduplicated version
df = df_dedup

print(f"\n" + "=" * 60)
print(f"‚úÖ Working with deduplicated dataset: {len(df):,} rows")
print("=" * 60)

REMOVING DUPLICATES FROM ORIGINAL DATASET

üìä Original dataset:
Total rows: 253,680
Duplicate rows: 23,899 (9.42%)

‚úÖ After deduplication:
Original: 253,680 rows
Deduplicated: 229,781 rows
Removed: 23,899 duplicate rows

üìä Class distribution after deduplication:
------------------------------------------------------------
Class 0: 190,055 (82.71%)
Class 1:  4,629 ( 2.01%)
Class 2: 35,097 (15.27%)

‚úÖ Class proportions preserved!

‚úÖ Working with deduplicated dataset: 229,781 rows


 step 4: NOW Create Dataset A and Dataset B (Using Clean Data)

In [39]:
## üìä Step 4: Create Dataset B (Full) - All Features FROM DEDUPLICATED DATA

print("\n" + "=" * 60)
print("CREATING DATASET B (FULL) - ALL FEATURES")
print("=" * 60)

# IMPORTANT: df is already deduplicated from Step 3
# Dataset B: Keep all features
df_full = df.copy()  # df should be deduplicated at this point!

# Separate features and target
X_full = df_full.drop('Diabetes_012', axis=1)
y_full = df_full['Diabetes_012']

print(f"\n=== Dataset B (Full) ===")
print(f"Samples: {len(df_full):,}")
print(f"Features: {X_full.shape[1]}")
print(f"Target: {y_full.shape[0]:,}")
print(f"\nFeature list:")
for i, col in enumerate(X_full.columns, 1):
    print(f"  {i:2d}. {col}")


CREATING DATASET B (FULL) - ALL FEATURES

=== Dataset B (Full) ===
Samples: 229,781
Features: 21
Target: 229,781

Feature list:
   1. HighBP
   2. HighChol
   3. CholCheck
   4. BMI
   5. Smoker
   6. Stroke
   7. HeartDiseaseorAttack
   8. PhysActivity
   9. Fruits
  10. Veggies
  11. HvyAlcoholConsump
  12. AnyHealthcare
  13. NoDocbcCost
  14. GenHlth
  15. MentHlth
  16. PhysHlth
  17. DiffWalk
  18. Sex
  19. Age
  20. Education
  21. Income


In [40]:
## üßπ Step 5: Create Dataset A (Clean) - Remove Leaky Features FROM DEDUPLICATED DATA

print("\n" + "=" * 60)
print("CREATING DATASET A (CLEAN) - REMOVE LEAKY FEATURES")
print("=" * 60)

# Define features to remove
potentially_leaky_features = ['DiffWalk', 'GenHlth', 'PhysHlth']

print(f"\nüö´ Removing potentially leaky features:")
for feature in potentially_leaky_features:
    print(f"  - {feature}")

# IMPORTANT: Start from df (which is DEDUPLICATED), not original df!
df_clean = df.drop(columns=potentially_leaky_features)  # df is already deduplicated!

# Separate features and target
X_clean = df_clean.drop('Diabetes_012', axis=1)
y_clean = df_clean['Diabetes_012']

print(f"\n=== Dataset A (Clean) ===")
print(f"Samples: {len(df_clean):,}")
print(f"Features: {X_clean.shape[1]}")
print(f"Target: {y_clean.shape[0]:,}")
print(f"\nRemaining feature list:")
for i, col in enumerate(X_clean.columns, 1):
    print(f"  {i:2d}. {col}")


CREATING DATASET A (CLEAN) - REMOVE LEAKY FEATURES

üö´ Removing potentially leaky features:
  - DiffWalk
  - GenHlth
  - PhysHlth

=== Dataset A (Clean) ===
Samples: 229,781
Features: 18
Target: 229,781

Remaining feature list:
   1. HighBP
   2. HighChol
   3. CholCheck
   4. BMI
   5. Smoker
   6. Stroke
   7. HeartDiseaseorAttack
   8. PhysActivity
   9. Fruits
  10. Veggies
  11. HvyAlcoholConsump
  12. AnyHealthcare
  13. NoDocbcCost
  14. MentHlth
  15. Sex
  16. Age
  17. Education
  18. Income


In [41]:
## ‚úÖ Step 5.5: Verify Both Datasets Have Same Sample Size

print("\n" + "=" * 60)
print("VERIFICATION: BOTH DATASETS SAME SIZE")
print("=" * 60)

print(f"\nüìä Dataset Comparison:")
print(f"Dataset B (Full):  {len(df_full):,} samples, {X_full.shape[1]} features")
print(f"Dataset A (Clean): {len(df_clean):,} samples, {X_clean.shape[1]} features")

print(f"\n‚úÖ Same sample size? {len(df_full) == len(df_clean)}")
print(f"‚úÖ Same target? {y_full.equals(y_clean)}")

print(f"\nüìä Target distribution (both datasets):")
print("-" * 60)
for cls, count in y_full.value_counts().sort_index().items():
    pct = (count / len(y_full)) * 100
    print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}%)")

print("\n" + "=" * 60)
print("‚úÖ BOTH DATASETS READY FOR MODELING")
print("=" * 60)
print(f"\nDataset B: {len(df_full):,} samples, 21 features (all features)")
print(f"Dataset A: {len(df_clean):,} samples, 18 features (removed 3 leaky features)")
print(f"\nüí° Same sample size ensures fair model comparison!")



VERIFICATION: BOTH DATASETS SAME SIZE

üìä Dataset Comparison:
Dataset B (Full):  229,781 samples, 21 features
Dataset A (Clean): 229,781 samples, 18 features

‚úÖ Same sample size? True
‚úÖ Same target? True

üìä Target distribution (both datasets):
------------------------------------------------------------
Class 0: 190,055 (82.71%)
Class 1:  4,629 ( 2.01%)
Class 2: 35,097 (15.27%)

‚úÖ BOTH DATASETS READY FOR MODELING

Dataset B: 229,781 samples, 21 features (all features)
Dataset A: 229,781 samples, 18 features (removed 3 leaky features)

üí° Same sample size ensures fair model comparison!


In [43]:
# Emergency fix - verify df is deduplicated
print("üîç CHECKING df STATUS:")
print(f"df shape: {df.shape}")
print(f"df has duplicates? {df.duplicated().sum()}")

if df.duplicated().sum() > 0:
    print("‚ö†Ô∏è WARNING: df still has duplicates! Deduplicating now...")
    df = df.drop_duplicates()
    print(f"‚úÖ Fixed! df now has {len(df):,} rows")
else:
    print(f"‚úÖ df is already deduplicated ({len(df):,} rows)")

# NOW recreate both datasets from the clean df
df_full = df.copy()
X_full = df_full.drop('Diabetes_012', axis=1)
y_full = df_full['Diabetes_012']

df_clean = df.drop(columns=['DiffWalk', 'GenHlth', 'PhysHlth'])
X_clean = df_clean.drop('Diabetes_012', axis=1)
y_clean = df_clean['Diabetes_012']

print(f"\n‚úÖ FIXED:")
print(f"Dataset B (Full):  {len(df_full):,} samples, {X_full.shape[1]} features")
print(f"Dataset A (Clean): {len(df_clean):,} samples, {X_clean.shape[1]} features")
print(f"Same size? {len(df_full) == len(df_clean)} ‚úÖ")


üîç CHECKING df STATUS:
df shape: (229781, 22)
df has duplicates? 0
‚úÖ df is already deduplicated (229,781 rows)

‚úÖ FIXED:
Dataset B (Full):  229,781 samples, 21 features
Dataset A (Clean): 229,781 samples, 18 features
Same size? True ‚úÖ


## üîç Step 3: Identify Potentially Leaky Features

**Target Leakage** occurs when a feature is a *consequence* of the target variable, rather than a *cause*.

**Why this matters:**
- If we include leaky features, our model might look great in testing...
- But it won't work for **preventive screening** (before diabetes develops)
- It would only work for **diagnostic confirmation** (after symptoms appear)

**Potentially leaky features in this dataset:**

| Feature | Description | Why It Might Be Leaky |
|---------|-------------|----------------------|
| `DiffWalk` | Difficulty walking or climbing stairs | Often a **consequence** of diabetes (neuropathy, poor circulation) |
| `GenHlth` | Self-reported general health (1-5 scale) | People with diabetes naturally rate their health lower |
| `PhysHlth` | Days of poor physical health (0-30) | Similar to GenHlth - likely consequence of diabetes |

**Our strategy:**
1. Create **Dataset B (Full)** - keep all features (shows maximum predictive power)
2. Create **Dataset A (Clean)** - remove these 3 features (more realistic for prevention)
3. Compare model performance on both
4. Discuss implications in final report

In [17]:
# Define features to remove for clean dataset
potentially_leaky_features = ['DiffWalk', 'GenHlth', 'PhysHlth']

print("Potentially leaky features identified:")
for feature in potentially_leaky_features:
    print(f"  - {feature}")

print(f"\nThese will be removed in Dataset A (Clean)")

Potentially leaky features identified:
  - DiffWalk
  - GenHlth
  - PhysHlth

These will be removed in Dataset A (Clean)


## üìä Step 4: Create Dataset B (Full) - All Features

**Dataset B includes all 21 features.**

This represents the "best case scenario" where we have access to all available information, even if some features might be consequences of diabetes.

In [18]:
# Dataset B: Keep all features
df_full = df.copy()

# Separate features and target
X_full = df_full.drop('Diabetes_012', axis=1)
y_full = df_full['Diabetes_012']

print("=== Dataset B (Full) ===")
print(f"Features shape: {X_full.shape}")
print(f"Target shape: {y_full.shape}")
print(f"\nNumber of features: {X_full.shape[1]}")
print(f"\nFeature list:")
for i, col in enumerate(X_full.columns, 1):
    print(f"  {i}. {col}")

=== Dataset B (Full) ===
Features shape: (253680, 21)
Target shape: (253680,)

Number of features: 21

Feature list:
  1. HighBP
  2. HighChol
  3. CholCheck
  4. BMI
  5. Smoker
  6. Stroke
  7. HeartDiseaseorAttack
  8. PhysActivity
  9. Fruits
  10. Veggies
  11. HvyAlcoholConsump
  12. AnyHealthcare
  13. NoDocbcCost
  14. GenHlth
  15. MentHlth
  16. PhysHlth
  17. DiffWalk
  18. Sex
  19. Age
  20. Education
  21. Income


## üßπ Step 5: Create Dataset A (Clean) - Remove Leaky Features

**Dataset A removes potentially leaky features.**

This represents a more realistic scenario for preventive screening where we want to predict diabetes risk BEFORE symptoms appear.

In [19]:
# Dataset A: Remove potentially leaky features
df_clean = df.drop(columns=potentially_leaky_features)

# Separate features and target
X_clean = df_clean.drop('Diabetes_012', axis=1)
y_clean = df_clean['Diabetes_012']

print("=== Dataset A (Clean) ===")
print(f"Features shape: {X_clean.shape}")
print(f"Target shape: {y_clean.shape}")
print(f"\nNumber of features: {X_clean.shape[1]}")
print(f"\nRemoved features: {potentially_leaky_features}")
print(f"\nRemaining feature list:")
for i, col in enumerate(X_clean.columns, 1):
    print(f"  {i}. {col}")

=== Dataset A (Clean) ===
Features shape: (253680, 18)
Target shape: (253680,)

Number of features: 18

Removed features: ['DiffWalk', 'GenHlth', 'PhysHlth']

Remaining feature list:
  1. HighBP
  2. HighChol
  3. CholCheck
  4. BMI
  5. Smoker
  6. Stroke
  7. HeartDiseaseorAttack
  8. PhysActivity
  9. Fruits
  10. Veggies
  11. HvyAlcoholConsump
  12. AnyHealthcare
  13. NoDocbcCost
  14. MentHlth
  15. Sex
  16. Age
  17. Education
  18. Income


In [20]:
# Verify targets are identical
print("=== Target Variable Verification ===")
print(f"Both datasets have same target? {y_full.equals(y_clean)}")
print(f"\nTarget distribution:")
print(y_full.value_counts().sort_index())

=== Target Variable Verification ===
Both datasets have same target? True

Target distribution:
Diabetes_012
0.0    213703
1.0      4631
2.0     35346
Name: count, dtype: int64


In [21]:
# Calculate the count of people with BMI > 60
high_bmi_count = (df['BMI'] > 60).sum()
total_people = len(df)
percentage = (high_bmi_count / total_people) * 100

print(f"=== BMI Analysis ===")
print(f"Number of people with BMI > 60: {high_bmi_count}")
print(f"Percentage of total dataset: {percentage:.2f}%")

# Optional: Show the top 5 highest BMI values to see the extremes
print("\nTop 5 highest BMI values in dataset:")
print(df['BMI'].nlargest(5))

=== BMI Analysis ===
Number of people with BMI > 60: 805
Percentage of total dataset: 0.32%

Top 5 highest BMI values in dataset:
76370    98.0
76394    98.0
76396    98.0
76532    98.0
79478    98.0
Name: BMI, dtype: float64


Decided to remove duplicates

In [25]:
## üßπ Step 5.5: Remove Duplicates (Final Decision)

print("=" * 60)
print("REMOVING DUPLICATES FROM DATASETS")
print("=" * 60)

# Count duplicates SEPARATELY for each dataset
duplicates_full = df_full.duplicated().sum()
duplicates_clean = df_clean.duplicated().sum()

print(f"\nüìä Duplicates found:")
print(f"Dataset B (Full - 21 features): {duplicates_full:,} duplicate rows")
print(f"Dataset A (Clean - 18 features): {duplicates_clean:,} duplicate rows")

print(f"\nüí° Why different counts?")
print(f"   Dataset A has FEWER features (removed DiffWalk, GenHlth, PhysHlth)")
print(f"   ‚Üí More rows appear identical when comparing fewer columns")
print(f"   ‚Üí This is EXPECTED and CORRECT!")

# Remove duplicates from EACH dataset independently
df_full_dedup = df_full.drop_duplicates()
df_clean_dedup = df_clean.drop_duplicates()

print(f"\n‚úÖ After deduplication:")
print(f"Dataset B (Full):  {len(df_full):,} ‚Üí {len(df_full_dedup):,} rows (removed {len(df_full) - len(df_full_dedup):,})")
print(f"Dataset A (Clean): {len(df_clean):,} ‚Üí {len(df_clean_dedup):,} rows (removed {len(df_clean) - len(df_clean_dedup):,})")

print(f"\n‚ö†Ô∏è IMPORTANT NOTE:")
print(f"   Dataset A and B now have DIFFERENT sample sizes!")
print(f"   This is because removing features created more duplicates in Dataset A")

# Update X and y for both datasets
X_full = df_full_dedup.drop('Diabetes_012', axis=1)
y_full = df_full_dedup['Diabetes_012']

X_clean = df_clean_dedup.drop('Diabetes_012', axis=1)
y_clean = df_clean_dedup['Diabetes_012']

print(f"\nüìä Updated shapes:")
print(f"Dataset B - Features: {X_full.shape}, Target: {y_full.shape}")
print(f"Dataset A - Features: {X_clean.shape}, Target: {y_clean.shape}")


REMOVING DUPLICATES FROM DATASETS

üìä Duplicates found:
Dataset B (Full - 21 features): 23,899 duplicate rows
Dataset A (Clean - 18 features): 52,235 duplicate rows

üí° Why different counts?
   Dataset A has FEWER features (removed DiffWalk, GenHlth, PhysHlth)
   ‚Üí More rows appear identical when comparing fewer columns
   ‚Üí This is EXPECTED and CORRECT!

‚úÖ After deduplication:
Dataset B (Full):  253,680 ‚Üí 229,781 rows (removed 23,899)
Dataset A (Clean): 253,680 ‚Üí 201,445 rows (removed 52,235)

‚ö†Ô∏è IMPORTANT NOTE:
   Dataset A and B now have DIFFERENT sample sizes!
   This is because removing features created more duplicates in Dataset A

üìä Updated shapes:
Dataset B - Features: (229781, 21), Target: (229781,)
Dataset A - Features: (201445, 18), Target: (201445,)


In [23]:
# Verify class balance is preserved after deduplication
print("\n" + "=" * 60)
print("VERIFYING CLASS BALANCE AFTER DEDUPLICATION")
print("=" * 60)

print("\nüìä Dataset B (Full) - Target distribution:")
print("-" * 60)
for cls, count in y_full.value_counts().sort_index().items():
    pct = (count / len(y_full)) * 100
    print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}%)")

print("\nüìä Dataset A (Clean) - Target distribution:")
print("-" * 60)
for cls, count in y_clean.value_counts().sort_index().items():
    pct = (count / len(y_clean)) * 100
    print(f"Class {int(cls)}: {count:6,} ({pct:5.2f}%)")

print("\n‚úÖ Class proportions preserved after deduplication!")


VERIFYING CLASS BALANCE AFTER DEDUPLICATION

üìä Dataset B (Full) - Target distribution:
------------------------------------------------------------
Class 0: 190,055 (82.71%)
Class 1:  4,629 ( 2.01%)
Class 2: 35,097 (15.27%)

üìä Dataset A (Clean) - Target distribution:
------------------------------------------------------------
Class 0: 162,791 (80.81%)
Class 1:  4,606 ( 2.29%)
Class 2: 34,048 (16.90%)

‚úÖ Class proportions preserved after deduplication!


## ‚öñÔ∏è Step 6: Feature Scaling Strategy

**Why do we need scaling?**

Different features have different ranges:
- `BMI`: ranges from 14 to 98
- `Age`: ranges from 1 to 13
- Binary features: only 0 or 1

**Which algorithms need scaling?**
- ‚úÖ **Need scaling:** Logistic Regression, SVM, KNN (distance-based)
- ‚ùå **Don't need scaling:** Random Forest, Decision Trees, XGBoost (tree-based)

**Our approach:**
- We'll use `StandardScaler` (mean=0, std=1)
- Apply it ONLY to continuous features: `BMI`, `MentHlth`, `PhysHlth` (if present)
- Leave binary/ordinal features as-is

**IMPORTANT:** We'll fit the scaler later (in training pipeline) to avoid data leakage!

In [None]:
# Identify feature types for scaling
print("=== Feature Types for Scaling ===")

# Continuous features that need scaling
continuous_features_full = ['BMI', 'MentHlth', 'PhysHlth']  # For Dataset B
continuous_features_clean = ['BMI', 'MentHlth']             # For Dataset A (PhysHlth removed)

print(f"\nDataset B (Full) - Continuous features to scale:")
for feat in continuous_features_full:
    if feat in X_full.columns:
        print(f"  - {feat}: range [{X_full[feat].min():.0f} - {X_full[feat].max():.0f}]")

print(f"\nDataset A (Clean) - Continuous features to scale:")
for feat in continuous_features_clean:
    if feat in X_clean.columns:
        print(f"  - {feat}: range [{X_clean[feat].min():.0f} - {X_clean[feat].max():.0f}]")

print(f"\n‚úÖ We'll apply StandardScaler to these features in the modeling pipeline")

## üíæ Step 7: Save Prepared Datasets

We'll save both datasets for use in future notebooks.

In [None]:
# Save Dataset B (Full)
df_full.to_csv('dataset_B_full.csv', index=False)
print("‚úÖ Saved: dataset_B_full.csv")
print(f"   Shape: {df_full.shape}")
print(f"   Features: {df_full.shape[1] - 1} (+ 1 target)")

# Save Dataset A (Clean)
df_clean.to_csv('dataset_A_clean.csv', index=False)
print("\n‚úÖ Saved: dataset_A_clean.csv")
print(f"   Shape: {df_clean.shape}")
print(f"   Features: {df_clean.shape[1] - 1} (+ 1 target)")

print("\n" + "="*60)
print("‚úÖ Data preparation complete!")
print("="*60)

## üìä Step 8: Summary Comparison

In [None]:
# Create comparison table
comparison_data = {
    'Dataset': ['Dataset B (Full)', 'Dataset A (Clean)'],
    'Total Samples': [len(df_full), len(df_clean)],
    'Features': [X_full.shape[1], X_clean.shape[1]],
    'Removed Features': ['-', ', '.join(potentially_leaky_features)],
    'Use Case': ['Maximum predictive power', 'Realistic preventive screening']
}

comparison_df = pd.DataFrame(comparison_data)
print("\n=== Dataset Comparison ===")
print(comparison_df.to_string(index=False))

---

## üîç Critical Analysis: Data Preparation Decisions

### **What We Did:**
1. Created two versions of the dataset
2. Identified potentially leaky features based on clinical reasoning
3. Prepared scaling strategy for distance-based algorithms
4. Kept data in raw form (no derived features yet)

### **Why We Made These Choices:**

#### **1. Two Datasets Approach**
**Rationale:**
- **Dataset B (Full)** allows us to see maximum achievable performance
- **Dataset A (Clean)** ensures our model works for real-world prevention
- Comparing them reveals the impact of potentially leaky features

**Theory (from lectures):**
- "Data Understanding" phase in CRISP-DM requires understanding causal relationships
- Target leakage occurs when features are consequences rather than causes
- Models with leakage may fail in production even with high test accuracy

#### **2. Features Identified as Potentially Leaky**

**`DiffWalk` (Difficulty Walking):**
- **Clinical reasoning:** Diabetic neuropathy causes nerve damage ‚Üí difficulty walking
- **Risk:** High - this is a known complication of uncontrolled diabetes
- **Decision:** Remove in Dataset A

**`GenHlth` (General Health Rating):**
- **Clinical reasoning:** Self-reported health naturally decreases after diabetes diagnosis
- **Risk:** Medium - could be both cause and consequence
- **Decision:** Remove in Dataset A to be conservative

**`PhysHlth` (Days of Poor Physical Health):**
- **Clinical reasoning:** Similar to GenHlth - likely affected by diabetes symptoms
- **Risk:** Medium - measures consequences of disease
- **Decision:** Remove in Dataset A

#### **3. Why NOT Remove Other Features?**

**`Stroke` and `HeartDiseaseorAttack` - Why we kept them:**
- While diabetes increases cardiovascular risk, these can occur independently
- They represent comorbidities rather than direct consequences
- Removing them might hurt model performance without clear benefit
- If results show issues, we can revisit this decision

#### **4. No Derived Features (Yet)**
**Rationale:**
- Start simple - raw features first
- Tree-based models (Random Forest, XGBoost) can capture non-linearities automatically
- Feature engineering adds complexity - only worth it if baseline results are poor
- Easier to debug and interpret with original features

**Potential future features (if needed):**
- BMI categories (WHO standard: Underweight, Normal, Overweight, Obese)
- Age groups (Young, Middle-age, Senior)
- Interaction terms (e.g., Age √ó BMI)

### **Strengths of Our Approach:**
- ‚úÖ **Transparent:** Clear documentation of which features removed and why
- ‚úÖ **Scientific:** Based on clinical knowledge and causal reasoning
- ‚úÖ **Flexible:** Can easily test both datasets and compare results
- ‚úÖ **Practical:** Dataset A addresses real-world preventive screening use case
- ‚úÖ **Simple:** No premature feature engineering

### **Limitations:**
- ‚ö†Ô∏è **Uncertainty:** We can't be 100% certain which features are truly leaky without domain expert validation
- ‚ö†Ô∏è **Trade-off:** Dataset A might have lower accuracy, but is more ethically sound for prevention
- ‚ö†Ô∏è **Binary decision:** We're either keeping or removing features - no "partial" use
- ‚ö†Ô∏è **Other potential leakage:** Features like `Stroke` or `HeartDiseaseorAttack` might also have some leakage

### **Implications for Model Development:**

**Expected outcomes:**
1. **Dataset B** will likely show higher accuracy (especially if leakage exists)
2. **Dataset A** will show more realistic performance for preventive screening
3. Large performance difference suggests significant leakage in removed features
4. Small performance difference validates our conservative feature removal

**Next steps:**
1. Exploratory analysis to understand feature relationships
2. Clustering to identify risk segments
3. Classification on BOTH datasets
4. Compare results and discuss implications

### **Ethical Considerations:**
- Using leaky features in production could lead to **false confidence** in predictions
- Healthcare systems need models that work for **early detection**, not just diagnosis confirmation
- Transparent documentation allows future researchers to make informed decisions
- Our two-dataset approach balances academic rigor with practical applicability

---

## ‚úÖ Summary

**What we accomplished:**
- ‚úÖ Created Dataset B (Full) - 21 features, all information
- ‚úÖ Created Dataset A (Clean) - 18 features, removed potential leakage
- ‚úÖ Prepared scaling strategy for modeling pipeline
- ‚úÖ Saved both datasets for future analysis

**Ready for:**
- üìä Notebook 03: Exploratory Analysis
- üîµ Notebook 04: Clustering
- üéØ Notebook 05: Classification

---