# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 7 - Lab 03: Mini-Project Challenge
**Instructor:** Amir Charkhi | **Type:** Integrated Practice

> Apply everything you learned in Week 7!

## 🎯 Challenge Objectives

Build a complete ML pipeline from scratch:
- Load and explore data
- Engineer features
- Compare multiple models
- Select and evaluate the best one
- Save your model

**Time**: 40-50 minutes  
**Difficulty**: ⭐⭐⭐⭐☆ (Challenge)

---

## 📋 Your Mission

**Scenario**: You're a data scientist at a hospital. You need to predict whether a patient has heart disease based on medical measurements.

**Business Impact**: Early detection can save lives!

**Success Criteria**:
- High **recall** (can't miss patients with disease!)
- Good **precision** (don't want too many false alarms)
- Model ready to save for deployment

---

In [None]:
# Setup - Run this first!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay
)
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
print("🏥 Heart Disease Prediction Challenge")
print("✅ Setup complete! Let's save some lives!\n")

---

## 📊 Step 1: Load and Explore Data

**Estimated time**: 5 minutes

### Task 1.1: Create the Dataset

In [None]:
# Create synthetic heart disease dataset
np.random.seed(42)
n_patients = 500

# Generate features
data = {
    'age': np.random.randint(30, 80, n_patients),
    'sex': np.random.choice([0, 1], n_patients),  # 0=Female, 1=Male
    'chest_pain': np.random.choice([0, 1, 2, 3], n_patients),
    'resting_bp': np.random.randint(90, 200, n_patients),
    'cholesterol': np.random.randint(120, 400, n_patients),
    'fasting_sugar': np.random.choice([0, 1], n_patients, p=[0.85, 0.15]),
    'max_heart_rate': np.random.randint(70, 200, n_patients),
    'exercise_angina': np.random.choice([0, 1], n_patients, p=[0.7, 0.3]),
    'oldpeak': np.random.uniform(0, 6, n_patients),
}

df = pd.DataFrame(data)

# Create target with realistic patterns
risk_score = (
    (df['age'] > 55) * 0.2 +
    (df['sex'] == 1) * 0.15 +
    (df['chest_pain'] > 0) * 0.25 +
    (df['cholesterol'] > 240) * 0.2 +
    (df['max_heart_rate'] < 120) * 0.15 +
    (df['exercise_angina'] == 1) * 0.3 +
    (df['oldpeak'] > 2) * 0.2
)

df['heart_disease'] = (risk_score > np.random.uniform(0.3, 0.7, n_patients)).astype(int)

# TODO 1.1: Display basic information
# - Shape of dataset
# - First few rows
# - Class distribution
# - Check for missing values

print("Heart Disease Dataset")
print("="*50)
# Your code here:


print("\n✅ Task 1.1 Complete!")

### Task 1.2: Quick EDA

In [None]:
# TODO 1.2: Create visualizations to understand the data
# Requirements:
#   1. Plot disease rate by age groups
#   2. Compare cholesterol levels for disease vs no disease
#   3. Any other interesting pattern

# Your code here:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Your visualization


# Plot 2: Your visualization


plt.tight_layout()
plt.show()

print("💡 What patterns do you notice?")
print("✅ Task 1.2 Complete!")

---

## ✂️ Step 2: Prepare Data

**Estimated time**: 5 minutes

### Task 2.1: Feature Engineering (Optional)

In [None]:
# TODO 2.1: Create new features if you want
# Ideas:
#   - Age groups (e.g., is_senior = age > 60)
#   - High cholesterol flag (cholesterol > 240)
#   - BMI-like features
#   - Interaction features

# Your code here (optional):


print("✅ Task 2.1 Complete!")

### Task 2.2: Train/Test Split

In [None]:
# TODO 2.2: Split the data
# Requirements:
#   - Separate features (X) and target (y)
#   - 80/20 split
#   - Stratified
#   - random_state=42

# Your code here:
X = # Select all columns except 'heart_disease'
y = # Select 'heart_disease' column

X_train, X_test, y_train, y_test = # Your split

# Validation
print(f"Training set: {len(X_train)} patients")
print(f"Test set: {len(X_test)} patients")
print(f"\nClass balance in training:")
print(y_train.value_counts(normalize=True))

assert len(X_train) + len(X_test) == len(df), "❌ Split error!"
print("\n✅ Task 2.2 Complete!")

---

## 🤖 Step 3: Build and Compare Models

**Estimated time**: 10 minutes

### Task 3.1: Compare Multiple Models

In [None]:
# TODO 3.1: Compare at least 3 different models
# Use cross-validation on TRAINING data only
# Evaluate with multiple metrics: accuracy, precision, recall, F1

print("🔬 Model Comparison with Cross-Validation\n")

# Define your models
models = {
    # Add at least 3 models here
    # Example: 'Logistic Regression': LogisticRegression(max_iter=1000)
}

# Your code here:
cv = # Create StratifiedKFold
results = []

for name, model in models.items():
    # Calculate multiple metrics
    acc_scores = # accuracy
    prec_scores = # precision
    rec_scores = # recall
    f1_scores = # f1
    
    results.append({
        'Model': name,
        'Accuracy': f"{acc_scores.mean():.3f} ± {acc_scores.std():.3f}",
        'Precision': f"{prec_scores.mean():.3f} ± {prec_scores.std():.3f}",
        'Recall': f"{rec_scores.mean():.3f} ± {rec_scores.std():.3f}",
        'F1': f"{f1_scores.mean():.3f} ± {f1_scores.std():.3f}",
        'Recall_mean': rec_scores.mean()  # For sorting
    })

# Create results DataFrame
results_df = pd.DataFrame(results).sort_values('Recall_mean', ascending=False)
print(results_df[['Model', 'Accuracy', 'Precision', 'Recall', 'F1']].to_string(index=False))

print("\n💡 Remember: For disease detection, RECALL is most important!")
print("   We can't afford to miss patients with heart disease.")
print("\n✅ Task 3.1 Complete!")

### Task 3.2: Select Best Model

In [None]:
# TODO 3.2: Based on CV results, select the best model
# Consider: What metric is most important for this problem?

# Your decision:
best_model_name = # Name of model you chose
best_model = # Create instance of that model

print(f"🏆 Selected Model: {best_model_name}")
print(f"\n📋 Reasoning:")
# Write your reasoning here (why this model?)

print("\n✅ Task 3.2 Complete!")

---

## 📊 Step 4: Final Evaluation

**Estimated time**: 10 minutes

### Task 4.1: Train and Evaluate Final Model

In [None]:
# TODO 4.1: Train your chosen model on full training set
# Then evaluate on test set (ONCE!)

print("🎯 Final Model Training & Evaluation\n")

# Train on full training set
# Your code here:


# Predict on test set
y_pred = # Your predictions

# Calculate all metrics
accuracy = # Calculate
precision = # Calculate
recall = # Calculate
f1 = # Calculate

# Display results
print("Test Set Performance:")
print("="*50)
print(f"Accuracy:  {accuracy:.3f} ({accuracy:.1%})")
print(f"Precision: {precision:.3f} ({precision:.1%})")
print(f"Recall:    {recall:.3f} ({recall:.1%}) ⭐")
print(f"F1-Score:  {f1:.3f} ({f1:.1%})")

print("\n💡 What this means:")
print(f"  - We catch {recall:.1%} of patients with heart disease")
print(f"  - When we predict disease, we're right {precision:.1%} of the time")

if recall > 0.7:
    print("\n✅ Good recall! We're catching most disease cases.")
else:
    print("\n⚠️ Recall could be higher - consider model tuning")

print("\n✅ Task 4.1 Complete!")

### Task 4.2: Confusion Matrix Analysis

In [None]:
# TODO 4.2: Create and analyze confusion matrix
# Show both raw counts and percentages

# Your code here:
cm = # Calculate confusion matrix
tn, fp, fn, tp = # Extract values

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot confusion matrices (counts and normalized)
# Your plotting code


plt.tight_layout()
plt.show()

# Analyze
print("\n🔍 Analysis:")
print(f"  Correctly identified healthy: {tn}")
print(f"  Correctly identified disease: {tp}")
print(f"  False alarms (predicted disease, was healthy): {fp}")
print(f"  MISSED cases (predicted healthy, had disease): {fn} ⚠️")

if fn < fp:
    print("\n✅ Good! We're prioritizing catching disease cases.")

print("\n✅ Task 4.2 Complete!")

---

## 💾 Step 5: Save Your Model

**Estimated time**: 5 minutes

### Task 5.1: Pickle the Model

In [None]:
# TODO 5.1: Save your model with pickle
# Include: model, feature names, performance metrics

# Your code here:
model_package = {
    'model': # Your trained model
    'feature_names': # List of feature names
    'metrics': {
        # Your performance metrics
    },
    'model_name': best_model_name,
    'training_date': pd.Timestamp.now().strftime('%Y-%m-%d')
}

# Save to file
with open('heart_disease_model.pkl', 'wb') as f:
    pickle.dump(model_package, f)

print("💾 Model saved as: heart_disease_model.pkl")
print("\n✅ Task 5.1 Complete!")

### Task 5.2: Test Loading the Model

In [None]:
# TODO 5.2: Load the model and make a test prediction

# Your code here:
with open('heart_disease_model.pkl', 'rb') as f:
    loaded_model = # Load the model

# Test with first 5 test samples
test_predictions = # Make predictions

print("🔍 Testing loaded model:")
print(f"Predictions: {test_predictions}")
print(f"Actual:      {y_test.iloc[:5].values}")
print("\n✅ Model loads and works correctly!")
print("✅ Task 5.2 Complete!")

---

## 📝 Step 6: Write Your Report

**Estimated time**: 5 minutes

### Task 6.1: Summary Report

In [None]:
# TODO 6.1: Complete this summary report

print("🏥 HEART DISEASE PREDICTION - PROJECT SUMMARY")
print("="*60)

print("\n📊 DATASET:")
print(f"  Total patients: {len(df)}")
print(f"  Features used: {len(X.columns)}")
print(f"  Disease prevalence: {(y==1).mean():.1%}")

print("\n🤖 MODEL SELECTION:")
print(f"  Models compared: {len(models)}")
print(f"  Selected model: {best_model_name}")
print(f"  Selection criteria: [YOUR REASONING HERE]")

print("\n📈 PERFORMANCE:")
print(f"  Test Accuracy:  {accuracy:.1%}")
print(f"  Test Precision: {precision:.1%}")
print(f"  Test Recall:    {recall:.1%} ⭐")
print(f"  Test F1-Score:  {f1:.1%}")

print("\n💡 BUSINESS IMPACT:")
print(f"  ✅ Catches {recall:.0%} of disease cases")
print(f"  ✅ {precision:.0%} of positive predictions are correct")
print(f"  ⚠️ Misses {(1-recall)*100:.0f}% of disease cases (false negatives)")
print(f"  ⚠️ {(1-precision)*100:.0f}% false alarm rate")

print("\n🎯 RECOMMENDATIONS:")
# TODO: Add your recommendations
# - Should this model be deployed?
# - What could improve performance?
# - What are the risks?

print("\n✅ Task 6.1 Complete!")

---

## 🏆 Challenge Complete!

### What You Accomplished:

✅ **Loaded and explored** medical data  
✅ **Prepared data** with proper train/test split  
✅ **Compared multiple models** using cross-validation  
✅ **Selected best model** based on business needs  
✅ **Evaluated thoroughly** with multiple metrics  
✅ **Saved model** for deployment  
✅ **Documented results** professionally  

### Skills Demonstrated:

1. **End-to-end ML workflow** from data to model
2. **Proper evaluation** using train/test split and CV
3. **Metric selection** based on business context
4. **Model comparison** with stratified CV
5. **Production readiness** with model persistence
6. **Communication** through clear reporting

### Reflection Questions:

1. **Why did you choose your final model?**
2. **Is recall of {recall:.1%} acceptable for this problem?**
3. **What would you do to improve performance?**
4. **Would you deploy this model? Why or why not?**

### Next Steps:

- Try different hyperparameters
- Add more feature engineering
- Experiment with ensemble methods
- Build a simple Streamlit app (see Notebook 04)

---

## 🎉 Congratulations!

You've successfully completed a real-world ML project following all best practices from Week 7!

**You're now ready to tackle real ML challenges! 🚀**