# 👩‍💻 Using Boosting Models to Predict Heart Disease

## 📋 Overview
In this lab, you are tasked with using boosting models to predict the presence of heart disease. Each heartbeat, each pause, tells a story, and as data scientists, it's our job to listen. The UCI Heart Disease dataset is your patient today, and boosted trees are your diagnostic tools. Through this hands-on activity, you will harness the power of XGBoost and LightGBM to improve model accuracy and gain insights into key predictive features. By the end of this lab, you will be able to explore and preprocess data, train and tune XGBoost and LightGBM models, and compare their performance.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- ✅ Apply XGBoost and LightGBM to classification problems
- ✅ Perform hyperparameter tuning to enhance model performance
- ✅ Evaluate and compare model metrics

## Task 1: Load and Explore Data

**Context:** Understanding the data structure is crucial before jumping into model training.

**Steps:**

1. **Conduct exploratory data analysis (EDA) including:**
    - Displaying summary statistics
    - Checking for missing values
    - Identifying categorical features

In [None]:
# Required Imports
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
!pip install xgboost
import xgboost as xgb
!pip install lightgbm
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load Data
df = pd.read_csv('heart.csv')
df = df.drop('id', axis=1)

# Conduct EDA: summary stats, missing values, identifying categorical features
# ... your code here

💡 **Tip:** Use`pd.read_csv()`to load the data, and `df.describe()` to get summary statistics.

**⚙️ Test Your Work:**
- Display the first 5 rows of the dataset.

**Expected output:** A preview of the data with columns such as 'age', 'sex', 'cp', etc.

## Task 2: Data Preprocessing

**Context:** Ensuring data is clean and in a suitable format for model training.

**Steps:**

1. Handle any missing data appropriately.
2. Encode non-numeric columns using LabelEncoder.
3. Normalize or standardize the data using StandardScaler

In [None]:
# Task 2: Data Preprocessing

💡 **Tip:** Use `LabelEncoder` for categorical variables and `StandardScaler` for normalization.
    
⚙️ **Test Your Work:**
- Check the transformed dataset to ensure all features are numeric.

**Expected output:** All columns should have numeric types.

## Task 3: Split Data

**Context:** It is essential to split your data to train and test the models effectively.

**Steps:**

1. Divide the dataset into training and testing sets with an 80-20 split.

In [None]:
# Task 3: Split Data

**💡 Tip:** Use `train_test_split` with a `test_size` of 0.2 and a `random_state` for reproducibility.

**⚙️ Test Your Work:**
- Print the shapes of the training and testing sets.

**Expected output:** Shapes that reflect the 80-20 split.

## Task 4: Model Training with XGBoost

**Context:** XGBoost is a powerful gradient boosting library for training models.

**Steps:**

1. Instantiate an `XGBClassifier` from `xgboost`.
2. Train the model on the training data.
3. Adjust hyperparameters like `n_estimators` and `learning_rate`.

In [None]:
# Task 4: Model Training with XGBoost

**💡 Tip:** Use `fit` to train the model and `predict` to make predictions.

**⚙️ Test Your Work:**
- Print the classification report for XGBoost model predictions.

**Expected output:** Metrics such as accuracy, precision, recall, F1 score.

## Task 5: Model Training with LightGBM

**Context:** LightGBM is another powerful tool for gradient boosting.

**Steps:**

1. Instantiate an `LGBMClassifier` from `lightgbm`.
2. Train the model on the training data.
3. Adjust hyperparameters like `max_depth` and `subsample`.

In [None]:
# Task 5: Model Training with LightGBM

**💡 Tip:** Similar to XGBoost, use `fit` to train and `predict` to predict with LightGBM.

**⚙️ Test Your Work:**
- Print the classification report for LightGBM model predictions.

Expected output: Metrics similar to those of the XGBoost model.

# Task 6: Evaluate and Compare Models

**Context:** Evaluation helps to understand the strengths and weaknesses of each model.

**Steps:**

1. Compare metrics (accuracy, precision, recall, F1 score) of both models.
2. Analyze feature importances from each model.

In [None]:
# Task 6: Evaluate and Compare Models

**💡 Tip:** Use `feature_importances_` attribute for both models to extract feature importances.

**⚙️ Test Your Work:**
- Display a comparison table of model metrics.

**Expected output:** A clear comparison of both models’ performances.

### ✅ Success Checklist

- Successfully loaded and explored the dataset
- Preprocessed the data correctly (handled missing data, encoded categorical variables, normalized data)
- Split the data into training and testing sets
- Trained both XGBoost and LightGBM models
- Evaluated and compared model performances using relevant metrics
- Documented reflections and insights

### 🔍 Common Issues & Solutions

**Problem:** Dataset file not found.  
**Solution:** Ensure the dataset file is in the correct folder.

**Problem:** Categorical encoding errors.   
**Solution:** Double-check the columns being encoded.

**Problem:** Model training errors.   
**Solution:** Verify that data preprocessing steps were correctly applied.

### 🔑 Key Points

- Data preprocessing is critical for model performance.
- Hyperparameter tuning can significantly impact boost model outcomes.
- Evaluating both model metrics and feature importances helps in understanding model behavior.

## 💻 Exemplar Solution

<details>    
<summary><strong>Click HERE to see an exemplar solution</strong></summary>

```python
# Required Imports
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
!pip install xgboost
import xgboost as xgb
!pip install lightgbm   
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load Data
df = pd.read_csv('heart.csv')
df = df.drop('id', axis=1)

# Data Exploration
# Display summary statistics
print(df.describe())
print(df.dtypes)
df.isnull().sum()

# Data Preprocessing
label_encoder = LabelEncoder()
for column in df.select_dtypes(include=['object']).columns:
    df[column] = label_encoder.fit_transform(df[column])

# Standardizing data
scaler = StandardScaler()
y = df.pop('num')
X = scaler.fit_transform(df)

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Model
xgb_clf = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, use_label_encoder=False, eval_metric='mlogloss')
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)
print(f"XGBoost Model Report:\n {classification_report(y_test, y_pred_xgb)}")

# Train LightGBM Model
lgb_clf = lgb.LGBMClassifier(n_estimators=100, max_depth=4, subsample=0.8)
lgb_clf.fit(X_train, y_train)
y_pred_lgb = lgb_clf.predict(X_test)
print(f"LightGBM Model Report:\n {classification_report(y_test, y_pred_lgb)}")

# Feature Importance
xgb_importances = xgb_clf.feature_importances_
lgb_importances = lgb_clf.feature_importances_
```