<a href="https://colab.research.google.com/github/anandchauhan21/Machine_Learning/blob/main/Labs/Lab7_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧪 Lab 7: Random Forest — Disease Likelihood Prediction

## 🎯 Objective
To implement a **Random Forest Classifier** to predict the likelihood of a disease
using patient demographic, lifestyle, and clinical history data.

---

## 📚 Learning Outcomes
By the end of this lab, you will be able to:
- Build and train a Random Forest classifier.
- Evaluate its performance using classification metrics.
- Visualize feature importance.
- Tune hyperparameters for optimal results.
- Predict disease likelihood for new patient data.


# 🧠 Concept Recap — Random Forest Classifier

## 🌳 What is a Random Forest?
- **Random Forest** is an **ensemble learning method** that combines multiple **Decision Trees**.
- Each tree learns from a random subset of data and features, which helps reduce **overfitting**.
- The final prediction is made by **majority voting** (for classification) or **averaging** (for regression).

---

## ⚙️ How Random Forest Works
1. **Bootstrap Sampling** — Each tree trains on a random subset of the dataset.
2. **Feature Randomness** — At each split, a random subset of features is considered.
3. **Aggregation** — Predictions from all trees are combined to form the final output.

---

## 📊 Why Random Forest for Health Data?
- Handles **nonlinear** relationships well.
- Works with **imbalanced or noisy data**.
- Gives **feature importance** (helps interpret which parameters matter most).
- Suitable for **tabular clinical data**.

---

## ⚖️ Common Evaluation Metrics
| Metric | Meaning | Ideal Value |
|---------|----------|-------------|
| Accuracy | Overall correctness | High |
| Precision | TP / (TP + FP) | High |
| Recall (Sensitivity) | TP / (TP + FN) | High |
| F1-score | Balance of Precision & Recall | High |
| ROC-AUC | Probability model ranks positive higher than negative | Close to 1 |

---

## 🧩 Example Use Case
Predict if a patient has a **high likelihood of hypertension or diabetes**
based on features like **Age, BMI, BP, Cholesterol, Glucose, Smoking, Family History**, etc.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    confusion_matrix, classification_report, roc_auc_score, roc_curve,
    accuracy_score, precision_score, recall_score, f1_score
)
from sklearn.utils import resample
import joblib
import warnings

sns.set(style="whitegrid")
warnings.filterwarnings("ignore")
np.random.seed(42)


In [None]:
# Synthetic patient dataset
def create_synthetic_patient_history(n=1000):
    np.random.seed(42)
    age = np.random.randint(18, 85, n)
    bmi = np.round(np.random.normal(27, 4, n), 1)
    systolic = np.round(110 + 0.5*age + 0.4*bmi + np.random.normal(0, 10, n), 1)
    diastolic = np.round(70 + 0.2*age + 0.2*bmi + np.random.normal(0, 6, n), 1)
    cholesterol = np.round(np.random.normal(200, 30, n), 1)
    glucose = np.round(np.random.normal(100, 20, n), 1)
    smoking = np.random.binomial(1, 0.25, n)
    family_history = np.random.binomial(1, 0.2, n)
    physical_activity = np.random.randint(0, 5, n)

    risk_score = (0.03*age + 0.06*bmi + 0.02*systolic + 0.015*cholesterol +
                  0.5*smoking + 0.8*family_history - 0.2*physical_activity)
    prob = 1 / (1 + np.exp(-0.08*(risk_score - 12)))
    disease = np.random.binomial(1, prob)

    df = pd.DataFrame({
        'Age': age,
        'BMI': bmi,
        'SystolicBP': systolic,
        'DiastolicBP': diastolic,
        'Cholesterol': cholesterol,
        'Glucose': glucose,
        'Smoking': smoking,
        'FamilyHistory': family_history,
        'PhysicalActivity': physical_activity,
        'Disease': disease
    })
    return df

df = create_synthetic_patient_history(1200)
print("✅ Dataset created. Shape:", df.shape)
df.head()


In [None]:
print(df['Disease'].value_counts(normalize=True).round(3))
print("\nSummary Statistics:")
display(df.describe().T)

# Correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()


In [None]:
features = ['Age','BMI','SystolicBP','DiastolicBP','Cholesterol','Glucose',
            'Smoking','FamilyHistory','PhysicalActivity']
X = df[features]
y = df['Disease']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42)

# Balance classes (if needed)
train = pd.concat([X_train, y_train], axis=1)
minority = train[train['Disease']==1]
majority = train[train['Disease']==0]
print("Before resampling:", train['Disease'].value_counts().to_dict())

if len(minority) < 0.7 * len(majority):
    minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42)
    train = pd.concat([majority, minority_upsampled])
    X_train, y_train = train[features], train['Disease']
    print("After upsampling:", y_train.value_counts().to_dict())

# Scaling (optional for RF, but good practice)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train_scaled, y_train)

# Predictions
y_pred = rf.predict(X_test_scaled)
y_proba = rf.predict_proba(X_test_scaled)[:,1]

# Evaluate
print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
print("Precision:", round(precision_score(y_test, y_pred), 3))
print("Recall:", round(recall_score(y_test, y_pred), 3))
print("F1 Score:", round(f1_score(y_test, y_pred), 3))
print("ROC AUC:", round(roc_auc_score(y_test, y_proba), 3))


In [None]:
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted"); plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_proba):.3f}")
plt.plot([0,1],[0,1],'--', color='gray')
plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()


In [None]:
importances = pd.Series(rf.feature_importances_, index=features).sort_values(ascending=True)
plt.figure(figsize=(7,5))
importances.plot(kind='barh', color='teal')
plt.title("Feature Importance")
plt.show()

display(importances.sort_values(ascending=False))


In [None]:
from scipy.stats import randint as sp_randint

param_dist = {
    'n_estimators': sp_randint(50, 300),
    'max_depth': sp_randint(3, 20),
    'min_samples_split': sp_randint(2, 10),
    'min_samples_leaf': sp_randint(1, 6),
    'max_features': ['sqrt', 'log2', 0.5, None]
}

rsearch = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20, cv=3, scoring='roc_auc', n_jobs=-1, random_state=42
)
rsearch.fit(X_train_scaled, y_train)

print("Best parameters:", rsearch.best_params_)
best_rf = rsearch.best_estimator_
print("Best CV ROC AUC:", round(rsearch.best_score_, 3))


In [None]:
# Save model
joblib.dump(best_rf, "rf_disease_model.joblib")
joblib.dump(scaler, "scaler.joblib")
print("✅ Model saved successfully.")

# Predict for a new patient
new_patient = pd.DataFrame([{
    'Age': 52, 'BMI': 31.5, 'SystolicBP': 145, 'DiastolicBP': 90,
    'Cholesterol': 250, 'Glucose': 120, 'Smoking': 1,
    'FamilyHistory': 1, 'PhysicalActivity': 1
}])

scaler = joblib.load("scaler.joblib")
model = joblib.load("rf_disease_model.joblib")
X_new = scaler.transform(new_patient)
prob = model.predict_proba(X_new)[0,1]
pred = model.predict(X_new)[0]

print(f"Predicted Probability: {prob:.3f}")
print(f"Disease Likelihood: {'High (1)' if pred==1 else 'Low (0)'}")


# 🧩 Recap — Random Forest for Disease Prediction

### What we did
✅ Generated or loaded patient history dataset  
✅ Performed feature scaling and balanced the classes  
✅ Trained a **Random Forest** model  
✅ Evaluated using Accuracy, F1, and ROC-AUC  
✅ Visualized feature importance and confusion matrix  
✅ Tuned model hyperparameters  
✅ Predicted disease likelihood for new patients  

---

### Key Insights
- Random Forest can model complex, nonlinear health patterns.  
- Important predictors might include **Age**, **BMI**, **Blood Pressure**, **Cholesterol**, etc.  
- Ensemble averaging reduces overfitting and improves generalization.  
- ROC-AUC is a robust metric when class imbalance exists.  

---

### Limitations
- Less interpretable than single Decision Trees.  
- May need many trees for stability.  
- Tuning hyperparameters can be computationally expensive.


## ✅ Viva Questions

1. What is the main idea behind Random Forest?  
2. How does Random Forest prevent overfitting?  
3. Why use ROC-AUC over Accuracy in medical problems?  
4. What is the difference between bagging and boosting?  
5. Which features had the most influence in your model?  
6. How would you explain this model’s prediction to a clinician?  
7. Why do we split data into training and testing sets?  
8. What is the role of `max_depth` and `n_estimators` in Random Forest?
