<a href="https://colab.research.google.com/github/anandchauhan21/Machine_Learning/blob/main/Labs/Lab8_Random_Forest_for_Disease_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧪 Lab 8: Random Forest for Disease Prediction

## 🎯 Objective
Train and evaluate a **Random Forest** model to predict the likelihood of a disease using patient demographic, clinical and lifestyle features.

## ✅ Learning outcomes
- Understand Random Forest basics (bagging, feature randomness)
- Preprocess and balance clinical data
- Train, evaluate, and tune a Random Forest classifier
- Interpret feature importance and save the trained model


# 🧠 Concept Recap — Random Forest

- **Random Forest**: ensemble of decision trees trained on bootstrap samples and random feature subsets.
- **Prediction (classification)**: majority vote across trees.
- **Advantages**: robustness to noise, handles nonlinearity, provides feature importances.
- **Limitations**: less interpretable, can be compute-heavy for many trees.
- **Useful metrics for medical classification**: Precision, Recall, F1, ROC-AUC (esp. for imbalanced data).


In [None]:
# Run this cell first
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, classification_report, roc_auc_score,
                             roc_curve, accuracy_score, precision_score, recall_score, f1_score)
from sklearn.utils import resample
import joblib
import warnings
warnings.filterwarnings("ignore")
sns.set(style="whitegrid")
np.random.seed(42)


In [None]:
# OPTION A: Synthetic dataset (quick start)
def create_synthetic_patient_history(n=1000):
    np.random.seed(42)
    age = np.random.randint(18, 85, n)
    bmi = np.round(np.random.normal(27, 4, n), 1)
    systolic = np.round(110 + 0.5*age + 0.4*bmi + np.random.normal(0, 10, n), 1)
    diastolic = np.round(70 + 0.2*age + 0.2*bmi + np.random.normal(0, 6, n), 1)
    cholesterol = np.round(np.random.normal(200, 30, n), 1)
    glucose = np.round(np.random.normal(100, 20, n), 1)
    smoking = np.random.binomial(1, 0.25, n)
    family_history = np.random.binomial(1, 0.2, n)
    physical_activity = np.random.randint(0, 5, n)

    # risk → probability → label
    risk_score = (0.03*age + 0.06*bmi + 0.02*systolic + 0.015*cholesterol +
                  0.5*smoking + 0.8*family_history - 0.2*physical_activity)
    prob = 1 / (1 + np.exp(-0.08*(risk_score - 12)))
    disease = np.random.binomial(1, prob)

    df = pd.DataFrame({
        'Age': age, 'BMI': bmi, 'SystolicBP': systolic, 'DiastolicBP': diastolic,
        'Cholesterol': cholesterol, 'Glucose': glucose,
        'Smoking': smoking, 'FamilyHistory': family_history, 'PhysicalActivity': physical_activity,
        'Disease': disease
    })
    return df

# Create dataset
df = create_synthetic_patient_history(1500)
print("Dataset shape:", df.shape)
df.head()


> OPTION B (use your CSV): replace the synthetic block with:
```python
# df = pd.read_csv('/path/to/patient_history.csv')
# df = df[ ['Age','BMI','SystolicBP','DiastolicBP','Cholesterol','Glucose','Smoking','FamilyHistory','PhysicalActivity','Disease'] ]


In [None]:

---

### 📊 Cell 5 — Quick EDA (Code)
```python
# Target balance
print("Target distribution:")
print(df['Disease'].value_counts(), "\n")

# Summary statistics
display(df.describe().T)

# Correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Feature Correlation")
plt.show()


In [None]:
features = ['Age','BMI','SystolicBP','DiastolicBP','Cholesterol','Glucose',
            'Smoking','FamilyHistory','PhysicalActivity']
X = df[features].copy()
y = df['Disease'].copy()

# Train-test split (stratify to keep class ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Train class counts:", y_train.value_counts().to_dict())
print("Test class counts:", y_test.value_counts().to_dict())

# OPTIONAL: Upsample minority class in training (simple)
train = pd.concat([X_train, y_train], axis=1)
minority = train[train['Disease']==1]
majority = train[train['Disease']==0]

if len(minority) > 0 and len(minority) < 0.7 * len(majority):
    minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42)
    train_bal = pd.concat([majority, minority_upsampled])
    X_train = train_bal[features]
    y_train = train_bal['Disease']
    print("After upsampling (train):", y_train.value_counts().to_dict())
else:
    print("No upsampling performed.")

# Scale numeric features (RF doesn't require it, but scaler useful for other models)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train_scaled, y_train)

# Predict
y_pred = rf.predict(X_test_scaled)
y_proba = rf.predict_proba(X_test_scaled)[:,1]

# Metrics
print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
print("Precision:", round(precision_score(y_test, y_pred), 3))
print("Recall:", round(recall_score(y_test, y_pred), 3))
print("F1:", round(f1_score(y_test, y_pred), 3))
print("ROC AUC:", round(roc_auc_score(y_test, y_proba), 3))

print("\nClassification report:")
print(classification_report(y_test, y_pred))


In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4,3))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=[0,1], yticklabels=[0,1])
plt.xlabel("Predicted"); plt.ylabel("Actual"); plt.title("Confusion Matrix")
plt.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_proba):.3f}")
plt.plot([0,1],[0,1],'--', color='gray')
plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()


In [None]:
feat_imp = pd.Series(rf.feature_importances_, index=features).sort_values(ascending=True)
plt.figure(figsize=(7,5))
feat_imp.plot(kind='barh', color='teal')
plt.title("Feature Importances (Random Forest)")
plt.xlabel("Importance")
plt.show()

display(feat_imp.sort_values(ascending=False).round(3))


In [None]:
from scipy.stats import randint as sp_randint

param_dist = {
    'n_estimators': sp_randint(50, 300),
    'max_depth': sp_randint(3, 25),
    'min_samples_split': sp_randint(2, 12),
    'min_samples_leaf': sp_randint(1, 6),
    'max_features': ['sqrt','log2', None, 0.5]
}

rsearch = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20, cv=3, scoring='roc_auc', random_state=42, n_jobs=-1
)

rsearch.fit(X_train_scaled, y_train)
print("Best params:", rsearch.best_params_)
print("Best CV ROC AUC:", round(rsearch.best_score_, 3))

best_rf = rsearch.best_estimator_

# Evaluate tuned model on test set
y_pred_t = best_rf.predict(X_test_scaled)
y_proba_t = best_rf.predict_proba(X_test_scaled)[:,1]
print("\nTuned model metrics on test set:")
print("ROC AUC:", round(roc_auc_score(y_test, y_proba_t), 3))
print("Accuracy:", round(accuracy_score(y_test, y_pred_t), 3))
print("F1:", round(f1_score(y_test, y_pred_t), 3))


In [None]:
joblib.dump(best_rf, "lab8_rf_model.joblib")
joblib.dump(scaler, "lab8_scaler.joblib")
print("Saved: lab8_rf_model.joblib, lab8_scaler.joblib")


In [None]:
# Example new patient
new_patient = pd.DataFrame([{
    'Age': 60, 'BMI': 30.2, 'SystolicBP': 145, 'DiastolicBP': 92,
    'Cholesterol': 240, 'Glucose': 118, 'Smoking': 1, 'FamilyHistory': 1, 'PhysicalActivity': 1
}])

scaler = joblib.load("lab8_scaler.joblib")
model = joblib.load("lab8_rf_model.joblib")
Xn = scaler.transform(new_patient[features])
prob = model.predict_proba(Xn)[0,1]
pred = model.predict(Xn)[0]

print(f"Predicted disease probability: {prob:.3f}   Predicted label: {pred}")


In [None]:
from sklearn.calibration import calibration_curve
prob_pos = best_rf.predict_proba(X_test_scaled)[:,1]
frac_pos, mean_pred = calibration_curve(y_test, prob_pos, n_bins=10)

plt.figure(figsize=(6,4))
plt.plot(mean_pred, frac_pos, "s-", label="RandomForest")
plt.plot([0,1],[0,1],"--", color='gray')
plt.xlabel("Mean predicted probability"); plt.ylabel("Fraction of positives")
plt.title("Calibration plot")
plt.legend(); plt.show()


# 🔁 Recap — Lab 8: Random Forest for Disease Prediction

**What we did**
- Built/loaded patient dataset, explored features and the label.
- Handled class imbalance (optional upsampling).
- Trained a baseline Random Forest, evaluated with Precision, Recall, F1, ROC-AUC.
- Tuned hyperparameters with RandomizedSearchCV and saved the best model.
- Interpreted feature importances and predicted for a new patient.

**Key reminders**
- Use ROC-AUC and recall/precision when classes are imbalanced.
- Feature importance helps clinical interpretability, but for per-prediction explanations consider SHAP/LIME.
- Always validate model decisions with clinicians before deployment.


## ✅ Viva Questions

1. Why is Random Forest suitable for disease prediction?  
2. What does ROC-AUC represent and why is it useful for imbalanced datasets?  
3. How does Random Forest reduce overfitting compared to a single decision tree?  
4. When might you prefer calibration of predicted probabilities?  
5. How would you explain the model prediction for a single patient to a clinician?
