<div align="center">

# MLOps (S1-25_AIMLCZG523) ASSIGNMENT - 1

## Group - 126

</div>

| Name | ID | Contribution |
|------|----|----|
| DEVAPRASAD P | 2023AA05069 | 100% |
| DEVENDER KUMAR | 2024AA05065 | 100% |
| ROHAN TIRTHANKAR BEHERA | 2024AA05607 | 100% |
| PALAKOLANU PREETHI | 2024AA05608 | 100% |
| CHAVALI AMRUTHA VALLI | 2024AA05610 | 100% |

---

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import pickle
import joblib
import warnings
warnings.filterwarnings('ignore')

# Sklearn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    roc_auc_score, confusion_matrix, classification_report, roc_curve
)
from sklearn.pipeline import Pipeline

# MLflow
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

%matplotlib inline

## 2. Load Cleaned Data

In [2]:
# Load the cleaned dataset
data_path = Path('../data/processed/heart_disease_clean.csv')
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape}")
print(f"\nTarget distribution:")
print(df['target'].value_counts())
df.head()

Dataset shape: (303, 14)

Target distribution:
target
0    164
1    139
Name: count, dtype: int64


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


## 3. Feature Engineering & Data Preparation

In [3]:
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {list(X.columns)}")

Features shape: (303, 13)
Target shape: (303,)

Features: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']


In [4]:
# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"\nTrain target distribution:\n{y_train.value_counts()}")
print(f"\nTest target distribution:\n{y_test.value_counts()}")

Training set size: (242, 13)
Test set size: (61, 13)

Train target distribution:
target
0    131
1    111
Name: count, dtype: int64

Test target distribution:
target
0    33
1    28
Name: count, dtype: int64


## 4. Setup MLflow Experiment Tracking

In [5]:
# Set up MLflow
mlflow.set_tracking_uri("file:../mlruns")
mlflow.set_experiment("heart_disease_prediction")

print("MLflow tracking URI:", mlflow.get_tracking_uri())
print("MLflow experiment:", mlflow.get_experiment_by_name("heart_disease_prediction"))

MLflow tracking URI: file:../mlruns
MLflow experiment: <Experiment: artifact_location='file:///Users/I339667/Library/CloudStorage/OneDrive-SAPSE/Documents/Personal/BITS/MLOPS_Assignment1_2025/notebooks/../mlruns/155980306092719807', creation_time=1767158961162, experiment_id='155980306092719807', last_update_time=1767158961162, lifecycle_stage='active', name='heart_disease_prediction', tags={'mlflow.experimentKind': 'custom_model_development'}>


## 5. Model Development & Training

### 5.1 Helper Functions

In [6]:
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Evaluate model and return metrics"""
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    y_test_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    metrics = {
        'train_accuracy': accuracy_score(y_train, y_train_pred),
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1_score': f1_score(y_test, y_test_pred),
        'roc_auc': roc_auc_score(y_test, y_test_proba)
    }
    
    # Print results
    print(f"\n{'='*60}")
    print(f"{model_name} Results")
    print(f"{'='*60}")
    print(f"Train Accuracy: {metrics['train_accuracy']:.4f}")
    print(f"Test Accuracy:  {metrics['test_accuracy']:.4f}")
    print(f"Precision:      {metrics['precision']:.4f}")
    print(f"Recall:         {metrics['recall']:.4f}")
    print(f"F1-Score:       {metrics['f1_score']:.4f}")
    print(f"ROC-AUC:        {metrics['roc_auc']:.4f}")
    print(f"{'='*60}")
    
    return metrics, y_test_pred, y_test_proba


def plot_confusion_matrix(y_true, y_pred, model_name):
    """Plot confusion matrix"""
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['No Disease', 'Disease'],
                yticklabels=['No Disease', 'Disease'])
    plt.title(f'{model_name} - Confusion Matrix', fontsize=14, fontweight='bold')
    plt.ylabel('True Label', fontsize=12)
    plt.xlabel('Predicted Label', fontsize=12)
    plt.tight_layout()
    return plt.gcf()


def plot_roc_curve(y_true, y_proba, model_name, roc_auc):
    """Plot ROC curve"""
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, 
             label=f'ROC curve (AUC = {roc_auc:.4f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', fontsize=12)
    plt.ylabel('True Positive Rate', fontsize=12)
    plt.title(f'{model_name} - ROC Curve', fontsize=14, fontweight='bold')
    plt.legend(loc='lower right', fontsize=11)
    plt.grid(alpha=0.3)
    plt.tight_layout()
    return plt.gcf()

### 5.2 Model 1: Logistic Regression

In [7]:
# Start MLflow run for Logistic Regression
with mlflow.start_run(run_name="logistic_regression_baseline"):
    
    # Create pipeline
    lr_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(random_state=RANDOM_STATE, max_iter=1000))
    ])
    
    # Log parameters
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("scaler", "StandardScaler")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("random_state", RANDOM_STATE)
    
    # Train model
    lr_pipeline.fit(X_train, y_train)
    
    # Evaluate
    metrics, y_pred, y_proba = evaluate_model(
        lr_pipeline, X_train, X_test, y_train, y_test, "Logistic Regression"
    )
    
    # Log metrics
    for metric_name, metric_value in metrics.items():
        mlflow.log_metric(metric_name, metric_value)
    
    # Cross-validation
    cv_scores = cross_val_score(lr_pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    mlflow.log_metric("cv_roc_auc_mean", cv_scores.mean())
    mlflow.log_metric("cv_roc_auc_std", cv_scores.std())
    print(f"\nCross-validation ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    
    # Plot and log confusion matrix
    cm_fig = plot_confusion_matrix(y_test, y_pred, "Logistic Regression")
    mlflow.log_figure(cm_fig, "confusion_matrix.png")
    plt.savefig('../figures/lr_confusion_matrix.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Plot and log ROC curve
    roc_fig = plot_roc_curve(y_test, y_proba, "Logistic Regression", metrics['roc_auc'])
    mlflow.log_figure(roc_fig, "roc_curve.png")
    plt.savefig('../figures/lr_roc_curve.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Log model
    signature = infer_signature(X_train, lr_pipeline.predict(X_train))
    mlflow.sklearn.log_model(lr_pipeline, "model", signature=signature)
    
    print(f"\nMLflow Run ID: {mlflow.active_run().info.run_id}")


Logistic Regression Results
Train Accuracy: 0.8512
Test Accuracy:  0.8689
Precision:      0.8125
Recall:         0.9286
F1-Score:       0.8667
ROC-AUC:        0.9513

Cross-validation ROC-AUC: 0.8893 (+/- 0.0422)





MLflow Run ID: ab0d49b2f6684e48a655dc08161ed493


### 5.3 Model 2: Random Forest

In [8]:
# Start MLflow run for Random Forest
with mlflow.start_run(run_name="random_forest_baseline"):
    
    # Create pipeline
    rf_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            min_samples_split=5,
            min_samples_leaf=2,
            random_state=RANDOM_STATE,
            n_jobs=-1
        ))
    ])
    
    # Log parameters
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("scaler", "StandardScaler")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("min_samples_split", 5)
    mlflow.log_param("min_samples_leaf", 2)
    mlflow.log_param("random_state", RANDOM_STATE)
    
    # Train model
    rf_pipeline.fit(X_train, y_train)
    
    # Evaluate
    metrics, y_pred, y_proba = evaluate_model(
        rf_pipeline, X_train, X_test, y_train, y_test, "Random Forest"
    )
    
    # Log metrics
    for metric_name, metric_value in metrics.items():
        mlflow.log_metric(metric_name, metric_value)
    
    # Cross-validation
    cv_scores = cross_val_score(rf_pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    mlflow.log_metric("cv_roc_auc_mean", cv_scores.mean())
    mlflow.log_metric("cv_roc_auc_std", cv_scores.std())
    print(f"\nCross-validation ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'feature': X.columns,
        'importance': rf_pipeline.named_steps['classifier'].feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\nFeature Importance:")
    print(feature_importance)
    
    # Plot feature importance
    plt.figure(figsize=(10, 6))
    sns.barplot(data=feature_importance, x='importance', y='feature', palette='viridis')
    plt.title('Random Forest - Feature Importance', fontsize=14, fontweight='bold')
    plt.xlabel('Importance', fontsize=12)
    plt.ylabel('Feature', fontsize=12)
    plt.tight_layout()
    mlflow.log_figure(plt.gcf(), "feature_importance.png")
    plt.savefig('../figures/rf_feature_importance.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Plot and log confusion matrix
    cm_fig = plot_confusion_matrix(y_test, y_pred, "Random Forest")
    mlflow.log_figure(cm_fig, "confusion_matrix.png")
    plt.savefig('../figures/rf_confusion_matrix.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Plot and log ROC curve
    roc_fig = plot_roc_curve(y_test, y_proba, "Random Forest", metrics['roc_auc'])
    mlflow.log_figure(roc_fig, "roc_curve.png")
    plt.savefig('../figures/rf_roc_curve.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Log model
    signature = infer_signature(X_train, rf_pipeline.predict(X_train))
    mlflow.sklearn.log_model(rf_pipeline, "model", signature=signature)
    
    print(f"\nMLflow Run ID: {mlflow.active_run().info.run_id}")


Random Forest Results
Train Accuracy: 0.9711
Test Accuracy:  0.8852
Precision:      0.8387
Recall:         0.9286
F1-Score:       0.8814
ROC-AUC:        0.9491

Cross-validation ROC-AUC: 0.8882 (+/- 0.0400)

Feature Importance:
     feature  importance
12      thal    0.156067
2         cp    0.132357
7    thalach    0.123803
11        ca    0.120198
9    oldpeak    0.099370
0        age    0.076512
4       chol    0.067811
3   trestbps    0.058517
8      exang    0.054999
10     slope    0.045303
1        sex    0.039948
6    restecg    0.018501
5        fbs    0.006617





MLflow Run ID: 8b3b505645204058a0d65dae5c7e18dc


### 5.4 Model 3: Gradient Boosting (Bonus)

In [9]:
# Start MLflow run for Gradient Boosting
with mlflow.start_run(run_name="gradient_boosting_baseline"):
    
    # Create pipeline
    gb_pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', GradientBoostingClassifier(
            n_estimators=100,
            learning_rate=0.1,
            max_depth=3,
            random_state=RANDOM_STATE
        ))
    ])
    
    # Log parameters
    mlflow.log_param("model_type", "GradientBoosting")
    mlflow.log_param("scaler", "StandardScaler")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("learning_rate", 0.1)
    mlflow.log_param("max_depth", 3)
    mlflow.log_param("random_state", RANDOM_STATE)
    
    # Train model
    gb_pipeline.fit(X_train, y_train)
    
    # Evaluate
    metrics, y_pred, y_proba = evaluate_model(
        gb_pipeline, X_train, X_test, y_train, y_test, "Gradient Boosting"
    )
    
    # Log metrics
    for metric_name, metric_value in metrics.items():
        mlflow.log_metric(metric_name, metric_value)
    
    # Cross-validation
    cv_scores = cross_val_score(gb_pipeline, X_train, y_train, cv=5, scoring='roc_auc')
    mlflow.log_metric("cv_roc_auc_mean", cv_scores.mean())
    mlflow.log_metric("cv_roc_auc_std", cv_scores.std())
    print(f"\nCross-validation ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    
    # Plot and log confusion matrix
    cm_fig = plot_confusion_matrix(y_test, y_pred, "Gradient Boosting")
    mlflow.log_figure(cm_fig, "confusion_matrix.png")
    plt.savefig('../figures/gb_confusion_matrix.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Plot and log ROC curve
    roc_fig = plot_roc_curve(y_test, y_proba, "Gradient Boosting", metrics['roc_auc'])
    mlflow.log_figure(roc_fig, "roc_curve.png")
    plt.savefig('../figures/gb_roc_curve.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Log model
    signature = infer_signature(X_train, gb_pipeline.predict(X_train))
    mlflow.sklearn.log_model(gb_pipeline, "model", signature=signature)
    
    print(f"\nMLflow Run ID: {mlflow.active_run().info.run_id}")


Gradient Boosting Results
Train Accuracy: 0.9917
Test Accuracy:  0.8525
Precision:      0.7879
Recall:         0.9286
F1-Score:       0.8525
ROC-AUC:        0.9459

Cross-validation ROC-AUC: 0.8536 (+/- 0.0492)





MLflow Run ID: 5dfbe1c043b64ed4a74c422d5dec8177


## 6. Hyperparameter Tuning (Random Forest)

In [10]:
# Hyperparameter tuning for Random Forest
with mlflow.start_run(run_name="random_forest_tuned"):
    
    # Define parameter grid
    param_grid = {
        'classifier__n_estimators': [50, 100, 200],
        'classifier__max_depth': [5, 10, 15],
        'classifier__min_samples_split': [2, 5, 10],
        'classifier__min_samples_leaf': [1, 2, 4]
    }
    
    # Create base pipeline
    rf_pipeline_tuned = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1))
    ])
    
    # Grid search
    grid_search = GridSearchCV(
        rf_pipeline_tuned, param_grid, cv=5, 
        scoring='roc_auc', n_jobs=-1, verbose=1
    )
    
    print("Starting Grid Search...")
    grid_search.fit(X_train, y_train)
    
    # Best parameters
    print(f"\nBest Parameters: {grid_search.best_params_}")
    print(f"Best CV Score: {grid_search.best_score_:.4f}")
    
    # Log parameters
    mlflow.log_param("model_type", "RandomForest_Tuned")
    mlflow.log_param("scaler", "StandardScaler")
    for param_name, param_value in grid_search.best_params_.items():
        mlflow.log_param(param_name, param_value)
    
    # Get best model
    best_rf = grid_search.best_estimator_
    
    # Evaluate
    metrics, y_pred, y_proba = evaluate_model(
        best_rf, X_train, X_test, y_train, y_test, "Random Forest (Tuned)"
    )
    
    # Log metrics
    for metric_name, metric_value in metrics.items():
        mlflow.log_metric(metric_name, metric_value)
    mlflow.log_metric("cv_roc_auc_best", grid_search.best_score_)
    
    # Plot and log confusion matrix
    cm_fig = plot_confusion_matrix(y_test, y_pred, "Random Forest (Tuned)")
    mlflow.log_figure(cm_fig, "confusion_matrix.png")
    plt.savefig('../figures/rf_tuned_confusion_matrix.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Plot and log ROC curve
    roc_fig = plot_roc_curve(y_test, y_proba, "Random Forest (Tuned)", metrics['roc_auc'])
    mlflow.log_figure(roc_fig, "roc_curve.png")
    plt.savefig('../figures/rf_tuned_roc_curve.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # Log model
    signature = infer_signature(X_train, best_rf.predict(X_train))
    mlflow.sklearn.log_model(best_rf, "model", signature=signature)
    
    # Save best model
    model_path = Path('../models/best_model.pkl')
    model_path.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(best_rf, model_path)
    mlflow.log_artifact(str(model_path))
    
    print(f"\nBest model saved to: {model_path}")
    print(f"MLflow Run ID: {mlflow.active_run().info.run_id}")

Starting Grid Search...
Fitting 5 folds for each of 81 candidates, totalling 405 fits

Best Parameters: {'classifier__max_depth': 5, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 200}
Best CV Score: 0.8940

Random Forest (Tuned) Results
Train Accuracy: 0.9380
Test Accuracy:  0.9016
Precision:      0.8667
Recall:         0.9286
F1-Score:       0.8966
ROC-AUC:        0.9567





Best model saved to: ../models/best_model.pkl
MLflow Run ID: 93528a42e208414e8e855a282ea2fac6


## 7. Model Comparison

In [11]:
# Load all runs from MLflow
experiment = mlflow.get_experiment_by_name("heart_disease_prediction")
runs_df = mlflow.search_runs(experiment_ids=[experiment.experiment_id])

# Display comparison
comparison_cols = ['run_id', 'params.model_type', 'metrics.test_accuracy', 
                   'metrics.precision', 'metrics.recall', 'metrics.f1_score', 
                   'metrics.roc_auc']
print("\nModel Comparison:")
print("=" * 100)
runs_df[comparison_cols].sort_values('metrics.roc_auc', ascending=False)


Model Comparison:


Unnamed: 0,run_id,params.model_type,metrics.test_accuracy,metrics.precision,metrics.recall,metrics.f1_score,metrics.roc_auc
0,93528a42e208414e8e855a282ea2fac6,RandomForest_Tuned,0.901639,0.866667,0.928571,0.896552,0.95671
8,fe660c93049b470d99bdae5ac3164729,RandomForest_Tuned,0.901639,0.866667,0.928571,0.896552,0.95671
15,dc68cfe6769742bea6517ea6c9f207e8,RandomForest_Tuned,0.901639,0.866667,0.928571,0.896552,0.95671
4,1dea29a657d149c0a38ccec4f482702b,RandomForest_Tuned,0.901639,0.866667,0.928571,0.896552,0.95671
12,651646e3082c423f9b76919e9e14b64e,RandomForest_Tuned,0.901639,0.866667,0.928571,0.896552,0.95671
14,f69fdf45568c4282a6050470de3f76c7,LogisticRegression,0.868852,0.8125,0.928571,0.866667,0.951299
11,c701256fb0f048128850f2eb5d4d13d9,LogisticRegression,0.868852,0.8125,0.928571,0.866667,0.951299
18,a5cd16c95cba48e4a1dee5243a7b032c,LogisticRegression,0.868852,0.8125,0.928571,0.866667,0.951299
7,d8cb931e8f0549b8813d28521e2cedb6,LogisticRegression,0.868852,0.8125,0.928571,0.866667,0.951299
3,ab0d49b2f6684e48a655dc08161ed493,LogisticRegression,0.868852,0.8125,0.928571,0.866667,0.951299


## 8. Conclusions

### Models Trained:
1. **Logistic Regression** - Baseline linear model providing interpretable results
2. **Random Forest** - Ensemble tree-based model showing strong performance
3. **Gradient Boosting** - Boosting ensemble model with competitive accuracy
4. **Random Forest (Tuned)** - Hyperparameter optimized model selected for deployment

### Key Findings:
- Random Forest (Tuned) achieved the best performance with **90.16% test accuracy** and **95.67% ROC-AUC**
- Random Forest baseline model achieved 88.52% accuracy, Logistic Regression 86.89%, and Gradient Boosting 85.25%
- All models showed good generalization with consistent cross-validation scores (CV ROC-AUC: 88-89%)
- Hyperparameter tuning improved Random Forest accuracy from 88.52% to 90.16%
- Feature importance analysis revealed key predictors for heart disease
- Model tracking with MLflow enabled systematic comparison and selection

### Experiment Tracking:
- All experiments logged in MLflow with full reproducibility
- Parameters, metrics, and model artifacts systematically captured
- Confusion matrices and ROC curves generated for performance analysis
- Model signatures created for deployment readiness

### Learnings:
- Structured experiment tracking is essential for model comparison
- Hyperparameter tuning improved model performance significantly
- Cross-validation provided robust estimates of model generalization
- MLflow integration enables seamless transition from development to deployment