In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Indonesia Heart Attack Prediction\n",
    "## Notebook 6: Predictive Modeling\n",
    "\n",
    "---\n",
    "### Tahap 6 dari Data Science Life Cycle\n",
    "Membangun beberapa model klasifikasi, melakukan cross-validation, tuning sederhana, dan menyimpan model terbaik."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import library dan load data serta preprocessor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, StratifiedKFold\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report\n",
    "import joblib\n",
    "\n",
    "# Load data\n",
    "df = pd.read_csv('../data/heart.csv')\n",
    "target_col = 'target'\n",
    "X = df.drop(columns=[target_col])\n",
    "y = df[target_col]\n",
    "\n",
    "# Load preprocessor yang disimpan (jika ada)\n",
    "preprocessor = None\n",
    "try:\n",
    "    preprocessor = joblib.load('../models/preprocessor.pkl')\n",
    "    print('Loaded preprocessor from ../models/preprocessor.pkl')\n",
    "except Exception as e:\n",
    "    print('Preprocessor not found, please run notebook 05 to create it. Error:', e)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Split data (train / test)\n",
    "Gunakan stratified split untuk menjaga proporsi kelas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)\n",
    "X_train.shape, X_test.shape, y_train.value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Model baseline: Logistic Regression\n",
    "Buat pipeline yang menggabungkan preprocessor dan model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipe_lr = Pipeline(steps=[\n",
    "    ('preprocessor', preprocessor),\n",
    "    ('clf', LogisticRegression(max_iter=1000))\n",
    "])\n",
    "pipe_lr.fit(X_train, y_train)\n",
    "y_pred = pipe_lr.predict(X_test)\n",
    "y_proba = pipe_lr.predict_proba(X_test)[:,1]\n",
    "print('Accuracy:', accuracy_score(y_test, y_pred))\n",
    "print('Recall:', recall_score(y_test, y_pred))\n",
    "print('Precision:', precision_score(y_test, y_pred))\n",
    "print('ROC AUC:', roc_auc_score(y_test, y_proba))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Model tree-based: Random Forest\n",
    "Coba RandomForest dengan hyperparameter sederhana, lalu bandingkan."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipe_rf = Pipeline(steps=[\n",
    "    ('preprocessor', preprocessor),\n",
    "    ('clf', RandomForestClassifier(random_state=42))\n",
    "])\n",
    "param_grid = {\n",
    "    'clf__n_estimators':[100,200],\n",
    "    'clf__max_depth':[None,10,20]\n",
    "}\n",
    "cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)\n",
    "gs_rf = GridSearchCV(pipe_rf, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1)\n",
    "gs_rf.fit(X_train, y_train)\n",
    "print('Best RF params:', gs_rf.best_params_)\n",
    "best_rf = gs_rf.best_estimator_\n",
    "y_pred_rf = best_rf.predict(X_test)\n",
    "y_proba_rf = best_rf.predict_proba(X_test)[:,1]\n",
    "print('RF ROC AUC:', roc_auc_score(y_test, y_proba_rf))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Evaluasi & Perbandingan Model\n",
    "Hitung metriks penting untuk setiap model (accuracy, precision, recall, f1, roc_auc)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "models = {\n",
    "    'LogisticRegression': pipe_lr,\n",
    "    'RandomForest': best_rf\n",
    "}\n",
    "results = []\n",
    "for name, m in models.items():\n",
    "    y_pred = m.predict(X_test)\n",
    "    y_proba = m.predict_proba(X_test)[:,1]\n",
    "    res = {\n",
    "        'model': name,\n",
    "        'accuracy': accuracy_score(y_test, y_pred),\n",
    "        'precision': precision_score(y_test, y_pred),\n",
    "        'recall': recall_score(y_test, y_pred),\n",
    "        'f1': f1_score(y_test, y_pred),\n",
    "        'roc_auc': roc_auc_score(y_test, y_proba)\n",
    "    }\n",
    "    results.append(res)\n",
    "pd.DataFrame(results).sort_values(by='roc_auc', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Menyimpan model terbaik\n",
    "Simpan model terbaik (misal RandomForest terbaik) untuk deployment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_model = best_rf\n",
    "joblib.dump(best_model, '../models/best_model.pkl')\n",
    "print('Best model saved to ../models/best_model.pkl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Catatan\n",
    "- Bila dataset kecil, hindari overfitting lewat CV dan regularisasi.\n",
    "- Pertimbangkan SMOTE atau class weight jika imbalance signifikan.\n",
    "### Next Steps:\n",
   ]
  }