In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Indonesia Heart Attack Prediction\n",
    "## Notebook 5: Feature Engineering\n",
    "\n",
    "---\n",
    "### Tahap 5 dari Data Science Life Cycle\n",
    "Pada notebook ini kita melakukan persiapan fitur agar model dapat belajar dengan baik: encoding, scaling, pembuatan fitur baru, dan seleksi fitur."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import library dan data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.feature_selection import SelectKBest, f_classif\n",
    "\n",
    "# Load dataset (sesuaikan path jika perlu)\n",
    "df = pd.read_csv('../data/heart.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Cek fitur dan tipe data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()\n",
    "df.describe().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Menentukan fitur numerik dan kategorikal\n",
    "Sesuaikan list di bawah jika kolom dataset-mu berbeda nama."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Contoh daftar fitur (ubah sesuai kolom sebenarnya)\n",
    "target_col = 'target'  # ganti jika nama kolom target berbeda\n",
    "num_features = ['age','trestbps','chol','thalach','oldpeak']\n",
    "cat_features = ['sex','fbs','restecg','exang','slope','ca','thal']\n",
    "\n",
    "print('Numerical:', num_features)\n",
    "print('Categorical:', cat_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Pipeline preprocessing\n",
    "Langkah: imputasi nilai hilang, encoding kategori, scaling numerik."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numeric_transformer = Pipeline(steps=[\n",
    "    ('imputer', SimpleImputer(strategy='median')),\n",
    "    ('scaler', StandardScaler())\n",
    "])\n",
    "categorical_transformer = Pipeline(steps=[\n",
    "    ('imputer', SimpleImputer(strategy='most_frequent')),\n",
    "    ('encoder', OneHotEncoder(handle_unknown='ignore'))\n",
    "])\n",
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('num', numeric_transformer, num_features),\n",
    "        ('cat', categorical_transformer, cat_features)\n",
    "    ]\n",
    ")\n",
    "preprocessor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Fitur turunan (opsional)\n",
    "Contoh: rasio kolesterol terhadap usia atau interaksi antar fitur bila relevan."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Contoh pembuatan fitur turunan\n",
    "df['chol_age_ratio'] = df['chol'] / (df['age'] + 1)\n",
    "num_features.append('chol_age_ratio')\n",
    "df[['chol','age','chol_age_ratio']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Seleksi fitur awal\n",
    "Kita gunakan SelectKBest sebagai contoh untuk memilih fitur numerik penting sebelum modeling. Pilihan k bisa diatur."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split data sementara (tanpa transform) untuk seleksi fitur\n",
    "X = df[num_features + cat_features].copy()\n",
    "y = df[target_col].copy()\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)\n",
    "\n",
    "# Untuk SelectKBest kita butuh encoding numerik sementara\n",
    "X_train_num = X_train[num_features].fillna(X_train[num_features].median())\n",
    "selector = SelectKBest(score_func=f_classif, k=min(8, X_train_num.shape[1]))\n",
    "selector.fit(X_train_num, y_train)\n",
    "scores = pd.DataFrame({'feature': X_train_num.columns, 'score': selector.scores_}).sort_values(by='score', ascending=False)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Menyimpan preprocessor untuk dipakai di pipeline modeling\n",
    "Disarankan menyimpan preprocessor yang sudah dibuat agar dipakai konsisten saat inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import joblib\n",
    "joblib.dump(preprocessor, '../models/preprocessor.pkl')\n",
    "print('Preprocessor saved to ../models/preprocessor.pkl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "- Telah dibuat pipeline preprocessing untuk numeric & categorical.\n",
    "- Contoh pembuatan fitur turunan chol_age_ratio.\n",
    "- Dilakukan seleksi fitur numerik dengan SelectKBest sebagai referensi.\n",
    "- Preprocessor disimpan untuk digunakan pada modeling dan deployment.\n",
    "### Next Steps:\n",
    "Lanjut ke *Notebook 6: Predictive Modeling* untuk membangun dan membandingkan model."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}