In [1]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Indonesia Heart Attack Prediction\n",
    "## Notebook 2: Data Mining\n",
    "\n",
    "---\n",
    "\n",
    "### Tahap 2 dari Data Science Life Cycle\n",
    "\n",
    "Pada tahap ini, kita akan:\n",
    "1. Load dataset dari file CSV\n",
    "2. Melihat struktur dan karakteristik data\n",
    "3. Memahami setiap variabel dalam dataset\n",
    "4. Melakukan initial assessment terhadap kualitas data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data manipulation\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Visualization\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# System utilities\n",
    "import sys\n",
    "import os\n",
    "\n",
    "# Add src to path\n",
    "sys.path.append('../src')\n",
    "\n",
    "# Import custom modules\n",
    "from data_preprocessing import DataPreprocessor, get_column_types, data_quality_report\n",
    "\n",
    "# Settings\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 100)\n",
    "plt.style.use('seaborn-v0_8-darkgrid')\n",
    "sns.set_palette('husl')\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "print(\"Libraries imported successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize preprocessor\n",
    "preprocessor = DataPreprocessor()\n",
    "\n",
    "# Load data\n",
    "df = preprocessor.load_data('../data/heart_attack_data.csv')\n",
    "\n",
    "print(f\"\\nDataset shape: {df.shape}\")\n",
    "print(f\"Total records: {df.shape[0]}\")\n",
    "print(f\"Total features: {df.shape[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Initial Data Inspection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 First Few Rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"First 10 rows of the dataset:\")\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Last Few Rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Last 5 rows of the dataset:\")\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 Dataset Information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Dataset Information:\")\n",
    "print(\"=\"*60)\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4 Column Names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"All column names:\")\n",
    "print(\"=\"*60)\n",
    "for i, col in enumerate(df.columns, 1):\n",
    "    print(f\"{i:2d}. {col}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Data Dictionary\n",
    "\n",
    "### Demographics (4 features)\n",
    "1. **age**: Usia individu (25-90 tahun)\n",
    "2. **gender**: Jenis kelamin (Male, Female)\n",
    "3. **region**: Area tempat tinggal (Urban, Rural)\n",
    "4. **income_level**: Status sosial ekonomi (Low, Middle, High)\n",
    "\n",
    "### Clinical Risk Factors (6 features)\n",
    "5. **hypertension**: Tekanan darah tinggi (1=Yes, 0=No)\n",
    "6. **diabetes**: Diabetes terdiagnosis (1=Yes, 0=No)\n",
    "7. **cholesterol_level**: Total kadar kolesterol (mg/dL)\n",
    "8. **obesity**: BMI > 30 (1=Yes, 0=No)\n",
    "9. **waist_circumference**: Lingkar pinggang (cm)\n",
    "10. **family_history**: Riwayat keluarga penyakit jantung (1=Yes, 0=No)\n",
    "\n",
    "### Lifestyle & Behavioral Factors (4 features)\n",
    "11. **smoking_status**: Kebiasaan merokok (Never, Past, Current)\n",
    "12. **alcohol_consumption**: Konsumsi alkohol (None, Moderate, High)\n",
    "13. **physical_activity**: Tingkat aktivitas fisik (Low, Moderate, High)\n",
    "14. **dietary_habits**: Kualitas diet (Healthy, Unhealthy)\n",
    "\n",
    "### Environmental & Social Factors (3 features)\n",
    "15. **air_pollution_exposure**: Paparan polusi udara (Low, Moderate, High)\n",
    "16. **stress_level**: Tingkat stress (Low, Moderate, High)\n",
    "17. **sleep_hours**: Rata-rata jam tidur per malam (3-9 jam)\n",
    "\n",
    "### Medical Screening & Health System Factors (10 features)\n",
    "18. **blood_pressure_systolic**: Tekanan darah sistolik (mmHg)\n",
    "19. **blood_pressure_diastolic**: Tekanan darah diastolik (mmHg)\n",
    "20. **fasting_blood_sugar**: Kadar gula darah (mg/dL)\n",
    "21. **cholesterol_hdl**: Kadar HDL cholesterol (mg/dL)\n",
    "22. **cholesterol_ldl**: Kadar LDL cholesterol (mg/dL)\n",
    "23. **triglycerides**: Kadar trigliserida (mg/dL)\n",
    "24. **EKG_results**: Hasil elektrokardiogram (Normal, Abnormal)\n",
    "25. **previous_heart_disease**: Penyakit jantung sebelumnya (1=Yes, 0=No)\n",
    "26. **medication_usage**: Penggunaan obat jantung (1=Yes, 0=No)\n",
    "27. **participated_in_free_screening**: Ikut program screening gratis (1=Yes, 0=No)\n",
    "\n",
    "### Target Variable (1 feature)\n",
    "28. **heart_attack**: Kejadian serangan jantung (1=Yes, 0=No)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Data Types Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get column types\n",
    "column_types = get_column_types(df)\n",
    "\n",
    "print(\"Numerical Columns:\")\n",
    "print(\"=\"*60)\n",
    "for i, col in enumerate(column_types['numerical'], 1):\n",
    "    print(f\"{i:2d}. {col}\")\n",
    "\n",
    "print(f\"\\nTotal numerical columns: {len(column_types['numerical'])}\")\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"Categorical Columns:\")\n",
    "print(\"=\"*60)\n",
    "for i, col in enumerate(column_types['categorical'], 1):\n",
    "    print(f\"{i:2d}. {col}\")\n",
    "\n",
    "print(f\"\\nTotal categorical columns: {len(column_types['categorical'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Basic Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.1 Numerical Features Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Statistical Summary of Numerical Features:\")\n",
    "print(\"=\"*60)\n",
    "df.describe().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 Categorical Features Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Categorical Features Summary:\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "categorical_cols = column_types['categorical']\n",
    "\n",
    "for col in categorical_cols:\n",
    "    print(f\"\\n{col}:\")\n",
    "    print(\"-\" * 40)\n",
    "    print(df[col].value_counts())\n",
    "    print(f\"Unique values: {df[col].nunique()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Target Variable Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Target Variable (heart_attack) Distribution:\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "target_counts = df['heart_attack'].value_counts()\n",
    "target_percentages = df['heart_attack'].value_counts(normalize=True) * 100\n",
    "\n",
    "target_summary = pd.DataFrame({\n",
    "    'Count': target_counts,\n",
    "    'Percentage': target_percentages\n",
    "})\n",
    "\n",
    "target_summary.index = ['No Heart Attack (0)', 'Heart Attack (1)']\n",
    "print(target_summary)\n",
    "\n",
    "# Visualize\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Count plot\n",
    "target_counts.plot(kind='bar', ax=axes[0], color=['lightgreen', 'salmon'])\n",
    "axes[0].set_title('Heart Attack Distribution (Count)', fontsize=14, fontweight='bold')\n",
    "axes[0].set_xlabel('Heart Attack')\n",
    "axes[0].set_ylabel('Count')\n",
    "axes[0].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)\n",
    "\n",
    "# Add value labels\n",
    "for i, v in enumerate(target_counts):\n",
    "    axes[0].text(i, v + 5, str(v), ha='center', va='bottom', fontweight='bold')\n",
    "\n",
    "# Pie chart\n",
    "axes[1].pie(target_counts, labels=['No Heart Attack', 'Heart Attack'], \n",
    "           autopct='%1.1f%%', colors=['lightgreen', 'salmon'],\n",
    "           startangle=90)\n",
    "axes[1].set_title('Heart Attack Distribution (Percentage)', fontsize=14, fontweight='bold')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Check for class imbalance\n",
    "imbalance_ratio = target_counts.max() / target_counts.min()\n",
    "print(f\"\\nClass imbalance ratio: {imbalance_ratio:.2f}\")\n",
    "\n",
    "if imbalance_ratio > 1.5:\n",
    "    print(\"⚠️  Warning: Class imbalance detected. Consider using resampling techniques.\")\n",
    "else:\n",
    "    print(\"✓ Classes are relatively balanced.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Data Quality Assessment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8.1 Missing Values Check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Missing Values Analysis:\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "missing_data = preprocessor.check_missing_values(df)\n",
    "\n",
    "if len(missing_data) == 0:\n",
    "    print(\"✓ No missing values found in the dataset!\")\n",
    "else:\n",
    "    print(\"Missing values found:\")\n",
    "    print(missing_data)\n",
    "    \n",
    "    # Visualize missing values\n",
    "    plt.figure(figsize=(12, 6))\n",
    "    plt.bar(missing_data.index, missing_data['Percentage'])\n",
    "    plt.title('Missing Values Percentage by Column')\n",
    "    plt.xlabel('Columns')\n",
    "    plt.ylabel('Missing Percentage (%)')\n",
    "    plt.xticks(rotation=45, ha='right')\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8.2 Duplicate Records Check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Duplicate Records Check:\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "duplicates = df.duplicated().sum()\n",
    "print(f\"Number of duplicate records: {duplicates}\")\n",
    "\n",
    "if duplicates > 0:\n",
    "    print(f\"Percentage of duplicates: {(duplicates/len(df))*100:.2f}%\")\n",
    "    print(\"⚠️  Duplicates found. Will be removed in data cleaning stage.\")\n",
    "else:\n",
    "    print(\"✓ No duplicate records found!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8.3 Data Quality Report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Comprehensive Data Quality Report:\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "quality_report = data_quality_report(df)\n",
    "\n",
    "print(f\"\\nTotal Rows: {quality_report['total_rows']}\")\n",
    "print(f\"Total Columns: {quality_report['total_columns']}\")\n",
    "print(f\"Missing Values: {quality_report['missing_values']}\")\n",
    "print(f\"Duplicate Rows: {quality_report['duplicate_rows']}\")\n",
    "print(f\"Memory Usage: {quality_report['memory_usage']:.2f} MB\")\n",
    "print(f\"\\nNumerical Columns: {len(quality_report['column_types']['numerical'])}\")\n",
    "print(f\"Categorical Columns: {len(quality_report['column_types']['categorical'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Sample Data Inspection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9.1 Random Sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Random sample of 10 records:\")\n",
    "df.sample(10, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9.2 Cases with Heart Attack"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Sample of cases with heart attack (heart_attack = 1):\")\n",
    "df[df['heart_attack'] == 1].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9.3 Cases without Heart Attack"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Sample of cases without heart attack (heart_attack = 0):\")\n",
    "df[df['heart_attack'] == 0].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Data Overview Visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create summary visualization\n",
    "fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
    "\n",
    "# 1. Data types distribution\n",
    "type_counts = pd.Series({\n",
    "    'Numerical': len(column_types['numerical']),\n",
    "    'Categorical': len(column_types['categorical'])\n",
    "})\n",
    "type_counts.plot(kind='pie', ax=axes[0, 0], autopct='%1.1f%%', startangle=90)\n",
    "axes[0, 0].set_title('Feature Types Distribution', fontsize=12, fontweight='bold')\n",
    "axes[0, 0].set_ylabel('')\n",
    "\n",
    "# 2. Target variable distribution\n",
    "target_counts.plot(kind='bar', ax=axes[0, 1], color=['lightgreen', 'salmon'])\n",
    "axes[0, 1].set_title('Heart Attack Distribution', fontsize=12, fontweight='bold')\n",
    "axes[0, 1].set_xlabel('Heart Attack')\n",
    "axes[0, 1].set_ylabel('Count')\n",
    "axes[0, 1].set_xticklabels(['No', 'Yes'], rotation=0)\n",
    "\n",
    "# 3. Data quality metrics\n",
    "quality_metrics = pd.Series({\n",
    "    'Complete Records': len(df) - quality_report['duplicate_rows'] - quality_report['missing_values'],\n",
    "    'Duplicates': quality_report['duplicate_rows'],\n",
    "    'Missing Values': quality_report['missing_values']\n",
    "})\n",
    "quality_metrics.plot(kind='bar', ax=axes[1, 0], color=['green', 'orange', 'red'])\n",
    "axes[1, 0].set_title('Data Quality Overview', fontsize=12, fontweight='bold')\n",
    "axes[1, 0].set_ylabel('Count')\n",
    "axes[1, 0].set_xticklabels(quality_metrics.index, rotation=45, ha='right')\n",
    "\n",
    "# 4. Feature categories\n",
    "feature_categories = pd.Series({\n",
    "    'Demographics': 4,\n",
    "    'Clinical': 6,\n",
    "    'Lifestyle': 4,\n",
    "    'Environmental': 3,\n",
    "    'Medical Screening': 10,\n",
    "    'Target': 1\n",
    "})\n",
    "feature_categories.plot(kind='barh', ax=axes[1, 1], color='steelblue')\n",
    "axes[1, 1].set_title('Features by Category', fontsize=12, fontweight='bold')\n",
    "axes[1, 1].set_xlabel('Number of Features')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "Pada tahap Data Mining ini, kita telah:\n",
    "\n",
    "1. ✅ **Loaded Dataset**: Berhasil memuat dataset dengan 500 records dan 28 features\n",
    "2. ✅ **Inspected Structure**: Memahami struktur data, tipe data, dan karakteristik setiap feature\n",
    "3. ✅ **Analyzed Data Types**: Mengidentifikasi 19 numerical features dan 9 categorical features\n",
    "4. ✅ **Examined Target Variable**: Menganalisis distribusi target variable (heart_attack)\n",
    "5. ✅ **Assessed Data Quality**: \n",
    "   - Missing values: Check ✓\n",
    "   - Duplicate records: Check ✓\n",
    "   - Data consistency: Check ✓\n",
    "6. ✅ **Created Data Dictionary**: Dokumentasi lengkap untuk setiap variabel\n",
    "7. ✅ **Generated Statistics**: Basic statistics untuk numerical dan categorical features\n",
    "\n",
    "### Key Findings:\n",
    "- Dataset terdiri dari 500 individu dengan 28 features\n",
    "- Target variable: heart_attack (binary: 0=No, 1=Yes)\n",
    "- Data quality: [Sesuaikan dengan hasil actual - misal: No missing values, No duplicates]\n",
    "- Feature categories:\n",
    "  - Demographics: 4 features\n",
    "  - Clinical Risk Factors: 6 features\n",
    "  - Lifestyle Factors: 4 features\n",
    "  - Environmental Factors: 3 features\n",
    "  - Medical Screening: 10 features\n",
    "  - Target: 1 feature\n",
    "\n",
    "### Next Steps:\n",
    "Lanjut ke **Notebook 3: Data Cleaning** untuk membersihkan dan mempersiapkan data untuk analisis.\n",
    "\n",
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

NameError: name 'null' is not defined