# **1. Perkenalan Dataset**

Dokumen ini berisi eksperimen manual untuk dataset Heart Disease.

Tahap pertama, Anda harus mencari dan menggunakan dataset dengan ketentuan sebagai berikut:

1. **Sumber Dataset**:  
   Dataset dapat diperoleh dari berbagai sumber, seperti public repositories (*Kaggle*, *UCI ML Repository*, *Open Data*) atau data primer yang Anda kumpulkan sendiri.


# **2. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning atau deep learning.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import os

# **3. Memuat Dataset**

Pada tahap ini, Anda perlu memuat dataset ke dalam notebook.

In [None]:
# Load dataset
df = pd.read_csv('../heart_disease_raw/heart_disease.csv')
df.head()

# **4. Exploratory Data Analysis (EDA)**

Pada tahap ini, Anda akan melakukan **Exploratory Data Analysis (EDA)** untuk memahami karakteristik dataset.

In [None]:
# Info Data
print(df.info())

# Statistik Deskriptif
print(df.describe())

# Cek Missing Values
print(df.isnull().sum())

# Visualisasi Target
sns.countplot(x='Heart Disease Status', data=df)
plt.title('Distribution of Heart Disease Status')
plt.show()

# Korelasi Matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# **5. Data Preprocessing**

Pada tahap ini, data preprocessing adalah langkah penting untuk memastikan kualitas data sebelum digunakan dalam model machine learning.

In [None]:
# 1. Handling Missing Values
numeric_features = df.select_dtypes(include=['float64', 'int64']).columns
categorical_features = df.select_dtypes(include=['object']).columns

# Imputasi Median untuk Numerik
imputer_num = SimpleImputer(strategy='median')
df[numeric_features] = imputer_num.fit_transform(df[numeric_features])

# Imputasi Modus untuk Kategorikal
imputer_cat = SimpleImputer(strategy='most_frequent')
df[categorical_features] = imputer_cat.fit_transform(df[categorical_features])

# 2. Encoding Categorical Data
# Manual Mapping for Binary/Ordinal
binary_cols = ['Smoking', 'Family Heart Disease', 'Diabetes', 'High Blood Pressure', 
               'Low HDL Cholesterol', 'High LDL Cholesterol', 'Heart Disease Status']

for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})

df['Gender'] = df['Gender'].map({'Male': 1, 'Female': 0})

ordinal_cols = {'Exercise Habits': ['Low', 'Medium', 'High'],
                'Alcohol Consumption': ['None', 'Low', 'Medium', 'High'],
                'Stress Level': ['Low', 'Medium', 'High'],
                'Sugar Consumption': ['Low', 'Medium', 'High']}

for col, order in ordinal_cols.items():
    df[col] = df[col].apply(lambda x: order.index(x) if x in order else -1)

# 3. Splitting Data
X = df.drop('Heart Disease Status', axis=1)
y = df['Heart Disease Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Shape X_train:", X_train_scaled.shape)
print("Shape X_test:", X_test_scaled.shape)

# Save Preprocessed Data (optional for verification)
pd.DataFrame(X_train_scaled, columns=X.columns).join(y_train.reset_index(drop=True)).to_csv('heart_disease_preprocessing/train.csv', index=False)
pd.DataFrame(X_test_scaled, columns=X.columns).join(y_test.reset_index(drop=True)).to_csv('heart_disease_preprocessing/test.csv', index=False)