In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

In [4]:
df = pd.read_csv('CAD.csv')

Menampilkan Data Awal

In [5]:
print("Data Awal:")
print(df.head())

Data Awal:
   No  Age  Weight  Length    Sex  BMI DM HTN Current Smoker EX-Smoker  ...  \
0   1   53      90     175   Male    ?  n   Y              Y         n  ...   
1   2   67      70     157  Fmale  NaN  n   Y              n         n  ...   
2   3   54      54     164   Male  NaN  n   n              Y         n  ...   
3   4   66      67     158  Fmale  NaN  n   Y              n         n  ...   
4   5   50      87     153  Fmale  NaN  n   Y              n         n  ...   

     K   Na    WBC Lymph Neut  PLT EF-TTE Region RWMA     VHD    Cath  
0  4.7  141   5700    39   52  261     50           0       N     Cad  
1  4.7  156   7700    38   55  165     40           4       N     Cad  
2  4.7  139   7400    38   60  230     40           2    mild     Cad  
3  4.4  142  13000    18   72    ?     55           0  Severe  Normal  
4  4.0  140   9200    55   39  274     50           0  Severe  Normal  

[5 rows x 56 columns]


Pre-Processing Data

In [6]:
# Memeriksa missing value
missing_values = df.isnull().sum()
print("\nJumlah Missing Values per Kolom:")
print(missing_values)


Jumlah Missing Values per Kolom:
No                         0
Age                        0
Weight                     0
Length                     0
Sex                        0
BMI                      302
DM                         1
HTN                        0
Current Smoker             4
EX-Smoker                  0
FH                         0
Obesity                    0
CRF                        0
CVA                        0
Airway disease             0
Thyroid Disease            0
CHF                        0
DLP                        0
BP                         0
PR                         0
Edema                      0
Weak Peripheral Pulse      0
Lung rales                 0
Systolic Murmur            0
Diastolic Murmur           0
Typical Chest Pain         0
Dyspnea                    0
Function Class             0
Atypical                   0
Nonanginal                 1
Exertional CP              0
LowTH Ang                  2
Q Wave                     0
St Elevat

In [9]:
# Mengatasi missing values
for col in df.columns:
    if df[col].dtype == 'object':  # Jika kategorikal, isi dengan modus
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:  # Jika numerik, isi dengan median
        df[col].fillna(df[col].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values

In [10]:
print("\nJumlah Missing Values setelah penanganan:")
print(df.isnull().sum())


Jumlah Missing Values setelah penanganan:
No                       0
Age                      0
Weight                   0
Length                   0
Sex                      0
BMI                      0
DM                       0
HTN                      0
Current Smoker           0
EX-Smoker                0
FH                       0
Obesity                  0
CRF                      0
CVA                      0
Airway disease           0
Thyroid Disease          0
CHF                      0
DLP                      0
BP                       0
PR                       0
Edema                    0
Weak Peripheral Pulse    0
Lung rales               0
Systolic Murmur          0
Diastolic Murmur         0
Typical Chest Pain       0
Dyspnea                  0
Function Class           0
Atypical                 0
Nonanginal               0
Exertional CP            0
LowTH Ang                0
Q Wave                   0
St Elevation             0
St Depression            0
Tinversion  

```df.isnull().sum()``` menghitung jumlah nilai kosong di setiap kolom

In [11]:
# Menghapus duplikasi jika ada
duplicate_rows = df.duplicated().sum()
print("\nJumlah duplikasi dalam data:", duplicate_rows)
if duplicate_rows > 0:
    df.drop_duplicates(inplace=True)
    print("Duplikasi telah dihapus.")


Jumlah duplikasi dalam data: 0


- Pada kolom kategori (object) → Missing values diisi dengan modus (nilai yang paling sering muncul).
- Pada kolom numerik (int atau float) → Missing values diisi dengan median.

In [12]:
# Memeriksa outlier menggunakan IQR dan mengatasinya
for col in df.select_dtypes(include=['int64', 'float64']).columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = ((df[col] < lower_bound) | (df[col] > upper_bound)).sum()
    print(f"Outlier pada {col}: {outliers}")

Outlier pada No: 0
Outlier pada Age: 0
Outlier pada Weight: 3
Outlier pada Length: 0
Outlier pada PR: 13
Outlier pada Typical Chest Pain: 0
Outlier pada Function Class: 0
Outlier pada Tinversion: 0
Outlier pada FBS: 30
Outlier pada CR: 8
Outlier pada HDL: 6
Outlier pada ESR: 17
Outlier pada HB: 7
Outlier pada K: 2
Outlier pada Na: 15
Outlier pada WBC: 9
Outlier pada Lymph: 3
Outlier pada Neut: 1
Outlier pada EF-TTE: 15
Outlier pada Region RWMA: 28


```df.duplicated().sum()``` menghitung jumlah baris yang duplikat.
Jika ada duplikasi, maka ```df.drop_duplicates(inplace=True)``` menghapusnya.

In [14]:
# Mengatasi outlier dengan winsorizing
df[col] = df[col].clip(lower_bound, upper_bound)

In [15]:
# Mengubah semua kolom menjadi numerik jika masih ada kategori
label_encoders = {}
for col in df.columns:
    if df[col].dtype == 'object':  # Jika kategorikal, ubah ke numerik
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        label_encoders[col] = le

In [16]:
print("\nData setelah preprocessing:")
print(df.head())


Data setelah preprocessing:
   No  Age  Weight  Length  Sex  BMI  DM  HTN  Current Smoker  EX-Smoker  ...  \
0   1   53      90     175    1    0   1    0               1          1  ...   
1   2   67      70     157    0    0   1    0               3          1  ...   
2   3   54      54     164    1    0   1    1               1          1  ...   
3   4   66      67     158    0    0   1    0               3          1  ...   
4   5   50      87     153    0    0   1    0               3          1  ...   

     K   Na    WBC  Lymph  Neut  PLT  EF-TTE  Region RWMA  VHD  Cath  
0  4.7  141   5700     39    52   92      50          0.0    1     0  
1  4.7  156   7700     38    55   10      40          2.5    1     0  
2  4.7  139   7400     38    60   67      40          2.0    3     0  
3  4.4  142  13000     18    72  134      55          0.0    2     1  
4  4.0  140   9200     55    39  102      50          0.0    2     1  

[5 rows x 56 columns]


- Age, Weight, Length → Berisi angka yang mencerminkan data asli pasien.
- Sex → Sudah dikonversi ke angka (0 untuk Female, 1 untuk Male).
= BMI, DM, HTN, Smoker, EX-Smoker → Semua sudah dalam format numerik.
- Beberapa fitur lain seperti WBC, Lymph, Neut, PLT, EF-TTE menunjukkan data medis pasien.

In [17]:
# Melakukan proses normalisasi pada masing-masing atribut
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

In [18]:
print("\nData setelah normalisasi:")
print(df_normalized.head())


Data setelah normalisasi:
         No       Age    Weight    Length  Sex  BMI   DM  HTN  Current Smoker  \
0  0.000000  0.410714  0.583333  0.729167  1.0  0.0  1.0  0.0        0.333333   
1  0.003311  0.660714  0.305556  0.354167  0.0  0.0  1.0  0.0        1.000000   
2  0.006623  0.428571  0.083333  0.500000  1.0  0.0  1.0  1.0        0.333333   
3  0.009934  0.642857  0.263889  0.375000  0.0  0.0  1.0  0.0        1.000000   
4  0.013245  0.357143  0.541667  0.270833  0.0  0.0  1.0  0.0        1.000000   

   EX-Smoker  ...         K        Na       WBC     Lymph      Neut       PLT  \
0        0.5  ...  0.472222  0.464286  0.139860  0.603774  0.350877  0.686567   
1        0.5  ...  0.472222  1.000000  0.279720  0.584906  0.403509  0.074627   
2        0.5  ...  0.472222  0.392857  0.258741  0.584906  0.491228  0.500000   
3        0.5  ...  0.388889  0.500000  0.650350  0.207547  0.701754  1.000000   
4        0.5  ...  0.277778  0.428571  0.384615  0.905660  0.122807  0.761194   
