# Data Preprocessing

Data Name : Cirrhosis
Data Link : https://archive.ics.uci.edu/dataset/878/cirrhosis+patient+survival+prediction+dataset-1

In [141]:
# Library terlebih dahulu

import pandas as pd
import numpy as np

# Handling Missing Value

In [142]:
# Import Data Cirrhosis.csv
df = pd.read_csv('./cirrhosis_data.csv')

# Menampilkan 5 record teratas
df.head()

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin
0,1,400,D,,21464,F,Y,,Y,Y,145,,26,156.0,1718,13795,172,190.0,122
1,2,4500,C,D-penicillamine,20617,F,N,,Y,N,11,302.0,414,54.0,73948,11352,88,221.0,106
2,3,1012,D,D-penicillamine,25594,M,N,N,N,S,14,176.0,348,210.0,516,961,55,151.0,12
3,4,1925,D,D-penicillamine,19994,F,N,Y,Y,S,18,,254,,61218,6063,92,183.0,103
4,5,1504,CL,Placebo,40 Years old,F,N,Y,Y,N,34,279.0,353,143.0,671,11315,72,136.0,109


In [143]:
# Tampilkan informasi untuk mengetahui Missing Value
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID             10 non-null     int64  
 1   N_Days         10 non-null     int64  
 2   Status         10 non-null     object 
 3   Drug           8 non-null      object 
 4   Age            10 non-null     object 
 5   Sex            10 non-null     object 
 6   Ascites        8 non-null      object 
 7   Hepatomegaly   8 non-null      object 
 8   Spiders        9 non-null      object 
 9   Edema          10 non-null     object 
 10  Bilirubin      10 non-null     int64  
 11  Cholesterol    8 non-null      float64
 12  Albumin        10 non-null     int64  
 13  Copper         8 non-null      float64
 14  Alk_Phos       10 non-null     int64  
 15  SGOT           10 non-null     int64  
 16  Tryglicerides  10 non-null     int64  
 17  Platelets      9 non-null      float64
 18  Prothrombin  

Dapat kita lihat beberapa tipe fitur dan juga jumlah missing value, maka kita bisa list terlebih dahulu
Object:
1. Drug (2 Null)
2. Ascites (2 Null)
3. Hepatomegaly (2 Null)
4. Spiders (1 Null)

Float:
1. Cholesterol (2 Null)
2. Copper (2 Null)
3. Platelets (1 Null)

Maka kita bisa melakukan proses untuk mengatasi missing value tersebut

In [144]:
# Mengganti Missing Value Float dengan Mean
df['Cholesterol'].fillna(df['Cholesterol'].mean(), inplace=True)
df['Copper'].fillna(df['Copper'].mean(), inplace=True)
df['Platelets'].fillna(df['Platelets'].mean(), inplace=True)

# Mengganti Missing Value Object dengan Modus
df['Drug'].fillna(df['Drug'].mode()[0], inplace=True)
df['Ascites'].fillna(df['Ascites'].mode()[0], inplace=True)
df['Hepatomegaly'].fillna(df['Hepatomegaly'].mode()[0], inplace=True)
df['Spiders'].fillna(df['Spiders'].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Cholesterol'].fillna(df['Cholesterol'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Copper'].fillna(df['Copper'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on w

In [145]:
df.head()

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin
0,1,400,D,D-penicillamine,21464,F,Y,N,Y,Y,145,296.125,26,156.0,1718,13795,172,190.0,122
1,2,4500,C,D-penicillamine,20617,F,N,N,Y,N,11,302.0,414,54.0,73948,11352,88,221.0,106
2,3,1012,D,D-penicillamine,25594,M,N,N,N,S,14,176.0,348,210.0,516,961,55,151.0,12
3,4,1925,D,D-penicillamine,19994,F,N,Y,Y,S,18,296.125,254,110.5,61218,6063,92,183.0,103
4,5,1504,CL,Placebo,40 Years old,F,N,Y,Y,N,34,279.0,353,143.0,671,11315,72,136.0,109


In [146]:
# Tampilkan informasi untuk mengetahui Missing Value
# df.info()

# Mengubah 40 Years old pada age menjadi bentuk hari
df['Age'] = df['Age'].str.replace('40 Years old', '40')

# Mengubah 40 menjadi 40*365
df['Age'] = df['Age'].str.replace('40', str(40*365))

# Cek kembali data
# df.head()

# Tampilkan Keseluruhan data
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID             10 non-null     int64  
 1   N_Days         10 non-null     int64  
 2   Status         10 non-null     object 
 3   Drug           10 non-null     object 
 4   Age            10 non-null     object 
 5   Sex            10 non-null     object 
 6   Ascites        10 non-null     object 
 7   Hepatomegaly   10 non-null     object 
 8   Spiders        10 non-null     object 
 9   Edema          10 non-null     object 
 10  Bilirubin      10 non-null     int64  
 11  Cholesterol    10 non-null     float64
 12  Albumin        10 non-null     int64  
 13  Copper         10 non-null     float64
 14  Alk_Phos       10 non-null     int64  
 15  SGOT           10 non-null     int64  
 16  Tryglicerides  10 non-null     int64  
 17  Platelets      10 non-null     float64
 18  Prothrombin  

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin
0,1,400,D,D-penicillamine,21464,F,Y,N,Y,Y,145,296.125,26,156.0,1718,13795,172,190.0,122
1,2,4500,C,D-penicillamine,20617,F,N,N,Y,N,11,302.0,414,54.0,73948,11352,88,221.0,106
2,3,1012,D,D-penicillamine,25594,M,N,N,N,S,14,176.0,348,210.0,516,961,55,151.0,12
3,4,1925,D,D-penicillamine,19994,F,N,Y,Y,S,18,296.125,254,110.5,61218,6063,92,183.0,103
4,5,1504,CL,Placebo,14600,F,N,Y,Y,N,34,279.0,353,143.0,671,11315,72,136.0,109
5,6,2503,D,Placebo,24201,F,N,Y,Y,N,8,248.0,398,50.0,944,93,63,223.444444,11
6,7,1832,C,D-penicillamine,20284,F,N,Y,N,N,1,322.0,409,52.0,824,6045,213,204.0,97
7,8,2466,D,Placebo,19379,F,N,N,N,N,3,280.0,1256,110.5,46512,2838,189,373.0,11
8,9,2400,D,D-penicillamine,15526,F,N,N,Y,N,32,562.0,308,79.0,2276,14415,88,251.0,11
9,10,51,D,Placebo,25772,F,N,N,Y,Y,126,200.0,274,140.0,918,14725,143,302.0,115
