## **Aplikasi untuk menganalisis penyakit sirosis**



**DATA**

Data yang akan dianalisis adalah data tentang penyakit sirosis, penyakit sirosis itu sendiri adalah penyakit di liver yang menyerang sel-sel sehat, kemudian seiring waktu berubah menjadi jaringan parut.

Pada dataset ini terdiri sebanyak 418 data dengan 20 fitur. Adapun fitur-fiturnya yaitu:

1. ID: pengidentifikasi unik
2. N_Days: jumlah hari antara pendaftaran dan kematian yang lebih awal, transplantasi, atau waktu analisis studi pada Juli 1986
3. Status: status pasien C (disensor), CL (disensor karena tx hati), atau D (meninggal)
4. Obat : jenis obat D-penicillamine atau placebo
5. Umur: umur dalam [hari]
6. Jenis Kelamin: M (laki-laki) atau F (perempuan)
7. Asites: adanya asites N (Tidak) atau Y (Ya)
8. Hepatomegali: adanya hepatomegali N (Tidak) atau Y (Ya)
9. Laba-laba: keberadaan laba-laba N (Tidak) atau Y (Ya)
10. Edema: adanya edema N (tidak ada edema dan tidak ada terapi diuretik untuk edema), S (ada edema tanpa diuretik, atau edema teratasi dengan diuretik), atau Y (edema meskipun dengan terapi diuretik)
11. Bilirubin: bilirubin serum dalam [mg/dl]
12. Kolesterol: kolesterol serum dalam [mg/dl]
13. Albumin: albumin dalam [gm/dl]
14. Tembaga: tembaga urin dalam [ug/hari]
15. Alk_Phos: alkaline phosphatase dalam [U/liter]
16. SGOT: SGOT dalam [U/ml]
17. Trigliserida: trigliserida dalam [mg/dl]
18. Trombosit: trombosit per kubik [ml/1000]
19. Protrombin: waktu protrombin dalam detik [s]
20. Stadium: stadium histologis penyakit (1, 2, 3, atau 4)

Ada empat fitur yang tidak akan digunakan yaitu ID, N-Day(hari mulai saat diputuskan menderita penyakit sirosis), status(masih hidup atau sudah meninggal), Drug(obat yang diminum),  dan Age(umur dalam bentuk hari). Karena tujuan analisis ini untuk mengetahui tingkat stadium seseorang yang terkena sirosis, jadi hanya perlu fitur-fitur yang berhubungan dengan gejalanya.

Karena data dari urutan 314 sampai 418 terdapat 9 fitur dengan isian NA (kosong), maka data tidak akan dipakai. untuk data missing valaue pada data ke 1 sampai 313 akan di isi dengan rata-rata dari fitur yang terdapat missing value

In [28]:
#import semua library yang dibutuhkan
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from pandas import DataFrame
import pickle


In [29]:
#membaca data kanker 
pd_crs= pd.read_csv('/content/drive/MyDrive/datamining/tugas/cirrhosis.csv')
#menampilkan 5 data teratas
pd_crs.head(10)

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,1,400,D,D-penicillamine,21464,F,Y,Y,Y,Y,14.5,261,2.6,156,1718.0,137.95,172,190,12.2,4
1,2,4500,C,D-penicillamine,20617,F,N,Y,Y,N,1.1,302,4.14,54,7394.8,113.52,88,221,10.6,3
2,3,1012,D,D-penicillamine,25594,M,N,N,N,S,1.4,176,3.48,210,516.0,96.1,55,151,12.0,4
3,4,1925,D,D-penicillamine,19994,F,N,Y,Y,S,1.8,244,2.54,64,6121.8,60.63,92,183,10.3,4
4,5,1504,CL,Placebo,13918,F,N,Y,Y,N,3.4,279,3.53,143,671.0,113.15,72,136,10.9,3
5,6,2503,D,Placebo,24201,F,N,Y,N,N,0.8,248,3.98,50,944.0,93.0,63,262,11.0,3
6,7,1832,C,Placebo,20284,F,N,Y,N,N,1.0,322,4.09,52,824.0,60.45,213,204,9.7,3
7,8,2466,D,Placebo,19379,F,N,N,N,N,0.3,280,4.0,52,4651.2,28.38,189,373,11.0,3
8,9,2400,D,D-penicillamine,15526,F,N,N,Y,N,3.2,562,3.08,79,2276.0,144.15,88,251,11.0,2
9,10,51,D,Placebo,25772,F,Y,N,Y,Y,12.6,200,2.74,140,918.0,147.25,143,302,11.5,4


### **Propocesing**

Di dataset terdapat dua tipe data yaitu binary dan numerik, jadi perlu adanya normalisasi agar data dapat di proses di model

In [30]:
pd_crs.rename(columns = {"Sex": "gender"}, inplace=True)

del(pd_crs['ID'], pd_crs['Status'], pd_crs['Drug'], pd_crs['N_Days'], pd_crs['Age'])
pd_crs.head()

Unnamed: 0,gender,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,F,Y,Y,Y,Y,14.5,261,2.6,156,1718.0,137.95,172,190,12.2,4
1,F,N,Y,Y,N,1.1,302,4.14,54,7394.8,113.52,88,221,10.6,3
2,M,N,N,N,S,1.4,176,3.48,210,516.0,96.1,55,151,12.0,4
3,F,N,Y,Y,S,1.8,244,2.54,64,6121.8,60.63,92,183,10.3,4
4,F,N,Y,Y,N,3.4,279,3.53,143,671.0,113.15,72,136,10.9,3


In [31]:
#nilai konversi biner
# one = 1
# zero = 0

# def biner_gender(gender):
#   return one if gender == 'M' else zero
# def biner_ascites(Ascites):
#   return one if Ascites == 'Y' else zero
# def biner_hepa(Hepatomegaly):
#   return one if Hepatomegaly == 'Y' else zero
# def biner_spi(Spiders):
#   return one if Spiders == 'Y' else zero
# def biner_ede(Edema):
#   return one if Edema == 'Y' else zero


# #Update Nilai dari Gender dan Test Preparation
# pd_crs['gender'] = pd_crs['gender'].apply(biner_gender)
# pd_crs['Ascites'] = pd_crs['Ascites'].apply(biner_ascites)
# pd_crs['Hepatomegaly'] = pd_crs['Hepatomegaly'].apply(biner_hepa)
# pd_crs['Spiders'] = pd_crs['Spiders'].apply(biner_spi)
# pd_crs['Edema'] = pd_crs['Edema'].apply(biner_ede)

# pd_crs.head()


In [32]:
# label_encoder object
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'gender'.
pd_crs['gender']= label_encoder.fit_transform(pd_crs['gender'])
pd_crs['gender'].unique()

# Encode labels in column 'Ascites'.
pd_crs['Ascites']= label_encoder.fit_transform(pd_crs['Ascites'])
pd_crs['Ascites'].unique()

# Encode labels in column 'Hepatomegaly'.
pd_crs['Hepatomegaly']= label_encoder.fit_transform(pd_crs['Hepatomegaly'])
pd_crs['Hepatomegaly'].unique()

# Encode labels in column 'Spiders'.
pd_crs['Spiders']= label_encoder.fit_transform(pd_crs['Spiders'])
pd_crs['Spiders'].unique()

# Encode labels in column 'Edema'.
pd_crs['Edema']= label_encoder.fit_transform(pd_crs['Edema'])
pd_crs['Edema'].unique()

pd_crs.head(100)

Unnamed: 0,gender,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,0,1,1,1,2,14.5,261,2.60,156,1718.0,137.95,172,190,12.2,4
1,0,0,1,1,0,1.1,302,4.14,54,7394.8,113.52,88,221,10.6,3
2,1,0,0,0,1,1.4,176,3.48,210,516.0,96.10,55,151,12.0,4
3,0,0,1,1,1,1.8,244,2.54,64,6121.8,60.63,92,183,10.3,4
4,0,0,1,1,0,3.4,279,3.53,143,671.0,113.15,72,136,10.9,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,1.0,370,3.70,33,1258.0,99.20,125,338,10.4,3
96,1,0,1,0,1,2.0,420,3.26,62,3196.0,77.50,91,344,11.4,3
97,0,0,0,0,0,1.0,239,3.77,77,1877.0,97.65,101,312,10.2,1
98,1,0,0,0,0,1.8,460,3.35,148,1472.0,108.50,118,172,10.2,2


In [33]:
scaler = MinMaxScaler()
crs_new = pd.DataFrame(pd_crs, columns=['Bilirubin',	'Cholesterol',	'Albumin',	'Copper',	
                                        'Alk_Phos',	'SGOT',	'Tryglicerides',	'Platelets',	'Prothrombin'])
scaler.fit(crs_new)
crs_new = scaler.transform(crs_new)

crs_new = DataFrame(crs_new)


In [34]:
pd_crs_stage = pd.DataFrame(pd_crs, columns = ['Stage'])
del(pd_crs['Stage'], pd_crs['Bilirubin'], pd_crs['Cholesterol'], pd_crs['Albumin'], pd_crs['Copper'],
    pd_crs['Alk_Phos'], pd_crs['SGOT'], pd_crs['Tryglicerides'], pd_crs['Platelets'], pd_crs['Prothrombin'])

pd_crs_new = pd.concat([pd_crs,crs_new], axis=1)

pd_crs_new.rename(columns = {0: "Bilirubin", 1: "Cholesterol", 2: "Albumin", 3: "Copper", 
                         4: "Alk_Phos", 5: "SGOT", 6: "Tryglicerides", 7: "Platelets", 
                         8: "Prothrombin"}, inplace=True)

data_crs = pd.concat([pd_crs_new,pd_crs_stage], axis=1)

data_crs.head()

Unnamed: 0,gender,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,0,1,1,1,2,0.512635,0.085196,0.238806,0.260274,0.105279,0.258993,0.246018,0.255489,0.395062,4
1,0,0,1,1,0,0.028881,0.10997,0.813433,0.085616,0.523509,0.202298,0.097345,0.317365,0.197531,3
2,1,0,0,0,1,0.039711,0.033837,0.567164,0.35274,0.016724,0.161871,0.038938,0.177645,0.37037,4
3,0,0,1,1,1,0.054152,0.074924,0.216418,0.10274,0.429723,0.079554,0.104425,0.241517,0.160494,4
4,0,0,1,1,0,0.111913,0.096073,0.585821,0.238014,0.028143,0.201439,0.069027,0.147705,0.234568,3


### **Model**

**Naive Bayes**

In [35]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.preprocessing import StandardScaler

feature=data_crs.iloc[:,0:14].values
label=data_crs.iloc[:,14].values



In [36]:
x_train, x_test, y_train, y_test = train_test_split(feature, label, test_size = 0.24, random_state = 1)

classifier = GaussianNB()

classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

accuracy_score(y_test, y_pred)

0.44

In [37]:
# save the model to disk
filename = 'NaiveBayes_model.pkl'
pickle.dump(classifier, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

# apply the whole pipeline to data
dataArray = [0,	1,	1,	1,	2,	0.512635,	0.085196,	0.238806,	0.260274,	
             0.105279,	0.258993,	0.246018,	0.255489,	0.395062]
pred = loaded_model.predict([dataArray])
print(pred)

[4]


**Random Forest**

In [38]:
import pandas as pd
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.metrics import *
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
import pickle

feature=data_crs.iloc[:,0:14].values
label=data_crs.iloc[:,14].values
#split data training dan testing
X_train,X_test,Y_train,Y_test=train_test_split(feature, label, test_size=0.3,random_state=0)

In [39]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100) 

# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, Y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)

# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(Y_test, y_pred))

ACCURACY OF THE MODEL:  0.4574468085106383


In [40]:
# save the model to disk
filename = 'RandomForest_model.pkl'
pickle.dump(clf, open(filename, 'wb'))
 
# load the model from disk
#

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

# apply the whole pipeline to data
dataArray = [0,	1,	1,	1,	2,	0.512635,	0.085196,	0.238806,	0.260274,	
             0.105279,	0.258993,	0.246018,	0.255489,	0.395062]
pred = loaded_model.predict([dataArray])
print(pred)

[4]


**KNN**

In [41]:
import numpy as np
import pandas as pd 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
   
  
#Extracting Independent and dependent Variable  
feature=data_crs.iloc[:,0:14].values
label=data_crs.iloc[:,14].values
  
#membagi data training dan data testing  
x_train, x_test, y_train, y_test= train_test_split(feature, label, test_size= 0.25, random_state=0)  

# klasifikasi KNN  
classifier= KNeighborsClassifier(n_neighbors=5 )  
classifier.fit(x_train, y_train)

#lakukan prediksi
y_pred= classifier.predict(x_test)
accuracy_score(y_test, y_pred)

0.48717948717948717

In [42]:
# save the model to disk
filename = 'KNN_model.pkl'
pickle.dump(clf, open(filename, 'wb'))

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

# apply the whole pipeline to data
dataArray = [0,	1,	1,	1,	2,	0.512635,	0.085196,	0.238806,	0.260274,	
             0.105279,	0.258993,	0.246018,	0.255489,	0.395062]
pred = loaded_model.predict([dataArray])
print(pred)

[4]


**Decision Tree**

In [43]:
from sklearn import tree

#memisahkan fitur dan label
feature=data_crs.iloc[:,0:14].values
label=data_crs.iloc[:,14].values

#membagi data training dan testing
X_train, X_test, y_train, y_test = train_test_split(feature, label, test_size=0.3, random_state=1)

#klasifikasi menggunakan decision tree
clf = tree.DecisionTreeClassifier(random_state=3, max_depth=1)
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.5425531914893617

In [44]:
# save the model to disk
filename = 'DecisionTree.pkl'
pickle.dump(clf, open(filename, 'wb'))
