## Liver Cirrhosis Stage Prediction using Machine Learning
**Author:** Carlos Alfredo Hernández Alvarez

**Github:** [carloshdez522](https://github.com/carloshdez522?tab=repositories)

**ORCID:** [https://orcid.org/0009-0006-6749-1686](https://orcid.org/0009-0006-6749-1686)
### Introduction
#### Disease Description
**Liver Cirrhosis:** Cirrhosis is a chronic condition of the liver characterized by the formation of scars (fibrosis) and regenerative nodules, resulting from prolonged damage. This condition alters the structure and function of the liver, impairing its ability to perform vital functions, such as detoxification of the blood, production of essential proteins and regulation of body energy.
#### Dataset Description
**Context:**
This dataset originates from a Mayo Clinic study of primary biliary cirrhosis (PBC) of the liver, conducted between 1974 and 1984. This dataset contains 25000 rows and 19 columns, providing a solid basis for exploring and predicting the stages of liver cirrhosis using machine learning techniques.

### Attribute Information:
- `N_Days`: Number of days between registration and the earlier of death, transplantation, or study analysis time in 1986
- `Status`: status of the patient C (censored), CL (censored due to liver tx), or D (death)
- `Drug`: type of drug D-penicillamine or placebo
- `Age`: age in days
- `Sex`: M (male) or F (female)
- `Ascites`: presence of ascites N (No) or Y (Yes)
- `Hepatomegaly`: presence of hepatomegaly N (No) or Y (Yes)
- `Spiders`: presence of spiders N (No) or Y (Yes)
- `Edema`: presence of edema N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy)
- `Bilirubin`: serum bilirubin in [mg/dl]
- `Cholesterol`: serum cholesterol in [mg/dl]
- `Albumin`: albumin in [gm/dl]
- `Copper`: urine copper in [ug/day]
- `Alk_Phos`: alkaline phosphatase in [U/liter]
- `SGOT`: SGOT in [U/ml]
- `Tryglicerides`: triglicerides in [mg/dl]
- `Platelets`: platelets per cubic [ml/1000]
- `Prothrombin`: prothrombin time in seconds [s]
- `Stage`: histologic stage of disease ( 1, 2, or 3 )

### **Objective**
The objective of this notebook is to demonstrate how machine learning can contribute significantly in the field of medicine, specifically in predicting the stage of liver cirrhosis. Several machine learning models will be used to make predictions and their performances will be compared.

### Data Loading and Preprocessing
Data were loaded for further analysis.

In [21]:
import pandas as pd

df = pd.read_csv('Liver Cirrhosis Stage Data.csv')
df

Unnamed: 0,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,2221,C,Placebo,18499,F,N,Y,N,N,0.5,149.000000,4.04,227.0,598.0,52.70,57.000000,256.0,9.9,1
1,1230,C,Placebo,19724,M,Y,N,Y,N,0.5,219.000000,3.93,22.0,663.0,45.00,75.000000,220.0,10.8,2
2,4184,C,Placebo,11839,F,N,N,N,N,0.5,320.000000,3.54,51.0,1243.0,122.45,80.000000,225.0,10.0,2
3,2090,D,Placebo,16467,F,N,N,N,N,0.7,255.000000,3.74,23.0,1024.0,77.50,58.000000,151.0,10.2,2
4,2105,D,Placebo,21699,F,N,Y,N,N,1.9,486.000000,3.54,74.0,1052.0,108.50,109.000000,151.0,11.5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,3584,D,D-penicillamine,23612,F,N,N,N,N,0.8,231.000000,3.87,173.0,9009.8,127.71,96.000000,295.0,11.0,2
24996,3584,D,D-penicillamine,23612,F,N,N,N,N,0.8,231.000000,3.87,173.0,9009.8,127.71,96.000000,295.0,11.0,2
24997,971,D,D-penicillamine,16736,F,N,Y,Y,Y,5.1,369.510563,3.23,18.0,790.0,179.80,124.702128,104.0,13.0,3
24998,3707,C,D-penicillamine,16990,F,N,Y,N,N,0.8,315.000000,4.24,13.0,1637.0,170.50,70.000000,426.0,10.9,2


There are no missing values in any of the columns, which facilitates the preprocessing of the data for further analysis.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   N_Days         25000 non-null  int64  
 1   Status         25000 non-null  object 
 2   Drug           25000 non-null  object 
 3   Age            25000 non-null  int64  
 4   Sex            25000 non-null  object 
 5   Ascites        25000 non-null  object 
 6   Hepatomegaly   25000 non-null  object 
 7   Spiders        25000 non-null  object 
 8   Edema          25000 non-null  object 
 9   Bilirubin      25000 non-null  float64
 10  Cholesterol    25000 non-null  float64
 11  Albumin        25000 non-null  float64
 12  Copper         25000 non-null  float64
 13  Alk_Phos       25000 non-null  float64
 14  SGOT           25000 non-null  float64
 15  Tryglicerides  25000 non-null  float64
 16  Platelets      25000 non-null  float64
 17  Prothrombin    25000 non-null  float64
 18  Stage 

The dataset contains 7 categorical variables that need to be coded before being used in machine learning models. These variables are: `Sex`, `Ascites`, `Hepatomegaly`, `Spiders`, `Edema`, `Drug` and `Status1`.

`LabelEncoder` from the sklearn library will be used to convert these categorical variables into numerical values.

In [23]:
from sklearn.preprocessing import LabelEncoder
data = df.copy()

for column in ['Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema', 'Drug', 'Status']:
  data[column] = LabelEncoder().fit_transform(data[column])

We divided the data into training, validation and test sets. First, we separated the feature set (x) and the target variable (y). The target variable in this case is `Stage`.

In [24]:
from sklearn.model_selection import train_test_split
x = data.drop('Stage', axis=1)
y = data['Stage'] - 1

x_train, x_test_val, y_train, y_test_val = train_test_split(x, y, test_size=0.4, random_state=42)
x_test, x_val, y_test, y_val = train_test_split(x_test_val, y_test_val, test_size=0.5, random_state=42)

### Model Training and Evaluation

To evaluate the performance of different machine learning models in predicting liver cirrhosis stage, several algorithms will be used. The models selected for this analysis include: `RandomForestClassifier`, `GradientBoostingClassifier`, `KNeighborsClassifier`, `DecisionTreeClassifier` and `XGBClassifier`. In addition, `AutoML` will be used to perform automatic model optimization.



In [25]:
!pip install -q flaml
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from flaml import AutoML

models = {
    'XGBoost': XGBClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Random Forest': RandomForestClassifier(),
    'KNN': KNeighborsClassifier()
}

The model accuracies are organized in a DataFrame for easy comparison and visualization.

In [26]:
from sklearn.metrics import accuracy_score
accuracy = {}

for name, model in models.items():
  model.fit(x_train, y_train)
  accuracy[name] = accuracy_score(y_test, model.predict(x_test))

automl = AutoML()
automl.fit(x_train, y_train, task='classification', estimator_list=["rf"], verbose=False)
accuracy['AutoML'] = accuracy_score(y_test, automl.predict(x_test))

accuracy = pd.DataFrame({
    'Model': list(accuracy.keys()),
    'Accuracy': list(accuracy.values())
})

INFO:flaml.default.suggest:metafeature distance: 0.2942364131561268
INFO:flaml.default.suggest:metafeature distance: 0.2942364131561268


The `matplotlib` library is used to create a bar chart showing the accuracies of the different models.

In [None]:
import matplotlib.pyplot as plt

accuracy_melted = accuracy.melt(id_vars='Model', var_name='Dataset', value_name='Value')

fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(accuracy_melted['Model'], accuracy_melted['Value'], color='skyblue')

for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, round(yval, 4), va='bottom') # use 'va' for vertical alignment

ax.set_title('Models Accuracy', fontsize=15)
ax.set_xlabel('Model', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)

plt.xticks(rotation=45)
plt.ylim(0, 1)

### Results

Several machine learning models were trained and evaluated using a liver cirrhosis dataset. The models used were `XGBoost`, `Decision Tree`, `Gradient Boosting`, `Random Forest`, `KNeighborsClassifier` and `AutoML`. The accuracy results of each model are presented below:

In [28]:
plt.show()

In [29]:
accuracy

Unnamed: 0,Model,Accuracy
0,XGBoost,0.9602
1,Decision Tree,0.9088
2,Gradient Boosting,0.8636
3,Random Forest,0.9508
4,KNN,0.8196
5,AutoML,0.9444


### Conclusions

This notebook has successfully met its objective of demonstrating the ability of machine learning algorithms to predict the stage of liver cirrhosis with high accuracy, the high accuracy achieved by these underscores the effectiveness of these approaches in predicting diseases based on clinical data.

Furthermore, this analysis can serve as a robust basis for the application of machine learning techniques in the prediction of other diseases. The methodology followed in this notebook, from data loading and preprocessing to model training and evaluation, is applicable to a wide variety of medical datasets. This highlights the versatility and potential of machine learning algorithms in the field of bioinformatics.

Bioinformatics, which combines biology, computer science and statistics, benefits greatly from machine learning techniques to analyze complex biological data and large volumes of clinical data. These methods enable:

- Identification of hidden patterns in the data.
- Improved diagnostic accuracy.
- Personalization of medical treatments.


The use of machine learning in bioinformatics is not only limited to predicting the stage of diseases such as liver cirrhosis, but can also be applied in areas such as:

- The early detection of genetic diseases.
- The identification of biomarkers for various medical conditions.
- Optimization of drug development processes.

Not only has the ability of machine learning models to predict the stage of liver cirrhosis with high accuracy been demonstrated, but it has also established a methodology that can be used for the analysis and prediction of other diseases. This adaptability and high accuracy ensures that the techniques presented here can have a significant impact on improving medical diagnostics and personalized treatment in the future.

### Citation:
Dickson,E., Grambsch,P., Fleming,T., Fisher,L., and Langworthy,A.. (2023). Cirrhosis Patient Survival Prediction. UCI Machine Learning Repository. https://doi.org/10.24432/C5R02G.