# Multi-Class Prediction of Cirrhosis Outcomes

### **Problem Statement**: 
#### The task is to predict survival state of patients with liver cirrhosis given features.
> Cirrhosis results from prolonged liver damage, leading to extensive scarring, often due to conditions like hepatitis or chronic alcohol consumption.

The original data description is adapted from this [notebook](https://www.kaggle.com/code/markuslill/s3e26-xgbclassifer) and the orignal data [source](https://www.kaggle.com/datasets/joebeachcapital/cirrhosis-patient-survival-prediction).

The data we are using currently differs from the original source.

| Variable Name | Role | Type | Encoding | Description | Missing Values |
| --- | --- | --- | --- | --- | --- |
| ID | ID | Int | drop | Unique identifier | No |
| N_Days | Feature? | Int | numeric | Number of days between registration and the earlier of death(D), transplantation(CL), or being alive at the study analysis time in July 1986(C) | No |
| Drug | Feature | Categorical | Binary | Type of drug: D-penicillamine or placebo. Type of medication may impact the effectiveness of treatment, thus affecting status. | Yes |
| Age | Feature | Int | numeric | Age. Age could be related to disease progression; older patients may have a different status trajectory. | No |
| Sex | Feature | Categorical | Binary | Gender: M (male) or F (female). Biological sex may influence disease patterns and response to treatment, thereby affecting status. | No |
| Ascites | Feature | Categorical | Binary | Presence of ascites: N (No) or Y (Yes). The accumulation of fluid in the abdomen, often a sign of advanced liver disease, which could indicate a poorer status. | Yes |
| Hepatomegaly | Feature | Categorical | Binary | Presence of hepatomegaly: N (No) or Y (Yes). Enlargement of the liver. If present, it might suggest more serious liver disease and potentially a poorer status. | Yes |
| Spiders | Feature | Categorical | Binary | Presence of spiders: N (No) or Y (Yes). Spider angiomas are small, spider-like capillaries visible under the skin, associated with liver disease and could indicate a more advanced disease affecting status. | Yes |
| Edema | Feature | Categorical | One-Hot | Presence of edema: N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy). Swelling caused by excess fluid trapped in the body's tissues, often worsening the prognosis and indicating poorer status. | No |
| Bilirubin | Feature | Continuous(Int) | numeric | Serum bilirubin. High levels can indicate liver dysfunction and may correlate with more advanced disease and poorer status. | No |
| Cholesterol | Feature | Continous(Int)? | numeric | Serum cholesterol. While not directly related to liver function, abnormal levels can be associated with certain liver conditions and overall health status. | Yes |
| Albumin | Feature | Continous(Int) | numeric | Low levels can be a sign of liver disease and can indicate a poorer status due to the liver's reduced ability to synthesize proteins. | No |
| Copper | Feature | Int | numeric | Urine copper. Elevated in certain liver diseases (like Wilson's disease), and could affect status if levels are abnormally high. | Yes |
| Alk_Phos | Feature | Continuous | numeric | Alkaline phosphatase. An enzyme related to the bile ducts; high levels might indicate blockage or other issues related to the liver. | Yes |
| SGOT | Feature | Continuous | numeric | An enzyme that, when elevated, can indicate liver damage and could correlate with a worsening status. | Yes |
| Triglycerides | Feature | Int | numeric | Triglycerides. Though mainly a cardiovascular risk indicator, they can be affected by liver function and, by extension, the status of the patient. | Yes |
| Platelets | Feature | Int | numeric | Platelets per cubic. Low platelet count can be a result of advanced liver disease and can indicate a poorer status. | Yes |
| Prothrombin | Feature | Int | numeric | Prothrombin time. A measure of how long it takes blood to clot. Liver disease can cause increased times, indicating poorer status. | Yes |
| Stage | Feature | Categorical | Ordinal | UHistologic stage of disease (1, 2, 3, or 4). The stage of liver disease, which directly correlates with the patient's status - the higher the stage, the more serious the condition. | Yes |
| Status | Target | Categorical | Label | C (censored) indicates the patient was alive at N_Days, CL indicates the patient was alive at N_Days due to liver a transplant, and D indicates the patient was deceased at N_Days | No |

#### Glaring Issues in Dataset: 
1. Continous vs. Int
2. N_days is not considered as a feature but a 'other' type. It is indeed very confused, need some feature engineering to make better use of it.
3. Age is in days, no idea why, we could probably bin it. 

## EDA

In [4]:
import warnings
import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt

alt.data_transformers.enable("vegafusion")

warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
train_df = pd.read_csv("data/train.csv")

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             7905 non-null   int64  
 1   N_Days         7905 non-null   int64  
 2   Drug           7905 non-null   object 
 3   Age            7905 non-null   int64  
 4   Sex            7905 non-null   object 
 5   Ascites        7905 non-null   object 
 6   Hepatomegaly   7905 non-null   object 
 7   Spiders        7905 non-null   object 
 8   Edema          7905 non-null   object 
 9   Bilirubin      7905 non-null   float64
 10  Cholesterol    7905 non-null   float64
 11  Albumin        7905 non-null   float64
 12  Copper         7905 non-null   float64
 13  Alk_Phos       7905 non-null   float64
 14  SGOT           7905 non-null   float64
 15  Tryglicerides  7905 non-null   float64
 16  Platelets      7905 non-null   float64
 17  Prothrombin    7905 non-null   float64
 18  Stage   

#### After a quick preliminary examination, we found no missing values from the provided train.csv. 

### Categorical Features 