# Multi-Class Prediction of Cirrhosis Outcomes

### **Problem Statement**: 
#### The task is to predict survival state of patients with liver cirrhosis given features.
> Cirrhosis results from prolonged liver damage, leading to extensive scarring, often due to conditions like hepatitis or chronic alcohol consumption.

The original data description is adapted from this [notebook](https://www.kaggle.com/code/markuslill/s3e26-xgbclassifer) and the orignal data [source](https://www.kaggle.com/datasets/joebeachcapital/cirrhosis-patient-survival-prediction).

The data we are using currently differs from the original source.

| Variable Name | Role | Type | Encoding | Description | Missing Values |
| --- | --- | --- | --- | --- | --- |
| ID | ID | Int | drop | Unique identifier | No |
| N_Days | Feature? | Int | numeric | Number of days between registration and the earlier of death(D), transplantation(CL), or being alive at the study analysis time in July 1986(C) | No |
| Drug | Feature | Categorical | Binary | Type of drug: D-penicillamine or placebo. Type of medication may impact the effectiveness of treatment, thus affecting status. | Yes |
| Age | Feature | Int | numeric | Age. Age could be related to disease progression; older patients may have a different status trajectory. | No |
| Sex | Feature | Categorical | Binary | Gender: M (male) or F (female). Biological sex may influence disease patterns and response to treatment, thereby affecting status. | No |
| Ascites | Feature | Categorical | Binary | Presence of ascites: N (No) or Y (Yes). The accumulation of fluid in the abdomen, often a sign of advanced liver disease, which could indicate a poorer status. | Yes |
| Hepatomegaly | Feature | Categorical | Binary | Presence of hepatomegaly: N (No) or Y (Yes). Enlargement of the liver. If present, it might suggest more serious liver disease and potentially a poorer status. | Yes |
| Spiders | Feature | Categorical | Binary | Presence of spiders: N (No) or Y (Yes). Spider angiomas are small, spider-like capillaries visible under the skin, associated with liver disease and could indicate a more advanced disease affecting status. | Yes |
| Edema | Feature | Categorical | One-Hot | Presence of edema: N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy). Swelling caused by excess fluid trapped in the body's tissues, often worsening the prognosis and indicating poorer status. | No |
| Bilirubin | Feature | Continuous(Int) | numeric | Serum bilirubin. High levels can indicate liver dysfunction and may correlate with more advanced disease and poorer status. | No |
| Cholesterol | Feature | Continous(Int)? | numeric | Serum cholesterol. While not directly related to liver function, abnormal levels can be associated with certain liver conditions and overall health status. | Yes |
| Albumin | Feature | Continous(Int) | numeric | Low levels can be a sign of liver disease and can indicate a poorer status due to the liver's reduced ability to synthesize proteins. | No |
| Copper | Feature | Int | numeric | Urine copper. Elevated in certain liver diseases (like Wilson's disease), and could affect status if levels are abnormally high. | Yes |
| Alk_Phos | Feature | Continuous | numeric | Alkaline phosphatase. An enzyme related to the bile ducts; high levels might indicate blockage or other issues related to the liver. | Yes |
| SGOT | Feature | Continuous | numeric | An enzyme that, when elevated, can indicate liver damage and could correlate with a worsening status. | Yes |
| Triglycerides | Feature | Int | numeric | Triglycerides. Though mainly a cardiovascular risk indicator, they can be affected by liver function and, by extension, the status of the patient. | Yes |
| Platelets | Feature | Int | numeric | Platelets per cubic. Low platelet count can be a result of advanced liver disease and can indicate a poorer status. | Yes |
| Prothrombin | Feature | Int | numeric | Prothrombin time. A measure of how long it takes blood to clot. Liver disease can cause increased times, indicating poorer status. | Yes |
| Stage | Feature | Categorical | Ordinal | UHistologic stage of disease (1, 2, 3, or 4). The stage of liver disease, which directly correlates with the patient's status - the higher the stage, the more serious the condition. | Yes |
| Status | Target | Categorical | Label | C (censored) indicates the patient was alive at N_Days, CL indicates the patient was alive at N_Days due to liver a transplant, and D indicates the patient was deceased at N_Days | No |

#### Glaring Issues in Dataset: 
1. Continous vs. Int
2. N_days is not considered as a feature but a 'other' type. It is indeed very confused, need some feature engineering to make better use of it.
3. Age is in days, no idea why, we could probably bin it.
4. Target class is imbalanced
5. feature engineering is needed
6. feature selection(RFE)
7. feature importance (SHAP) 

Feature Engineering ideas: <br>
|Transformer | Class | Type	| Description|
| --- | --- | --- | --- |
DiagnosisDateTransformer| num | Calculates 'Diagnosis_Date' by subtracting 'N_Days' from 'Age'. This could provide a more direct measure of time since diagnosis, relevant for analysis. |
AgeBinsTransformer|	cat	| Categorizes 'Age' into bins (19, 29, 49, 64, 99), converting a continuous variable into a categorical one for simplified analysis.|
BilirubinAlbuminTransformer| num | Creates a new feature 'Bilirubin_Albumin' by multiplying 'Bilirubin' and 'Albumin', potentially highlighting interactions between these two variables.|
NormalizeLabValuesTransformer|	num	 | Normalizes laboratory values (like 'Bilirubin', 'Cholesterol', etc.) to their z-scores, standardizing these features for modeling purposes.
DrugEffectivenessTransformer| num | Generates a new feature 'Drug_Effectiveness' by combining 'Drug' and 'Bilirubin' levels. This assumes that changes in 'Bilirubin' reflect drug effectiveness. |
SymptomScore(Cat)Transformer| num | Summarizes the presence of symptoms ('Ascites', 'Hepatomegaly', etc.) into a single 'Symptom_Score', simplifying the representation of patient symptoms. |
LiverFunctionTransformer| num | Computes 'Liver_Function_Index' as the average of key liver function tests, providing a comprehensive metric for liver health. |
RiskScoreTransformer| num | Calculates 'Risk_Score' using a combination of 'Bilirubin', 'Albumin', and 'Alk_Phos', potentially offering a composite risk assessment for patients.|
TimeFeaturesTransformer| num | Extracts 'Year' and 'Month' from 'N_Days', converting a continuous time measure into more interpretable categorical time units. |


## EDA

In [4]:
import warnings
import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt

alt.data_transformers.enable("vegafusion")

warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
train_df = pd.read_csv("data/train.csv")

In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             7905 non-null   int64  
 1   N_Days         7905 non-null   int64  
 2   Drug           7905 non-null   object 
 3   Age            7905 non-null   int64  
 4   Sex            7905 non-null   object 
 5   Ascites        7905 non-null   object 
 6   Hepatomegaly   7905 non-null   object 
 7   Spiders        7905 non-null   object 
 8   Edema          7905 non-null   object 
 9   Bilirubin      7905 non-null   float64
 10  Cholesterol    7905 non-null   float64
 11  Albumin        7905 non-null   float64
 12  Copper         7905 non-null   float64
 13  Alk_Phos       7905 non-null   float64
 14  SGOT           7905 non-null   float64
 15  Tryglicerides  7905 non-null   float64
 16  Platelets      7905 non-null   float64
 17  Prothrombin    7905 non-null   float64
 18  Stage   

##### After a quick preliminary examination, we found no missing values from the provided train.csv. 

### Target

In [48]:
target = ['Status']

In [30]:
alt.Chart(train_df).mark_bar().encode(
    y = alt.Y('Status').sort('x'),
    x = 'count()'
).properties(
    title = "Histogram of Status, the target"
)

In [31]:
train_df['Status'].value_counts()

Status
C     4965
D     2665
CL     275
Name: count, dtype: int64

##### From the plot above, it's evident that there is a significant class imbalance, with only 3.48% of the data labeled as 'CL'. To address this issue, techniques such as over-sampling should be considered to ensure a more balanced dataset.

##### Let's examine the distribution of the specified features in relation to the target variable to identify any notable patterns.

### Categorical Features 

In [54]:
cat_col = ["Drug", "Sex", "Ascites", "Hepatomegaly", "Edema", "Spiders", "Stage"]

In [55]:
chart = alt.Chart(train_df).mark_bar().encode(
    alt.X(alt.repeat()),
    y = 'count()',
    color = 'Status'
).properties(
    width = 100,
    height = 200
).repeat(
        cat_col
)

chart.properties(
    title="Histograms of Categorical Features relative to Status"
)

### Numerical Features 

In [44]:
numeric_col = ["N_Days", "Age", "Bilirubin", "Albumin", "Copper", "Alk_Phos", "SGOT", "Tryglicerides", "Platelets", "Prothrombin"]

In [45]:
chart = alt.Chart(train_df).mark_boxplot().encode(
    x = alt.X(alt.repeat()).type('quantitative'),
    color = 'Status'
).properties(
    width = 300,
    height = 200
).repeat(
        numeric_col
)

chart.properties(
    title="Boxplots of Numerical Features relative to Status"
)

From what we've seen, there's no clear link between the target and the other features. <br>
Among numerical features, `N_Days` is the number of days between the registeration date and the date of study. This feature is confusing and could be misleading if we interpret it, so we decide to remove it. <br>

Also, we observed that the unit of `Age` is in days, not in years. We could bin the data to different categories:

    - Infants and Toddlers: 0 to 730 days (0 - 2 years)
    - Children: 1,095 to 3,650 days (3 - 10 years)
    - Adolescents: 4,015 to 6,205 days (11 - 17 years)
    - Young Adults: 6,570 to 10,585 days (18 - 29 years)
    - Adults: 10,950 to 14,235 days (30 - 39 years)
    - Middle-Aged Adults: 14,600 to 17,885 days (40 - 49 years)
    - Pre-Senior Adults: 18,250 to 21,535 days (50 - 59 years)
    - Senior Adults: 21,900 to 25,185 days (60 - 69 years)
    - Elderly: 25,550 to 28,835 days (70 - 79 years)
    - Advanced Elderly: 29,200 days and above (80 years and above)
    
What's more, we've spotted some outliers in the numerical data. It might be a good idea to think about taking them out for a cleaner analysis.

### Correlation Matrix

##### Let's examine the correlation matrix of the numerical features to obtain a clearer insight into their interrelationships.

In [51]:
correlation_matrix_numeric = (
    train_df[numeric_col + target]
    .corr(numeric_only=True)
    .style.background_gradient(cmap="viridis")
    .set_table_styles(
        [
            {
                "selector": "table",
                "props": [
                    ("max-width", "50px"),
                    ("max-height", "50px"),
                    ("font-size", "2px"),
                ],
            }
        ]
    )
)
correlation_matrix_numeric

Unnamed: 0,N_Days,Age,Bilirubin,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin
N_Days,1.0,-0.102354,-0.346434,0.255724,-0.284355,-0.030874,-0.240918,-0.186453,0.147626,-0.156032
Age,-0.102354,1.0,0.099016,-0.114848,0.095199,0.025879,-0.020768,0.021767,-0.094822,0.141705
Bilirubin,-0.346434,0.099016,1.0,-0.303191,0.442223,0.131317,0.368653,0.315681,-0.081987,0.294325
Albumin,0.255724,-0.114848,-0.303191,1.0,-0.218479,-0.083582,-0.200928,-0.112304,0.141284,-0.2046
Copper,-0.284355,0.095199,0.442223,-0.218479,1.0,0.124058,0.323226,0.290435,-0.107894,0.238771
Alk_Phos,-0.030874,0.025879,0.131317,-0.083582,0.124058,1.0,0.128746,0.087789,0.047869,0.079517
SGOT,-0.240918,-0.020768,0.368653,-0.200928,0.323226,0.128746,1.0,0.155287,-0.042004,0.136766
Tryglicerides,-0.186453,0.021767,0.315681,-0.112304,0.290435,0.087789,0.155287,1.0,0.006511,0.063582
Platelets,0.147626,-0.094822,-0.081987,0.141284,-0.107894,0.047869,-0.042004,0.006511,1.0,-0.169741
Prothrombin,-0.156032,0.141705,0.294325,-0.2046,0.238771,0.079517,0.136766,0.063582,-0.169741,1.0


##### From the correlation matrix, we can observe that `Bilirubin` and `Copper` exhibit the strongest correlation, with a score of 0.442223. Additionally, `Bilirubin`, `Albumin`, and `SGOT` demonstrate moderate correlations, each surpassing the 0.3. While these features don't display strong correlations, further analysis is needed to confirm that they are not affected by multicollinearity.

### Summary of EDA

- Among numerical columns, `ID` and `N_Days` carry minimal useful information, which could hinder the performance of our model. We decide to drop them for our analysis.
- The `Age` column should be binned into different age groups for better intrepretation of the data.
- We saw no obvious high correlation between any feature with the target and between features themselves. Among all the correlations, `Bilirubin` and `Copper`, `Bilirubin`, `Albumin`, and `SGOT` stood out, which can be furtherly analyzed with enough time. 

As required, our model is measure based on the log-loss metric.

## Feature Engineering