## Liver Cirrhosis Prediction

Dataset found here: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Goal is to predict stage/severity of patient's cirrhosis based on 18 clinical features identified below

In [None]:
import pandas as pd

df = pd.read_csv('/content/cirrhosis.csv')
# drop status to avoid data leakage
df = df.drop(columns=['ID', 'Status'])
df.head()

In [None]:
df.info()

In [None]:
# identify numerical/categorical features
numerical = list(df.select_dtypes(['float64', 'int64']).columns)
categorical = list(df.select_dtypes(['object']).columns)

In [None]:
import numpy as np

# convert age from days to years
df['Age'] = df['Age']/365
df['Age'] = df['Age'].astype('int64')

# replace missing categorical values
for i in range(len(categorical)):
  df[categorical[i]].fillna(df[categorical[i]].mode()[0], inplace=True) 

# replace missing numerical values
for i in range(len(numerical)):
  df[numerical[i]].fillna(df[numerical[i]].median(), inplace=True)

# drop target from list
numerical = numerical[:-1]

# simplify to binary classification problem
df['Stage'] = np.where(df['Stage'] == 4, 1, 0)

In [None]:
import seaborn as sns

# plot target
_ = sns.countplot(df, x='Stage')

Findings:
- Several numerical/categorical features missing values
- Imbalanced target classes in favor of negative outcome slightly corrected by conversion to binary classification problem

In [None]:
# count unique values in categorical columns
for col in categorical:
  print(f"{col} has {df[col].nunique()} values")

In [None]:
# describe numerical statistics
df.describe()

In [None]:
# analyze skew
df[numerical].skew()

Findings:
- Bilirubin, cholesterol, copper, alk_phos, and tryglicerides levels skewed heavily right but only mean and median cholesterol levels of 350.3 mg/dL and 309.5 mg/dL are of particular concern 
- SGOT skewed right with mean and median of 114.7 U/mL and 120.6 U/mL respectively which are outside range of normal levels

In [None]:
def plot_continuous_distribution1(ax, data, column):
  _ = sns.histplot(ax=ax, data=data, x=column, kde=True).set(title='Distribution of ' + column)

def plot_categorical_distribution1(ax, data, column):
  _ = sns.countplot(ax=ax, data=data, x=column).set(title='Distribution of ' + column)

def plot_continuous_distribution2(ax, data, column):
  _ = sns.histplot(ax=ax, data=data, x=column, kde=True, hue='Stage').set(title='Distribution of ' + column)

def plot_categorical_distribution2(ax, data, column):
  _ = sns.countplot(ax=ax, data=data, x=column, hue='Stage').set(title='Distribution of ' + column)

def plot_outlier_check(data, column):
  _ = sns.boxplot(x=df[column], data=data)

In [None]:
import matplotlib.pyplot as plt

# plot box-and-whisker plots to check for outliers
plt.figure(figsize=(12, 8))
for i in range(len(numerical)):
  plt.subplot(4, 3, i+1)
  plot_outlier_check(df, numerical[i])
  plt.tight_layout()
  plt.title(numerical[i], size=18)

In [None]:
# check for and remove outliers as needed
for col in numerical:
  df_len = len(df)
  q3 = df[col].quantile(0.75).round(4)
  q1 = df[col].quantile(0.25).round(4)
  upper_lim = round(q3 + 1.5*(q3-q1), 4)
  lower_lim = round(q1 - 1.5*(q3-q1), 4)
  df = df[(df[col] < upper_lim) & (df[col] > lower_lim)]
  print(f"Number of Outliers in {col}: {df_len - len(df)}")

Findings:
- Several features have outliers but removal would further limit data size

In [None]:
# plot numerical feature distributions
fig, axes = plt.subplots(2, 5, figsize=(16, 15))
j = 0
k = 0
for i in range(len(numerical)):
  k = i % 5
  if k == 0 and i != 0:
    j += 1
  #plot_continuous_distribution1(axes[j,k], df, numerical[i])
  plot_continuous_distribution2(axes[j,k], df, numerical[i])

In [None]:
# find categorical value proportions
for col in categorical:
  print(f"{round(df[col].value_counts(normalize=True), 4)}\n")

In [None]:
# plot categorical feature distributions
fig, axes = plt.subplots(2, 3, figsize=(16, 9))
j = 0
k = 0
for i in range(len(categorical)):
  k = i % 3
  if k == 0 and i != 0:
    j += 1
  #plot_categorical_distribution1(axes[j,k], df, categorical[i])
  plot_categorical_distribution2(axes[j,k], df, categorical[i])

In [None]:
# find categorical value correlation with target
for col in categorical:
  for val in df[col].unique():
    print(f"{col} = {val} -> {round(df[df[col] == val]['Stage'].mean()*100, 2)}% chance of cirrhosis")
  print('\n')

Findings:
- People with ascites more likely to have severe cirrhosis than those without
- People with presence of spiders more likely to have severe cirrhosis than those without
- People with edema more likely to have severe cirrhosis than those without
- Males slightly more likely to have severe cirrhosis
- Little difference in outcome based on drug

In [None]:
# plot numerical feature correlations
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
from sklearn.preprocessing import *

# convert categorical to numerical
le = LabelEncoder()
df[categorical] = df[categorical].apply(le.fit_transform)

# prepare data for partitioning
X1 = df.drop(columns=['Stage'])
y1 = df['Stage']

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# select best features for predicting target
classifier = SelectKBest(score_func=f_classif)
results = classifier.fit(X1, y1)

xdf = pd.DataFrame(results.scores_)
cols = pd.DataFrame(X1.columns)

fscores = pd.concat([cols, xdf], axis=1)
fscores.columns = ['Attribute', 'Score']
fscores.sort_values(by='Score', ascending=False)
print(fscores)

Findings:
- Presence of albumin and hepatomegaly appear to be most relevant in predicting cirrhosis severity
- Prothrombin time as well as presences of ascites and edema are also relevant, supporting above findings
- Number of days between registration and earliest out of death, transplant, or end of study seems relevant but suggests some data leakage

In [None]:
from imblearn.over_sampling import SMOTE

feature_select = False
ncols = fscores[fscores['Score'] > 5]['Attribute']

# oversample to address class imbalance
sm = SMOTE(random_state=14)

# standardize data
ss = StandardScaler()

if feature_select:
  X, y = sm.fit_resample(X1[ncols], y1)
else:
   X, y = sm.fit_resample(X1, y1)
X = ss.fit_transform(X)

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report

# use k-fold cross validation
skf = StratifiedKFold(n_splits=5, random_state=14, shuffle=True)
#model = RandomForestClassifier()
#model = SVC(kernel='rbf')
model = xgb.XGBClassifier()
#model = LogisticRegression()
fold = 1
acc = []

# partition data
for train_i, test_i in skf.split(X, y):
  X_train, X_test = X[train_i], X[test_i]
  y_train, y_test = y[train_i], y[test_i]
  model.fit(X_train, y_train)

  y_pred = model.predict(X_test)
  print(f"Accuracy Score on Fold {fold}: {round(accuracy_score(y_test, y_pred) * 100, 2)}%")
  fold += 1
  acc.append(accuracy_score(y_test, y_pred))

print(f"Mean Accuracy Score: {round(np.mean(acc) * 100, 2)}%") 

In [None]:
from sklearn.metrics import classification_report, roc_auc_score

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)

print(classification_report(y_test, y_pred))
print(f"AUC: {roc_auc_score(y_test, y_prob[:, 1])}")

Findings:
- XGBoost classifier outperformed other classifers but not by much
- Model performance is relatively poor but can perhaps be attributed to lack of data