<a href="https://www.kaggle.com/code/vidhikishorwaghela/prediction-of-cirrhosis-multiclass-approach?scriptVersionId=153866056" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Multi-class prediction model for cirrhosis outcomes:

### Import Necessary Libraries:

Here, we import essential libraries for data manipulation, model training, and evaluation.

In [1]:
#Importing the necessary libraries:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss, accuracy_score, classification_report



### Loading the dataset and exploring it the same time:

We load the training and test datasets from Kaggle, which contain information about patients with cirrhosis.

In [2]:
#Loading the datasets:
train_df = pd.read_csv('/kaggle/input/playground-series-s3e26/train.csv')
test_df = pd.read_csv('/kaggle/input/playground-series-s3e26/test.csv')

In [3]:
train_df.columns

Index(['id', 'N_Days', 'Drug', 'Age', 'Sex', 'Ascites', 'Hepatomegaly',
       'Spiders', 'Edema', 'Bilirubin', 'Cholesterol', 'Albumin', 'Copper',
       'Alk_Phos', 'SGOT', 'Tryglicerides', 'Platelets', 'Prothrombin',
       'Stage', 'Status'],
      dtype='object')

In [4]:
test_df.columns

Index(['id', 'N_Days', 'Drug', 'Age', 'Sex', 'Ascites', 'Hepatomegaly',
       'Spiders', 'Edema', 'Bilirubin', 'Cholesterol', 'Albumin', 'Copper',
       'Alk_Phos', 'SGOT', 'Tryglicerides', 'Platelets', 'Prothrombin',
       'Stage'],
      dtype='object')

In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             7905 non-null   int64  
 1   N_Days         7905 non-null   int64  
 2   Drug           7905 non-null   object 
 3   Age            7905 non-null   int64  
 4   Sex            7905 non-null   object 
 5   Ascites        7905 non-null   object 
 6   Hepatomegaly   7905 non-null   object 
 7   Spiders        7905 non-null   object 
 8   Edema          7905 non-null   object 
 9   Bilirubin      7905 non-null   float64
 10  Cholesterol    7905 non-null   float64
 11  Albumin        7905 non-null   float64
 12  Copper         7905 non-null   float64
 13  Alk_Phos       7905 non-null   float64
 14  SGOT           7905 non-null   float64
 15  Tryglicerides  7905 non-null   float64
 16  Platelets      7905 non-null   float64
 17  Prothrombin    7905 non-null   float64
 18  Stage   

In [6]:
train_df.head(3)

Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
0,0,999,D-penicillamine,21532,M,N,N,N,N,2.3,316.0,3.35,172.0,1601.0,179.8,63.0,394.0,9.7,3.0,D
1,1,2574,Placebo,19237,F,N,N,N,N,0.9,364.0,3.54,63.0,1440.0,134.85,88.0,361.0,11.0,3.0,C
2,2,3428,Placebo,13727,F,N,Y,Y,Y,3.3,299.0,3.55,131.0,1029.0,119.35,50.0,199.0,11.7,4.0,D


In [7]:
train_df.tail(3)

Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
7902,7902,1576,D-penicillamine,25873,F,N,N,Y,S,2.0,225.0,3.19,51.0,933.0,69.75,62.0,200.0,12.7,2.0,D
7903,7903,3584,D-penicillamine,22960,M,N,Y,N,N,0.7,248.0,2.75,32.0,1003.0,57.35,118.0,221.0,10.6,4.0,D
7904,7904,1978,D-penicillamine,19237,F,N,N,N,N,0.7,256.0,3.23,22.0,645.0,74.4,85.0,336.0,10.3,3.0,C


In [8]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5271 entries, 0 to 5270
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             5271 non-null   int64  
 1   N_Days         5271 non-null   int64  
 2   Drug           5271 non-null   object 
 3   Age            5271 non-null   int64  
 4   Sex            5271 non-null   object 
 5   Ascites        5271 non-null   object 
 6   Hepatomegaly   5271 non-null   object 
 7   Spiders        5271 non-null   object 
 8   Edema          5271 non-null   object 
 9   Bilirubin      5271 non-null   float64
 10  Cholesterol    5271 non-null   float64
 11  Albumin        5271 non-null   float64
 12  Copper         5271 non-null   float64
 13  Alk_Phos       5271 non-null   float64
 14  SGOT           5271 non-null   float64
 15  Tryglicerides  5271 non-null   float64
 16  Platelets      5271 non-null   float64
 17  Prothrombin    5271 non-null   float64
 18  Stage   

In [9]:
test_df.head(3)

Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,7905,3839,D-penicillamine,19724,F,N,Y,N,N,1.2,546.0,3.37,65.0,1636.0,151.9,90.0,430.0,10.6,2.0
1,7906,2468,D-penicillamine,14975,F,N,N,N,N,1.1,660.0,4.22,94.0,1257.0,151.9,155.0,227.0,10.0,2.0
2,7907,51,Placebo,13149,F,N,Y,N,Y,2.0,151.0,2.96,46.0,961.0,69.75,101.0,213.0,13.0,4.0


In [10]:
test_df.tail(3)

Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
5268,13173,3707,D-penicillamine,16990,F,N,Y,N,N,0.8,315.0,4.09,13.0,1637.0,170.5,70.0,426.0,10.9,3.0
5269,13174,1216,Placebo,11773,F,N,N,N,N,0.7,329.0,3.8,52.0,678.0,57.0,126.0,306.0,10.2,1.0
5270,13175,2272,D-penicillamine,21600,F,N,N,N,N,2.0,232.0,3.42,18.0,1636.0,170.5,83.0,213.0,13.6,2.0


### Feature Engineering and Preprocessing:

We perform feature engineering and preprocessing steps, such as dropping unnecessary columns, and converting categorical variables to numerical using one-hot encoding.

In [11]:
#Feature engineering and preprocessing:
cols_to_drop = ['id', 'N_Days']

In [12]:
#Droping the unnecessary columns:
train_df = train_df.drop(cols_to_drop, axis=1)
test_df = test_df.drop(cols_to_drop, axis=1)

In [13]:
#Convert categorical variables to numerical using one-hot encoding:
train_df = pd.get_dummies(train_df, columns=['Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema'])
test_df = pd.get_dummies(test_df, columns=['Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema'])

In [14]:
#Separating the data for training and validation:
X_train = train_df.drop('Status', axis=1)
y_train = train_df['Status']

### Split Data for Training and Validation:

The training data is split into training and validation sets to evaluate the model's performance.

In [15]:
#Spliting the data for training and validation:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42 )

In [16]:
import warnings

# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_model = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='neg_log_loss')
grid_search.fit(X_train, y_train)


In [17]:
# Get the best model from the grid search
best_rf_model = grid_search.best_estimator_

### Make Predictions on the Validation Set:

Predictions are made on the validation set, and probabilities are obtained for each class.

In [18]:
# Make predictions on the validation set
y_pred_proba = best_rf_model.predict_proba(X_val)

### Evaluate the Model:

The model's performance is evaluated using the log loss metric on the validation set.



In [19]:
# Convert labels to numeric type using LabelEncoder
label_encoder = LabelEncoder()
y_val_encoded = label_encoder.fit_transform(y_val)
y_pred_encoded = np.argmax(y_pred_proba, axis=1)


In [20]:
# Evaluate the model
log_loss_score = log_loss(y_val_encoded, y_pred_proba)
accuracy = accuracy_score(y_val_encoded, y_pred_encoded)
classification_rep = classification_report(y_val_encoded, y_pred_encoded)

print(f'Log Loss Score on Validation Set: {log_loss_score}')
print(f'Accuracy on Validation Set: {accuracy}')
print('Classification Report:')
print(classification_rep)

Log Loss Score on Validation Set: 0.4729549340409714
Accuracy on Validation Set: 0.8228969006957622
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.93      0.87       966
           1       0.83      0.10      0.17        52
           2       0.83      0.71      0.77       563

    accuracy                           0.82      1581
   macro avg       0.83      0.58      0.60      1581
weighted avg       0.82      0.82      0.81      1581



### Making the Predictions on the Test Set:

The trained model is used to make predictions on the test set.


In [21]:
# Now, use the trained model to make predictions on the test set
test_predictions = best_rf_model.predict_proba(test_df)

In [22]:
submission_df = pd.DataFrame(test_predictions, columns=['Status_C', 'Status_CL', 'Status_D'])


### Creating the Submission File:

In [23]:
# Ensure 'id' is present in the test_df columns
if 'id' in test_df.columns:
    submission_df.insert(0, 'id', test_df['id'])
else:
    # If 'id' is not present, create a new sequence for submission
    submission_df.insert(0, 'id', range(7905, 7905 + len(test_df)))

# Save the submission file
submission_df.to_csv('/kaggle/working/submission.csv', index=False)

The final step involves creating a submission file in the required format for Kaggle, including 'id', 'Status_C', 'Status_CL', and 'Status_D' columns.

This model employs a Random Forest Classifier to predict the outcomes of patients with cirrhosis using a multi-class approach. The preprocessing steps ensure the data is in a suitable format for training, and the model's performance is assessed on a validation set before making predictions on the test set for submission to Kaggle.