# **TFG**  

***Machine Learning-Based Classification of Hospital Discharge Diagnoses Using SNOMED-CT Encoded Health Problems and Clinical Data***  

Cindy Chen

Universitat de Barcelona

2024-2025


In [1]:
# Import modules 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display 
import gradio as gr
import os
from pycaret.classification import *
from pycaret.classification import plot_model
from pycaret.classification import predict_model, pull, load_model

%matplotlib inline

## **Data Importation**

Import Subset 1 to find the best model, train the model, tune it and evaluate.

In [2]:
# Import data
data = pd.read_csv("C:/Users/Cindy Chen/Desktop/TFG/data/04_data/subset_1.csv")
data

Unnamed: 0,sex_atr,age,death,episode_duration,care_level_duration,num_health_issues,ongoing,health_issue_motive,health_issue_ou_med_ref,snomed_code,...,prescription_phform_ref,drg_weight,drg_ref,drg_soi_ref,drg_rom_ref,drg_mdc_ref,diag_class_ref_S,diag_class_ref_H,diag_class_ref_P,icd10_capitulo
0,2.0,81.0,1.0,706.924438,706.924438,3.0,1.0,1.0,11.859204,240131006.0,...,152.0,1.7913,469.0,4.0,4.0,11.0,1.0,0.0,0.0,17
1,1.0,47.0,0.0,287.425842,146.750000,3.0,1.0,0.0,10.077813,109989006.0,...,207.0,0.9884,662.0,3.0,3.0,16.0,1.0,0.0,0.0,17
2,1.0,73.0,0.0,2222.074951,1703.783569,5.0,1.0,4.0,9.377647,439740005.0,...,152.0,5.9280,260.0,4.0,4.0,7.0,1.0,0.0,0.0,1
3,2.0,74.0,0.0,1358.753296,433.244171,8.0,1.0,0.0,10.600160,433146000.0,...,152.0,2.3206,231.0,3.0,2.0,6.0,1.0,0.0,0.0,9
4,2.0,74.0,0.0,1358.753296,481.936676,8.0,1.0,0.0,10.600160,307496006.0,...,207.0,2.3206,231.0,3.0,2.0,6.0,1.0,0.0,0.0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1045979,2.0,67.0,1.0,1842.586670,2.855278,4.0,1.0,1.0,10.965332,409623005.0,...,152.0,4.4717,710.0,4.0,4.0,18.0,1.0,0.0,0.0,14
1045980,1.0,77.0,0.0,163.000000,94.935280,2.0,1.0,0.0,12.826443,230690007.0,...,110.0,2.0237,260.0,2.0,1.0,7.0,1.0,0.0,0.0,14
1045981,2.0,45.0,0.0,1512.166626,107.043053,3.0,1.0,0.0,10.077813,307651005.0,...,110.0,17.9363,7.0,4.0,3.0,17.0,1.0,0.0,0.0,1
1045982,2.0,54.0,0.0,1311.468872,1306.877197,1.0,1.0,0.0,9.260414,77493009.0,...,207.0,1.3415,320.0,2.0,1.0,8.0,0.0,1.0,0.0,15


Setup the pycaret environment and select the target column, in our case is the 'icd10_capitulo' column. This function also splits the data into training (70%) and testing set (30%). 

In [None]:
# Configuración del entorno de PyCaret
clf = setup(data=data, target='icd10_capitulo', session_id=123, verbose=True)

![setup](./images/S1_setup.png)

In [None]:
# Get preprocessing pipeline
get_config('pipeline')

## **Model Comparison**

Using the ```compare_models``` function to compare the selected models: Logistic Regression, Decision Trees, Random Forest, XGBoost, SVM, KNN and MLP.  

This function trains and evaluates the performance of all estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. 

In [None]:
# Compare models
model_comparison = compare_models(include=["lr", 'dt', 'rf', 'xgboost', 'svm', 'knn', 'mlp'])

In [None]:
# Save results table
model_comparison_df = pull()
model_comparison_df.to_csv('S1_model_comparison_df.csv', index=False)

# Save best_model
save_model(model_comparison, 'S1_model_comparison')

## **Best Model**

Once we have compared the different models, we select the best one. Then using the ```create_model``` function we train and evaluate the performance of the model using cross-validation. By default, it uses the 10 fold. 

In [None]:
# Model evaluation: cross validation 10 folds
decision_tree_model = create_model('dt')  
evaluate_model(decision_tree_model)

In [None]:
# Save model
save_model(decision_tree_model, 'S1_best_model_dt')

In [None]:
# Load model
decision_tree_model = load_model("C:/Users/Cindy Chen/Desktop/TFG/plots/04/S1/S1_best_model_dt")

## **Optimize and Tune Model**

The ```tune_model``` function tunes the hyperparameters of the model. The output of this function is a scoring grid with cross-validated scores by fold.  

By default, it uses RandomGridSearch from the sklearn and the number of iterations (n_iter) is set to 10.

In [None]:
# Optimize and tune model using Random Grid Search 
decision_tree_model_tuned = tune_model(decision_tree_model)

As we can see, the Random Grid Search didn't obtain better results than the original model, so let's try increasing the number of iterations to 100.

In [None]:
# Optimize and tune model using Random Grid Search and increasing the number of iterations
decision_tree_model_tuned = tune_model(decision_tree_model, n_iter=100) # chose_better=False

Even when we increase the number of iterations the model doesn't improve. So let's customise the grid and parameters using custom_grid.  

For decision trees, the hyperparameters decided to use are:
- max_depth
- min_samples_split
- min_samples_leaf
- criterion

In [None]:
# Optimize and tune model using custom grid and Random Grid Search
params = {'max_depth': [None, 3, 5, 10, 15],
          'min_samples_split': [2, 5, 10],
          'min_samples_leaf': [1, 2, 5],
          'criterion': ['gini', 'entropy', 'log_loss']
} 

decision_tree_model_tuned = tune_model(decision_tree_model, custom_grid=params) # chose_better=False

In [None]:
# Optimize and tune model using custom grid and GridSearchCV
params = {'max_depth': [None, 3, 5, 10, 15],
          'min_samples_split': [2, 5, 10],
          'min_samples_leaf': [1, 2, 5],
          'criterion': ['gini', 'entropy', 'log_loss']
}

decision_tree_model_tuned = tune_model(decision_tree_model, custom_grid=params, search_library='scikit-learn', search_algorithm='grid')

As we can see, the model improved compared to the original one.

In [None]:
# Save tuned model
save_model(decision_tree_model_tuned, 'S1_best_model_dt_tuned')

In [None]:
# Load tuned model
decision_tree_model_tuned = load_model("C:/Users/Cindy Chen/Desktop/TFG/plots/04/S1/S1_best_model_dt_tuned")

In [None]:
# Hyperparameters before tunning
print(decision_tree_model)

In [None]:
# Hyperparameters after tunning
print(decision_tree_model_tuned)

In this case, the hyperparameter of criterion changed from gini to log_loss.

## **Predictions**

Using the ```predict_model``` function, we can predicts the performance of the model on the test set. 

In [None]:
# Predict on the test set
predict_model(decision_tree_model_tuned)

In [None]:
# Get the performance metrics
pull()

## **Analyse Model**  

Plots to analyse the performance of the model on the test set.

### **Confusion Matrix**

In [None]:
# Confusion Matrix
plt.figure(figsize=(25, 25))
plot_model(decision_tree_model_tuned, plot = 'confusion_matrix') # save=True, use_train_data=False

### **Area Under the Curve (AUC)**

In [None]:
# Area Under the Curve (AUC)
plt.figure(figsize=(25, 25))
plot_model(decision_tree_model_tuned, plot = 'auc') # save=True

### **Class Prediction Error**

In [None]:
# Class Prediction Error
plt.figure(figsize=(10, 5))
plot_model(decision_tree_model_tuned, plot = 'error') # save=True

### **Classification Report**

In [None]:
# Classification Report
plt.figure(figsize=(10, 10))
plot_model(decision_tree_model_tuned, plot = 'class_report') # save=True

### **Feature Importance**

In [None]:
# Feature importance
plt.figure(figsize=(10, 10))
plot_model(decision_tree_model_tuned, plot = 'feature_all') # save=True

In [None]:
# Feature importance (top 10)
plt.figure(figsize=(10, 10))
plot_model(decision_tree_model_tuned, plot = 'feature') # save=True