In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from imblearn.over_sampling import SMOTENC

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, f1_score

from sklweka.classifiers import WekaEstimator
from sklweka.dataset import to_nominal_labels

import sklweka.jvm as jvm
jvm.start(packages=True)

SEED = 42

# Use Case 1: Patient Triage in the Emergency Room

This notebook is dedicated to the analysis of datasets for patient triage in emergency departments.

In this notebook, students will be allocated to load a synthetic dataset of false patients, to analyse them with statistics and or machine learning methodes, and to apply pre/post treatements if they want.

## Dataset

To illustrate we’ll use a dataset containing the following variables:

| Variable                  | Description                                                   |
|---------------------------|---------------------------------------------------------------|
| Age                       | Age of the patient (in years)                                 |
| BMI                       | Body Mass Index (BMI) of the patient (in Kg/m²)               |
| Gender                    | Gender of the patient (M/F/O/U)                               |
| Chief_Complaint           | Reason for the patient's visit to the emergency department    |
| Chief_Complaint_Severity  | Severity of the patient’s chief complaint                     |
| Stress_Level              | General state of the patient according to the clinician       |
| Dolor_Degree              | Dolor degree self-estimated by the patient (0-4)              |
| D_Blood_Pressure          | Diastolic blood pressure of the patient (in mm Hg)            |
| S_Blood_Pressure          | Systolic blood pressure of the patient (in mm Hg)             |
| Heart_Rate                | Heart rate of the patient (in beats per minute)               |
| Respiratory_Rate          | Respiratory rate of the patient (in breaths per minute)       |
| Triage_Priority           | The triage priority assigned to the patient (0-10)            |

To use this dataset, we first need to load it from the file **UC1-dataset.csv** using the *read_csv* function from the *pandas* library.

In [None]:
df = pd.read_csv("UC-dataset.csv")
df

We can also summarize the values of the different variables using the *describe* function.

In [None]:
df.describe()

## Pre-analyzes

Now we have loaded our dataset, we can compute some statistics to analyze it before training.

### Variables distribution

First, let’s compute the distribution of variables’ values in our dataset.

For example, with the age of patients:

In [None]:
plt.hist(df["Age"])
plt.show()

We can observe that a majority patient coming at emergency services are between 50 and 80 years old. Patients between 20 and 40 years old are underrepresented.

You can perform additionnal analyzes below.

In [None]:
# your code here

### Correlation between variables

We can also compute a correlation matrice to detect if some variables are correlated (positively or negatively).

To do so, we use the *corr* function of *pandas* on our dataset (we also use the *factorize* function for non-numerical variables).

We obtain then a correlation matrice with, for each cell, a correlation metric between -1 and 1. Close to 1 the two variables are positively correlated, close to -1 the two variables are negatively correlated, close to 0 the two variables are not correlated.

In [None]:
corr_df = df.apply(lambda x: x.factorize()[0]).corr()
corr_df

We can also plot this correlation matrice as a heatmap using the *heatmap* function from the *seaborn* package to observe correlations between variables more easily.

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(corr_df, cmap="viridis")
plt.show()

We can observe that variables *Chief_Complaint* and *Chief_Complaint_Severity* are lightly positively correlated, as well as *D_Blood_Pressure* and *S_Blood_Pressure*. Which makes senses.

What do you also observe?

*add your observations here*

### Additionnal pre-analyzes

If you want to made additionnal pre-analyzes, feel free to do them here.

In [None]:
# you code here

## Pre-processes

Before training a classifier, we need to split our dataset into a training dataset (90% of the original dataset) and a test dataset (the 10% left), by using the *train_test_split* function of the *scikit-learn* library.

In [None]:
X = df.drop("Triage_Priority", axis=1)
Y = df["Triage_Priority"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=SEED)

X_train

If you want, it’s also possible to perform some pre-processes on the dataset.

For example, you can do data augmentation using the [SMOTENC](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#smotenc) algorithm, a variant of the [SMOTE](https://www.jair.org/index.php/jair/article/view/10302) algorithm for datasets with mixed numerical and categorical variables, which is available in the [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn) library.

**n.b**: be careful to resample only you training dataset

In [None]:
categorical_features=[2, 3, 4, 6]
oversampler = SMOTENC(categorical_features=categorical_features, random_state=SEED)
resampled_X_train, resampled_Y_train = oversampler.fit_resample(X_train, Y_train)

resampled_X_train

As we can observe above, this method allow us to oversample our dataset with additionnal data corresponding to underrepresented classes.

For example, if we compare classes before and after resampling:

In [None]:
plt.hist(Y_train)
plt.show()

plt.hist(resampled_Y_train)
plt.show()

You can perform other from [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn/wiki) or [scikit-learn](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#standardscaler) for under-sampling or over-sampling the dataset.

In [None]:
# your code here

## Build your classifier

We can now build a classifier by using well known machine learning algorithms.

To do so, we’ll use the [scikit-learn](https://scikit-learn.org/stable/) library and the [sklearn-weka-plugin](https://fracpete.github.io/sklearn-weka-plugin/), which contains different algorithms and methods for machine learning.

We propose below three well-know machine learning algorithm:

* C4.5, a learning algorithm based on decision trees
* Naive Bayes, a learning algorithm based on probability thoery and Bayes theorem
* Multi-layer Perceptron, a learning algorithm based on artificial neural networks

Just uncomment the one you want to use (don’t forget to justify your choice).

You can also try other [machine learning algorithms for classification](https://weka.sourceforge.io/doc.dev/weka/classifiers/Classifier.html) available via *weka*, or [machine learning algorithms](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) proposed by *scikit-learn*.

Feel free to modify parameters of these algorithms.

In [None]:
# to use weka algorithms, we first need to transform classes into nominal labels
Y_test = to_nominal_labels(Y_test)
resampled_Y_train = to_nominal_labels(resampled_Y_train)

In [None]:
# C4.5
clf = WekaEstimator(classname="weka.classifiers.trees.J48", options=["-M", "3"])

# Naive Bayes
#clf = WekaEstimator(classname="weka.classifiers.bayes.NaiveBayes", options=["-K"])

# Multi-layer Perceptron
#clf = WekaEstimator(classname="weka.classifiers.functions.MultilayerPerceptron", options=["-L", "0.1", "-N", "100", "-S", str(SEED)])

Finally, we can train the choosen algorithms, using the *fit* function on the training dataset, to obtain a classifier.

In [None]:
clf = clf.fit(resampled_X_train, resampled_Y_train)

## Test your classifier

Now we have trained a classifier, we must test its performances.

To do so, we first need to use the *predict* function on the test dataset to obtain the predictions of the classifier on data it didn’t see yet.

In [None]:
Y_pred = clf.predict(X_test)
print(Y_pred)

Therefore, we can compute some metrics used in machine learning to evaluate performances of a classifier:

* Precision: $\frac{TP}{TP + FP}$
* Recall: $\frac{TP}{TP + FN}$
* f1-score: $2 \times \frac{precision \times recall}{precision + recall}$

To do so, we’ll use the *precision_score*, *recall_score*, and *f1_score* of the *scikit-learn* library.

**n.b:** because we are in a multi-class classification problem, we have to choose between micro or macro average for these metrics.

In [None]:
avg = 'micro'
print("Precision:", precision_score(Y_test, Y_pred, average=avg))
print("Recall:", recall_score(Y_test, Y_pred, average=avg))
print("F1-score:", f1_score(Y_test, Y_pred, average=avg))

Finally, we can compare the prediction of the classifier with the classification expected by using the *confusion_matrix* function from the *scikit-learn* library.

In [None]:
cm = confusion_matrix(Y_test, Y_pred, normalize="true")
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

## Use your classifier

Now you have a functional classifier, you can use it to estimate the triage priority of new patient coming at emergency departments.

In [None]:
new_patient = pd.DataFrame([{
    "Age": 52,
    "BMI": 22,
    "Gender": "F",
    "Race": "black",
    "Chief_Complaint": "pain with bright lights",
    "Chief_Complaint_Severity": 150,
    "Stress_Level": "Very much",
    "Dolor_Degree": 4,
    "D_Blood_Pressure": 90,
    "S_Blood_Pressure": 120,
    "Heart_Rate": 90,
    "Respiratory_Rate": 15
}])

In [None]:
clf.predict(new_patient)

## Bonus: Explain your classifier

Finally, it could be interesting to extract explanations from your classifier concerning the reason of a prediction but also concerning its general process of classification.

In this section you are totally free to use any things you want.

Check at: https://marcotcr.github.io/lime/tutorials/Tutorial%20-%20continuous%20and%20categorical%20features.html

In [None]:
import lime
from lime import lime_tabular
import numpy as np

In [None]:
explainer = lime_tabular.LimeTabularExplainer(
    training_data=np.array(X_train),
    feature_names=X_train.columns,
    categorical_features=categorical_features,
    class_names=['_0', '_1', '_2', '_3', '_4', '_5', '_6', '_7', '_8', '_9', '_10'],
    mode='classification'
)

In [None]:
# your code here