In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import SMOTENC

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, f1_score

from sklearn.preprocessing import LabelEncoder

from sklweka.classifiers import WekaEstimator
from sklweka.dataset import to_nominal_labels

import lime
import lime.lime_tabular

import sklweka.jvm as jvm

SEED = 42
np.random.seed(SEED)

In [None]:
# you may need to run this twice
jvm.start(packages=True)

# Use Case 1: Patient Triage in the Emergency Room

This notebook is dedicated to the analysis of datasets for patient triage in emergency departments.

In this notebook, students will be allocated to load a synthetic dataset of false patients, analyze them with statistics and or machine learning methods, and apply pre/post treatments if they want.

## Dataset

To illustrate we’ll use a dataset containing the following variables:

| Variable                  | Description                                                   |
|---------------------------|---------------------------------------------------------------|
| Age                       | Age of the patient (in years)                                 |
| BMI                       | Body Mass Index (BMI) of the patient (in Kg/m²)               |
| Gender                    | Gender of the patient (M/F/O/U)                               |
| Race                      | "race" of the patient                                         |
| Chief_Complaint           | Reason for the patient's visit to the emergency department    |
| Chief_Complaint_Severity  | Severity of the patient’s chief complaint                     |
| Stress_Level              | General state of the patient according to the clinician       |
| Dolor_Degree              | Dolor degree self-estimated by the patient (0-4)              |
| D_Blood_Pressure          | Diastolic blood pressure of the patient (in mm Hg)            |
| S_Blood_Pressure          | Systolic blood pressure of the patient (in mm Hg)             |
| Heart_Rate                | Heart rate of the patient (in beats per minute)               |
| Respiratory_Rate          | Respiratory rate of the patient (in breaths per minute)       |
| Triage_Priority           | The triage priority assigned to the patient (0-10)            |

To use this dataset, we first need to load it from the file **UC1-dataset.csv** using the *read_csv* function from the *pandas* library.

In [31]:
df = pd.read_csv("UC-dataset.csv")
df

Unnamed: 0,Age,BMI,Gender,Race,Chief_Complaint,Chief_Complaint_Severity,Stress_Level,Dolor_Degree,D_Blood_Pressure,S_Blood_Pressure,Heart_Rate,Respiratory_Rate,Triage_Priority
0,19,24.4,M,white,cough,73,A little bit,2.0,88.0,127.0,83.0,13.0,5
1,94,35.3,F,white,fatigue,60,Not at all,1.0,57.0,114.0,78.0,15.0,0
2,27,24.1,F,white,swollen tonsils,69,Somewhat,3.0,78.0,116.0,82.0,15.0,4
3,29,24.9,F,white,shortness of breath,69,A little bit,3.0,87.0,113.0,96.0,16.0,7
4,81,27.5,M,white,swollen lymph nodes,46,A little bit,1.0,85.0,146.0,99.0,12.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17660,73,29.0,M,white,sinus pain,59,Somewhat,2.0,88.0,102.0,75.0,14.0,1
17661,73,29.1,M,white,sinus pain,59,Not at all,0.0,91.0,111.0,94.0,14.0,0
17662,74,29.1,M,white,sinus pain,76,Somewhat,3.0,85.0,112.0,71.0,15.0,3
17663,74,29.5,M,white,sinus pain,76,Not at all,1.0,88.0,109.0,68.0,14.0,1


We can also summarize the values of the different variables using the *describe* function.

In [None]:
df.describe()

## Pre-analyzes

Now we have loaded our dataset, we can compute some statistics to analyze it before training.

### Variables distribution

First, let’s compute the distribution of variables’ values in our dataset.

For example, with the age of patients:

In [None]:
plt.hist(df["Age"])
plt.show()

We can observe that a majority of patients coming to emergency services are between 50 and 80 years old. Patients between 20 and 40 years old are underrepresented.

You can perform additional analyzes below.

In [None]:
# your code here

### Correlation between variables

We can also compute a correlation matrice to detect if some variables are correlated (positively or negatively).

To do so, we use the *corr* function of *pandas* on our dataset (we also use the *factorize* function for non-numerical variables).

Therefore, we obtain a correlation matrice with, for each cell, a correlation metric between -1 and 1. Close to 1, the two variables are positively correlated, close to -1, the two variables are negatively correlated, close to 0, the two variables are not correlated.

In [None]:
corr_df = df.apply(lambda x: x.factorize()[0]).corr()
corr_df

We can also plot this correlation matrice as a heatmap using the *heatmap* function from the *seaborn* package to observe correlations between variables more easily.

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(corr_df, cmap="viridis")
plt.show()

We can observe that variables *Chief_Complaint* and *Chief_Complaint_Severity* are lightly positively correlated, as well as *D_Blood_Pressure* and *S_Blood_Pressure*. Which makes sense.

What do you also observe?

*add your observations here*

### Additional pre-analyzes

If you want to make additional pre-analyzes, feel free to do them here.

In [None]:
# you code here

## Pre-processes

Before training a classifier, we need to split our dataset into a training dataset (90% of the original dataset) and a test dataset (the 10% left), by using the *train_test_split* function of the *scikit-learn* library.

In [None]:
X = df.drop("Triage_Priority", axis=1)
Y = df["Triage_Priority"]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=SEED)

X_train

If you want, it’s also possible to perform some pre-processes on the dataset.

For example, you can do data augmentation using the [SMOTENC](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#smotenc) algorithm, a variant of the [SMOTE](https://www.jair.org/index.php/jair/article/view/10302) algorithm for datasets with mixed numerical and categorical variables, which is available in the [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn) library.

**n.b**: be careful to resample only the training dataset.

In [None]:
categorical_features=[2, 3, 4, 6]
oversampler = SMOTENC(categorical_features=categorical_features, random_state=SEED)
resampled_X_train, resampled_Y_train = oversampler.fit_resample(X_train, Y_train)

resampled_X_train

As we can observe above, this method allows us to oversample our dataset with additional data corresponding to underrepresented classes.

For example, if we compare classes before and after resampling:

In [None]:
plt.hist(Y_train)
plt.show()

plt.hist(resampled_Y_train)
plt.show()

You can perform other from [imbalanced-learn](https://github.com/scikit-learn-contrib/imbalanced-learn/wiki) or [scikit-learn](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#standardscaler) for under-sampling or over-sampling the dataset.

In [None]:
# your code here

## Build your classifier

We can now build a classifier by using well-known Machine Learning algorithms.

To do so, we’ll use the [scikit-learn](https://scikit-learn.org/stable/) library and the [sklearn-weka-plugin](https://fracpete.github.io/sklearn-weka-plugin/), which contain different algorithms and methods for machine learning.

We propose three well-known ML algorithms:

* C4.5, a learning algorithm based on decision trees
* Naive Bayes, a learning algorithm based on probability theory and Bayes theorem
* Multi-layer Perceptron, a learning algorithm based on artificial neural networks

Just uncomment the one you want to use (don’t forget to justify your choice).

You can also try other [machine learning algorithms for classification](https://weka.sourceforge.io/doc.dev/weka/classifiers/Classifier.html) available via *weka*, or [machine learning algorithms](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) proposed by *scikit-learn*.

Feel free to modify the parameters of these algorithms.

In [None]:
# to use weka algorithms, we first need to transform classes into nominal labels
Y_test = to_nominal_labels(Y_test)
resampled_Y_train = to_nominal_labels(resampled_Y_train)

In [None]:
# C4.5 (decision trees)
clf = WekaEstimator(classname="weka.classifiers.trees.J48", options=["-M", "3"])

# Naive Bayes
#clf = WekaEstimator(classname="weka.classifiers.bayes.NaiveBayes", options=["-K"])

# Multi-layer Perceptron
#clf = WekaEstimator(classname="weka.classifiers.functions.MultilayerPerceptron", options=["-L", "0.1", "-N", "100", "-S", str(SEED)])

Finally, we can train the chosen algorithm, using the *fit* function on the training dataset, to obtain a classifier.

In [None]:
clf = clf.fit(resampled_X_train, resampled_Y_train)

## Test your classifier

Now we have trained a classifier, we must test its performance.

To do so, we first need to use the *predict* function on the test dataset to obtain the predictions of the classifier on data it didn’t see yet.

In [None]:
Y_pred = clf.predict(X_test)
print(Y_pred)

Therefore, we can compute some metrics used in machine learning to evaluate the performances of a classifier:

* Precision: $\frac{TP}{TP + FP}$
* Recall: $\frac{TP}{TP + FN}$
* f1-score: $2 \times \frac{precision \times recall}{precision + recall}$

To do so, we’ll use the *precision_score*, *recall_score*, and *f1_score* of the *scikit-learn* library.

**n.b:** because we are in a multi-class classification problem, we have to choose between micro or macro averages for these metrics.

In [None]:
avg = 'micro'
print("Precision:", precision_score(Y_test, Y_pred, average=avg))
print("Recall:", recall_score(Y_test, Y_pred, average=avg))
print("F1-score:", f1_score(Y_test, Y_pred, average=avg))

Finally, we can compare the classifier’s prediction with the classifications expected by using the *confusion_matrix* function from the *scikit-learn* library.

In [None]:
cm = confusion_matrix(Y_test, Y_pred, normalize="true")
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

## Use your classifier

Now you have a functional classifier, you can use it to estimate the triage priority of new patients coming to emergency departments.

In [None]:
new_patient = pd.DataFrame([{
    "Age": 52,
    "BMI": 22,
    "Gender": "F",
    "Race": "white",
    "Chief_Complaint": "pain with bright lights",
    "Chief_Complaint_Severity": 150,
    "Stress_Level": "Very much",
    "Dolor_Degree": 4,
    "D_Blood_Pressure": 90,
    "S_Blood_Pressure": 120,
    "Heart_Rate": 90,
    "Respiratory_Rate": 15
}])

In [None]:
clf.predict(new_patient)

## Bonus: Explain your classifier

Finally, it could be interesting to extract explanations from your classifier concerning the reason for a prediction and allow end-users to interpret the results of a model.

Several methods exist for [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/), such as [Partial Dependence Plot (PDP)](https://hastie.su.domains/ElemStatLearn/) or [Individual Conditional Expectation (ICE)](https://arxiv.org/abs/1309.6392). Both are included in the [*scikit-learn*](https://scikit-learn.org/stable/modules/partial_dependence.html) library.

In this section, we’ll use the Local Interpretable Model-agnostic Explanations (LIME) methods proposed by [Ribeiro, Singh, and Guestrin (2016)](https://arxiv.org/abs/1602.04938). This method aims to work on every machine learning model and to provide explanations interpretable by non-computer scientist end-users (by giving the main reason for a prediction).

To do so, we’ll use the [*lime*](https://github.com/marcotcr/lime) library developed by the authors.

However, this library is based on NumPy arrays and a certain format of data to work, so we’ll need to apply some updates to the dataset.

First, we need to parse the dataset from a data frame format to a NumPy array format.

In [None]:
np_resampled_X_train = resampled_X_train.to_numpy(copy=True, dtype=str)
np_resampled_X_train

Then, because *LIME* will not work with strings, we’ll need to encode the classes of the *Triage_Priority* variable with the *LabelEncoder* of the *scikit-learn* library.

In [None]:
labels = resampled_Y_train.copy()
le= LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
class_names = le.classes_ # we keep names of each class for later
print(class_names)

We’ll also need to get the names of each feature to run *LIME*.

In [None]:
feature_names = resampled_X_train.columns.tolist()
print(feature_names)

Because *LIME* is not able to treat textual values, we’ll also need to encode the categorical features.

In [None]:
categorical_names = {}
categorical_encoders = {}
for feature in categorical_features:
    categorical_encoders[feature] = LabelEncoder() # we keep the encoder for later
    categorical_encoders[feature].fit(np_resampled_X_train[:, feature]) # setup the encoder
    # we take the opportunity to transform our dataset
    np_resampled_X_train[:, feature] = categorical_encoders[feature].transform(np_resampled_X_train[:, feature])
    categorical_names[feature] = categorical_encoders[feature].classes_ # we keep variables domain for later
print(categorical_names)

Finally, we need to set up all values of the dataset as float to allow *LIME* to work.

In [None]:
np_resampled_X_train = np_resampled_X_train.astype(float)

Now, we have to define a way to encode, in the *LIME* format, new data to explain the result of the model.

In [None]:
def encode_data(data, categorical_features, categorical_encoders):
    np_data = data.to_numpy(copy=True, dtype=str)
    for feature in categorical_features:
        np_data[:, feature] = categorical_encoders[feature].transform(np_data[:, feature])
    return np_data.astype(float)

In [None]:
encoded_new_patient = encode_data(new_patient, categorical_features, categorical_encoders)
encoded_new_patient

And, because our model was trained on a dataset in a certain format, we also need to define a way to decode the data given to *LIME*, which we’ll give this data to our model to compute predictions.

In [None]:
def decode_data(np_data, feature_names, categorical_features, categorical_encoders):
    df_data = pd.DataFrame(np_data, columns = feature_names)
    for feature in categorical_features:
        le = categorical_encoders[feature]
        fe = np_data[:, feature]
        tmp = le.inverse_transform(fe.astype(int))
        df_data[feature_names[feature]] = tmp
    return df_data

In [None]:
decode_data(encoded_new_patient, feature_names, categorical_features, categorical_encoders)

Therefore, we can parse data given to *LIME* into our dataset’s original format and define the function used by *LIME* to call our model.

In [None]:
predict_fn = lambda x: clf.predict_proba(decode_data(x, feature_names, categorical_features, categorical_encoders)).astype(float)

Finally, we can declare *LIME*’s Explainer for tabular data.

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(np_resampled_X_train,
                                                   feature_names = feature_names,
                                                   class_names=class_names,
                                                   categorical_features=categorical_features, 
                                                   categorical_names=categorical_names, kernel_width=3)

And, use it to highlight the main reasons that conducted the model’s prediction.

In [None]:
exp = explainer.explain_instance(encoded_new_patient[0], predict_fn, num_features=5, top_labels=2)
exp.show_in_notebook(show_table=False, show_all=False)

Now, feel free to modify *LIME*’s parameters or to test other explanation methods for machine learning.

In [None]:
# your code here