# Feature Selection w/ a Random Forrest Classifier

I want to use some patient info as input to a CNN that will classify ECG rhythm types. This is the process I use to find the best features from our given data.
I am using the kaggle dataset [PTB-XL - Atrial Fibrillation Detection](https://www.kaggle.com/datasets/arjunascagnetto/ptbxl-atrial-fibrillation-detection)

The description of this dataset states that it is intended to be used for detection of three ecg rhythms: Normal, Atrial Fibrillation (AF), all other arrhythmia. However, the dataset contains mutliple rhythm types, I will attempt to classify all of them.

Rhythm types and support:

|Rhythm|# examples|
|:-----|---------:|
|Atrial Fibrillation| 1514|
|Atrial Flutter| 73|
|Bigeminal Pattern (Unknown Origin, SV or Ventricular)| 82|
|Normal Functioning Artificial Pacemaker| 296|
|Paroxysmal Supraventricular Tachycardia| 24|
|Sinus Arrhythmia| 772|
|Sinus Bradycardia| 637|
|Sinus Rhythm| 2000|
|Sinus Tachycardia| 826|
|Supraventricular Arrhythmia| 157|
|Supraventricular Tachycardia| 27|
|Trigeminal Pattern (Unknown Origin, SV or Ventricular)| 20|

In order to simplify things a little, I will consider sinus rhythm (SR), sinus bradycardia (SB), and sinus tachycardia (ST) as normal sinus rhythm (NSR). My thought is that with a good beat detector the heart rate can be found separately and a brady or tachy determination can be made based on user provided threshholds.

## Import the Data
Read the csv from PTB-XL as a pandas dataframe.

In [11]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
import pandas as pd

csv_fn = "coorteeqsrafva.csv"
df = pd.read_csv(csv_fn, sep=";", index_col=0,)

df.head()

Unnamed: 0,diagnosi,ecg_id,ritmi,patient_id,age,sex,height,weight,nurse,site,...,validated_by_human,baseline_drift,static_noise,burst_noise,electrodes_problems,extra_beats,pacemaker,strat_fold,filename_lr,filename_hr
0,STACH,10900,VA,15654.0,54.0,0,,,0.0,0.0,...,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
1,AFLT,10900,AF,15654.0,54.0,0,,,0.0,0.0,...,False,,,,,,,6,records100/10000/10900_lr,records500/10000/10900_hr
2,SR,8209,SR,12281.0,55.0,0,,,1.0,2.0,...,True,,,,,,,10,records100/08000/08209_lr,records500/08000/08209_hr
3,STACH,17620,VA,2007.0,29.0,1,164.0,56.0,7.0,1.0,...,True,,,,,,,1,records100/17000/17620_lr,records500/17000/17620_hr
4,SBRAD,12967,VA,8685.0,57.0,0,,,0.0,0.0,...,False,,", I-AVR,",,,,,1,records100/12000/12967_lr,records500/12000/12967_hr


## Format the Data
To use this data in a random forrest classifierit will first need to formatted. The script, feature_selection.py, contains the function I use to do that. It keeps the following columns:

- diagnosi: Rhythm type
- age: Patient age
- sex: Patient sex
- height: Patient height
- weight:Patient weight
- heart_axis: The direction of the overall electrical activity of the heart
- pacemaker: If the patient has an implanted pacemaker

Most of the formatting is done by filling in missing values, and categorizing text numerically. For example, heart_axis can have values of LAD, AXL, MID, RAD, etc., these are converted to 0, 1, 2, 3, 4 ...

Some notable formatting happens with age, height, and weight.

- Age: missing age values are determined by the mean age of patients that are within +/- 2.5cm of the patient's height, and +/- 2kg of the patient's weight.
    - If both height and weight are missing, the age is determined to be the mean age of the entire dataset.
- Height: missing height values are determined by the mean height of the patients age demographic, as binned by 0-4yrs, 5-9yrs, 10-19yrs, and >20yrs
- Weight: missing height values are determined by the mean height of the patients age demographic, as binned by in 5 year increments from 0 - 10yrs, and ten year increments from 10yrs and onwards.

In [13]:
from feature_selection import format_df

df = format_df(df)
df.head()

Unnamed: 0,diagnosi,age,sex,height,weight,heart_axis,pacemaker
0,0,54.0,0,173.459649,84.785714,0,0
1,6,54.0,0,173.459649,84.785714,0,0
2,0,55.0,0,173.459649,84.785714,1,0
3,0,29.0,1,164.0,56.0,2,0
4,0,57.0,0,173.459649,84.785714,0,0


## Train, Test, and Evaluate
A random forrest classifier from Scikit-Learn is used to classify the data by "diagnosi".

- The data is split into train and test datasets. This is done by withholding 20% of each rhythm type for testing.
- Hyperparameter tuning is performed.
- A classifier is trainged with the best hyperparameters on the full training dataset.
- The classifier is tested and its results are reported

In [14]:
from feature_selection import (
    train_test_split_by_category,
    train,
    test
)

(X_train, y_train), (X_test, y_test) = train_test_split_by_category(df, "diagnosi")
print(X_train.shape)
print(X_test.shape)

rf = train(X_train, y_train)
report, cm = test(rf, X_test, y_test)

(5138, 6)
(1290, 6)



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [15]:
from IPython.display import display

from feature_selection import (
    select_features,
    create_confusion_matrix,
    create_report_table,
    test
)

report, cm = test(rf, X_test, y_test)

fig1 = create_report_table(report)
fig2 = create_confusion_matrix(cm)

display(fig1)
display(fig2)


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



## Feature Selection

The features are scored with impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

### Accuracy
We can see from our accuracy measures that patient demographic information, by itself, isn't enough to predict what type of rhythm an ecg strip is presenting (except for pace rhythm, but that info is usually in the patient info). However, accuracy is not the point of this experiment. I just wanted to see what patient info is most indicitive of rhythm types.

In [16]:
features = select_features(rf, X_train)
features

[('age', 0.40534650622274254),
 ('weight', 0.21370480224947813),
 ('pacemaker', 0.14582698726070603),
 ('height', 0.14013217086754282),
 ('heart_axis', 0.08061118885735595),
 ('sex', 0.014378344542174477)]

I'm surprized that sex is not a more important feature. Let's continue this experiment by performing binary classifications for each rhythm type, and see if feature importance changes by rhythm.

In [17]:
import numpy as np
from feature_selection import CM_LABELS

orig_trn_lbls = y_train
orig_tst_lbls = y_test

n_lbls = orig_trn_lbls.max()

for i in range(n_lbls):
    curr_y_trn = pd.Series(np.where(orig_trn_lbls == i, 1, 0))
    curr_y_tst = pd.Series(np.where(orig_tst_lbls == i, 1, 0))

    rf = train(X_train, curr_y_trn)
    report, cm = test(rf, X_test, curr_y_tst)
    features = select_features(rf, X_train)
    print(CM_LABELS[i], report["accuracy"], features)
    

NSR 0.6263565891472869 [('age', 0.4073111570626327), ('weight', 0.2470546280431409), ('height', 0.15697779479950738), ('heart_axis', 0.08926976906706484), ('pacemaker', 0.08454175446532865), ('sex', 0.014844896562325566)]
AFIB 0.7542635658914729 [('age', 0.4737558391343282), ('weight', 0.2109322039154629), ('height', 0.16058781716387296), ('heart_axis', 0.10863000797186372), ('pacemaker', 0.030600830909381253), ('sex', 0.015493300905091014)]
SARRH 0.8697674418604651 [('age', 0.5037351378206264), ('weight', 0.27494626660326626), ('height', 0.12893133732810608), ('heart_axis', 0.07056078894170664), ('sex', 0.014995789766459362), ('pacemaker', 0.006830679539835362)]
PACE 0.9968992248062015 [('pacemaker', 0.8926332719582352), ('heart_axis', 0.03980415158779869), ('height', 0.02513683943591719), ('age', 0.022580873650903128), ('weight', 0.01782374299730337), ('sex', 0.0020211203698424)]



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



SVARR 0.9751937984496124 [('age', 0.6696414975998143), ('heart_axis', 0.16315576233293125), ('weight', 0.12275713097948528), ('height', 0.030939891145682637), ('pacemaker', 0.0069277705270664475), ('sex', 0.0065779474150202055)]
BIGU 0.986046511627907 [('age', 0.38026545661276506), ('weight', 0.2479253662659412), ('height', 0.22711161330480964), ('heart_axis', 0.11451368316665986), ('sex', 0.024511325061290404), ('pacemaker', 0.005672555588533842)]
AFLT 0.9852713178294573 [('age', 0.38880910932772184), ('weight', 0.22906977004932533), ('height', 0.1827470392144765), ('heart_axis', 0.1687134542369751), ('sex', 0.017805539214813267), ('pacemaker', 0.01285508795668805)]
SVTAC 0.9922480620155039 [('age', 0.35243784974405024), ('height', 0.28637023028296515), ('weight', 0.2745527867709492), ('heart_axis', 0.0454357006081656), ('sex', 0.03105272487354569), ('pacemaker', 0.01015070772032418)]
PSVT 0.9953488372093023 [('age', 0.4124952011655212), ('weight', 0.26454851648086186), ('height', 0.2

## Results
Wow! That works surprizingly well. It's likely that these results are a product of the dataset and would not generalize over a real world population. Never-the-less, what is clear is that age, height, and weight are important for all rhythm types. I'm still surprized that sex is always one of the least important features.

In effort to choose features that would easily available for most patients, I will be using age, height, weight, pacemaker, and sex as input the CNN. Even though sex does not seem to be an important feature in this dataset, it is considered by most physicians to be important.