# Data Mining / Prospecção de Dados

## Sara C. Madeira, 2024/2025

# Project 2 - Classification in Temporal Data using Sequential Pattern Mining

## Logistics 
**_Read Carefully_**

**Students should work in teams of 3 people**. 

Groups with less than 3 people might be allowed (with valid justification), but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `June, 8th (23:59)`.** 

Students should **upload a `.zip` file** containing a folder with all the files necessary for project evaluation. 
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the `zip` file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202425_P2.ipynb` as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs.**

**Decisions should be justified and results should be critically discussed.** 

Remember that **your notebook should be as clear and organized as possible**, that is, **only the relevant code and experiments should be presented, not everything you tried and did not work, or is not relevant** (that can be discussed in the text, if relevant)! Tables and figures can be used together with text to summarize results and conclusions, improving understanding, readability and concision. **More does not mean better! The target is quality not quantity!**

_**Project solutions containing only code and outputs without discussions will achieve a maximum grade of 10 out of 20.**_

## Dataset and Tools

Amyotrophic Lateral Sclerosis (ALS) is a devastating neurodegenerative disease causing rapid degeneration of motor neurons and usually leading to death by respiratory failure. Since there is no cure, treatment’s goal is to improve symptoms and prolong survival. Non-invasive Ventilation (NIV) has been shown to extend life expectancy and improve quality of life, thus it is key to effectively predict if ALS patients will be eligible for NIV in the near future based on disease progression. In this context, Martins et al. (2021) proposed to learn prognostic models using disesase progression patterns (https://ieeexplore.ieee.org/document/9426397), and formulated the following prognostic prediction problem (schematized in Figure below): given a specific ALS patient static data collected at diagnosis and temporal data from disease follow-up, can we effectively use these clinical evaluations to predict if this patient will require NIV within k days of last evaluation?

<img src="prognostic_problem.png" alt="Prognostic Prediction" style="width: 500px;"/>

In this project, we will perform a reduced part of the work published by Martins et al. to learn a machine learning model (classifier) able to predict the need for NIV in a time window of 180 days given static and temporal data and sequential patterns as features.

The dataset to be analysed was obtained from the Lisbon ALS database, containing clinical data from ALS patients collected during their follow-up at the hospital. To reduce the preprocessing steps relatively to what was done in the paper and should be done in the real scenario the following **preprocessed datasets** are already provided:

1. `Dataset_Static_Features.csv` - each row is a patient with a REF id described by a set of features collected at diagnosis time. These features are called static since their are not collected over time. Duplicated patients should be deleted.

<img src="static_example.png" alt="Temporal Data" style="width: 1000px;"/>

2. `Dataset_Temporal_Features.csv` - each patient has a set of rows (snapshots), each corresponding to a visit at the hospital and the values collected for a set of temporal features collected over time. The rows per patient REF are sorted in chonologic order, such that the first row and the last row of a REF correspond, respectively to the data collected in the first and last visit at the hospital. In the example below, the patient REF=9 has temporal data describing 5 clinical evaluations (time-points).

<img src="temporal_example.png" alt="Temporal Data" style="width: 500px;"/>

3. `Dataset_NIV_Evolution_180.cvs`- each patient has a set of rows with the true value of NIV (yes or no) 180 days in the future. These are the class labels to be later used to train the classifier. Patient REF=9 does not evolve to need NIV 180 after time-points 1 to 4 but evolves to NIV 180 after time-point 5.

<img src="NIV_evolution_example.png" alt="Temporal Data" style="width: 100px;"/>

In this context, the project has **2 main tasks**:
1. Learn disease progression patterns from the temporal data using temporal pattern mining
2. Learn a classifier to predict NIV using temporal patterns as features together with the static features


**In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org), [`SPMF`](https://www.philippe-fournier-viger.com/software.php) for temporal pattern mining, and [`Scikit-learn`](https://scikit-learn.org/stable/) for classification.**


## Team Identification

**GROUP 07**

Students:

* Daniel Carvalho - 64350
* Gabriel Meirinho - 64873
* Rita Silva - 56798

## 1. Learn disease progression patterns from the temporal data using temporal pattern mining

In this first task you should load and preprocessed the dataset **`Dataset_Temporal_Features.csv`** in order to compute sequential patterns for each patient. 

For that, you first need to load and preprocess this dataset and then transform the temporal data into a **sequence database**.

You should consider **a minimum of 2 time-points** and a **maximum of 5 time-points** per patient.

The sequential pattern mining algorithm `Fourier08` (https://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php), an extension of `PrefixSpan` able to **deal with time** and compute **close patterns**, should be used.

In [423]:
import pandas as pd
import os
from spmf import Spmf
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import re

In [424]:
#increase the display width to see the full sequence
pd.set_option('display.max_colwidth', None)

In [425]:
def alsfrs_intervals(val):
    if val < 4:
        return '<4'
    elif 4 <= val < 8:
        return '[4,8['
    elif 8 <= val < 12:
        return '[8,12['
    else:
        return '12'
    
def alsfrs_r(val):
    if val < 12:
        return '<12'
    elif 12 <= val < 24:
        return '[12,24['
    elif 24 <= val < 36:
        return '[24,36['
    else:
        return '>=36'

def create_sequence_db_df(df):
    records = []
    item_mapping = {}
    item_counter = 1

    # Define all possible feature-value pairs
    all_features = {
        'ALSFRSb': ['<4', '[4,8[', '[8,12[', '12'],
        'ALSFRSsUL': ['<4', '[4,8[', '[8,12[', '12'],
        'ALSFRSsT': ['<4', '[4,8[', '[8,12[', '12'],
        'ALSFRSsLL': ['<4', '[4,8[', '[8,12[', '12'],
        'R': ['<4', '[4,8[', '[8,12[', '12'],
        'ALSFRS-R': ['<12', '[12,24[', '[24,36[', '>=36']
    }

    # Pre-populate item_mapping with all possible items
    for feature, values in all_features.items():
        for value in values:
            key = f"{feature}={value}"
            if key not in item_mapping:
                item_mapping[key] = item_counter
                item_counter += 1

    for ref, group in df.groupby('REF'):
        parts = []       # for SPMF encoding
        parts_raw = []   # for human‐readable
        for t, (_, row) in enumerate(group.iterrows()):
            # build encoded item list
            items = []
            for col in df.columns:
                if col == 'REF':
                    continue
                value = str(row[col])
                key = f"{col}={value}"
                items.append(str(item_mapping[key]))
            parts.append(f"<{t}> " + " ".join(items) + " -1")

            # build raw tuple
            raw_items = " ".join(f"{col} = {row[col]}"
                                 for col in df.columns if col != 'REF')
            parts_raw.append(f"({t}, {raw_items} )")

        sequence_str = " ".join(parts) + " -2"
        sequence_raw = " ".join(parts_raw)
        records.append({
            'REF': ref,
            'Sequence': sequence_str,
            'Sequence_raw': sequence_raw
        })

    # build mapping header
    mapping_str = "@CONVERTED_FROM_TEXT\n"
    for key, number in item_mapping.items():
        mapping_str += f"@ITEM={number}={key}\n"

    return pd.DataFrame(records), mapping_str

def save_sequences_and_map(sequences_df, mapping_str, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(mapping_str)
        for seq in sequences_df['Sequence']:
            f.write(seq.strip() + '\n')

def filter_outputs_by_len(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as fin, \
     open(output_file, 'w', encoding='utf-8') as fout:
        for line in fin:
            # collect all time‐point tags like <0>, <1>
            tps = re.findall(r'<\s*(\d+)\s*>', line)
            # if exactly one time‐point and it's "0", skip this line
            if len(tps) == 1 and tps[0] == '0':
                continue
            fout.write(line)
    print(f"Filtered file written to {output_file}")

def is_subsequence(pat, seq):
    i = 0
    for pset in pat:
        while i < len(seq) and not pset.issubset(seq[i]):
            i += 1
        if i == len(seq): return False
        i += 1
    return True

def parse_sequence(raw):
    out = []
    # find every "(t, …)" chunk
    for content in re.findall(r'\(\s*\d+,\s*(.*?)\s*\)', raw):
        toks = content.split()
        s = set()
        # every three tokens are feat, '=', val
        for i in range(0, len(toks), 3):
            feat, _, val = toks[i : i+3]
            s.add(f"{feat}={val}")
        out.append(s)
    return out

def extract_pattern_matches(filtered_file: str,
                            sequences: pd.DataFrame) -> pd.DataFrame:
    records = []
    with open(filtered_file, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line.startswith('<'):
                continue

            pat_str, sup_str = line.split('#SUP:')
            enc_pattern = pat_str.strip()
            support     = int(sup_str.strip())

            # decode for human readability
            sets = re.findall(r'<(\d+)>\s+(.*?)\s+-1', enc_pattern)
            decoded = " ".join(
                f"({t}, " +
                  ", ".join(f"{feat} = {val}" for feat,val in 
                            (item.split('=',1) for item in items.split())) +
                " )"
                for t,items in sets
            )

            # build set‐of‐sets for subsequence test
            pat_sets = [ set(s.split()) for _,s in sets ]

            hits = [
                row.REF
                for _, row in sequences.iterrows()
                if is_subsequence(pat_sets,
                                  parse_sequence(row.Sequence_raw))
            ]

            records.append({
                'pattern':         enc_pattern,
                'decoded_pattern': decoded,
                'support':         support,
                'REF':             hits
            })

    df = pd.DataFrame(records)
    df = df.sort_values(by='support', ascending=False).reset_index(drop=True)
    total = len(sequences)
    df['support_pct'] = (df['support'] / total * 100).round(2)
    return df



### 1.1. Load and Preprocess Dataset

- Remove all patients (REFs) with missing values in the temporal features.
- Remove all patients with less than 2 time-points.
- For the patients with more than 5 time-points, keep only the first 5.
- In order to reduce the number of patterns to be generated agregate the values of the temporal features using the following intervals (4 intervals for each feature as in Martins et al): 
  1) Intervals for `ALSFRSb, ALSFRSsUL, ALSFRSsT, ALSFRSsLL, and R: <4 [4,8[ [8,12[ 12`;
  2) Intervals for `ALSFRS-R: <12  [12,24[  [24,36[  >=36`

In [426]:
temporal_df = pd.read_csv('Dataset_Temporal_Features.csv', delimiter=';')

In [427]:
temporal_df.head(3)

Unnamed: 0,REF,ALSFRSb,ALSFRSsUL,ALSFRSsT,ALSFRSsLL,R,ALSFRS-R
0,2,12.0,8.0,6.0,4.0,12.0,42.0
1,2,12.0,8.0,6.0,4.0,12.0,42.0
2,2,12.0,8.0,6.0,3.0,12.0,41.0


In [428]:
print(f"Number of nulls per column\n{temporal_df.isnull().sum()}")

Number of nulls per column
REF            0
ALSFRSb      318
ALSFRSsUL    329
ALSFRSsT     329
ALSFRSsLL    329
R            366
ALSFRS-R     381
dtype: int64


#### Remove all patients with missing values in the temporal features

In [429]:
temporal_df = temporal_df.dropna()
print(f"\nNumber of nulls per column\n{temporal_df.isnull().sum()}")


Number of nulls per column
REF          0
ALSFRSb      0
ALSFRSsUL    0
ALSFRSsT     0
ALSFRSsLL    0
R            0
ALSFRS-R     0
dtype: int64


In [430]:
ref_counts = temporal_df['REF'].value_counts()
print(f"\nReference counts:\n{ref_counts}")


Reference counts:
REF
723     26
269     25
608     24
395     24
981     23
        ..
1162     1
1453     1
110      1
519      1
521      1
Name: count, Length: 957, dtype: int64


#### Remove all patients with less than 2 time-points

In [431]:
temporal_df = temporal_df.groupby('REF').filter(lambda x: len(x) >= 2)

#### For the patients with more than 5 time-points, keep only the first 5

In [432]:
temporal_df = temporal_df.groupby('REF').head(5)
print(f"\nNumber of visits per REF after filtering:\n{temporal_df['REF'].value_counts()}")


Number of visits per REF after filtering:
REF
571     5
920     5
922     5
924     5
1539    5
       ..
1027    2
1034    2
1039    2
1043    2
1878    2
Name: count, Length: 691, dtype: int64


In [433]:
#print the unique values in all columns except 'REF'
for col in temporal_df.columns:
    if col != 'REF':
        unique_values = temporal_df[col].unique()
        print(f"\nUnique values in column '{col}':\n{unique_values}")


Unique values in column 'ALSFRSb':
[12. 10.  9. 11.  7.  8.  6.  4.  5.  2.  0.  1.  3.]

Unique values in column 'ALSFRSsUL':
[8. 5. 6. 3. 4. 2. 7. 1. 0.]

Unique values in column 'ALSFRSsT':
[6. 4. 2. 8. 7. 3. 5. 1. 0.]

Unique values in column 'ALSFRSsLL':
[4. 3. 8. 5. 1. 0. 6. 7. 2.]

Unique values in column 'R':
[12. 11. 10.  9.  8.]

Unique values in column 'ALSFRS-R':
[42. 41. 36. 37. 38. 32. 46. 45. 44. 31. 27. 39. 35. 30. 33. 47. 43. 40.
 29. 28. 34. 25. 23. 26. 19. 24. 48. 16. 22. 21. 20. 18. 15. 17. 13. 14.
 11. 12.]


#### Intervals for the features 'ALSFRSb', 'ALSFRSsUL', 'ALSFRSsT', 'ALSFRSsLL' and 'R'

In [434]:
cols = ['ALSFRSb', 'ALSFRSsUL', 'ALSFRSsT', 'ALSFRSsLL', 'R']
for col in cols:
    temporal_df[col] = temporal_df[col].apply(alsfrs_intervals)

In [435]:
temporal_df.head(3)

Unnamed: 0,REF,ALSFRSb,ALSFRSsUL,ALSFRSsT,ALSFRSsLL,R,ALSFRS-R
0,2,12,"[8,12[","[4,8[","[4,8[",12,42.0
1,2,12,"[8,12[","[4,8[","[4,8[",12,42.0
2,2,12,"[8,12[","[4,8[",<4,12,41.0


#### Intervals for feature 'ALSFRS-R'

In [436]:
temporal_df['ALSFRS-R'] = temporal_df['ALSFRS-R'].apply(alsfrs_r)

In [437]:
temporal_df.head(3)

Unnamed: 0,REF,ALSFRSb,ALSFRSsUL,ALSFRSsT,ALSFRSsLL,R,ALSFRS-R
0,2,12,"[8,12[","[4,8[","[4,8[",12,>=36
1,2,12,"[8,12[","[4,8[","[4,8[",12,>=36
2,2,12,"[8,12[","[4,8[",<4,12,>=36


In [438]:
#print the unique values in all columns except 'REF'
for col in temporal_df.columns:
    if col != 'REF':
        unique_values = temporal_df[col].unique()
        print(f"\nUnique values in column '{col}':\n{unique_values}")


Unique values in column 'ALSFRSb':
['12' '[8,12[' '[4,8[' '<4']

Unique values in column 'ALSFRSsUL':
['[8,12[' '[4,8[' '<4']

Unique values in column 'ALSFRSsT':
['[4,8[' '<4' '[8,12[']

Unique values in column 'ALSFRSsLL':
['[4,8[' '<4' '[8,12[']

Unique values in column 'R':
['12' '[8,12[']

Unique values in column 'ALSFRS-R':
['>=36' '[24,36[' '[12,24[' '<12']


This code implements a comprehensive data preprocessing pipeline for a temporal medical dataset, specifically focused on ALS (Amyotrophic Lateral Sclerosis) functional rating scale data. The preprocessing follows a systematic approach to clean and transform the data for pattern analysis.

Data Loading and Initial Exploration The pipeline begins by loading a CSV file containing temporal features using a semicolon delimiter, which is common in European datasets. The code immediately explores the data structure by displaying the first few rows and checking for missing values across all columns. This initial inspection is crucial for understanding data quality issues before proceeding with cleaning steps.

Sequential Data Cleaning Process The cleaning process follows a logical sequence that preserves data integrity. First, all rows with missing values are removed using dropna(), which is a conservative approach that ensures complete cases for analysis. The code then addresses the temporal nature of the data by filtering patients based on their number of time-points. Patients with fewer than 2 observations are removed since temporal analysis requires multiple data points, while those with more than 5 time-points are truncated to keep only the first 5 observations.

Advanced Grouping Operations Two sophisticated pandas operations demonstrate advanced data manipulation techniques. The groupby('REF').filter() operation applies a condition to entire groups, keeping only patient groups that meet the minimum time-point requirement. The groupby('REF').head(5) operation elegantly limits each patient to their first 5 chronological observations, maintaining temporal order while standardizing the dataset structure.

Feature Discretization Strategy The final preprocessing step transforms continuous numerical features into categorical intervals, a technique called discretization or binning. This approach reduces the complexity of pattern generation by grouping similar values together. The code applies different interval schemes to different feature sets: ALSFRS-related scores use one binning strategy (alsfrs_intervals), while the ALSFRS-R score uses a different approach (alsfrs_r). This domain-specific discretization likely reflects clinical knowledge about meaningful score ranges.

Potential Gotchas and Considerations One important consideration is that the code references functions alsfrs_intervals and alsfrs_r that aren't defined in the visible code, suggesting they're defined elsewhere in the notebook. Additionally, the dropna() operation removes entire rows when any column has missing data, which could lead to significant data loss if missingness is high. The truncation to 5 time-points per patient might also discard valuable longitudinal information for patients with longer follow-up periods.

### 1.2. Compute the Sequence Database

Note that items have now the form `Feature=value` and you should have a sequence database with as many sequences as patients. 

Each sequence encodes the several time-points (maximum 5) of each patient.

See `Fourier08` example (https://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php) to undertand the format received by the algorithm, specially the time information. 

Remember also the end of the example with PrefixSpan (https://www.philippe-fournier-viger.com/spmf/PrefixSpan.php) to understand how to use strings instead of integers to encode items .

In [439]:
sequences, map = create_sequence_db_df(temporal_df)

In [440]:
#print the number of patients
print(len(sequences['REF'].unique()))

691


In [441]:
sequences.head()

Unnamed: 0,REF,Sequence,Sequence_raw
0,2,<0> 4 7 10 14 20 24 -1 <1> 4 7 10 14 20 24 -1 <2> 4 7 10 13 20 24 -1 <3> 4 6 10 13 20 24 -1 -2,"(0, ALSFRSb = 12 ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = [4,8[ R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = 12 ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = [4,8[ R = 12 ALSFRS-R = >=36 ) (2, ALSFRSb = 12 ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 ) (3, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 )"
1,8,<0> 4 6 10 15 20 24 -1 <1> 4 5 9 15 20 24 -1 -2,"(0, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = <4 ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 )"
2,9,<0> 4 6 10 13 20 24 -1 <1> 4 5 10 14 20 24 -1 <2> 4 6 9 13 19 23 -1 <3> 4 5 10 13 20 24 -1 <4> 4 5 10 13 19 23 -1 -2,"(0, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = [4,8[ ALSFRSsLL = [4,8[ R = 12 ALSFRS-R = >=36 ) (2, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = <4 ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ ) (3, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 ) (4, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ )"
3,10,<0> 3 7 11 15 20 24 -1 <1> 3 7 10 15 20 24 -1 <2> 3 7 11 15 19 24 -1 -2,"(0, ALSFRSb = [8,12[ ALSFRSsUL = [8,12[ ALSFRSsT = [8,12[ ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = [8,12[ ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 ) (2, ALSFRSb = [8,12[ ALSFRSsUL = [8,12[ ALSFRSsT = [8,12[ ALSFRSsLL = [8,12[ R = [8,12[ ALSFRS-R = >=36 )"
4,14,<0> 3 6 10 13 19 23 -1 <1> 3 5 9 13 19 23 -1 -2,"(0, ALSFRSb = [8,12[ ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ ) (1, ALSFRSb = [8,12[ ALSFRSsUL = <4 ALSFRSsT = <4 ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ )"


In [442]:
save_sequences_and_map(sequences, map, 'sequences.txt')

Write text in cells like this ...


This code section transforms the preprocessed temporal data into a specialized sequence database format required for sequential pattern mining algorithms, particularly those in the SPMF (Sequential Pattern Mining Framework) library.

Sequence Database Transformation The core operation create_sequence_db_df(temporal_df) converts each patient's temporal data into a structured sequence format. Each patient becomes a single sequence in the database, where their multiple time-points are encoded as ordered events within that sequence. This transformation is crucial for sequential pattern mining, which aims to discover common temporal patterns across patients. The function returns both the sequences DataFrame and a mapping string that translates between human-readable feature-value pairs and numeric identifiers.

SPMF Format Compliance The sequence encoding follows the specific format requirements of SPMF algorithms. Each sequence uses time-stamped itemsets with special delimiters: <t> indicates the time-point, -1 marks the end of an itemset (all features at one time-point), and -2 marks the end of a complete sequence (one patient's entire timeline). For example, a patient's first visit might be encoded as <0> 1 5 9 -1 <1> 2 6 10 -1 -2, where the numbers represent encoded feature-value combinations. This format allows algorithms to understand both the temporal ordering and the co-occurrence of features at each time-point.

Dual Representation Strategy The implementation cleverly maintains both machine-readable and human-readable versions of each sequence. The Sequence column contains the numeric encoding required by mining algorithms, while Sequence_raw provides a human-interpretable version like (0, ALSFRSb = [4,8[ ALSFRSsUL = <4 ...). This dual approach facilitates debugging and result interpretation while ensuring algorithm compatibility.

Data Validation and Export The code includes validation steps to verify the transformation was successful. It prints the number of unique patients to confirm that each patient generated exactly one sequence, and displays the first few sequences to inspect the format. Finally, save_sequences_and_map() exports both the sequences and the item mapping to a text file, creating a complete package that external sequential pattern mining tools can process.

Key Considerations The pre-population of all possible feature-value combinations in the item mapping ensures consistent encoding across the entire dataset, even if some combinations don't appear in the actual data. This approach prevents issues where different subsets of data might generate different numeric codes for the same feature-value pair. However, the reliance on the SPMF format makes this code highly specialized for sequential pattern mining and not easily adaptable to other types of temporal analysis.

### 1.3. Compute Sequential Patterns

Use `Fourier08` to compute the closed sequential patterns. Trivial patterns of length 1 should be discarded.

Note that later you need to know what are the sequences (patients) where the patterns occur (the algorithm can output that info).

In [443]:
# # min_support, min_gap of the time intervals (saltinhos entre os time points), max_gap, min_length of the sequences (quantidade de time points), max_length
os.system("java -jar spmf.jar run Fournier08-Closed+time sequences.txt output.txt 25% 0 5 0 5")

0

In [444]:
filter_outputs_by_len(input_file='output.txt', output_file='filtered_output.txt')

Filtered file written to filtered_output.txt


In [445]:
df_matches = extract_pattern_matches('filtered_output.txt', sequences)

In [446]:
df_matches.head(3)

Unnamed: 0,pattern,decoded_pattern,support,REF,support_pct
0,<0> R=12 -1 <1> R=12 -1,"(0, R = 12 ) (1, R = 12 )",524,"[2, 8, 9, 10, 17, 18, 20, 21, 24, 30, 34, 35, 36, 39, 40, 42, 43, 45, 46, 49, 50, 54, 55, 56, 64, 66, 67, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 100, 103, 104, 105, 111, 113, 115, 119, 122, 125, 126, 133, 136, 137, 141, 144, 145, 151, 153, 156, 161, 162, 164, 165, 166, 167, 169, 171, 173, 174, 176, 177, 178, 179, 180, 185, 196, 197, 200, 201, 202, 205, 207, 210, 211, 212, 213, 214, 215, 216, 219, 220, 227, 236, 238, 241, 242, 247, 250, 253, 256, 259, 261, ...]",75.83
1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,"(0, ALSFRS-R = >=36 ) (1, ALSFRS-R = >=36 )",511,"[2, 8, 9, 10, 17, 21, 24, 26, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 60, 61, 63, 64, 66, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 99, 100, 103, 104, 113, 115, 119, 122, 126, 134, 136, 137, 141, 151, 155, 156, 160, 161, 162, 164, 166, 167, 169, 171, 173, 174, 176, 177, 179, 185, 196, 197, 200, 202, 203, 205, 207, 208, 210, 211, 212, 213, 214, 215, 217, 220, 227, 236, 238, 241, 242, 243, 244, 247, 250, 253, 256, 259, 261, 262, 263, 265, 267, 269, ...]",73.95
2,<0> ALSFRS-R=>=36 -1 <1> R=12 -1,"(0, ALSFRS-R = >=36 ) (1, R = 12 )",502,"[2, 8, 9, 10, 17, 21, 24, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 64, 66, 67, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 100, 103, 104, 105, 113, 115, 119, 122, 126, 134, 136, 137, 141, 144, 145, 151, 153, 156, 161, 162, 164, 165, 166, 167, 169, 171, 173, 174, 176, 177, 178, 179, 180, 185, 196, 197, 200, 202, 205, 207, 210, 211, 212, 213, 214, 215, 220, 227, 236, 238, 241, 242, 247, 250, 253, 256, 259, 261, 262, 263, 265, 267, 269, 271, 273, 277, 278, ...]",72.65


In [447]:
# Explode the REF column to have one row per REF-pattern combination
df_exploded = df_matches[['pattern', 'REF']].explode('REF')

# Create binary matrix using crosstab
binary_df = pd.crosstab(df_exploded['REF'], df_exploded['pattern'])

# Convert to integer type (crosstab returns counts, but since each REF-pattern pair appears once, it's already binary)
binary_df = binary_df.astype(int)

# Reset index to make REF a column instead of index
binary_df = binary_df.reset_index()

print(f"Binary matrix shape: {binary_df.shape}")
print(f"Number of patients: {len(binary_df)}")
print(f"Number of patterns: {len(binary_df.columns) - 1}")

binary_df.head()

Binary matrix shape: (691, 5163)
Number of patients: 691
Number of patterns: 5162


pattern,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ ALSFRS-R=>=36 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ R=12 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ R=12 ALSFRS-R=>=36 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsUL=[4,8[ -1",<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1 <4> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 ALSFRS-R=>=36 -1,<0> R=12 ALSFRS-R=>=36 -1 <4> R=12 -1
0,2,1,1,1,0,0,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,8,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,1,1
2,9,1,1,0,0,0,1,0,0,1,...,1,1,1,1,1,1,1,1,1,1
3,10,1,1,0,0,0,0,0,0,0,...,0,1,1,1,1,0,1,0,1,1
4,14,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Write text in cells like this ...


This code section executes sequential pattern mining using the SPMF framework and transforms the results into a patient-pattern binary matrix suitable for further analysis or machine learning applications.

Sequential Pattern Mining Execution The core mining operation uses the Fournier08 algorithm through a Java subprocess call to SPMF. The algorithm parameters are carefully configured: 25% minimum support ensures patterns appear in at least a quarter of patients, gap constraints (0 minimum, 5 maximum) control temporal spacing between pattern elements, and length constraints (0 minimum, 5 maximum) limit pattern complexity. The "Closed+time" variant finds closed sequential patterns, which are maximal patterns that cannot be extended without losing support, reducing redundancy in the results.

Pattern Filtering and Validation The filter_outputs_by_len() function removes trivial single-timepoint patterns that occur only at time <0>. This filtering is crucial because patterns appearing at only the first visit don't represent temporal progression and lack predictive value for understanding disease evolution. The filtering uses regex pattern matching to identify and exclude these trivial cases while preserving meaningful multi-timepoint patterns.

Pattern Matching and Decoding The extract_pattern_matches() function performs sophisticated pattern analysis by decoding the numeric SPMF output back into human-readable format and identifying which specific patients exhibit each pattern. It parses the encoded sequences, converts them to readable feature-value combinations, and performs subsequence matching against the original patient data. This step is computationally intensive but essential for understanding which patients contribute to each discovered pattern.

Binary Matrix Construction The final transformation creates a patient-by-pattern binary matrix using pandas operations. The explode('REF') operation transforms the list of matching patients for each pattern into individual rows, creating a long-format representation. The crosstab() function then pivots this data into a wide binary matrix where rows represent patients and columns represent patterns, with 1 indicating the patient exhibits that pattern and 0 otherwise. This matrix format is ideal for clustering, classification, or other machine learning analyses.

Data Structure Insights The resulting binary matrix provides a compressed representation of temporal patterns across the patient population. Each column represents a discovered sequential pattern (like "ALSFRS scores decline from moderate to severe over time"), and each row shows which patterns characterize a specific patient's disease progression. This representation enables researchers to identify patient subgroups with similar temporal trajectories, predict future progression patterns, or discover novel disease phenotypes based on temporal feature evolution.

## 2.  Learn a classifier to predict NIV using temporal patterns as features together with the static features

In this task you should create a training set where the features are 1) the original static features `(Dataset_static_features.cvs`and 2) the sequential patterns computed above. The class labels to be used for each patient are in file `Dataset_NIV_Evolution_180.cvs`.

### 2.1. Load/Preprocess the Dataset

- Remember to delete from `Dataset_static_features.csv` the patients you deleted in step 1.1. You should only have one row per patient, thus remove repetitions.
- Remember to delete from `Dataset_NIV_Evolution_180.cvs` the patients you deleted in step 1.1.
- Note that for each patient the class label you need from `Dataset_NIV_Evolution_180.cvs` is the one corresponding to the last time-point you considered in step 1.1.

In [448]:
evolution_df = pd.read_csv('Dataset_NIV_Evolution_180.csv', delimiter=';')
static_df = pd.read_csv('Dataset_Static_Features.csv', delimiter=';')

In [449]:
#remove the rows that have a 'REF' value that is not in the binary_df
evolution_df = evolution_df[evolution_df['REF'].isin(sequences['REF'])]
static_df = static_df[static_df['REF'].isin(sequences['REF'])]

print(len(evolution_df[~evolution_df['REF'].isin(sequences['REF'])]))
print(len(static_df[~static_df['REF'].isin(sequences['REF'])]))

0
0


In [450]:
binary_df

pattern,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ ALSFRS-R=>=36 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ R=12 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ R=12 ALSFRS-R=>=36 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsUL=[4,8[ -1",<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1 <4> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 ALSFRS-R=>=36 -1,<0> R=12 ALSFRS-R=>=36 -1 <4> R=12 -1
0,2,1,1,1,0,0,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,8,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,1,1
2,9,1,1,0,0,0,1,0,0,1,...,1,1,1,1,1,1,1,1,1,1
3,10,1,1,0,0,0,0,0,0,0,...,0,1,1,1,1,0,1,0,1,1
4,14,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
686,1795,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
687,1814,1,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,0,1,1
688,1818,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
689,1877,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [451]:
static_df.head(3)

Unnamed: 0,REF,Gender,Age at onset (years),Revised El Escorial Criteria,Onset,Diagnostic delay (months),BMI at 1st visit,MND familiar history,C9orf72
0,2,Male,54.92,Probable,Limbs,5.22,17.9,No,
1,2,Male,54.92,Probable,Limbs,5.22,17.9,No,
2,2,Male,54.92,Probable,Limbs,5.22,17.9,No,


In [452]:
#indicate null values in the static_df
print(f"Number of nulls per column in static_df:\n{static_df.isnull().sum()}")

Number of nulls per column in static_df:
REF                                0
Gender                             0
Age at onset (years)               0
Revised El Escorial Criteria       0
Onset                              0
Diagnostic delay (months)          0
BMI at 1st visit                 305
MND familiar history             140
C9orf72                         1634
dtype: int64


In [453]:
static_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3853 entries, 0 to 4321
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   REF                           3853 non-null   int64  
 1   Gender                        3853 non-null   object 
 2   Age at onset (years)          3853 non-null   float64
 3   Revised El Escorial Criteria  3853 non-null   object 
 4   Onset                         3853 non-null   object 
 5   Diagnostic delay (months)     3853 non-null   float64
 6   BMI at 1st visit              3548 non-null   float64
 7   MND familiar history          3713 non-null   object 
 8   C9orf72                       2219 non-null   object 
dtypes: float64(3), int64(1), object(5)
memory usage: 301.0+ KB


In [454]:
static_df = static_df.dropna(subset=['BMI at 1st visit'])

In [455]:
static_df = static_df.fillna('unknown')
print(f"\nNumber of nulls per column in static_df after filling:\n{static_df.isnull().sum()}")


Number of nulls per column in static_df after filling:
REF                             0
Gender                          0
Age at onset (years)            0
Revised El Escorial Criteria    0
Onset                           0
Diagnostic delay (months)       0
BMI at 1st visit                0
MND familiar history            0
C9orf72                         0
dtype: int64


In [456]:
evolution_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3853 entries, 0 to 4321
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   REF        3853 non-null   int64 
 1   Evolution  3853 non-null   object
dtypes: int64(1), object(1)
memory usage: 90.3+ KB


In [458]:
len(static_df), len(evolution_df)

(3548, 3853)

In [459]:
evolution_df.head(3)

Unnamed: 0,REF,Evolution
0,2,N
1,2,N
2,2,N


In [460]:
static_df.head(3)

Unnamed: 0,REF,Gender,Age at onset (years),Revised El Escorial Criteria,Onset,Diagnostic delay (months),BMI at 1st visit,MND familiar history,C9orf72
0,2,Male,54.92,Probable,Limbs,5.22,17.9,No,unknown
1,2,Male,54.92,Probable,Limbs,5.22,17.9,No,unknown
2,2,Male,54.92,Probable,Limbs,5.22,17.9,No,unknown


In [461]:
binary_df

pattern,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ ALSFRS-R=>=36 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ R=12 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ R=12 ALSFRS-R=>=36 -1","<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsUL=[4,8[ -1",<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1 <4> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 ALSFRS-R=>=36 -1,<0> R=12 ALSFRS-R=>=36 -1 <4> R=12 -1
0,2,1,1,1,0,0,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,8,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,1,1
2,9,1,1,0,0,0,1,0,0,1,...,1,1,1,1,1,1,1,1,1,1
3,10,1,1,0,0,0,0,0,0,0,...,0,1,1,1,1,0,1,0,1,1
4,14,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
686,1795,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
687,1814,1,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,0,1,1
688,1818,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
689,1877,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [462]:
#for each patient in the evolution_df and static_df, keep the last record for each REF
evolution_df = evolution_df.groupby('REF').tail(1)
static_df = static_df.groupby('REF').tail(1)

In [463]:
#'inner' por causa dos BMI nulos eliminados
binary_df = binary_df.merge(static_df, on='REF', how='inner')

In [464]:
binary_df

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 ALSFRS-R=>=36 -1,<0> R=12 ALSFRS-R=>=36 -1 <4> R=12 -1,Gender,Age at onset (years),Revised El Escorial Criteria,Onset,Diagnostic delay (months),BMI at 1st visit,MND familiar history,C9orf72
0,2,1,1,1,0,0,1,1,1,1,...,1,1,Male,54.92,Probable,Limbs,5.22,17.90,No,unknown
1,8,1,0,0,0,0,0,0,0,0,...,1,1,Male,84.80,PMA,Limbs,8.90,29.38,No,unknown
2,9,1,1,0,0,0,1,0,0,1,...,1,1,Male,72.17,Possible,Limbs,27.93,26.61,No,No
3,10,1,1,0,0,0,0,0,0,0,...,1,1,Male,65.76,Probable,Bulbar,1.18,25.96,No,unknown
4,14,0,0,0,0,0,0,0,0,0,...,0,0,Female,61.63,Definite,Limbs,26.02,21.08,No,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
623,1794,1,0,0,0,0,0,0,0,0,...,0,0,Male,67.60,Possible,Limbs,6.05,21.26,No,No
624,1795,0,0,0,0,0,0,0,0,0,...,0,0,Female,37.22,Probable,Limbs,3.02,24.92,No,No
625,1814,1,0,0,0,0,0,0,0,0,...,1,1,Male,70.06,Prob. Lab Sup,Limbs,8.05,26.88,No,No
626,1818,0,0,0,0,0,0,0,0,0,...,0,0,Male,67.80,Probable,Bulbar,6.01,25.80,No,No


In [465]:
#one hot encode the static features with 0 and 1
binary_df = pd.get_dummies(binary_df, columns=['Gender', 'Revised El Escorial Criteria', 'Onset', 'MND familiar history', 'C9orf72'], dtype=int)

In [466]:
binary_df.head(3)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,Onset_Limbs,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown
0,2,1,1,1,0,0,1,1,1,1,...,1,0,0,0,1,0,0,0,0,1
1,8,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
2,9,1,1,0,0,0,1,0,0,1,...,1,0,0,0,1,0,0,1,0,0


In [467]:
#assign the evolution features to the binary_df
binary_df = binary_df.merge(evolution_df, on='REF', how='left')

In [468]:
binary_df.head(3)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown,Evolution
0,2,1,1,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,1,N
1,8,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,Y
2,9,1,1,0,0,0,1,0,0,1,...,0,0,0,1,0,0,1,0,0,Y


In [469]:
binary_df['Evolution'] = (binary_df['Evolution'] == 'Y').astype(int)

In [470]:
binary_df.head(3)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown,Evolution
0,2,1,1,1,0,0,1,1,1,1,...,0,0,0,1,0,0,0,0,1,0
1,8,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,1
2,9,1,1,0,0,0,1,0,0,1,...,0,0,0,1,0,0,1,0,0,1


Write text in cells like this ...


This code section creates a comprehensive machine learning dataset by combining temporal sequential patterns with static patient features to predict NIV (Non-Invasive Ventilation) outcomes in ALS patients. The process involves careful data alignment, preprocessing, and feature engineering to prepare for classification modeling.

Dataset Loading and Patient Alignment The code begins by loading two additional datasets: static patient features and NIV evolution outcomes. A critical step is ensuring data consistency by filtering both datasets to include only patients who survived the temporal preprocessing pipeline from step 1.1. This alignment is essential because patients with insufficient temporal data or missing values were previously excluded, and all downstream analyses must use the same patient cohort. The verification shows that no patients were accidentally excluded, confirming proper data synchronization.

Strategic Missing Data Handling The preprocessing applies a nuanced approach to missing data that balances statistical rigor with clinical practicality. For BMI values, complete case deletion is used since BMI is a critical baseline measurement that should be available for most patients. However, for other categorical variables like family history or genetic markers, missing values are recoded as "unknown" rather than deleted. This strategy preserves sample size while acknowledging that missing information itself may be clinically meaningful - for instance, untested genetic status might indicate different clinical pathways.

Temporal Consistency and Deduplication The groupby('REF').tail(1) operation ensures that each patient contributes exactly one record by selecting their most recent observation from each dataset. This approach is particularly important for the evolution dataset, where the instructions specify using the outcome corresponding to the last time-point considered in the temporal analysis. This temporal alignment ensures that the target variable (NIV requirement) is measured at the same time horizon as the sequential patterns, creating a coherent prediction framework.

Feature Engineering Pipeline The merging process systematically combines three types of features: sequential patterns (from the binary pattern matrix), static demographic/clinical features, and the target variable. The one-hot encoding step transforms categorical variables into binary indicators, creating interpretable features that machine learning algorithms can effectively process. This encoding is particularly important for variables like "Revised El Escorial Criteria" (diagnostic categories) and genetic markers, where each category may have distinct predictive relationships with NIV outcomes.

Binary Target Transformation and Class Balance The final step converts the string-based evolution indicator ('Y'/'N') into a binary numeric target (1/0), which is standard for classification algorithms. The value counts reveal the class distribution, which is crucial for understanding potential model bias and selecting appropriate evaluation metrics. If classes are severely imbalanced, this information would guide the choice of sampling strategies, cost-sensitive learning approaches, or specialized evaluation metrics that account for class imbalance in medical prediction tasks.

### 2.2. Create the Training Set

See Section **3.4 Training Set Creation and Model Learning** of the paper Martins et al (2021). 

Note that in this project the original static features are used thus you only need to compute the distance matrix for the sequencial patterns (which are now features), the static features are used as they are.

**Perform the experiments only for binary matrices**.

In [472]:
X = binary_df.drop(columns='Evolution')
y = binary_df['Evolution']

# one‐hot encode all categorical columns
#X_enc = pd.get_dummies(X, drop_first=True, dtype=int)



In [473]:
X.head(2)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 R=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSsT=[4,8[ -1",...,Onset_Limbs,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown
0,2,1,1,1,0,0,1,1,1,1,...,1,0,0,0,1,0,0,0,0,1
1,8,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1


In [474]:
#print the distribution of the target variable
# print(full_df['Evolution'].value_counts(normalize=True))
# print(y.value_counts(normalize=True))
print(f"Number of refs: {len(binary_df)}")

Number of refs: 628


Write text in cells like this ...


This code section creates the training dataset for a machine learning classifier by separating features from the target variable, following a specific methodological approach for combining temporal patterns with static clinical features.

Feature-Target Variable Separation The code implements the fundamental machine learning practice of splitting data into features (X) and target (y). The drop(columns='Evolution') operation removes the outcome variable from the feature matrix, leaving all other columns including the sequential patterns discovered through temporal mining and the static patient characteristics like demographics, clinical measurements, and genetic markers. The target variable y contains the binary NIV evolution outcomes that the model will learn to predict.

Hybrid Feature Integration Strategy Following the Martins et al. (2021) methodology, this approach combines two distinct feature types with different computational requirements. The sequential patterns, represented as binary indicators of temporal progression patterns, will require specialized distance matrix computations to capture pattern similarity between patients. Meanwhile, the static features (age, BMI, genetic markers, diagnostic criteria) are used directly without additional transformation. This hybrid architecture allows the model to leverage both baseline patient characteristics and dynamic disease progression information.

Binary Matrix Constraint The emphasis on "binary matrices only" reflects a deliberate design choice that simplifies the computational pipeline while maintaining interpretability. By representing sequential patterns as binary presence/absence indicators, the approach enables efficient similarity computations using metrics like Jaccard or Hamming distance. This binary representation also makes the features more interpretable - each pattern column directly indicates whether a patient exhibits that specific temporal disease progression pattern.

Data Validation and Exploration The code includes basic validation steps to verify the dataset structure and understand the modeling task. The head(2) operation displays the first two rows of the feature matrix, allowing inspection of the data types and value ranges across the combined temporal and static features. The dataset size reporting confirms the number of patients available for training, which is crucial for assessing the statistical power of the analysis and determining appropriate modeling strategies given the sample size constraints typical in medical datasets.

#### 2.1.3. Learn the model

See section **4.1 Model Evaluation** of the paper Martins et al (2021).

Use only a `Random-Forest`, default parameters and present the results for **5-fold cross-validation** (mean+-std). 

Note that the problem is difficult so don´t expect high performance.

In [475]:
# define RF and stratified 5‐fold CV
rf = RandomForestClassifier(random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# run CV on encoded data
#scores = cross_val_score(rf, X_enc, y, cv=cv, scoring='accuracy')
scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy')
print(f"5‐fold CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

5‐fold CV accuracy: 0.557 ± 0.058


Write text in cells like this 

This code section implements the final machine learning evaluation phase, using Random Forest classification with stratified cross-validation to assess the predictive performance of the combined temporal patterns and static features for NIV outcome prediction.

Model Selection and Configuration The code uses a Random Forest classifier with default parameters, following the Martins et al. (2021) methodology. Random Forest is particularly suitable for this task because it handles mixed feature types well (binary pattern indicators alongside continuous clinical variables), provides built-in feature importance measures, and is robust to overfitting even with relatively small medical datasets. The random_state=42 ensures reproducible results across multiple runs, which is crucial for scientific validity and model comparison.

Stratified Cross-Validation Strategy The implementation uses StratifiedKFold with 5 folds, which is essential for medical classification tasks where class imbalance is common. Stratified sampling ensures each fold maintains the same proportion of positive and negative NIV outcomes as the original dataset, preventing scenarios where some folds might contain predominantly one class. The shuffle=True parameter randomizes patient order before splitting, reducing potential bias from systematic data collection patterns, while the fixed random state maintains reproducibility.

Performance Evaluation Framework The cross_val_score function executes the complete evaluation pipeline, training and testing the model on each of the 5 fold combinations. For each fold, the model learns relationships between the hybrid feature set (temporal patterns + static characteristics) and NIV outcomes using 80% of patients, then evaluates prediction accuracy on the remaining 20%. This process repeats 5 times with different train/test splits, providing a robust estimate of model performance that accounts for variability in patient selection.

Result Interpretation and Expectations The output format (mean ± standard deviation) provides both the average accuracy across all folds and a measure of performance stability. The acknowledgment that "the problem is difficult so don't expect high performance" reflects realistic expectations for medical prediction tasks, where accuracy is often limited by disease complexity, measurement noise, and the inherent unpredictability of biological systems. Even modest improvements over random prediction can be clinically meaningful in progressive diseases like ALS.

Statistical Rigor and Clinical Context This evaluation approach follows established machine learning best practices for medical applications, providing unbiased performance estimates that can guide clinical decision-making. The combination of temporal disease progression patterns with baseline patient characteristics represents a sophisticated approach to prognosis prediction that goes beyond traditional snapshot-based models, potentially capturing important dynamics of disease evolution that influence treatment timing decisions.

### Analysis and Discussion

In [493]:
#count the number of patients with evolution 0 and 1
evolution_counts = binary_df['Evolution'].value_counts()
print(f"\nEvolution counts:\n{evolution_counts}")


Evolution counts:
Evolution
0    321
1    307
Name: count, dtype: int64


In [494]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.model_selection import cross_val_predict
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## Experiments

### Minimum Support 20%

In [476]:
os.system("java -jar spmf.jar run Fournier08-Closed+time sequences.txt output_20.txt 20% 0 5 0 5")

0

In [477]:
filter_outputs_by_len(input_file='output_20.txt', output_file='filtered_output_20.txt')

Filtered file written to filtered_output_20.txt


In [478]:
df_matches = extract_pattern_matches('filtered_output_20.txt', sequences)

In [479]:
df_matches.head(3)

Unnamed: 0,pattern,decoded_pattern,support,REF,support_pct
0,<0> R=12 -1 <1> R=12 -1,"(0, R = 12 ) (1, R = 12 )",524,"[2, 8, 9, 10, 17, 18, 20, 21, 24, 30, 34, 35, 36, 39, 40, 42, 43, 45, 46, 49, 50, 54, 55, 56, 64, 66, 67, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 100, 103, 104, 105, 111, 113, 115, 119, 122, 125, 126, 133, 136, 137, 141, 144, 145, 151, 153, 156, 161, 162, 164, 165, 166, 167, 169, 171, 173, 174, 176, 177, 178, 179, 180, 185, 196, 197, 200, 201, 202, 205, 207, 210, 211, 212, 213, 214, 215, 216, 219, 220, 227, 236, 238, 241, 242, 247, 250, 253, 256, 259, 261, ...]",75.83
1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,"(0, ALSFRS-R = >=36 ) (1, ALSFRS-R = >=36 )",511,"[2, 8, 9, 10, 17, 21, 24, 26, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 60, 61, 63, 64, 66, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 99, 100, 103, 104, 113, 115, 119, 122, 126, 134, 136, 137, 141, 151, 155, 156, 160, 161, 162, 164, 166, 167, 169, 171, 173, 174, 176, 177, 179, 185, 196, 197, 200, 202, 203, 205, 207, 208, 210, 211, 212, 213, 214, 215, 217, 220, 227, 236, 238, 241, 242, 243, 244, 247, 250, 253, 256, 259, 261, 262, 263, 265, 267, 269, ...]",73.95
2,<0> ALSFRS-R=>=36 -1 <1> R=12 -1,"(0, ALSFRS-R = >=36 ) (1, R = 12 )",502,"[2, 8, 9, 10, 17, 21, 24, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 64, 66, 67, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 100, 103, 104, 105, 113, 115, 119, 122, 126, 134, 136, 137, 141, 144, 145, 151, 153, 156, 161, 162, 164, 165, 166, 167, 169, 171, 173, 174, 176, 177, 178, 179, 180, 185, 196, 197, 200, 202, 205, 207, 210, 211, 212, 213, 214, 215, 220, 227, 236, 238, 241, 242, 247, 250, 253, 256, 259, 261, 262, 263, 265, 267, 269, 271, 273, 277, 278, ...]",72.65


In [480]:
# Explode the REF column to have one row per REF-pattern combination
df_exploded = df_matches[['pattern', 'REF']].explode('REF')

# Create binary matrix using crosstab
binary_df = pd.crosstab(df_exploded['REF'], df_exploded['pattern'])

# Convert to integer type (crosstab returns counts, but since each REF-pattern pair appears once, it's already binary)
binary_df = binary_df.astype(int)

# Reset index to make REF a column instead of index
binary_df = binary_df.reset_index()

print(f"Binary matrix shape: {binary_df.shape}")
print(f"Number of patients: {len(binary_df)}")
print(f"Number of patterns: {len(binary_df.columns) - 1}")

binary_df.head()

Binary matrix shape: (691, 15698)
Number of patients: 691
Number of patterns: 15697


pattern,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSb=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSsT=[4,8[ -1",<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,...,"<0> R=12 ALSFRS-R=>=36 -1 <3> ALSFRSsUL=[4,8[ R=12 -1",<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1 <4> ALSFRSb=12 -1,"<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1 <4> ALSFRSsT=[4,8[ -1",<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 -1 <4> R=12 -1,<0> R=12 ALSFRS-R=>=36 -1 <3> R=12 ALSFRS-R=>=36 -1,<0> R=12 ALSFRS-R=>=36 -1 <4> ALSFRSb=12 -1,"<0> R=12 ALSFRS-R=>=36 -1 <4> ALSFRSsT=[4,8[ -1",<0> R=12 ALSFRS-R=>=36 -1 <4> R=12 -1,"<0> R=[8,12[ -1 <1> R=[8,12[ -1"
0,2,1,1,1,0,0,0,0,0,1,...,1,1,1,1,1,1,1,1,1,0
1,8,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,1,0,1,0
2,9,1,1,0,0,0,0,0,0,1,...,0,1,1,1,1,1,1,1,1,1
3,10,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,1,0
4,14,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [481]:
binary_df = binary_df.merge(static_df, on='REF', how='inner')

In [482]:

binary_df = pd.get_dummies(binary_df, columns=['Gender', 'Revised El Escorial Criteria', 'Onset', 'MND familiar history', 'C9orf72'], dtype=int)

In [483]:
binary_df.head(3)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSb=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSsT=[4,8[ -1",<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,...,Onset_Limbs,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown
0,2,1,1,1,0,0,0,0,0,1,...,1,0,0,0,1,0,0,0,0,1
1,8,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
2,9,1,1,0,0,0,0,0,0,1,...,1,0,0,0,1,0,0,1,0,0


In [484]:
binary_df = binary_df.merge(evolution_df, on='REF', how='left')

In [485]:
binary_df.head(3)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSb=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSsT=[4,8[ -1",<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,...,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown,Evolution
0,2,1,1,1,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,N
1,8,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,Y
2,9,1,1,0,0,0,0,0,0,1,...,0,0,0,1,0,0,1,0,0,Y


In [486]:
binary_df['Evolution'] = (binary_df['Evolution'] == 'Y').astype(int)

In [487]:
binary_df.head(3)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSb=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSsT=[4,8[ -1",<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,...,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown,Evolution
0,2,1,1,1,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
1,8,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,1
2,9,1,1,0,0,0,0,0,0,1,...,0,0,0,1,0,0,1,0,0,1


In [488]:
evolution_counts = binary_df['Evolution'].value_counts()
print(f"\nEvolution counts:\n{evolution_counts}")


Evolution counts:
Evolution
0    321
1    307
Name: count, dtype: int64


In [489]:
X = binary_df.drop(columns='Evolution')
y = binary_df['Evolution']


In [490]:
X.head(2)

Unnamed: 0,REF,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSb=12 -1,"<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> ALSFRSsT=[4,8[ -1",<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRS-R=>=36 -1 <4> R=12 ALSFRS-R=>=36 -1,<0> ALSFRS-R=>=36 -1 <1> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1 <3> ALSFRSb=12 -1,...,Onset_Limbs,Onset_Respiratory,Onset_axial,Onset_diffuse,MND familiar history_No,MND familiar history_Yes,MND familiar history_unknown,C9orf72_No,C9orf72_Yes,C9orf72_unknown
0,2,1,1,1,0,0,0,0,0,1,...,1,0,0,0,1,0,0,0,0,1
1,8,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1


In [491]:
print(f"Number of refs: {len(binary_df)}")

Number of refs: 628


In [492]:
# define RF and stratified 5‐fold CV
rf = RandomForestClassifier(random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# run CV on encoded data
#scores = cross_val_score(rf, X_enc, y, cv=cv, scoring='accuracy')
scores = cross_val_score(rf, X, y, cv=cv, scoring='accuracy')
print(f"5‐fold CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

5‐fold CV accuracy: 0.572 ± 0.045
