# Data Mining / Prospecção de Dados

## Sara C. Madeira, 2024/2025

# Project 2 - Classification in Temporal Data using Sequential Pattern Mining

## Logistics 
**_Read Carefully_**

**Students should work in teams of 3 people**. 

Groups with less than 3 people might be allowed (with valid justification), but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `June, 8th (23:59)`.** 

Students should **upload a `.zip` file** containing a folder with all the files necessary for project evaluation. 
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the `zip` file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202425_P2.ipynb` as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs.**

**Decisions should be justified and results should be critically discussed.** 

Remember that **your notebook should be as clear and organized as possible**, that is, **only the relevant code and experiments should be presented, not everything you tried and did not work, or is not relevant** (that can be discussed in the text, if relevant)! Tables and figures can be used together with text to summarize results and conclusions, improving understanding, readability and concision. **More does not mean better! The target is quality not quantity!**

_**Project solutions containing only code and outputs without discussions will achieve a maximum grade of 10 out of 20.**_

## Dataset and Tools

Amyotrophic Lateral Sclerosis (ALS) is a devastating neurodegenerative disease causing rapid degeneration of motor neurons and usually leading to death by respiratory failure. Since there is no cure, treatment’s goal is to improve symptoms and prolong survival. Non-invasive Ventilation (NIV) has been shown to extend life expectancy and improve quality of life, thus it is key to effectively predict if ALS patients will be eligible for NIV in the near future based on disease progression. In this context, Martins et al. (2021) proposed to learn prognostic models using disesase progression patterns (https://ieeexplore.ieee.org/document/9426397), and formulated the following prognostic prediction problem (schematized in Figure below): given a specific ALS patient static data collected at diagnosis and temporal data from disease follow-up, can we effectively use these clinical evaluations to predict if this patient will require NIV within k days of last evaluation?

<img src="prognostic_problem.png" alt="Prognostic Prediction" style="width: 500px;"/>

In this project, we will perform a reduced part of the work published by Martins et al. to learn a machine learning model (classifier) able to predict the need for NIV in a time window of 180 days given static and temporal data and sequential patterns as features.

The dataset to be analysed was obtained from the Lisbon ALS database, containing clinical data from ALS patients collected during their follow-up at the hospital. To reduce the preprocessing steps relatively to what was done in the paper and should be done in the real scenario the following **preprocessed datasets** are already provided:

1. `Dataset_Static_Features.csv` - each row is a patient with a REF id described by a set of features collected at diagnosis time. These features are called static since their are not collected over time. Duplicated patients should be deleted.

<img src="static_example.png" alt="Temporal Data" style="width: 1000px;"/>

2. `Dataset_Temporal_Features.csv` - each patient has a set of rows (snapshots), each corresponding to a visit at the hospital and the values collected for a set of temporal features collected over time. The rows per patient REF are sorted in chonologic order, such that the first row and the last row of a REF correspond, respectively to the data collected in the first and last visit at the hospital. In the example below, the patient REF=9 has temporal data describing 5 clinical evaluations (time-points).

<img src="temporal_example.png" alt="Temporal Data" style="width: 500px;"/>

3. `Dataset_NIV_Evolution_180.cvs`- each patient has a set of rows with the true value of NIV (yes or no) 180 days in the future. These are the class labels to be later used to train the classifier. Patient REF=9 does not evolve to need NIV 180 after time-points 1 to 4 but evolves to NIV 180 after time-point 5.

<img src="NIV_evolution_example.png" alt="Temporal Data" style="width: 100px;"/>

In this context, the project has **2 main tasks**:
1. Learn disease progression patterns from the temporal data using temporal pattern mining
2. Learn a classifier to predict NIV using temporal patterns as features together with the static features


**In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org), [`SPMF`](https://www.philippe-fournier-viger.com/software.php) for temporal pattern mining, and [`Scikit-learn`](https://scikit-learn.org/stable/) for classification.**


## Team Identification

**GROUP NN**

Students:

* Student 1 - n_student1
* Student 2 - n_student2
* Student 3 - n_student3

## 1. Learn disease progression patterns from the temporal data using temporal pattern mining

In this first task you should load and preprocessed the dataset **`Dataset_Temporal_Features.csv`** in order to compute sequential patterns for each patient. 

For that, you first need to load and preprocess this dataset and then transform the temporal data into a **sequence database**.

You should consider **a minimum of 2 time-points** and a **maximum of 5 time-points** per patient.

The sequential pattern mining algorithm `Fourier08` (https://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php), an extension of `PrefixSpan` able to **deal with time** and compute **close patterns**, should be used.

### 1.1. Load and Preprocess Dataset

- Remove all patients (REFs) with missing values in the temporal features.
- Remove all patients with less than 2 time-points.
- For the patients with more than 5 time-points, keep only the first 5.
- In order to reduce the number of patterns to be generated agregate the values of the temporal features using the following intervals (4 intervals for each feature as in Martins et al): 
  1) Intervals for `ALSFRSb, ALSFRSsUL, ALSFRSsT, ALSFRSsLL, and R: <4 [4,8[ [8,12[ 12`;
  2) Intervals for `ALSFRS-R: <12  [12,24[  [24,36[  >=36`

In [85]:
# Write code in cells like this
# ....
import pandas as pd
import os
from spmf import Spmf

In [86]:
temporal_df = pd.read_csv('Dataset_Temporal_Features.csv', delimiter=';')

In [87]:
temporal_df.head(3)

Unnamed: 0,REF,ALSFRSb,ALSFRSsUL,ALSFRSsT,ALSFRSsLL,R,ALSFRS-R
0,2,12.0,8.0,6.0,4.0,12.0,42.0
1,2,12.0,8.0,6.0,4.0,12.0,42.0
2,2,12.0,8.0,6.0,3.0,12.0,41.0


In [88]:
#removing rows with NaN values
print(f"Number of nulls per column\n{temporal_df.isnull().sum()}")
temporal_df = temporal_df.dropna()
print(f"\nNumber of nulls per column\n{temporal_df.isnull().sum()}")

Number of nulls per column
REF            0
ALSFRSb      318
ALSFRSsUL    329
ALSFRSsT     329
ALSFRSsLL    329
R            366
ALSFRS-R     381
dtype: int64

Number of nulls per column
REF          0
ALSFRSb      0
ALSFRSsUL    0
ALSFRSsT     0
ALSFRSsLL    0
R            0
ALSFRS-R     0
dtype: int64


In [89]:
ref_counts = temporal_df['REF'].value_counts()
print(f"\nReference counts:\n{ref_counts}")


Reference counts:
REF
723     26
269     25
608     24
395     24
981     23
        ..
1162     1
1453     1
110      1
519      1
521      1
Name: count, Length: 957, dtype: int64


In [90]:
#print the unique values in all columns except 'REF'
for col in temporal_df.columns:
    if col != 'REF':
        unique_values = temporal_df[col].unique()
        print(f"\nUnique values in column '{col}':\n{unique_values}")


Unique values in column 'ALSFRSb':
[12.  3. 10.  9.  4. 11.  7.  8.  6.  5.  2.  0.  1.]

Unique values in column 'ALSFRSsUL':
[8. 5. 6. 3. 4. 2. 7. 1. 0.]

Unique values in column 'ALSFRSsT':
[6. 4. 7. 2. 8. 3. 5. 1. 0.]

Unique values in column 'ALSFRSsLL':
[4. 3. 6. 8. 5. 7. 1. 0. 2.]

Unique values in column 'R':
[12. 11. 10.  7.  9.  8.]

Unique values in column 'ALSFRS-R':
[42. 41. 36. 37. 38. 32. 46. 45. 44. 31. 27. 40. 39. 35. 30. 33. 47. 43.
 29. 28. 26. 34. 25. 22. 24. 23. 17. 20. 19. 21. 48. 16. 18. 15. 14. 12.
 11. 13.]


In [91]:
temporal_df = temporal_df.groupby('REF').filter(lambda x: len(x) >= 2)

In [92]:
temporal_df = temporal_df.groupby('REF').head(5)

In [93]:
def alsfrs_intervals(val):
    if val < 4:
        return '<4'
    elif 4 <= val < 8:
        return '[4,8['
    elif 8 <= val < 12:
        return '[8,12['
    else:
        return '12'

In [94]:
cols = ['ALSFRSb', 'ALSFRSsUL', 'ALSFRSsT', 'ALSFRSsLL', 'R']
for col in cols:
    temporal_df[col] = temporal_df[col].apply(alsfrs_intervals)

In [95]:
temporal_df.head(3)

Unnamed: 0,REF,ALSFRSb,ALSFRSsUL,ALSFRSsT,ALSFRSsLL,R,ALSFRS-R
0,2,12,"[8,12[","[4,8[","[4,8[",12,42.0
1,2,12,"[8,12[","[4,8[","[4,8[",12,42.0
2,2,12,"[8,12[","[4,8[",<4,12,41.0


In [96]:
def alsfrs_r(val):
    if val < 12:
        return '<12'
    elif 12 <= val < 24:
        return '[12,24['
    elif 24 <= val < 36:
        return '[24,36['
    else:
        return '>=36'

In [97]:
temporal_df['ALSFRS-R'] = temporal_df['ALSFRS-R'].apply(alsfrs_r)

In [98]:
temporal_df.head(3)

Unnamed: 0,REF,ALSFRSb,ALSFRSsUL,ALSFRSsT,ALSFRSsLL,R,ALSFRS-R
0,2,12,"[8,12[","[4,8[","[4,8[",12,>=36
1,2,12,"[8,12[","[4,8[","[4,8[",12,>=36
2,2,12,"[8,12[","[4,8[",<4,12,>=36


In [99]:
#print the unique values in all columns except 'REF'
for col in temporal_df.columns:
    if col != 'REF':
        unique_values = temporal_df[col].unique()
        print(f"\nUnique values in column '{col}':\n{unique_values}")


Unique values in column 'ALSFRSb':
['12' '[8,12[' '[4,8[' '<4']

Unique values in column 'ALSFRSsUL':
['[8,12[' '[4,8[' '<4']

Unique values in column 'ALSFRSsT':
['[4,8[' '<4' '[8,12[']

Unique values in column 'ALSFRSsLL':
['[4,8[' '<4' '[8,12[']

Unique values in column 'R':
['12' '[8,12[']

Unique values in column 'ALSFRS-R':
['>=36' '[24,36[' '[12,24[' '<12']


In [100]:
def create_sequence_db_df(df):
    records = []
    item_mapping = {}
    item_counter = 1

    for ref, group in df.groupby('REF'):
        parts = []       # for SPMF encoding
        parts_raw = []   # for human‐readable
        for t, (_, row) in enumerate(group.iterrows()):
            # build encoded item list
            items = []
            for col in df.columns:
                if col == 'REF':
                    continue
                value = str(row[col])
                key = f"{col}={value}"
                if key not in item_mapping:
                    item_mapping[key] = item_counter
                    item_counter += 1
                items.append(str(item_mapping[key]))
            parts.append(f"<{t}> " + " ".join(items) + " -1")

            # build raw tuple
            raw_items = " ".join(f"{col} = {row[col]}"
                                 for col in df.columns if col != 'REF')
            parts_raw.append(f"({t}, {raw_items} )")

        sequence_str = " ".join(parts) + " -2"
        sequence_raw = " ".join(parts_raw)
        records.append({
            'REF': ref,
            'Sequence': sequence_str,
            'Sequence_raw': sequence_raw
        })

    # build mapping header
    mapping_str = "@CONVERTED_FROM_TEXT\n"
    for key, number in item_mapping.items():
        mapping_str += f"@ITEM={number}={key}\n"

    return pd.DataFrame(records), mapping_str

In [101]:
sequences, map = create_sequence_db_df(temporal_df)

In [102]:
sequences.head()

Unnamed: 0,REF,Sequence,Sequence_raw
0,2,<0> 1 2 3 4 5 6 -1 <1> 1 2 3 4 5 6 -1 <2> 1 2 3 7 5 6 -1 <3> 1 8 3 7 5 6 -1 -2,"(0, ALSFRSb = 12 ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = [4,8[ R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = 12 ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = [4,8[ R = 12 ALSFRS-R = >=36 ) (2, ALSFRSb = 12 ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 ) (3, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 )"
1,8,<0> 1 8 3 9 5 6 -1 <1> 1 10 11 9 5 6 -1 -2,"(0, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = <4 ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 )"
2,9,<0> 1 8 3 7 5 6 -1 <1> 1 10 3 4 5 6 -1 <2> 1 8 11 7 12 13 -1 <3> 1 10 3 7 5 6 -1 <4> 1 10 3 7 12 13 -1 -2,"(0, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = [4,8[ ALSFRSsLL = [4,8[ R = 12 ALSFRS-R = >=36 ) (2, ALSFRSb = 12 ALSFRSsUL = [4,8[ ALSFRSsT = <4 ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ ) (3, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = 12 ALSFRS-R = >=36 ) (4, ALSFRSb = 12 ALSFRSsUL = <4 ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ )"
3,10,<0> 14 2 15 9 5 6 -1 <1> 14 2 3 9 5 6 -1 <2> 14 2 15 9 12 6 -1 -2,"(0, ALSFRSb = [8,12[ ALSFRSsUL = [8,12[ ALSFRSsT = [8,12[ ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 ) (1, ALSFRSb = [8,12[ ALSFRSsUL = [8,12[ ALSFRSsT = [4,8[ ALSFRSsLL = [8,12[ R = 12 ALSFRS-R = >=36 ) (2, ALSFRSb = [8,12[ ALSFRSsUL = [8,12[ ALSFRSsT = [8,12[ ALSFRSsLL = [8,12[ R = [8,12[ ALSFRS-R = >=36 )"
4,14,<0> 14 8 3 7 12 13 -1 <1> 14 10 11 7 12 13 -1 -2,"(0, ALSFRSb = [8,12[ ALSFRSsUL = [4,8[ ALSFRSsT = [4,8[ ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ ) (1, ALSFRSb = [8,12[ ALSFRSsUL = <4 ALSFRSsT = <4 ALSFRSsLL = <4 R = [8,12[ ALSFRS-R = [24,36[ )"


In [103]:
def save_sequences_and_map(sequences_df, mapping_str, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(mapping_str)
        for seq in sequences_df['Sequence']:
            f.write(seq.strip() + '\n')

In [104]:
save_sequences_and_map(sequences, map, 'sequences.txt')

In [105]:
os.system("java -jar spmf.jar run Fournier08-Closed+time sequences.txt output.txt 50% 2 10 2 10")

0

In [109]:
# import re

# with open('output.txt','r',encoding='utf-8') as f:
#     lines = f.readlines()

# records = []
# for line in lines:
#     line = line.strip()
#     if not line or not line.startswith('<'):
#         continue

#     pat, sup = line.split('#SUP:')
#     sup = int(sup.strip())
#     pat = pat.strip()

#     # extract each (<time> items -1)
#     sets = re.findall(r'<(\d+)>\s+(.*?)\s+-1', pat)

#     # build human-readable sequence_raw with spaces around '='
#     seq_parts = []
#     for t, items in sets:
#         tokens = items.split()
#         human_items = []
#         for tok in tokens:
#             feat, val = tok.split('=', 1)
#             human_items.append(f"{feat} = {val}")
#         seq_parts.append(f"({t}, {' '.join(human_items)} )")
#     seq_raw = " ".join(seq_parts)

#     records.append({
#         # 'pattern': pat,
#         'sequence_raw': seq_raw,
#         'support': sup
#     })

# df_patterns = pd.DataFrame(records)
# df_patterns.head()

In [None]:
import re

def parse_pattern(line):
    pat = line.split('#SUP:')[0]
    sets = re.findall(r'<\d+>\s+(.*?)\s+-1', pat)
    return [ set(s.split()) for s in sets ]

def parse_sequence(raw):
    out = []
    for row in raw.split('\n'):
        m = re.match(r'\(\s*\d+,\s*(.*?)\s*\)', row)
        if not m: continue
        toks = m.group(1).split()
        s = set()
        for i in range(0,len(toks),3):
            feat,_,val = toks[i:i+3]
            s.add(f"{feat}={val}")
        out.append(s)
    return out

def is_subsequence(pat, seq):
    i = 0
    for pset in pat:
        while i < len(seq) and not pset.issubset(seq[i]):
            i += 1
        if i == len(seq): return False
        i += 1
    return True

pattern_records = []
with open('output.txt','r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if not line.startswith('<'):
            continue
        # split off support
        pat_str, sup_str = line.split('#SUP:')
        enc_pattern = pat_str.strip()         # encoded pattern
        support     = int(sup_str.strip())

        # build decoded (human-readable) pattern
        sets = re.findall(r'<(\d+)>\s+(.*?)\s+-1', enc_pattern)
        decoded_parts = []
        for t, items in sets:
            tokens = items.split()
            human_items = []
            for tok in tokens:
                feat, val = tok.split('=', 1)
                human_items.append(f"{feat} = {val}")
            decoded_parts.append(f"({t}, {' '.join(human_items)} )")
        decoded_pattern = " ".join(decoded_parts)

        # parse & match as before
        pat_sets = [ set(s.split()) for _,s in sets ]
        hits = []
        for _, seq_row in sequences.iterrows():
            seq_sets = parse_sequence(seq_row.Sequence_raw)
            if is_subsequence(pat_sets, seq_sets):
                hits.append(seq_row.REF)

        pattern_records.append({
            'pattern':         enc_pattern,
            'decoded_pattern': decoded_pattern,
            'support':         support,
            'REFs':            hits
        })

df_matches = pd.DataFrame(pattern_records)
df_matches.head(50)

Unnamed: 0,pattern,decoded_pattern,support,REFs
0,<0> R=12 -1 <2> R=12 -1,"(0, R = 12 ) (2, R = 12 )",389,[]
1,<0> R=12 ALSFRS-R=>=36 -1 <2> R=12 -1,"(0, R = 12 ALSFRS-R = >=36 ) (2, R = 12 )",364,[]
2,<0> ALSFRS-R=>=36 -1 <2> R=12 -1,"(0, ALSFRS-R = >=36 ) (2, R = 12 )",388,[]
3,"<0> ALSFRS-R=>=36 -1 <2> ALSFRSsT=[4,8[ -1","(0, ALSFRS-R = >=36 ) (2, ALSFRSsT = [4,8[ )",347,[]
4,<0> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,"(0, ALSFRS-R = >=36 ) (2, ALSFRS-R = >=36 )",360,[]


In [108]:
import re

def parse_sequence(raw):
    """
    raw is a single string like:
      "(0, A = 1 B = 2 ) (1, B = 2 ) (2, C = 3 )"
    This will return:
      [ {'A=1','B=2'}, {'B=2'}, {'C=3'} ]
    """
    out = []
    # find every "(t, …)" chunk
    for content in re.findall(r'\(\s*\d+,\s*(.*?)\s*\)', raw):
        toks = content.split()
        s = set()
        # every three tokens are feat, '=', val
        for i in range(0, len(toks), 3):
            feat, _, val = toks[i : i+3]
            s.add(f"{feat}={val}")
        out.append(s)
    return out

# then your matching block stays the same:
pattern_records = []
with open('output.txt','r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if not line.startswith('<'): continue

        pat_str, sup_str = line.split('#SUP:')
        enc_pattern = pat_str.strip()
        support     = int(sup_str.strip())

        # build decoded pattern
        sets = re.findall(r'<(\d+)>\s+(.*?)\s+-1', enc_pattern)
        decoded = " ".join(
            f"({t}, {' '.join(f'{f} = {v}' for f,v in (x.split('=',1),))} )"
            for t,items in sets
            for x in [" ".join(items.split())]  # no-op to line up formats
        )
        # parse the encoded sets for subsequence
        pat_sets = [ set(s.split()) for _,s in sets ]

        hits = [
            seq_row.REF
            for _, seq_row in sequences.iterrows()
            if is_subsequence(pat_sets, parse_sequence(seq_row.Sequence_raw))
        ]

        pattern_records.append({
            'pattern':         enc_pattern,
            'decoded_pattern': decoded,
            'support':         support,
            'REFs':            hits
        })

df_matches = pd.DataFrame(pattern_records)
df_matches.head(50)

Unnamed: 0,pattern,decoded_pattern,support,REFs
0,<0> R=12 -1 <2> R=12 -1,"(0, R = 12 ) (2, R = 12 )",389,"[2, 8, 9, 10, 17, 18, 20, 21, 24, 30, 34, 35, 36, 39, 40, 42, 43, 45, 46, 49, 50, 54, 55, 56, 64, 66, 67, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 100, 103, 104, 105, 111, 113, 115, 119, 122, 125, 126, 133, 136, 137, 141, 144, 145, 151, 153, 156, 161, 162, 164, 165, 166, 167, 169, 171, 173, 174, 176, 177, 178, 179, 180, 185, 196, 197, 200, 201, 202, 205, 207, 210, 211, 212, 213, 214, 215, 216, 219, 220, 227, 236, 238, 241, 242, 247, 250, 253, 256, 259, 261, ...]"
1,<0> R=12 ALSFRS-R=>=36 -1 <2> R=12 -1,"(0, R = 12 ALSFRS-R=>=36 ) (2, R = 12 )",364,"[2, 8, 9, 10, 17, 21, 24, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 64, 66, 67, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 100, 103, 104, 105, 113, 115, 119, 122, 126, 136, 137, 141, 144, 145, 151, 153, 156, 161, 162, 164, 165, 166, 167, 169, 171, 173, 174, 176, 177, 178, 179, 180, 185, 196, 197, 200, 202, 205, 207, 210, 211, 212, 213, 214, 215, 220, 227, 236, 238, 241, 242, 247, 250, 253, 256, 259, 261, 262, 263, 265, 267, 269, 271, 273, 277, 278, 280, ...]"
2,<0> ALSFRS-R=>=36 -1 <2> R=12 -1,"(0, ALSFRS-R = >=36 ) (2, R = 12 )",388,"[2, 8, 9, 10, 17, 21, 24, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 64, 66, 67, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 100, 103, 104, 105, 113, 115, 119, 122, 126, 134, 136, 137, 141, 144, 145, 151, 153, 156, 161, 162, 164, 165, 166, 167, 169, 171, 173, 174, 176, 177, 178, 179, 180, 185, 196, 197, 200, 202, 205, 207, 210, 211, 212, 213, 214, 215, 220, 227, 236, 238, 241, 242, 247, 250, 253, 256, 259, 261, 262, 263, 265, 267, 269, 271, 273, 277, 278, ...]"
3,"<0> ALSFRS-R=>=36 -1 <2> ALSFRSsT=[4,8[ -1","(0, ALSFRS-R = >=36 ) (2, ALSFRSsT = [4,8[ )",347,"[2, 9, 10, 24, 26, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 60, 61, 63, 64, 66, 67, 72, 74, 78, 82, 85, 88, 91, 94, 97, 100, 104, 105, 113, 115, 119, 122, 126, 134, 136, 137, 145, 151, 153, 155, 160, 162, 164, 165, 167, 169, 171, 173, 174, 177, 178, 180, 196, 197, 200, 202, 205, 207, 210, 211, 212, 214, 215, 217, 220, 222, 236, 238, 240, 241, 243, 247, 250, 256, 259, 261, 262, 263, 265, 267, 269, 271, 273, 277, 282, 285, 286, 291, 293, 305, 308, 313, ...]"
4,<0> ALSFRS-R=>=36 -1 <2> ALSFRS-R=>=36 -1,"(0, ALSFRS-R = >=36 ) (2, ALSFRS-R = >=36 )",360,"[2, 8, 9, 10, 17, 21, 24, 26, 30, 34, 35, 39, 40, 42, 45, 46, 49, 50, 54, 55, 56, 60, 61, 63, 64, 66, 72, 74, 78, 79, 82, 85, 88, 91, 93, 94, 99, 100, 103, 104, 113, 115, 119, 122, 126, 134, 136, 137, 141, 151, 155, 156, 160, 161, 162, 164, 166, 167, 169, 171, 173, 174, 176, 177, 179, 185, 196, 197, 200, 202, 203, 205, 207, 208, 210, 211, 212, 213, 214, 215, 217, 220, 227, 236, 238, 241, 242, 243, 244, 247, 250, 253, 256, 259, 261, 262, 263, 265, 267, 269, ...]"


### 1.2. Compute the Sequence Database

Note that items have now the form `Feature=value` and you should have a sequence database with as many sequences as patients. 

Each sequence encodes the several time-points (maximum 5) of each patient.

See `Fourier08` example (https://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php) to undertand the format received by the algorithm, specially the time information. 

Remember also the end of the example with PrefixSpan (https://www.philippe-fournier-viger.com/spmf/PrefixSpan.php) to understand how to use strings instead of integers to encode items .

In [12]:
# Write code in cells like this
# ....

Write text in cells like this ...


### 1.3. Compute Sequential Patterns

Use `Fourier08` to compute the closed sequential patterns. Trivial patterns of length 1 should be discarded.

Note that later you need to know what are the sequences (patients) where the patterns occur (the algorithm can output that info).

In [16]:
# Write code in cells like this
# ....

Write text in cells like this ...


## 2.  Learn a classifier to predict NIV using temporal patterns as features together with the static features

In this task you should create a training set where the features are 1) the original static features `(Dataset_static_features.cvs`and 2) the sequential patterns computed above. The class labels to be used for each patient are in file `Dataset_NIV_Evolution_180.cvs`.

### 2.1. Load/Preprocess the Dataset

- Remember to delete from `Dataset_static_features.csv` the patients you deleted in step 1.1. You should only have one row per patient, thus remove repetitions.
- Remember to delete from `Dataset_NIV_Evolution_180.cvs` the patients you deleted in step 1.1.
- Note that for each patient the class label you need from `Dataset_NIV_Evolution_180.cvs` is the one corresponding to the last time-point you considered in step 1.1.

In [20]:
# Write code in cells like this
# ....

Write text in cells like this ...


### 2.2. Create the Training Set

See Section **3.4 Training Set Creation and Model Learning** of the paper Martins et al (2021). 

Note that in this project the original static features are used thus you only need to compute the distance matrix for the sequencial patterns (which are now features), the static features are used as they are.

**Perform the experiments only for binary matrices**.

In [24]:
# Write code in cells like this
# ....

Write text in cells like this ...


#### 2.1.3. Learn the model

See section **4.1 Model Evaluation** of the paper Martins et al (2021).

Use only a `Random-Forest`, default parameters and present the results for **5-fold cross-validation** (mean+-std). 

Note that the problem is difficult so don´t expect high performance.

In [27]:
# Write code in cells like this
# ....

Write text in cells like this 