# Extract labels from BIDS metadata

This notebook explains how to extract labels from BIDS.

## Label definition

Five labels can be extracted depending on the diagnostic status of the participants during the follow-up:
- **CN**: sessions of subjects who were diagnosed as _cognitively normal_ during all their follow-up;
- **AD**: sessions of subjects who were diagnosed as _demented_ during all their follow-up;
- **MCI**: sessions of subjects who were diagnosed as _prodromal_ at baseline, who did not encounter multiple reversions and conversions and who did not convert back to cognitively normal status;
- **pMCI**: sessions of subjects who were diagnosed as _prodromal_ at baseline, and _progressed to dementia_ during the <font color='red'> 36 months (time horizon)</font> following the current visit;
- **sMCI**: sessions of subjects who were diagnosed as _prodromal_ at baseline, _remained stable_ during the <font color='red'> 36 months (time horizon)</font> following the current visit and _never progressed to dementia_.

<div class="alert alert-block alert-info">
<b>Time horizon:</b><p>
    In this notebook, we set the time horizon to 36 months.
    The time horizon allows to study the MCI patients stability and to sort them between sMCI and pMCI classes.</p>
    <img src="./images/MCI_stability.png">
</div>


## Merge information with `clinica iotools`

BIDS information is dispatched between different tsv depending if it concerns a subject, a session or a scan.
To gather all data included in tsv files of the BIDS, use the command:

In [None]:
!clinica iotools merge-tsv <bids_directory> data/test_BIDS.tsv

Another command is also needed to identify which modalities are present for each subject. This will allow to restrict our sessions list to sessions comprising T1-MRI.

This is the aim of the following command: 

In [None]:
!clinica iotools check-missing-modalities <bids_directory> data/test_missing_mods

## Extract labels with `clinicadl tsvtool` on OASIS

In this section we will now use the OASIS dataset on which `clinica iotools merge-tsv` and `clinica iotools check-missing-modalities` was already performed.

The outputs of the corresponding pipelines can be found respectively at `data/OASIS_BIDS.tsv` and `data/OASIS_missing_mods`.

In [None]:
ls data

### Restrictions on OASIS dataset

In OASIS, the age distribution of CN participants is different from the one of AD participants. This is an issue as the classifier may learn a difference between age classes instead of learning atrophy patterns caused by Alzheimer's disease on such dataset.

<font color='green'>Should we add here a table or a piece of code to show the difference between the classes ? </font>

To avoid this bias, CN participants younger than the youngest AD patient were removed. This restriction can be performed with the following commandline:

In [None]:
!clinicadl tsvtool restrict OASIS data/OASIS_BIDS.tsv data/OASIS_restricted_BIDS.tsv

In [None]:
ls data

Some other sessions were also excluded because the preprocessing operations failed. The list of the images that were kept after the preprocessing is stored at `data/OASIS_qc_output.tsv`.

### Get the labels

The OASIS dataset only includes **AD** and **CN** labels. They can be extracted in a new folder `labels_lists` from the restricted dataset with the following commandline:

In [None]:
!clinicadl tsvtool getlabels data/OASIS_restricted_BIDS.tsv data/OASIS_missing_mods data/labels_lists --restriction_path data/OASIS_after_qc.tsv

In [None]:
ls data/labels_lists

For each diagnosis label, one separate file has been created with all the sessions that can be included in the classification task.

### Demographic analysis

The age bias in OASIS is well known and this is why the youngest CN participants were excluded in a previous section. However, other biases may exist, in particular after the preprocessing step which removed sessions from the dataset. Thus it is crucial to check before going further that there are no other biases in the dataset.

The following command will extract statistical values on the populations for each diagnosis label. Based on those it is possible to check that the dataset is suitable for the classification task.

In [None]:
!clinicadl tsvtool analysis data/OASIS_BIDS.tsv data/labels_lists data/OASIS_analysis.tsv

In [None]:
import pandas as pd

# Improve visualization
OASIS_analysis_df = pd.read_csv('data/OASIS_analysis.tsv', sep='\t')
columns = ["diagnosis", "n_subjects",
           "mean_age", "std_age", "min_age", "max_age",
           "sexF", "sexM",
           "mean_MMSE", "std_MMSE", "min_MMSE", "max_MMSE",
           "CDR_0", "CDR_0.5", "CDR_1", "CDR_2", "CDR_3"]

# Print formatted table
print("label \t| N\t| age \t\t\t  | sex \t| MMSE\t\t\t  | CDR \t\t\t")
print("-" * 115)
for idx in OASIS_analysis_df.index.values:
    print("%s \t| %i\t| %.1f ± %.1f [%.1f, %.1f] | %iF / %iM\t| %.1f ± %.1f [%.1f, %.1f] | 0: %i, 0.5: %i, 1: %i, 2:%i, 3:%i" 
          % tuple([OASIS_analysis_df.loc[idx, col] for col in columns]))

There is no significant bias on age anymore, but do you notice any other problems ? What kind of procedure could be done a posteriori to check that no bias was learnt ?