# How to use `tsvtools`

In this tutorial, we rely on the wonderful ADNI data set, as every preprocessing step needed by ClinicaDL was already performed. The goal will be to try to differentiate men from women on the cognitively normal population from t1w-MRI, and then infer the results on other 

BIDS data can be found at: `/network/lustre/dtlake01/aramis/datasets/adni/bids/BIDS`

Corresponding CAPS is at: `/network/lustre/dtlake01/aramis/datasets/adni/caps/caps_v2021`

## Find diagnosis labels

First, we will use the `getlabels` function of ClinicaDL to identify which participants are cognitively normal or demented.
For this we need clinical information stored in the BIDS, and already preprocessed by Clinica:
- summary TSV file merging all information of the BIDS (`clinica iotools merge-tsv`)
- missing imaging modalities (`clinica iotools check-missing-modalities`)

Fortunately these two steps were already completed on ADNI, then we can directly apply `clinicadl tsvtool getlabels`.

```{note}
If you have other labels, you can skip this step and directly go to the next one!
```

In [7]:
!clinicadl tsvtool getlabels \
    "/Volumes/dtlake01.aramis/datasets/adni/bids/ADNI_BIDS_clean.tsv" \
    "/Volumes/dtlake01.aramis/datasets/adni/bids/missing_mods" \
    "../data/labels_list/"

  return ctx.invoke(self.callback, **ctx.params)


One TSV file will be created for each diagnosis label: CN (cognitively normal) and AD (Alzheimer's disease). You can find the options used to create these files in the JSON file `getlabels.json`

In [9]:
!tree ../data/labels_list/

[01;34m../data/labels_list/[00m
├── AD.tsv
├── CN.tsv
└── getlabels.json

0 directories, 3 files


Then we can analyse our populations with the analysis tool

In [10]:
!clinicadl tsvtool analysis \
    "/Volumes/dtlake01.aramis/datasets/adni/bids/ADNI_BIDS_clean.tsv" \
    "../data/labels_list" \
    "../data/analysis.tsv"

  return ctx.invoke(self.callback, **ctx.params)
  f"NaN values were found for {key} values associated to diagnosis {diagnosis}"


In [13]:
import pandas as pd
df = pd.read_csv("../data/analysis.tsv", sep="\t")
display(df)

Unnamed: 0,diagnosis,n_subjects,mean_age,std_age,min_age,max_age,sexF,sexM,mean_MMSE,std_MMSE,min_MMSE,max_MMSE,CDR_0,CDR_0.5,CDR_1,CDR_2,CDR_3,mean_scans,std_scans,n_scans
0,AD,390.0,74.879487,7.796958,55.1,90.9,172.0,218.0,23.133333,2.140613,17.0,29.0,0.0,185.0,202.0,3.0,0.0,3.05641,1.279783,1192.0
1,CN,595.0,72.530976,6.352996,55.1,90.3,343.0,252.0,29.122689,1.087781,24.0,30.0,594.0,1.0,0.0,0.0,0.0,3.880672,3.075664,2309.0


To display more nicely the output we implemented in this notebook `display_table`:

In [14]:
def display_table(table_path):
    """Custom function to display the clinicadl tsvtool analysis output"""
    import pandas as pd

    OASIS_analysis_df = pd.read_csv(table_path, sep='\t')
    OASIS_analysis_df.set_index("diagnosis", drop=True, inplace=True)
    columns = ["n_subjects", "n_scans",
               "mean_age", "std_age", "min_age", "max_age",
               "sexF", "sexM",
               "mean_MMSE", "std_MMSE", "min_MMSE", "max_MMSE",
               "CDR_0", "CDR_0.5", "CDR_1", "CDR_2", "CDR_3"]

    # Print formatted table
    format_columns = ["subjects", "scans", "age", "sex", "MMSE", "CDR"]
    format_df = pd.DataFrame(index=OASIS_analysis_df.index, columns=format_columns)
    for idx in OASIS_analysis_df.index.values:    
        row_str = "%i; %i; %.1f ± %.1f [%.1f, %.1f]; %iF / %iM; %.1f ± %.1f [%.1f, %.1f]; 0: %i, 0.5: %i, 1: %i, 2:%i, 3:%i" % tuple([OASIS_analysis_df.loc[idx, col] for col in columns])
        row_list = row_str.split(';')
        format_df.loc[idx] = row_list

    format_df.index.name = None
    display(format_df)

In [18]:
display_table("../data/analysis.tsv")

Unnamed: 0,subjects,scans,age,sex,MMSE,CDR
AD,390,1192,"74.9 ± 7.8 [55.1, 90.9]",172F / 218M,"23.1 ± 2.1 [17.0, 29.0]","0: 0, 0.5: 185, 1: 202, 2:3, 3:0"
CN,595,2309,"72.5 ± 6.4 [55.1, 90.3]",343F / 252M,"29.1 ± 1.1 [24.0, 30.0]","0: 594, 0.5: 1, 1: 0, 2:0, 3:0"


## Create the test set

We put 100 participants in the test set with the split `function` of ClinicaDL.
This function ensures that there is no significant difference in the age and sex distributions between the train and test sets.

![split](../images/test_split.png)

In [20]:
!clinicadl tsvtool split ../data/labels_list --subset_name test --n_test 100

In [21]:
!tree ../data/labels_list

[01;34m../data/labels_list[00m
├── AD.tsv
├── CN.tsv
├── getlabels.json
├── split.json
├── [01;34mtest[00m
│   ├── AD.tsv
│   ├── AD_baseline.tsv
│   ├── CN.tsv
│   └── CN_baseline.tsv
└── [01;34mtrain[00m
    ├── AD.tsv
    ├── AD_baseline.tsv
    ├── CN.tsv
    └── CN_baseline.tsv

2 directories, 12 files


## Create the cross-validation

We choose to use a 2-fold validation (to avoid spending too much time on training).
We use the sex as stratification variable.

In [26]:
!clinicadl tsvtool kfold ../data/labels_list/train --n_splits 2 --stratification sex

Label CN
Split 0
Split 1
Label AD
Split 0
Split 1


In [27]:
!tree ../data/labels_list

[01;34m../data/labels_list[00m
├── AD.tsv
├── CN.tsv
├── getlabels.json
├── split.json
├── [01;34mtest[00m
│   ├── AD.tsv
│   ├── AD_baseline.tsv
│   ├── CN.tsv
│   └── CN_baseline.tsv
└── [01;34mtrain[00m
    ├── AD.tsv
    ├── AD_baseline.tsv
    ├── CN.tsv
    ├── CN_baseline.tsv
    ├── kfold.json
    ├── [01;34mtrain_splits-2[00m
    │   ├── [01;34msplit-0[00m
    │   │   ├── AD.tsv
    │   │   ├── AD_baseline.tsv
    │   │   ├── CN.tsv
    │   │   └── CN_baseline.tsv
    │   └── [01;34msplit-1[00m
    │       ├── AD.tsv
    │       ├── AD_baseline.tsv
    │       ├── CN.tsv
    │       └── CN_baseline.tsv
    └── [01;34mvalidation_splits-2[00m
        ├── [01;34msplit-0[00m
        │   ├── AD_baseline.tsv
        │   └── CN_baseline.tsv
        └── [01;34msplit-1[00m
            ├── AD_baseline.tsv
            └── CN_baseline.tsv

8 directories, 25 files
