## Summarize the contents of the event files in a BIDS dataset.

A first step in annotating a BIDS dataset is to find out what is in the dataset
event files.
Sometimes event files will have a few unexpected or incorrect codes.
It is usually a good idea to find out what is actually in the dataset
event files before starting the annotation process.

This notebook traverses through the BIDS data set and gathers the unique
values for each column and number of times each value appears in the dataset.

The script creates dictionaries of `key` to full path
for each type of file.  The `key` is of the form `sub-xxx_run-y` which
uniquely specify each event file in the dataset. If a dataset contains
multiple sessions for each subject, the `key` should include additional
parts of the file name to uniquely specify each subject.
Keys are specified by a `entities` tuple lists the BIDS entity names
to include in the key.
BIDS base file names are constructed of entity *name*-*value* pairs separated
by underbars and followed by an ending *_suffix*.

For a file name `sub-001_ses-3_task-target_run-01_events.tsv`,
the tuple ('sub', 'task') gives a key of `sub-001_task-target`,
while the tuple ('sub', 'ses', 'run) gives a key of `sub-001_ses-3_run-01`.
The use of dictionaries of file names with such keys makes it
easier to associate related files in the BIDS naming structure.

The setup requires the setting of the following variables for your dataset:

| Variable | Purpose |
| -------- | ------- |
| bids_root_path | Full path to root directory of dataset.|
| exclude_dirs | List of directories to exclude when constructing file lists. |
| entities  | Tuple of entity names used to construct a unique keys representing filenames. <br>(See [Dictionaries of filenames](https://hed-examples.readthedocs.io/en/latest/HedInPython.html#dictionaries-of-filenames-anchor) for examples of how to choose the keys.)|
| skip_columns  |  List of column names in the `events.tsv` files to skip in the analysis. |

For large datasets, you will want to be sure to exclude columns such as
`onset` and `sample`, since the summary produces counts of the number of times
each unique value appears somewhere in an event file.

The notebook creates a `TabularSummary` object to handle the summarization.

The example below uses a
[small version](https://github.com/hed-standard/hed-examples/tree/main/datasets/fmri_ds002790s_hed_aomic)
of an AOMIC dataset available on openNeuro as
[ds002790](https://openneuro.org/datasets/ds002790)
which is distributed as part of this dataset.

In [1]:
from hed.tools import TabularSummary, BidsTabularDictionary, get_file_list

bids_root_path = '../../../datasets/fmri_ds002790s_hed_aomic'
exclude_dirs = ['derivatives', 'models']
entities = ('sub', 'task')
skip_columns = ['onset', 'duration', 'response_time', 'stop_signal_delay']
tasks = ['emomatching', 'restingstate', 'stopsignal', 'workingmemory']
# tasks = ['stopsignal']
name = 'aomic-piop2'
# Construct the event file dictionaries for the BIDS and for EEG.event files
event_files = get_file_list(bids_root_path, extensions=[".tsv"], name_suffix="_events",
                            exclude_dirs=exclude_dirs)
bids_dict = BidsTabularDictionary(name, event_files, entities=('sub', 'task'))

In [2]:
task_dicts, leftovers = bids_dict.split_by_entity('task')
print(f"Dataset tasks are [{str(task_dicts.keys())}]")
for task, task_dict in task_dicts.items():
    print(f"\nBIDS-style event files for task {task}")
    for filename in task_dict.file_dict.values():
        print(f"\t{filename}")

if leftovers:
    print(f"\nThese file did not have a task entity\n{leftovers}")


Dataset tasks are [dict_keys(['stopsignal', 'workingmemory', 'emomatching'])]

BIDS-style event files for task stopsignal
	H:\HEDExamples\hed-examples\datasets\fmri_ds002790s_hed_aomic\sub-0001\func\sub-0001_task-stopsignal_acq-seq_events.tsv
	H:\HEDExamples\hed-examples\datasets\fmri_ds002790s_hed_aomic\sub-0002\func\sub-0002_task-stopsignal_acq-seq_events.tsv

BIDS-style event files for task workingmemory
	H:\HEDExamples\hed-examples\datasets\fmri_ds002790s_hed_aomic\sub-0001\func\sub-0001_task-workingmemory_acq-seq_events.tsv
	H:\HEDExamples\hed-examples\datasets\fmri_ds002790s_hed_aomic\sub-0002\func\sub-0002_task-workingmemory_acq-seq_events.tsv

BIDS-style event files for task emomatching
	H:\HEDExamples\hed-examples\datasets\fmri_ds002790s_hed_aomic\sub-0002\func\sub-0002_task-emomatching_acq-seq_events.tsv


In [3]:
print(f"\nBIDS-style event file columns:")
for task, task_dict in task_dicts.items():
    print(f"\nTask {task} event file columns:")
    for key, file, rowcount, columns in task_dict.iter_files():
        print(f"{key} [{rowcount} events]: {str(columns)}")


BIDS-style event file columns:

Task stopsignal event file columns:
sub-0001_task-stopsignal [100 events]: ['onset', 'duration', 'trial_type', 'stop_signal_delay', 'response_time', 'response_accuracy', 'response_hand', 'sex']
sub-0002_task-stopsignal [100 events]: ['onset', 'duration', 'trial_type', 'stop_signal_delay', 'response_time', 'response_accuracy', 'response_hand', 'sex']

Task workingmemory event file columns:
sub-0001_task-workingmemory [40 events]: ['onset', 'duration', 'trial_type', 'response_accuracy', 'response_time', 'response_hand']
sub-0002_task-workingmemory [40 events]: ['onset', 'duration', 'trial_type', 'response_accuracy', 'response_time', 'response_hand']

Task emomatching event file columns:
sub-0002_task-emomatching [48 events]: ['onset', 'duration', 'trial_type', 'response_time', 'response_hand', 'response_accuracy', 'ori_match', 'sex', 'ethn_target', 'ethn_match', 'ethn_distractor', 'emo_match']


In [4]:
print('\nBIDS events summary counts:')
for task, task_dict in task_dicts.items():
    dicts_all, dicts_sep = TabularSummary.make_combined_dicts(task_dict.file_dict, skip_cols=skip_columns)
    print(f"\nBIDS-style event info for task {task}:\n{dicts_all}")


BIDS events summary counts:

BIDS-style event info for task stopsignal:
Summary for column dictionary :
   Categorical columns (4):
      response_accuracy (4 distinct values):
         correct: [144, 2]
         incorrect: [5, 1]
         miss: [1, 1]
         n/a: [50, 2]
      response_hand (2 distinct values):
         left: [81, 2]
         right: [119, 2]
      sex (2 distinct values):
         female: [97, 2]
         male: [103, 2]
      trial_type (3 distinct values):
         go: [139, 2]
         succesful_stop: [50, 2]
         unsuccesful_stop: [11, 1]
   Value columns (0):

BIDS-style event info for task workingmemory:
Summary for column dictionary :
   Categorical columns (3):
      response_accuracy (4 distinct values):
         correct: [30, 2]
         incorrect: [34, 2]
         miss: [1, 1]
         n/a: [15, 2]
      response_hand (2 distinct values):
         left: [31, 2]
         right: [49, 2]
      trial_type (3 distinct values):
         active_change: [32, 