## Summarize the contents of the event files in a BIDS dataset.

A first step in annotating a BIDS dataset is to find out what is in the dataset
event files.
Sometimes event files will have a few unexpected or incorrect codes.
It is usually a good idea to find out what is actually in the dataset
event files before starting the annotation process.

This notebook traverses through the BIDS data set and gathers the unique
values for each column and number of times each value appears in the dataset.

The script creates dictionaries of `key` to full path
for each type of file.  The `key` is of the form `sub-xxx_run-y` which
uniquely specify each event file in the dataset. If a dataset contains
multiple sessions for each subject, the `key` should include additional
parts of the file name to uniquely specify each subject.
Keys are specified by a `entities` tuple lists the BIDS entity names
to include in the key.
BIDS base file names are constructed of entity *name*-*value* pairs separated
by underbars and followed by an ending *_suffix*.

For a file name `sub-001_ses-3_task-target_run-01_events.tsv`,
the tuple ('sub', 'task') gives a key of `sub-001_task-target`,
while the tuple ('sub', 'ses', 'run) gives a key of `sub-001_ses-3_run-01`.
The use of dictionaries of file names with such keys makes it
easier to associate related files in the BIDS naming structure.

The setup requires the setting of the following variables for your dataset:

| Variable | Purpose |
| -------- | ------- |
| bids_root_path | Full path to root directory of dataset.|
| exclude_dirs | List of directories to exclude when constructing file lists. |
| entities  | Tuple of entity names used to construct a unique keys representing filenames. <br>(See [Dictionaries of filenames](https://hed-examples.readthedocs.io/en/latest/HedInPython.html#dictionaries-of-filenames-anchor) for examples of how to choose the keys.)|
| skip_columns  |  List of column names in the `events.tsv` files to skip in the analysis. |

For large datasets, you will want to be sure to exclude columns such as
`onset` and `sample`, since the summary produces counts of the number of times
each unique value appears somewhere in an event file.

The notebook creates a `TabularSummary` object to handle the summarization.

The example below uses a
[small version](https://github.com/hed-standard/hed-examples/tree/main/datasets/eeg_ds003645s_hed)
of the Wakeman-Hanson face-processing dataset available on openNeuro as
[ds003645](https://openneuro.org/datasets/ds003645/versions/2.0.0)
which is distributed as part of this dataset.

In [1]:
import os
from hed.tools import TabularSummary, BidsTabularDictionary, get_file_list

# Variables to set for the specific dataset
bids_root_path =  os.path.realpath('../../../datasets/eeg_ds003645s_hed')
name = 'eeg_ds003645s_hed'
exclude_dirs = ['stimuli']
entities = ('sub', 'run')
skip_columns = ["onset", "duration", "sample", "stim_file", "trial", "response_time"]

# Construct the file dictionary for the BIDS event files
event_files = get_file_list(bids_root_path, extensions=[".tsv"], name_suffix="_events", exclude_dirs=exclude_dirs)
file_dict = BidsTabularDictionary(name, event_files, entities=entities)

# Output the column names for each type of event file
print(f"\nBIDS style event file columns:")
for key, file, rowcount, columns in file_dict.iter_files():
    print(f"{key} [{rowcount} events]: {str(columns)}")

# Create a summary of the original BIDS events file content
bids_dicts_all, bids_dicts =  TabularSummary.make_combined_dicts(file_dict, skip_cols=skip_columns)
print(f"\nSummary of all BIDS events files:\n{bids_dicts_all}")



BIDS style event file columns:
sub-002_run-1 [200 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub-002_run-2 [200 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub-002_run-3 [200 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub-003_run-1 [200 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub-003_run-2 [200 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']
sub-003_run-3 [200 events]: ['onset', 'duration', 'sample', 'event_type', 'face_type', 'rep_status', 'trial', 'rep_lag', 'value', 'stim_file']

Summary of all BIDS events files:
Summary for column dictionary :
   Categorical columns (5):
      event_typ