## Summarize the contents of the event files in a dataset.

A first step in annotating a dataset is to find out what is in the dataset event files. Sometimes event files will have a few unexpected or incorrect codes. It is usually a good idea to find out what is actually in the dataset event files before starting the annotation process.

This notebook traverses through the dataset and gathers the unique values for each column and number of times each value appears in the dataset.

The setup requires the setting of the following variables for your dataset:

| Variable | Purpose |
| -------- | ------- |
| dataset_path | Full path to root directory of dataset.|
| exclude_dirs | List of directories to exclude when constructing the list of event files. |
| `skip_columns`  |  List of column names in the `events.tsv` files to skip in the summary. |
| `value_columns` | List of columns names in the `events.tsv` files that are just listed with element counts. |

For large datasets, you will want to be sure to skip columns such as `onset` and `sample`, since the summary produces counts of the number of times each unique value appears somewhere in an event file.

The notebook creates a `TabularSummary` object to handle the summarization.

The example below uses a
[small version](https://github.com/hed-standard/hed-examples/tree/main/datasets/eeg_ds003645s_hed)
of the Wakeman-Hanson face-processing dataset available on openNeuro as
[ds003645](https://openneuro.org/datasets/ds003645/versions/2.0.0)
which is distributed as part of this dataset.

In [1]:
import os
from hed.tools import TabularSummary, get_file_list

# Variables to set for the specific dataset
dataset_path =  os.path.realpath('../../../datasets/eeg_ds003645s_hed')
output_path = ''
name = 'eeg_ds003645s_hed'
exclude_dirs = ['stimuli', 'code', 'derivatives', 'sourcedata', 'phenotype']
skip_columns = ["onset", "duration", "sample", "trial", "response_time"]
value_columns = ["stim_file"]

# Construct the file dictionary for the BIDS event files
event_files = get_file_list(dataset_path, extensions=[".tsv"], name_suffix="_events", exclude_dirs=exclude_dirs)
print(f"Processing {len(event_files)} files...")
# Create a tabular summary object
tab_sum = TabularSummary(value_cols=value_columns, skip_cols=skip_columns, name=name)

# Update the tabular summary with the information from each event file
print("Updating the summaries")
for events in event_files:
    tab_sum.update(events)
    
# Save or print
if output_path:
    with open(output_path, 'w') as fp:
        fp.write(f"{tab_sum}")
else:
    print(f"{tab_sum}")


Processing 6 files...
Updating the summaries
Summary for column dictionary eeg_ds003645s_hed:
   Categorical columns (5):
      event_type (8 distinct values):
         double_press: [1, 1]
         left_press: [83, 4]
         right_press: [168, 6]
         setup_right_sym: [6, 6]
         show_circle: [316, 6]
         show_cross: [310, 6]
         show_face: [310, 6]
         show_face_initial: [6, 6]
      face_type (4 distinct values):
         famous_face: [108, 6]
         nan: [884, 6]
         scrambled_face: [103, 6]
         unfamiliar_face: [105, 6]
      rep_lag (12 distinct values):
         1.0: [77, 6]
         10.0: [15, 6]
         11.0: [13, 5]
         12.0: [9, 5]
         13.0: [7, 6]
         14.0: [6, 4]
         15.0: [2, 2]
         6.0: [1, 1]
         7.0: [2, 2]
         8.0: [6, 4]
         9.0: [10, 6]
         nan: [1052, 6]
      rep_status (4 distinct values):
         delayed_repeat: [71, 6]
         first_show: [168, 6]
         immediate_repeat: [77