Discuss BEP021 derivatives for electrophys #5

robertoostenveld · 2021-02-15T10:15:20Z

This continues the issue started here bids-standard/bids-specification#733. This BEP021 is a better place to continue the discussion, as we can also use projects and share files here which would not fit under bids-specification directly.

robertoostenveld · 2021-02-15T10:16:25Z

On 12 February 2021 we had a Zoom call to discuss the progress on BEP021 for which the draft is on http://bids.neuroimaging.io/bep021 on google docs. Some of the people that attended were @arnodelorme, @dorahermes, @guiomar, @jasmainak, @agramfort, but do not know the github handle of all. Please help to attend the others with a github presence by mentioning them here.

robertoostenveld · 2021-02-15T10:16:42Z

@jasmainak mentioned

The example dataset:

bids-standard/bids-examples#171
bids-standard/bids-examples#161

I haven't looked too closely but I see that it was merged by Robert :) Perhaps a dataset to use for discussion?

robertoostenveld · 2021-02-15T10:16:59Z

The pull requests bids-standard/bids-examples#171 and bids-standard/bids-examples#161 refer to “derivatives”, but looking at the dataset at https://github.com/bids-standard/bids-examples/tree/master/eeg_face13 there is nothing that sets it apart from a regular raw BIDS EEG dataset.

An example dataset that is a proper derivatives is https://github.com/bids-standard/bids-examples/tree/master/ds000001-fmriprep. Note, however, that there are quite some files in that derived dataset that have to be ignored as they are not standardized (yet).

Just some general pointers that relate to some topics we discussed:

On https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description it is specified that a derived dataset is also a BIDS dataset. The derived dataset has some required and some recommended extra elements in the dataset_description. When you prepare an example/draft derived ieeg/eeg/meg dataset, please keep these in mind.

Also, on https://bids-specification.readthedocs.io/en/stable/05-derivatives/01-introduction.html it states "Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing.” That lines up with what Scott and I were saying that a (derived) dataset should be (re)usable in its own right.

robertoostenveld · 2021-02-15T10:17:10Z

ping @adam2392 @hoechenberger

robertoostenveld · 2021-02-15T10:18:10Z

@guimar wrote:

Hi! I copy here the email too :)

Thanks everyone for the meeting today!

We thought that for the next meeting some of us could present some practical examples so we can have a starting point to discuss.
Preferably we could share the datasets in advance so everyone that has time can go over them and focus the discussion in the more problematic things.

Here's the link to the common derivatives specs merged in 1.4.0:
https://bids-specification.readthedocs.io/en/stable/05-derivatives/01-introduction.html
There, "derivative" is defined as:

Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing.

In line with what we have discussed.

Also note that there are some metadata fields to point to source data:
https://bids-specification.readthedocs.io/en/stable/05-derivatives/02-common-data-types.html

It's important to distinguish between two types of derivatives:

pre-processed (in essence they are similar to source data - datatype remains unchanged)
and processed (they are substantially different to source data - datatype changed)
So I think pre-processed derivatives can be easily accessible at this point (e.g. detrending, filtering, downsampling, re-referencing, etc).

In this logic I don't know where annotations may exactly fall. But they are also an important step (and I think that can be also interesting for other modalities and extensions).

Talk very soon!

robertoostenveld · 2021-02-15T10:18:36Z

and @sappelhoff wrote that

As far as I heard, @smakeig (Scott Makeig) @nucleuscub (Ramon Martinez-Cancino) also attended

arnodelorme · 2021-02-15T18:36:02Z

Thank you, Robert, for this summary. We will generate test datasets as discussed and then we should reconvene.

guiomar · 2021-12-01T13:03:00Z

Hi @sappelhoff @robertoostenveld @arnodelorme, @dorahermes, @jasmainak, @agramfort @hoechenberger @smakeig @nucleuscub @CPernet !

I want to retake the effort on ephys derivatives.

Having a new look at the document, I see there are 3 main blocks:

Annotations
Preprocessing (derivatives that doesn't change datatype)
Procesing (new datatype)

I would incline to divide the work into independent lines, to avoid getting stuck due to the big amount of work ahead.
And I would start by dealing with: 2) pre-processing.
We have almost all the details needed to takle this one. I'm preparing some examples.

Would you like to meet and move this part forward?
https://www.when2meet.com/?13797922-eLdhm

hoechenberger · 2021-12-01T13:25:19Z

And I would start by dealing with: 2) pre-processing.

Preprocessing sounds good to me, I'm not sure about the point in the parens though ("derivatives that doesn't change datatype"). To me, preprocessing also includes epoching the data, which will create new data types too. But I'm also happy to just limit the next discussion to continuous data only (Maxwell-filtering, frequency filtering, …)

I will share some datasets we process using our BIDS pipeline shortly so you all can see how we're currently approaching things.

Cheers,
Richard

cc @agramfort

guiomar · 2021-12-01T13:35:25Z

Thanks @hoechenberger !!

I think this definition comes from the generic derivatives specification, but I'm not able to find it anymore.
Still it makes sense to differenciate between the two at this first approach, until we agree on the data format to be used when the datatype from the raw source is changed, to avoid getting blocked by this decision.
The annotations part is also more prone to debate. So let's focus on the more objective and straight forward!

This sounds awesome Richard! Thanks!

adam2392 · 2021-12-01T15:43:58Z

I have read over the BEP021 derivatives and also added my availability, although I'll be in California at the time, so could possibly be hard to overlap.

I also utilized the "annotations" derivatives framework to create a dataset of event markings found in the iEEG via an automated algorithm. Specifically, these are "high-frequency oscillation" (HFO) onset/duration/channels stored as a tsv file. It works well for my use-case, but the tsv file does "explode" in length. Dataset is in dropbox link.

https://www.dropbox.com/sh/5ih5ay9fvo3q12s/AADBY5eDc_SmszHGyC3Mn6QJa?dl=0

jasmainak · 2021-12-01T16:33:14Z

I'm hammered that week. Unlikely to be able to join :(

I agree about the "explosion" issue that @adam2392 pointed out. I think I've seen it before. Probably it might help to limit the vocabulary of annotations so it is machine-readable not just machine-writeable.

guiomar · 2021-12-02T08:56:00Z

Hi @adam2392! Awesome! We can review the annotations with your example as well :)

guiomar · 2021-12-06T15:45:24Z

It seems the most preferred day to meet is Wed 15th Dec from 5 to 6pm CET.
I'll send you calendar invites with hangout link :)

hoechenberger · 2021-12-06T15:53:46Z

Thank you, @guiomar!

guiomar · 2021-12-06T16:48:21Z

For those who didn't receive the invitation and still wan to join, this is the link to hangouts:

bep021 - ephys derivatives
Wednesday, 15 December · 17:00 – 18:00
Google Meet joining info
Video call link: https://meet.google.com/ccc-uzuc-nyg
Or dial: ‪(US) +1 716-249-4224‬ PIN: ‪807 934 263‬#

guiomar · 2021-12-16T19:36:51Z

Thank you all who joined yesterday!
@hoechenberger @robertoostenveld @adam2392 @smakeig @tpatpa

I would like to sumarize here some of the main points we discussed in the meeting:

1) Annotations:

Use HED Tags: HED 0.0.1-score –> EEG-artifacts
https://www.hedtags.org/display_hed_library.html?fireglass_rsn=true#fireglass_params&tabid=af6ad83fcc95e6c2&start_with_session_counter=2&application_server_address=mc5.prod.fire.glass
Try to stick with events.tsv files, since they already support HED tags and have very similar purpose. Try some examples and suggent any new field that could be nice to include so it can also cover annotations properly.

2) Preprocessing steps:

When ever possible use the BIDS Derived dataset and Pipeline Description. Described in a general dataset_description.json file
https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description
This may not include specific metadata for main parameters of toolboxes functions that may run more individually, and that could be of interest reporting for reproducibility purposes and easy metadata search

I have dedicated some time to reorganize the BEP021 documentation accordingly, since it was becoming very messy:
https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/

Please, add any other point I may have forgoten and you considered important :)

We planned to do another meeting in January to show some examples and continue further discussing the remaining issues.
If you are interested, you can mark your availabilities here:
https://www.when2meet.com/?13923648-HOHfD

Talk soon!

smakeig · 2021-12-18T22:36:10Z

Attached fyi is the paper in press in NeuroImage on HED event annotation in BIDS EEG (or MEG) data - a how-to and Guidelines expo. Scott

…

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

guiomar · 2021-12-20T11:07:59Z

Thnaks @smakeig! I can't see any attachment, maybe it's easier if you share the links?

smakeig · 2021-12-20T15:48:25Z

The paper in press is available at https://drive.google.com/file/d/1c8WaTSBgHgOmARX3g8e3LpD7izfDlG9l/view?usp=sharing Scott

…

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

guiomar · 2022-01-10T10:39:58Z

Hello @sappelhoff @robertoostenveld @arnodelorme, @dorahermes, @jasmainak, @agramfort @hoechenberger @smakeig @nucleuscub @CPernet @adam2392 @tpatpa!

We planned to do another meeting these days, if you are interested, you can mark your availabilities here:
https://www.when2meet.com/?13923648-HOHfD

guiomar · 2022-01-11T11:55:07Z

Thanks a lot! Let's make it Tuesday 25th January at 1:00pm CET
I'll send and invitation shortly

guiomar · 2022-01-11T11:59:05Z

Details for joining:

bep021 - ephys derivatives
Tuesday, 25 January · 13:00 – 15:00
Google Meet joining info
Video call link: https://meet.google.com/ccc-uzuc-nyg
Or dial: ‪(US) +1 716-249-4224‬ PIN: ‪807 934 263‬#

robertoostenveld · 2022-01-28T10:47:41Z

Let me repost the content of an email here that I already sent to the invitees of the most recent BEP021 meeting. I'll reformat it slightly. Having the discussion here rather than via email keeps it better in the open for everyone to follow and contribute.

Following the discussion of 25 Jan, in which we updated the BEP021 google doc to indicate what is in and out of scope, I have worked on some example pipelines and derivatives corresponding to sections 6.2, 6.3, 6.4, 6.5 and 6.6 in the google doc.

I started with doi:10.18112/openneuro.ds003645.v1.0.0, downloaded it (partially) and wrote a script to make a selection. I do not consider this part of the pipeline yet (although it could have been), so my starting point is “ds003645_selection” (selection code included).

Starting from “ds003645_selection", I ran the following pipelines

filtering
downsampling (without explicit filtering, i.e. startting from ds003645_selection)
rereferencing
filtering, followed by downsampling, followed by rereferencing (as three pipelines)
filtering and downsampling and rereferencing (as one preproc pipeline)

This results in 6 derivatives (indicated in bold, also below). The code for each is in the respective “code” directory. Note that there are more lines of code needed for data handling than for the actual pipeline (which I implemented with FieldTrip).

The resulting directory tree (only showing directories, not the files) looks like this (see below) and can be browsed on google drive (no login needed, view only).

ds003645_selection
├── code
├── derivatives
│   ├── downsampling
│   │   ├── code
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   ├── filter_and_downsample_and_rereference
│   │   ├── code
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   ├── filtering
│   │   ├── code
│   │   ├── derivatives
│   │   │   └── downsampling
│   │   │       ├── code
│   │   │       ├── derivatives
│   │   │       │   └── rereference
│   │   │       │       ├── code
│   │   │       │       ├── sub-002
│   │   │       │       │   └── eeg
│   │   │       │       ├── sub-003
│   │   │       │       │   └── eeg
│   │   │       │       └── sub-004
│   │   │       │           └── eeg
│   │   │       ├── sub-002
│   │   │       │   └── eeg
│   │   │       ├── sub-003
│   │   │       │   └── eeg
│   │   │       └── sub-004
│   │   │           └── eeg
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   └── rereference
│       ├── code
│       ├── sub-002
│       │   └── eeg
│       ├── sub-003
│       │   └── eeg
│       └── sub-004
│           └── eeg
├── sub-002
│   └── eeg
├── sub-003
│   └── eeg
└── sub-004
    └── eeg

As discussed yesterday, your review of these examples can be used to get concretre ideas what needs to be done to extend the specification. Note that AFAIK I have now created derivative datasets that are compliant according to BIDS version 1.6.0. I did not run the validator (as it does not work yet on derivatives).

Some known issues at the moment I created the derivatives:

README and CHANGES are not updated
Some of the metadata fields have suboptimal values (e.g. because the derived datasets are only on my disk)
the eeg.json metadata is very minimal

robertoostenveld · 2022-01-28T10:48:31Z

@arnodelorme wrote in a reply to my email

I feel for reproducibility purposes, we should indicate the version of MATLAB (and Fieldtrip) in the script. Also, the script could be named with a standard name (like derivative.m) to avoid confusion if there are several scripts in the folder.

For the script:

Add the DOI of the parent BIDS datasets (as a comment). There can be several snapshots for a dataset so the name only is not enough. NOTE THIS MIGHT BE A MAJOR FLAW OF THE CURRENT DERIVATIVE FORMAT AS THE “SourceDatasets” field of “dataset_description.json” does not allow to uniquely identify the dataset.

Add the version of MATLAB as a comment

Add the version of Fieldtrip as a comment

This could be an optional recommendation for people, so we maximize reproducibility. Or we could imagine a specific JSON file for that in the code folder (derivative.json).

I would not cascade BIDS derivative repositories if there is no intention of assigning a DOI to each one. In that case, (downsampling plus filtering), then create a single dataset that is both downsampled and filtered. This could avoid the proliferation of datasets and confusion.

Alternatively, we could assign a DOI to a cascaded BIDS dataset (so parent folder plus derivative folder plus the derivative of the derivative folder). However, adding a derivative folder means that we have to create a new DOI for the whole cascaded BIDS dataset, which does not seem like a good idea.

Also, we should consider in the derivative specification folders of name derivative, derivative2, derivative3, etc.…

Note the proposed architecture is opposite the behavior of the cfg var in Fieldtrip, where parents BIDS are stored in substructure cfg.cfg (unless I am mistaken). In BIDS, parent BIDS datasets are stored in the parent folder. I personally prefer when parents are stored in a subfolder (more intuitive). Also, this would avoid the problem of having multiple “derivative” folders (a dataset usually has a single parent, but a parent can have multiple children). That's a broader discussion, of course.

robertoostenveld · 2022-01-28T11:35:22Z

Looking at the nested derivatives data structure that I created, I realize two aspects. These are more fundamental that the discussion on which specific metadata fields are needed (like matlab and fieldtrip version, to which I agree).

Do we want to store provenance (filter settings etc.) along with each individual derivatives data file (i.e. duplicated), or do we want to store it along with the information about the pipeline? Note that if we were to store it with each datafile in the eeg.json, we could make use of inheritance and deduplicate it at the top level.
Do we want to store the provenance of the previous steps along with every derivative? This touches upon the data.cfg.previous.previous.previous strategy used in FieldTrip.

Regarding 1: it makes sense (and is needed) if each file were processed differently. But along the processing pipeline we usually make raw data that might be inconsistent at the individual participant level more and more consistent. If the same pipeline is used on all datafiles, then documenting the pipeline metadata once should suffice. Note that documenting it along with the data is like data.cfg.previous.previous.previous, whereas documenting it with (or in) the code is more similar to having the script available (e.g. on github, in the code directory, or as documented in the GeneratedBy metadata field).

Regarding 2: Assuming that we would only store provenance of the last step, then in my example the metadata of ds003645_selection/derivatives/rereference could not be distinguished from the metadata of ds003645_selection/derivatives/filtering/derivatives/downsampling/derivatives/rereference, as both have the same step as the last. Although @arnodelorme mentions that my sequential cascade of pipelines is not optimal w.r.t. .data handling (and I agree), the principle is that pipelines can be cascaded. My cascade should therefor be considered as an example that should be supported if a researcher would want to do so.

My item 2 also relates to what Arno mentions w.r.t. pointing in the derivative to its source. He discusses it in relation to DOIs and hence published/persistent datasets, whereas I was in creating the examples not thinking about publishing and hence more thinking about referencing the local (on my hard disk or my labs network disk) relation between datasets. This also relates to PR bids-standard/bids-specification#820 and bids-standard/bids-specification#821.

I think we all have some implicit expectation about how people (including ourselves) work and when to start a new derivative, or when to "weave" additional files into an existing derivatives. In general my directory organization looks like this

projectId/raw       # might be a symbolic link to data on another volume
projectId/code      # ideally also on github for version control and collaboration 
projectId/results

and as long as I keep on working on the pipelines in the code directory, the intermediate and final files continue to go into the results directory (which has some internal directory structure similar to BIDS). Once the analysis is finalized, I would clean up the code and results (prune dead branches, rename files for improved consistency, possibly rerun it all once more to ensure consistency and reproducibility) and consider publishing the results+code directories together. Those would then comprise the BIDS derivative. An example is this with raw and derivatives data, plus a copy of the code on github (a static version of the code is also included in the derivative).

The example that I prepared is however at a larger collaborative (and more FAIR) scale, where Daniel and Rik prepared the initial multimodal data ds000117, which was then handled by Dung and Arno, resulting in ds003645, which was handled by me resulting in ds003645_selection, which was then handled by the "filtering guy" resulting in derivatives/filtering, which was then handled by the "downsampling guy", etc, etc. That leads me to the question: which information do we want to be present at which level (e.g. is the EEG amplifier brand still relevant after 5 stages of processing?), and which information do we expect re-users of a dataset to look up at with the ancestor dataset (for which those need to be accessible).

agramfort · 2022-02-02T20:53:25Z

I have not looked in details but browsing the google drive I see that the entity desc is used while I would have imagine proc is more natural. But it's not a strong feeling here. The derivatives > derivatives > nested structure is coherent but maybe hard to navigate. If you have 6 steps of processing (maxfilter, temporal filtering, artifact rejection, epochs, averaging ...) it's a lot and I would have considered putting all the produced files in the same folder of derivatives for a given subject. With the proposal here is what I say a valid option? thx @robertoostenveld for the coordination

robertoostenveld · 2022-02-02T21:57:30Z

From the spec (emphasis mine): "The proc label is analogous to rec for MR and denotes a variant of a file that was a result of particular processing performed on the device. This is useful for files produced in particular by Elekta’s MaxFilter (for example, sss, tsss, trans, quat or mc), which some installations impose to be run on raw data because of active shielding software corrections before the MEG data can actually be exploited."

The existing entities res/den/label/desc are specifically applicable to derivative data, and desc is used "[w]hen necessary to distinguish two files that do not otherwise have a distinguishing entity...", which I why I used it with the short description of the pipeline.

My example pipelines only produced a single result per raw input file; the sequential application served to have us think about what happens if we pass derivatives around between each other or on openneuro. If you have a pipeline that produces multiple results (which also makes sense), then those can be placed next to each other. I imagine that could result in

sub-002_task-FacePerception_run-1_desc-pipelinenameresult1_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult2_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult3_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult4_eeg.eeg + vhdr + vmrk

where the description (hence desc) pipelinenameresultN combines both the pipeline name and the specific result. However, you could also imagine

sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result1_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result2_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result3_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result4_eeg.eeg + vhdr + vmrk

where xxx codes another to-be-defined entity to split the pipeline from the result.

I don't think that we will benefit from long file names with specific entities for specific pipelines (such as filt-lp and filt-hp for different types of filters, or ref-avg and ref-common for different rereferencing schemes). The entity list is already very long and if we want to have a unique entity for each step, combined with prescribed values for the label, the specification would have to become very elaborate. I rather think that we would benefit from a simple desc-pipelinenameresultN where we would specify a very clear/explicit mapping of pipelinenameresultN to a metadata field inside the corresponding JSON. A bit similar to how we have the required json metadata field TaskName for functional data (which maps onto task-<TaskName> in the filename), combined with the recommended TaskDescription metadata field. It could then be DescName and DescDescription (which sounds silly) or so. However, I think the existing GeneratedBy also serves the purpose just fine.

An important difference between GeneratedBy and extending the filename of individual files (and/or the JSON metadata of individual files) is whether we document the pipeline once in dataset_description.json, or many times for each resulting file.

robertoostenveld · 2022-02-18T13:48:03Z

Yesterday we have a BIDS steering group meeting (@guiomar also attended) and it was mentioned there that BEP028 is making good progress. I will study that, you might want to look at that as well.

Furthermore, we also shortly touched upon the two tangential motivations for BIDS in making the results of analysis replicable (e.g. be able to recompute) and making raw or derived data reusable (for follow-up analyses). The first requires extensive details, the second can also be achieved with minimal metadata.

Also (as I was reminded in the meeting yesterday), the overarching BIDS strategy is to keep things as simple and small as possible, and we consider the 80/20 pareto principle.

SaraStephenson · 2022-03-09T18:52:30Z

Hello All,

With the help of @CPernet and @jesscall, @jadesjardins and I have prepared an example of the Face13 dataset with annotations stored in .tsv files and described in .json files. This current example is for discussion surrounding how to store continuous time annotations.

These files are located within the bids-examples/eeg_face13/derivatives/BIDS-Lossless-EEG/sub-*/eeg folders.

The annotations in this example were produced by the EEG-IP-L pipeline. There are several different types of annotations from this pipeline, including channel annotations, component annotations, binary time annotations and non-binary (continuous) time annotations.

The EEG-IP-L pipeline currently produces an annotations.json, annotations.tsv, and annotations.mat file. The .json describes all of the pipeline annotations. The .tsv contains the channel, component and binary time annotations. The .mat file contains the continuous time annotations. Since the .mat file is not a part of the BIDS specification, this current example has added a ‘recording-marks_annotation.tsv.gz’ and an accompanying 'recording-marks_annotation.json' for continuous time annotations. The 'recording-marks_annotation.tsv.gz' and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings.

If we are to store continuous time annotations in a tsv file, one concern we have is the need for two annotations tsv files because the non-binary time annotations are stored differently than the binary time annotations, component annotations, and channel annotations. As all of these annotation types are important for the EEG-IP-L pipeline, we are looking forward to some suggestions around how they can best be stored in BIDS.

Thanks,
Sara Stephenson

robertoostenveld · 2022-03-10T08:30:08Z

Thanks @SaraStephenson!

To help others that want to look at it on their own computer: I just did this to get the changes (which are on a branch that is 160 commits behind and 1 commit ahead of HEAD)

cd  bids-examples
git checkout master
git pull origin master # where origin points to git@github.com:bids-standard/bids-examples.git
git checkout -b BUCANL-bep021_ephys_derivatives
git checkout 2105c6d23eddd657e2305fbb5181242c1b1b1545 # go back in time, to avoid conflicts elsewhere
git pull --ff-only git@github.com:BUCANL/bids-examples.git bep021_ephys_derivatives

CPernet · 2022-03-10T08:33:44Z

YES Robert! thx -- I'll also have a look as well
now of course raw or derivatives - even if pushed in bep21, we went for raw here

robertoostenveld · 2022-03-10T09:08:24Z

@SaraStephenson Let me comment on what I encounter while going through the data.

first in derivatives/BIDS-Lossless-EEG

dataset_description.json

even though it is now nested in a dataset, I would prefer SourceDatasets still to be specified with a DOI.
I cannot find version 0.1.0 Alpha, better would be to either tag that and/or make a github release, or use a git SHA as version

README and README.md are duplicates.

The LICENSE file applies to the code, but seems inappropriate to the data, i.e. it is not a data use agreement. The source eeg_face13 is ODbL, that also applies to derivatives (since share-alike). I recommend to add the license not only as a file, but also to dataset_description.json. Perhaps you want to move the LICENSE file to the code directory.

Rather than linking to https://jov.arvojournals.org/article.aspx?articleid=2121634 I recommend to link to https://doi.org/10.1167/13.5.22

The file eeg_face13/task-faceFO_events.json can be an empty object {} but not an empty list []. Better would be if it were to explain a bit about the events.tsv files.

The electrode files are identical for all subjects; that suggests that they are not measured but a template. It is not recommended to add template data to the individual subjects. If you want to add a single template to all subjects, better put it at the top level (i.e. following the inheritance principle).

The IntendedFor path is inconsistent between sub-001_task-faceFO_annotations.json and sub-001_task-faceFO_desc-qc_annotations.json (one has ./ in front, the other not).

It is not clear to me what the difference is between sub-001_task-faceFO_annotations.tsv and sub-001_task-faceFO_desc-qc_annotations.tsv.

I don't think that the SamplingFrequency field in sub-002_task-faceFO_annotations.json is needed. The corresponding TSV file is not expressed in samples, but in seconds.

I don't think that EDF is the optimal binary format for processed EEG data. EDF is limited to 16 bits, whereas the data was recorded with 24 bit (since Biosemi) and subsequently processed as single or even double precision. I recommend writing to the BrainVision format, that allows single precision floats to be represented. Or to EEGLAB .set.

The way you coded two things in the annotations.tsv files to me appear to be nearly orthogonal and not relating to each other: the first few rows (with chan_ and comp_) don't relate to onset and duration, and the latter rows don't relate to channels. Each row has a label, but the chan_xxx and comp_xxx labels appear to be very different from all others. Would it not be better to have those in two TSV files? Or possibly even three: desc-chan_annotations.tsv, desc-comp_annotations.tsv and desc-task_annotations.tsv file?

I am not sure (cannot check, since zero bytes) what is in the mat files.

There is a sub-001_task-faceFO_recording-marks_annotation.json file with a data dictionary, but no corresponding data. I would expect that to come with a TSV file (even when it would be empty, i.e. only with the first header row).

Not related to the derivative, but I noticed a typo in eeg_face13/sub-003/eeg/sub-003_task-faceFO_eeg.json: McMaster Univertisy rather than University.

Moving on to BIDS-Seg-Face13-EEGLAB:

Again duplicate README files.

Again PipelineDescription.Version being unfindable on github. Also here the dataset_description could contain more info. There is no license (should be ODbL, since derivatives from the original data should be share-alike).

I cannot review anything else at this level any more (since only binary files), which is not a problem per see.

arnodelorme · 2022-03-10T21:44:48Z

I think the discussion is fruitful. There is the issue of annotation and then there is the issue of derivatives data structures Let me address the issue of derivative. I think it is fine to generate the full derivative tree as long as it can be cleaned up. The alternative to having the full hierarchy is to have pipelines as described in Robert’s email of Feb 2, 2022 so I think this strategy covers both approaches. Three comments 1. Hierarchy. It is a detail, but I would prefer a hierarchy where the name of the derivate is appended to derivative folder. For example derivative-filtering, then subfolder, derivative-downsampling. I think it is closer to the current BIDS implementation (simply need to allow wildcard after “derivative”). Also simpler for user browsing (half the number of sub-folders to dig into). So instead of ds003645_selection/derivative/filtering/derivative/downsampling we would have ds003645_selection/derivative-filtering/derivative-downsampling. I am expecting Robert might have resistance to that (he always have a very good reason to do thing the way he does :-). Maybe we can vote? 2. Reproducibility. For the final derivative tree (to be published) - each branch should have a DOI and can reference the DOI of the parent instead of being embedded in it (so you can share the derivative folder directly without loosing tracking) - We need tools that can reiterate the tree from the raw data and the code in “code” folders and subfolders for quality control - Maybe in the code folder a JSON file, with field, software (i.e. Fieldtrip, EEGLAB, MNE), language (python, MATLAB), and dependencies which would contain a list as well (with name and version, for example for EEGLAB plugins, or other dependencies), and then a field with “script” that could contain the script name in the same folder to execute using the parent BIDS to obtain the current derivative { "software": { "name": "EEGLAB", "version": "2022.0", "url": "xxxx" }, "language": { "name": "MATLAB", "version": "2021b" }, "dependencies": [{ "name": "bids-mjatlab-tools", "version": "6.1" }, { "name": "Fieldtrip", "version": "2022_03_10" } ], "script": { "name": "my_pipeline.m" } } 3. New data file types. We need to define new EEG data files which can be reused (for group analysis etc…) in addition to the processed EEG. For example, Liedfield matrix, ERP/ERSP results, ICA, custom results, etc... Arno

…

On Mar 10, 2022, at 1:08 AM, Robert Oostenveld ***@***.***> wrote: @SaraStephenson Let me comment on what I encounter while going through the data. first in derivatives/BIDS-Lossless-EEG dataset_description.json • even though it is now nested in a dataset, I would prefer SourceDatasets still to be specified with a DOI. • I cannot find version 0.1.0 Alpha, better would be to either tag that and/or make a github release, or use a git SHA as version README and README.md are duplicates. The LICENSE file applies to the code, but seems inappropriate to the data, i.e. it is not a data use agreement. The source eeg_face13 is ODbL, that also applies to derivatives (since share-alike). I recommend to add the license not only as a file, but also to dataset_description.json. Perhaps you want to move the LICENSE file to the code directory. Rather than linking to https://jov.arvojournals.org/article.aspx?articleid=2121634 I recommend to link to https://doi.org/10.1167/13.5.22 The file eeg_face13/task-faceFO_events.json can be an empty object {} but not an empty list []. Better would be if it were to explain a bit about the events.tsv files. The electrode files are identical for all subjects; that suggests that they are not measured but a template. It is not recommended to add template data to the individual subjects. If you want to add a single template to all subjects, better put it at the top level (i.e. following the inheritance principle). The IntendedFor path is inconsistent between sub-001_task-faceFO_annotations.json and sub-001_task-faceFO_desc-qc_annotations.json (one has ./ in front, the other not). It is not clear to me what the difference is between sub-001_task-faceFO_annotations.tsv and sub-001_task-faceFO_desc-qc_annotations.tsv. I don't think that the SamplingFrequency field in sub-002_task-faceFO_annotations.json is needed. The corresponding TSV file is not expressed in samples, but in seconds. I don't think that EDF is the optimal binary format for processed EEG data. EDF is limited to 16 bits, whereas the data was recorded with 24 bit (since Biosemi) and subsequently processed as single or even double precision. I recommend writing to the BrainVision format, that allows single precision floats to be represented. Or to EEGLAB .set. The way you coded two things in the annotations.tsv files to me appear to be nearly orthogonal and not relating to each other: the first few rows (with chan_ and comp_) don't relate to onset and duration, and the latter rows don't relate to channels. Each row has a label, but the chan_xxx and comp_xxx labels appear to be very different from all others. Would it not be better to have those in two TSV files? Or possibly even three: desc-chan_annotations.tsv, desc-comp_annotations.tsv and desc-task_annotations.tsv file? I am not sure (cannot check, since zero bytes) what is in the mat files. There is a sub-001_task-faceFO_recording-marks_annotation.json file with a data dictionary, but no corresponding data. I would expect that to come with a TSV file (even when it would be empty, i.e. only with the first header row). Not related to the derivative, but I noticed a typo in eeg_face13/sub-003/eeg/sub-003_task-faceFO_eeg.json: McMaster Univertisy rather than University. Moving on to BIDS-Seg-Face13-EEGLAB: Again duplicate README files. Again PipelineDescription.Version being unfindable on github. Also here the dataset_description could contain more info. There is no license (should be ODbL, since derivatives from the original data should be share-alike). I cannot review anything else at this level any more (since only binary files), which is not a problem per see. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

SaraStephenson · 2022-03-22T22:52:07Z

Thank you for all your comments about the Face13 example @robertoostenveld, I will look into making the appropriate corrections. I want to provide some clarifying information about annotations in the Face13 example dataset so that the discussion about how to best store the different types of annotations (component, channel, binary time and continuous time annotations) in BIDS can continue.

The sub-001_task-faceFO_annotations.tsv file contains the annotations that were produced by the EEG-IP-L pipeline. The sub-001_task-faceFO_desc-qc_annotations.tsv file contains annotations after the manual quality control (QC) procedure has been completed. During the QC procedure, the reviewer can modify some time and component annotations (particularly the ‘manual’ mark) based on visual inspection of the data.

The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc.

I have a few concerns about storing the chan, comp, and time annotations in separate files. One concern is that this will result in a large number of annotation files considering there would also be multiple versions of each file (one for the EEG-IP-L pipeline output and at least one for the QC’ed (desc-qc) data). Another concern is the naming of these annotation files. Currently we use desc-qc to indicate if the annotations are associated with QC’ed data, but would this complicate naming the different types of annotation files with desc-chan, desc-comp and desc-task?

The .mat file contains the continuous time annotations (such as the AMICA log likelihood). Since the .mat file is not a part of the BIDS specification, this current example has added a recording-marks_annotation.tsv.gz and an accompanying recording-marks_annotation.json for continuous time annotations. The recording-marks_annotation.tsv.gz and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings. The recording-marks_annotation.tsv.gz in the Face13 example contains 100 rows for each of the annotations listed in the recording-marks_annotation.json. These new files were created because the continuous time annotations can not be stored in the same way the component, channel, and binary time annotations are currently stored.

Hopefully this example can help move the discussion on how to store annotations (particularly continuous time annotations) in BIDS forward.

Thanks,
Sara

smakeig · 2022-03-24T15:59:33Z

Sara - I wonder if it would be productive to call what you refer to as 'continuous time annotations' as, rather, 'continuous time data measures' - you give the example of AMICA model likelihoods; other measures could include RMS amplitude, "theta/beta ratio", etc. (any of which might be used in some data quality, cleaning, or evaluation pipeline). In other words, I 'd suggest treating the AMICA likelihood index as a derived data channel time sync'ed with the original data channels - reserving the term 'annotation' for text or numeric markers of facts pertaining either to the whole run (as with basic metadata) or to some portion of it (as with event annotations). Scott

…

On Tue, Mar 22, 2022 at 6:52 PM Sara Stephenson ***@***.***> wrote: Thank you for all your comments about the Face13 example @robertoostenveld <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_robertoostenveld&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=RwFiAbQuvIn9iB_IeUl19NBp7OtEj8oUwlSAyaawuOs&e=>, I will look into making the appropriate corrections. I want to provide some clarifying information about annotations in the Face13 example dataset so that the discussion about how to best store the different types of annotations (component, channel, binary time and continuous time annotations) in BIDS can continue. The sub-001_task-faceFO_annotations.tsv file contains the annotations that were produced by the EEG-IP-L pipeline <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.sciencedirect.com_science_article_pii_S0165027020303848&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=Z8Ogw_P1AhwacLF3_0aijaH89R_Kak1Rm3KGWU-QhVc&e=>. The sub-001_task-faceFO_desc-qc_annotations.tsv file contains annotations after the manual quality control (QC) procedure has been completed. During the QC procedure, the reviewer can modify some time and component annotations (particularly the ‘manual’ mark) based on visual inspection of the data. The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1PmcVs7vg7Th-2DcGC-2DUrX8rAhKUHIzOI-2DuIOh69-5Fmvdlw_edit-23heading-3Dh.begtazq5lz86&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=3KYY2QUgq1Zd0hGMFGsUe3avKxWr4-ryJLpjKToA6_Q&e=> . I have a few concerns about storing the chan, comp, and time annotations in separate files. One concern is that this will result in a large number of annotation files considering there would also be multiple versions of each file (one for the EEG-IP-L pipeline output and at least one for the QC’ed (desc-qc) data). Another concern is the naming of these annotation files. Currently we use desc-qc to indicate if the annotations are associated with QC’ed data, but would this complicate naming the different types of annotation files with desc-chan, desc-comp and desc-task? The .mat file contains the continuous time annotations (such as the AMICA log likelihood). Since the .mat file is not a part of the BIDS specification, this current example has added a recording-marks_annotation.tsv.gz and an accompanying recording-marks_annotation.json for continuous time annotations. The recording-marks_annotation.tsv.gz and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings <https://urldefense.proofpoint.com/v2/url?u=https-3A__bids-2Dspecification.readthedocs.io_en_stable_04-2Dmodality-2Dspecific-2Dfiles_06-2Dphysiological-2Dand-2Dother-2Dcontinuous-2Drecordings.html&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=DW3z-6NI2tAjm_R3O1YbI8UCbzBZvoqJf1t53ixqp-I&e=>. The recording-marks_annotation.tsv.gz in the Face13 example contains 100 rows for each of the annotations listed in the recording-marks_annotation.json. These new files were created because the continuous time annotations can not be stored in the same way the component, channel, and binary time annotations are currently stored. Hopefully this example can help move the discussion on how to store annotations (particularly continuous time annotations) in BIDS forward. Thanks, Sara — Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_bids-2Dstandard_bep021_issues_5-23issuecomment-2D1075728495&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=9wZxec_T6vXmANINfG5W1rPzwqMGqVARU9GbI5rt0MU&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKN2SFWI5H7Z7SSSWGOM5MDVBJFKHANCNFSM4XUNFFSA&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=cC2TsLzL9KCPRV5UYYh9_xtdEiXYHV3qmLSg8Pf3kgwQciIi1h1TgiIXQz5W612D&s=dYo2eHv8Ui2wQ-n7YSY-F7g2cg0ms3z_m85ItIbam1M&e=> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

robertoostenveld · 2022-03-25T09:27:50Z

I'd suggest treating the AMICA likelihood index as a derived data channel time sync'ed with the original data channels

Continuous data that is time synched with other data is already part of the BIDS specification and documented here https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/06-physiological-and-other-continuous-recordings.html. A very similar approach (again with TSV files) is used for PET blood recording data.

arnodelorme · 2022-03-25T19:10:11Z

Thanks Robert, Do you know of an EEG dataset (joint EEG, eye-tracking, or accelerometer etc…) that you know of that uses this synchronization scheme? Cheers, Arno

…

On Mar 25, 2022, at 2:28 AM, Robert Oostenveld ***@***.***> wrote: I'd suggest treating the AMICA likelihood index as a derived data channel time sync'ed with the original data channels Continuous data that is time synched with other data is already part of the BIDS specification and documented here https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/06-physiological-and-other-continuous-recordings.html. A very similar approach (again with TSV files) is used for PET blood recording data. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

jesscall · 2022-05-10T14:26:11Z

Hi BEP021 community, I'm looking to move forward on a few points from @SaraStephenson 's thread above:

The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc.

[...]
To store them in separate files, a few questions/concerns pop up:
a. This creates a large number of annotation files - multiplied by pipeline (EEG-IP-L) and QC outputs (desc-qc).
b. Naming all these files - Currently we find it Very useful to use desc-qc to indicate the annotations are associated with QC'ed data, but would this be possible with the convention of desc-chan, desc-comp and desc-task?

@robertoostenveld What are your thoughts on this? Should this be revisited?

We'd like to avoid unnecessary complexities in file naming, and Sara's example follows examples 2 and 3 of BEP021 5.1: Sidecar TSV.

...

Second,

I wonder if it would be productive to call what you refer to as 'continuous time annotations' as, rather, 'continuous time data measures' -

@smakeig thank you -- calling them "measures" rather than annotations can address this nicely, and works with @robertoostenveld and @CPernet's prior comments on storing in TSV files as continuous recordings. see spec here in previous comments.

@SaraStephenson, I think we'll try moving away from "annotations" - perhaps recording-marks_amica.tsv.gz could work?

dorahermes · 2022-05-10T17:35:16Z

I agree with differentiating continuous time measurements versus event based annotations with an onset and duration.

@tpatpa and I are making some updates to the examples for event based annotations with an onset and duration that use HED/SCORE tags.

smakeig · 2022-05-10T20:05:09Z

@SaraStephenson

<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SaraStephenson&d=DwMFaQ&c=-35OiAkTchMrZOngvJPOeA&r=KEnFjcsfiKF_BPOsgvPP912y1yQ0q05CJ14uAvMNdNQ&m=B0p-o128i25Ju_i50YAxnA9QpGJt_qRFdoRVsVkT6piETXz2z95aHIaiobZG9UgD&s=86RU_LLZNnz0ZLS5ZIsPZWQQtmp4t-SUcTJCWGYmRHU&e=>, I think we'll try moving away from "annotations" - perhaps recording-marks_amica.tsv.gz could work? Sara - What is it you propose to label 'marks' ? To me, this term would most naturally suggest individual time markers (event onset markers, etc). -Scott

…

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

jesscall · 2022-05-18T18:41:43Z

Hi all!

@robertoostenveld @CPernet @arnodelorme
I'm hoping to move the discussion forward for EEGNet for the first point in this comment. We'd appreciate your thoughts on avoiding file splitting to this extent.

Thanks!

CC @SaraStephenson @christinerogers

PS: @smakeig we're happy to workshop the naming of those TSV files - you're right that marks still suggests an individual time marker.

arnodelorme · 2022-05-18T19:28:57Z

It is unclear to me how the annotation file differs from the event.tsv file.

CPernet · 2022-06-02T12:14:30Z

Have you tried @robertoostenveld proposal for the continuous measurement/annotation?

jesscall · 2022-06-02T13:54:34Z

Have you tried @robertoostenveld proposal for the continuous measurement/annotation?

@CPernet
Yes, @SaraStephenson has already committed the suggested changes here.

Can we go ahead and make the necessary adjustments to the BEP021 spec to reflect this suggestion? This paragraph in section 5.1 of the doc still proposes the alternative method of creating a synthetic data channel. Perhaps it can be updated with information on storing continuous recordings as described here?

robertoostenveld · 2022-06-02T19:18:12Z

It is unclear to me how the annotation file differs from the event.tsv file.

I don't think they technically differ (except for the name), but conceptually I see it like this

events is for the observable things that happened during acquisition independent of the EEG data (like stimuli, button presses) and that could also have happened and be observed if the EEG were not recorded at all
annotations is for things that are observed specifically in the EEG data and hence requires (re)viewing the data

For many things this works, although I realize that there are some situations where it might not be clear. But as long as it covers 80% of the use-cases, I think it is useful. No need to get 99.99% coverage.

jesscall · 2022-06-08T16:42:02Z

Hi all,

I made a suggestion in the BEP021 document in order to reflect the discussion in this thread (RE: continuous annotations / data measures). The suggestion is based on feedback from @CPernet and @robertoostenveld.

Please have a look and let me know if further amendments are needed.

smakeig · 2022-06-08T17:28:51Z

Robert and all - In HED-3G we have made a deliberate choice to include 'data feature' events as a primary event type (along with 'sensory presentation', 'agent action', etc.). The SCORE library schema for neurophysiologist annotation of events in clinical data annotates data feature events noted in events.tsv. As I understand it (correctly?) the EEGNet project proposed use of 'annotation.tsv' files is to apply flags to *each* EEG frame - therefore out of the event framework - which requires that an 'event' is something that unfolds over a definite time period in the experiment. Events are marked in events.tsv by 'event onset markers' - plus possibly other 'event phase markers' -- foremost, 'event offset markers', but also possibly markers of other critical points in the event process, for example 'max volume', 'motion course correction','seizure process shift', etc. I would not support the term 'annotations' for the proposed purpose of flagging, e.g., data frames for rejection, etc. A more specific term should be used - for example 'data_flags.tsv' or other ... Scott

…

On Thu, Jun 2, 2022 at 12:18 PM Robert Oostenveld ***@***.***> wrote: It is unclear to me how the annotation file differs from the event.tsv file. I don't think they technically differ (except for the name), but conceptually I see it like this - events is for the observable things that happened during acquisition independent of the EEG data (like stimuli, button presses) and that could also have happened and be observed if the EEG were not recorded at all - annotations is for things that are observed specifically in the EEG data and hence requires (re)viewing the data For many things this works, although I realize that there are some situations where it might not be clear. But as long as it covers 80% of the use-cases, I think it is useful. No need to get 99.99% coverage. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/bids-standard/bep021/issues/5*issuecomment-1145225333__;Iw!!Mih3wA!GlxM8tl6qtOmomxPCAriXw-Nqrxbw5kDjmKLK4okV9rtldg4AnZa9B4bZXoVFUEHl6kIFdj2E_P6ebJOP-8BBIsY$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFXYOMSQQI3PDKG5KD3VNECH7ANCNFSM4XUNFFSA__;!!Mih3wA!GlxM8tl6qtOmomxPCAriXw-Nqrxbw5kDjmKLK4okV9rtldg4AnZa9B4bZXoVFUEHl6kIFdj2E_P6ebJOP0qpMMVZ$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

Andesha · 2022-06-16T18:20:50Z

Hello community!

It's been awhile since I have been involved in these discussions, but I am once again contributing to some BIDS EEG efforts on behalf of EEGNet.

I've taken some time to read over this thread and want to state here as part of my post what I believe to be the current understanding of annotations so that I can update Face13 (and its various example repositories) as necessary once again.

If there's anything wrong with my broad statements of the current efforts, please let me know and hopefully it can restart some discussion here.

@robertoostenveld's definition here of events being acquisition time versus annotations being specific data properties after recording is what sets the two apart.
The AMICA log likelihood is the only "continuous annotation" being put forward by Face13, and isn't used for segmentation, and rarely for decision making during review. As such, it is being handled similar to eye tracking data and kept an external zipped tsv.
- Open to discussion from others if the names of these files need any review.
Annotations are currently (in Face13) being kept all in one file so as to not overpopulate the derivative folders. In the case of the Lossless Pipeline, a dozen or more could be created. I believe having one larger file as opposed to numerous small ones benefits the standard best.
@smakeig raised the point of if these are just additional "flags", why are we using the term "annotations". I have personally inherited the term from working with @jadesjardins. If what has previously been discussed as "continuous annotations" become standardized to be stored much like eye tracking data, I feel like this may be a positive change to broaden understanding of what their use is at a first glance.
- Changing this may require broader community discussion, the name may have stuck in some places...

I'll be coordinating offline with @SaraStephenson and @jadesjardins as necessary for changes to Face13, so no worries. I think I own a few of the commits too...

CCing @jesscall and @christinerogers as well.

Thanks!

CPernet · 2022-06-19T12:56:37Z

we can raise those points at OHBM2022

@jesscall please make the suggestion on the google doc
@.all I think annotations as in continuous annotation via tsv.gz works great
any other 'events' that work with HED should be marked that way not creating new data type

christinerogers · 2022-07-15T21:28:00Z

Some updates to move forward -
First, our team (thanks @jesscall) added a suggestion to the BEP021 spec here - Thanks for integrating or responding.

Second, to get unstuck on annotation nomenclature, @sappelhoff could you confirm / advise on best BIDS practice /other specs? (We've looked at cardio but would appreciate more eyes on this, as previously discussed.)

Non-continuous annotations could be renamed flags per @smakeig's suggestion (above)
Continuous annotations could be retained as "annotations"

Thanks - and cc Tyler @Andesha Collins, @SaraStephenson

@CPernet to your last bullet above - I can shoot you the latest on events/HED creation next week.

smakeig · 2022-07-16T03:15:19Z

Christine - Before studying the attachment (it's bedtime...), I don't agree with appropriating the term 'annotation' for such a specialized part of HED/BIDS annotation per se. Flags or markers can easily be understood as pointing to single time points on the exp timeline. But continuous measures for each time point, however derived from the data, are derived data channels - here might be called 'indicator channels' as they hold statistical indicators of data quality or etc. Scott

…

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

dorahermes · 2022-08-04T22:06:15Z

Another important point for discussion here is that it will be necessary to have a recommendation for how to indicate a list in a .tsv file, here for the channels. This is not specified in BIDS now. Similar point raised in BEP021 spec

Would there be any issue with a simple comma separated list? (e.g. O1, Oz, O2)

CPernet · 2022-08-05T04:37:18Z

'Would there be any issue with a simple comma separated list?' just that BIDS has stuck with tsv :-;
now that I am back in a country that uses commas in numbers I appreciate tsv more ...

sappelhoff · 2022-08-05T07:01:01Z

Can somebody briefly summarize what the issue is with the two terms "annotations" versus "flags", please? From my perspective, we are talking about data and event annotations.

Re: "continuous annotations", I would (as many others already have) advise to follow the specification on physio and stim data. That entails:

data is saved in a .tsv.gz file without headers
headers and additional metadata are desribed in an accompanying .json file
different kinds of continuous annotations may be distinguished via entities such as recording
This last point leads me to the "problem": the terminology ("physio" and "stim" suffixes, "recording" entity) is not very much in line with annotations. We may consider:
- adding a new suffix to "physio"+"stim", which however would follow the same rules
- just reusing "physio" or "stim" for our purposes at the expense of the term being confusing
- same deliberations go for the "recording" entity

Alternative: Not sure if Scott proposed this above, but we could also specify that continuous annotations must be formatted as a data channel and included in the (derivative) neural data file (BrainVision, EDF, ...). In that case all we might have to do is to come up with a good way of how to specify such "channel types" (e.g., see table in this section)

I find this alternative very attractive, because it makes the "name finding" issue easier (see "problem" / 3 "considerations" above)

Re: specifying which particular "channels" a given event pertains to --> there is a discussion thread in the Gdoc draft: https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit?disco=AAAAc8E_W-g

Incorporating comments in that thread, my suggestion is to:

have two columns: "channels" and "electrodes", because events may pertain to either channels or electrodes
all events that happen may pertain to either:
- a channel in the data and/or
- an electrode used in the recording (part of the metadata / data)
- neither
For example if an event pertains to a bipolar channel "C3-F3", then that channel must be included in the data
the format of specifying channels and electrodes in their respective TSV columns is "pythonic", that is, we have a list of comma-separated, double-quoted strings, see the example below. I formulate it with "channels" in mind, but suggest the same principles for electrodes:

# two channels are affected
["Cz", "Fpz"]

# no channel is affected 
[]

# the notion of channel is not applicable for this data point
"n/a"

# all channels are affects
"all"

# a channel with the name "all" is affected
["all"]

# a channel that contains double quotes in its name is affected
# note the backslash as escape character
["my_weird\"channel_name", "FCz", "C3"]

# channel names MUST NOT contain "tab characters"
# this would break the TSV format / make it uncomfortable

☝️ depending on the event to be annotated, the above format may be used to indicate channels and/or electrodes in their respective columns

Finally, re: non-continuous annotations --> I am fine with both: including them in the standard *_events.tsv or a separate *_desc-annotations_events.tsv. I think it'd be a good idea to not invent a new datatype / suffix for this (we may already need to do that for continuous annotations, see above)

Andesha · 2022-08-08T15:08:02Z

Thanks for the input @sappelhoff - I'm going to respond in order to your sections...

For my lab, "annotations" vs "flags" is mostly just terminology. We used annotations as a term to encompass all post processed derivative data properties like bridged channels (computationally discovered), rejected ICs, or ICA breakdown quality. Flags to me implies things that are just boolean. I believe on this point we're simply just discussing phrasing.

I believe at this point we can lock in the idea of the continuous annotations being in the external files. There appears to be significant agreement. I would however suggest the adding of a new term as it will likely age better in the long term and may help inform inform the process.

As above, we are likely past packaging things into the data as either "status" channels or other. I'm fine with the external personally.

I am in favour of the idea of adding this information of channel/electrode as an optional column. It would allow our process to fairly easily encode whatever "annotation" (or your chosen term) fairly simply, as long as there's no notion of a "simplest form" requirement. I'm not sure, but how would you envision this extending to ICs @sappelhoff ?

Lastly, I agree. Inventing a new datatype/suffix is not strictly needed.

Some extra thoughts as I was responding:

How does this relate to the provenance (lifecycle?) of a file within BIDS. Does it work naturally or is there things we should be considering from BEP028? Consider the case of a pipeline doing processing, going back and marking up events, adding reaction time info, marking study things, etc... thoughts @christinerogers ?

Thanks!

smakeig · 2022-08-08T15:16:25Z

It is best not to try to give a more generally applied term in widespread/longstanding use ('annotation') in a quite more limited meaning in the same application (e.g., BIDS). The continuous measures you introduce in LP are in a broader sense like any other continuous data measure (say, RMS, likelihood of x, etc.). I suggest the term in BIDS should be chosen to apply equally to all these that users may want to include in the dataset in an external file. Perhaps 'derived data'? Scott

…

On Mon, Aug 8, 2022 at 11:08 AM Tyler Collins ***@***.***> wrote: Thanks for the input @sappelhoff <https://urldefense.com/v3/__https://github.com/sappelhoff__;!!Mih3wA!CXjZZ_eofzmLug6NdUwXhUi0ggwEDEYc1NYVItTR7esBTOzvVGZbcx13Bi8RyWR5iTv2tNzaEbwCrTGPUGEz9MQI$> - I'm going to respond in order to your sections... For my lab, "annotations" vs "flags" is mostly just terminology. We used annotations as a term to encompass all post processed derivative data properties like bridged channels (computationally discovered), rejected ICs, or ICA breakdown quality. Flags to me implies things that are just boolean. I believe on this point we're simply just discussing phrasing. I believe at this point we can lock in the idea of the continuous annotations being in the external files. There appears to be significant agreement. I would however suggest the adding of a new term as it will likely age better in the long term and may help inform inform the process. As above, we are likely past packaging things into the data as either "status" channels or other. I'm fine with the external personally. I am in favour of the idea of adding this information of channel/electrode as an optional column. It would allow our process to fairly easily encode whatever "annotation" (or your chosen term) fairly simply, as long as there's no notion of a "simplest form" requirement. I'm not sure, but how would you envision this extending to ICs @sappelhoff <https://urldefense.com/v3/__https://github.com/sappelhoff__;!!Mih3wA!CXjZZ_eofzmLug6NdUwXhUi0ggwEDEYc1NYVItTR7esBTOzvVGZbcx13Bi8RyWR5iTv2tNzaEbwCrTGPUGEz9MQI$> ? Lastly, I agree. Inventing a new datatype/suffix is not strictly needed. Some extra thoughts as I was responding: How does this relate to the provenance (lifecycle?) of a file within BIDS. Does it work naturally or is there things we should be considering from BEP028? Consider the case of a pipeline doing processing, going back and marking up events, adding reaction time info, marking study things, etc... thoughts @christinerogers <https://urldefense.com/v3/__https://github.com/christinerogers__;!!Mih3wA!CXjZZ_eofzmLug6NdUwXhUi0ggwEDEYc1NYVItTR7esBTOzvVGZbcx13Bi8RyWR5iTv2tNzaEbwCrTGPUNceXHV5$> ? Thanks! — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/bids-standard/bep021/issues/5*issuecomment-1208252806__;Iw!!Mih3wA!CXjZZ_eofzmLug6NdUwXhUi0ggwEDEYc1NYVItTR7esBTOzvVGZbcx13Bi8RyWR5iTv2tNzaEbwCrTGPUOfRmH-2$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKN2SFWGSSNZW72UBXVSI3TVYEPF5ANCNFSM4XUNFFSA__;!!Mih3wA!CXjZZ_eofzmLug6NdUwXhUi0ggwEDEYc1NYVItTR7esBTOzvVGZbcx13Bi8RyWR5iTv2tNzaEbwCrTGPUKz1AVxO$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Scott Makeig, Research Scientist and Director, Swartz Center for Computational Neuroscience, Institute for Neural Computation, University of California San Diego, La Jolla CA 92093-0559, http://sccn.ucsd.edu/~scott

robertoostenveld mentioned this issue Feb 15, 2021

Discuss BEP021 derivatives for electrophys bids-standard/bids-specification#733

Closed

adam2392 mentioned this issue Dec 11, 2021

Writing events/annotations to an existing BIDS dataset mne-tools/mne-bids#928

Open

robertoostenveld mentioned this issue Jul 1, 2022

Examples on how to read and process BIDS data with fieldtrip for the website fieldtrip/fieldtrip#2067

Open

Discuss BEP021 derivatives for electrophys #5

Discuss BEP021 derivatives for electrophys #5

Comments

robertoostenveld commented Feb 15, 2021

robertoostenveld commented Feb 15, 2021

robertoostenveld commented Feb 15, 2021

robertoostenveld commented Feb 15, 2021

robertoostenveld commented Feb 15, 2021

robertoostenveld commented Feb 15, 2021

robertoostenveld commented Feb 15, 2021

arnodelorme commented Feb 15, 2021

guiomar commented Dec 1, 2021 • edited

hoechenberger commented Dec 1, 2021 • edited

guiomar commented Dec 1, 2021 • edited

adam2392 commented Dec 1, 2021

jasmainak commented Dec 1, 2021

guiomar commented Dec 2, 2021

guiomar commented Dec 6, 2021

hoechenberger commented Dec 6, 2021

guiomar commented Dec 6, 2021

guiomar commented Dec 16, 2021

smakeig commented Dec 18, 2021 via email

guiomar commented Dec 20, 2021

smakeig commented Dec 20, 2021 via email

guiomar commented Jan 10, 2022

guiomar commented Jan 11, 2022

guiomar commented Jan 11, 2022

robertoostenveld commented Jan 28, 2022

robertoostenveld commented Jan 28, 2022

robertoostenveld commented Jan 28, 2022 • edited

agramfort commented Feb 2, 2022 • edited

robertoostenveld commented Feb 2, 2022 • edited

robertoostenveld commented Feb 18, 2022

SaraStephenson commented Mar 9, 2022 • edited

robertoostenveld commented Mar 10, 2022

CPernet commented Mar 10, 2022

robertoostenveld commented Mar 10, 2022

first in derivatives/BIDS-Lossless-EEG

Moving on to BIDS-Seg-Face13-EEGLAB:

arnodelorme commented Mar 10, 2022 via email

SaraStephenson commented Mar 22, 2022

smakeig commented Mar 24, 2022 via email

robertoostenveld commented Mar 25, 2022

arnodelorme commented Mar 25, 2022 via email

jesscall commented May 10, 2022

dorahermes commented May 10, 2022

smakeig commented May 10, 2022 via email

jesscall commented May 18, 2022

arnodelorme commented May 18, 2022

CPernet commented Jun 2, 2022

jesscall commented Jun 2, 2022

robertoostenveld commented Jun 2, 2022

jesscall commented Jun 8, 2022

smakeig commented Jun 8, 2022 via email

Andesha commented Jun 16, 2022 • edited

CPernet commented Jun 19, 2022 • edited by sappelhoff

christinerogers commented Jul 15, 2022

smakeig commented Jul 16, 2022 via email

dorahermes commented Aug 4, 2022

CPernet commented Aug 5, 2022

sappelhoff commented Aug 5, 2022 • edited

Andesha commented Aug 8, 2022

smakeig commented Aug 8, 2022 via email

guiomar commented Dec 1, 2021 •

edited

hoechenberger commented Dec 1, 2021 •

edited

guiomar commented Dec 1, 2021 •

edited

robertoostenveld commented Jan 28, 2022 •

edited

agramfort commented Feb 2, 2022 •

edited

robertoostenveld commented Feb 2, 2022 •

edited

SaraStephenson commented Mar 9, 2022 •

edited

Andesha commented Jun 16, 2022 •

edited

CPernet commented Jun 19, 2022 •

edited by sappelhoff

sappelhoff commented Aug 5, 2022 •

edited