Update tuh.py #431

MohammadJavadD · 2023-02-03T01:16:07Z

Adaptation to the TUH v3 and might resolve issue #430

but getting this error in the braindecode.preprocessing.preprocess RuntimeError: info["meas_date"] seconds must be between "(-2147483648, 0)" and "(2147483647, 0)", got "-2209161600"

Adaptation to the TUH v3

MohammadJavadD · 2023-02-03T01:48:16Z

I think the tests failed because the unit tests are based on the old TUH version.

bruAristimunha · 2023-02-05T14:13:51Z

I will update the test, thank you some much @MohammadJavadD!!

codecov · 2023-02-05T20:45:12Z

Codecov Report

Merging #431 (acdd6d3) into master (dad38a8) will decrease coverage by 0.29%.
The diff coverage is 63.82%.

@@            Coverage Diff             @@
##           master     #431      +/-   ##
==========================================
- Coverage   84.60%   84.31%   -0.29%     
==========================================
  Files          59       59              
  Lines        4221     4253      +32     
==========================================
+ Hits         3571     3586      +15     
- Misses        650      667      +17

bruAristimunha · 2023-02-05T20:48:24Z

Hi @MohammadJavadD,

I maintained the compatibility of the old versions, but also prioritized the new version of the dataset. I don't have much experience with this dataset, so I decided not to create new tests for version 3.0. We can do that in the future.

Can you test with the new version of the dataset? if it's good, LGTM

MohammadJavadD · 2023-02-05T21:53:24Z

Thank you for your time. AFAIK the old version should not be in use anymore as the data authorities asked everyone to delete all the old data. Sure, I'll try the code. Also as I said before I had problems with saving the data.

On Sun, Feb 5, 2023 at 3:48 PM Bru ***@***.***> wrote: Hi @MohammadJavadD <https://github.com/MohammadJavadD>, I maintained the compatibility of the old versions, but also prioritized the new version of the dataset. I don't have much experience with this dataset, so I decided not to create new tests for version 3.0. We can do that in the future. Can you test with the new version of the dataset? if it's good, LGTM — Reply to this email directly, view it on GitHub <#431 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJHECINUX5YRHOROVTF3TWTWWAG2HANCNFSM6AAAAAAUPWJNW4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- - Mohammad

I noticed even the version 2 is updated see [here](https://isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v2.0.0/AAREADME.txt) for more details. So this works for me now!

dengemann · 2023-02-09T23:15:41Z

braindecode/datasets/tuh.py

+        version = tokens[-6]
+    else:
+        version = tokens[-7]
+    subject_id = tokens[-1].split('_')[-2].split('s')[-1]


Is this correct? The new subject identifier is a combination of letters.
e.g. 'aaaaaaav_s004_t000.edf'

From the readme:

The last segment is the filename ("aaaaamye_s001_t000.edf"). This includes the subject identifier ("aaaaamye"), the session number ("s001") and a token number ("t000"). EEGs are split into a series of files starting with *t000.edf, *t001.edf, ... These represent pruned EEGs, so the original EEG is split into these segments, and uninteresting parts of the original recording were deleted (common in clinical practice).

To make it work on my side, I avoided conversion to integer downstream.

ostormer

Suggested changes to fix subject_id, session and segment extraction from file path

ostormer · 2023-02-17T14:04:07Z

braindecode/datasets/tuh.py

+    subject_id = tokens[-1].split('_')[-2].split('s')[-1]
    session = tokens[-2].split('_')[0]
    segment = tokens[-1].split('_')[-1].split('.')[-2]


Suggested change

subject_id = tokens[-1].split('_')[-2].split('s')[-1]

session = tokens[-2].split('_')[0]

segment = tokens[-1].split('_')[-1].split('.')[-2]

subject_id = tokens[-1].split('_')[0]

session = tokens[-1].split('_')[1]

segment = tokens[-1].split('_')[2].split('.')[0]

Fixes subject_id, session, and sample extraction from filename.

Does not retain combability with older versions, needs to be added after dataset and version check if compatibility is desired

ostormer · 2023-02-17T14:05:46Z

braindecode/datasets/tuh.py

+            'year': int(year),
+            'month': int(month),
+            'day': int(day),
+            'subject': int(subject_id),


Suggested change

'subject': int(subject_id),

'subject': subject_id,

Is no longer a number, as noted by @dengemann

braindecode/datasets/tuh.py

ostormer · 2023-02-17T15:10:04Z

This may need some more work to be usable with the different TUH datasets for versions both before and after the December 2022 update. For 'tuh_eeg' the newest version number is v2.0.0, while for 'tuh_eeg_abnormal' the newest version is v3.0.0

Checking the exact version to determine the file structure is a short term solution, the version number could be read as a number and checked to be version >= 2 for tuh_eeg and version >= 3 for abnormal.

dengemann · 2023-02-17T21:38:17Z

Could it make sense to introduce an explicit version parameter, defaulting to the latest version?
In any case, one is not supposed to use older versions any longer.

bruAristimunha · 2023-02-20T16:12:23Z

Hey @dengemann and @ostormer,

Hope you're both doing great. I was wondering if you could help me out and assume this PR? I know you both have a lot of expertise in the dataset and your approval would be super helpful for the project.

Thanks so much for taking the time to review the PR, and I really appreciate any feedback you can give me considering I'm not that familiar with the dataset.

dengemann · 2023-02-21T21:16:22Z

Hey @bruAristimunha, happy to help. I have already solved the problem for myself. In case it helps, check out this gist (in/out paths need to be set + the way I do it, the alphabetical ordering of the subject names made of letter sequences becomes a subject number): https://gist.github.com/dengemann/c2a411f50b7888d34ccd298cdfdf05c3
The question is, what design / functionality is good enough for braindecode.
What does it take to decide how we want it done here?

Edit: the gist does two things; it creates the braindecode dataset but it also converts to bids, based on our code from our brain-age benchmarks https://github.com/meeg-ml-benchmarks/brain-age-benchmark-paper

MohammadJavadD · 2023-02-22T17:19:45Z

Hey @bruAristimunha, happy to help. I have already solved the problem for myself. In case it helps, check out this gist (in/out paths need to be set + the way I do it, the alphabetical ordering of the subject names made of letter sequences becomes a subject number): https://gist.github.com/dengemann/c2a411f50b7888d34ccd298cdfdf05c3 The question is, what design / functionality is good enough for braindecode. What does it take to decide how we want it done here?

Edit: the gist does two things; it creates the braindecode dataset but it also converts to bids, based on our code from our brain-age benchmarks https://github.com/meeg-ml-benchmarks/brain-age-benchmark-paper

The gist worked for me as well and the conversion makes things way more organized. As there are some problems with the save function so it would be great if Braindecode could integrate the Bids format.

ostormer · 2023-02-23T07:33:38Z

Hello! I have been traveling and so have been offline for a couple of days. Before seeing @dengemann 's answer I rewrote the _parse_description_from_file_path function to work with both old and new versions of both datasets, but have not been able to test it with old versions as I don't have access to any old versions myself.

I see that '_create_chronological_description' function could be updated to only ommit dates in the case of abnormal v3, which is the only version without dates in the file path.

def _parse_description_from_file_path(file_path):
    # stackoverflow.com/questions/3167154/how-to-split-a-dos-path-into-its-components-in-python  # noqa
    file_path = os.path.normpath(file_path)
    tokens = file_path.split(os.sep)
    # Extract version number and tuh_eeg_abnormal/tuh_eeg from file path
    if ('train' in tokens) or ('eval' in tokens):  # tuh_eeg_abnormal
        abnormal = True
        # Tokens[-2] is channel configuration (always 01_tcp_ar in abnormal) 
        # on new versions, or
        #               session (e.g. s004_2013_08_15) on old versions
        if tokens[-2].split('_')[0][0] == 's':  # s denoting session number
            version = tokens[-9] #  Before dec 2022 updata
        else:
            version = tokens[-6]  # After the dec 2022 update
        
    else:  # tuh_eeg
        abnormal = False
        version = tokens[-7]
    v_number = int(version[1])

    if (abnormal and v_number >= 3) or ((not abnormal) and v_number >= 2):
        # New file path structure for versions after december 2022,
        # expect file paths as 
        # tuh_eeg/v2.0.0/edf/000/aaaaaaaa/
        #     s001_2015_12_30/01_tcp_ar/aaaaaaaa_s001_t000.edf
        # or for abnormal:
        # tuh_eeg_abnormal/v3.0.0/edf/train/normal/
        #     01_tcp_ar/aaaaaaav_s004_t000.edf
        subject_id = tokens[-1].split('_')[0]
        session = tokens[-1].split('_')[1]
        segment = tokens[-1].split('_')[2].split('.')[0]

        description = {
            'path': file_path,
            'version': version,
            'subject': subject_id,
            'session': int(session[1:]),
            'segment': int(segment[1:]),
        }
        if not abnormal:
            year, month, day = tokens[-3].split('_')[1:]
            description['year'] = int(year)
            description['month'] = int(month)
            description['day'] = int(day)
        return description
    else:  # Old file path structure
        # expect file paths as tuh_eeg/version/file_type/reference/data_split/
        #                          subject/recording session/file
        # e.g.                 tuh_eeg/v1.1.0/edf/01_tcp_ar/027/00002729/
        #                          s001_2006_04_12/00002729_s001.edf
        # or for abnormal
        # version/file type/data_split/pathology status/
        #     reference/subset/subject/recording session/file
        # v2.0.0/edf/train/normal/01_tcp_ar/000/00000021/
        #     s004_2013_08_15/00000021_s004_t000.edf
        subject_id = tokens[-1].split('_')[0]
        session = tokens[-2].split('_')[0]  # string on format 's000'
        # According to the example path in the comment 8 lines above,
        # segment is not included in the file name
        segment = tokens[-1].split('_')[-1].split('.')[0]  # TODO: test with tuh_eeg
        year, month, day = tokens[-2].split('_')[1:]
        return {
            'path': file_path,
            'version': version,
            'year': int(year),
            'month': int(month),
            'day': int(day),
            'subject': subject_id,
            'session': int(session[1:]),
            'segment': int(segment[1:]),
        }

robintibor · 2023-02-24T11:52:52Z

So @ostormer I have tried your code, we have only the old version of TUH here. It seems to be fine, except we still need to convert subject_id to int

            'subject': int(subject_id)

in the else clause for the old TUH to keep everything the same as before and for the tests to run.

In general, you can run pytest test/unit_tests/datasets/test_tuh.py -v to check that the new parsing code still works for the old style.

Using your function to replace the existing one should make everything work also for the new TUH right? Then maybe we can just do that?

robintibor · 2023-02-24T12:16:03Z

Just to clarify you would need to use the updated test_tuh.py that I pushed into this branch @ostormer

…ing recording ids.

robintibor · 2023-02-27T13:26:44Z

So I have now pushed some changes that should work for the old and new versions based on the existing parts from @MohammadJavadD and @ostormer . However, it will load edfs twice for the new abnormal version, and even load it the first time without parallelization (would slow loading down).

After some discussion with @gemeinl , I think the better way would be:

Parsing all information that can be extracted from the path without opening the edf into per-dataset descriptions (as happened before, without the new line that reads edf to get year/month/day)
Load edfs and add additional information such as year/month/day to the description (also as is now)
Sort the datasets chronologically based on the now-complete descriptions (would need to be added, basically moving the sorting)

This would circumvent loading the edf twice. What do you think? @bruAristimunha

bruAristimunha · 2023-02-27T14:23:03Z

Great, as always! I like this proposition, and it's a great alternative; we'll deal with version compatibility and try to maintain performance in line with the parallelization tutorials.

In the future, we can generate new tests for version 3.0.

LGTM =)

ostormer · 2023-03-01T13:05:25Z

I like your suggestions @robintibor, seems to be a good solution! My code was written while traveling, without testing anything other than the simple file path parsing. I agree that the full loading process was not ideal with just those changes.

I don't know enough about the rest of the Braindecode library as I just recently started using it, so thanks and great work to you guys more familiar with the whole picture :)

gemeinl · 2023-03-03T17:28:31Z

Updated as discussed with Robin above.
We do not yet have the new data release. Can someone please check whether this update works @ostormer @MohammadJavadD @dengemann

gemeinl · 2023-03-03T17:34:23Z

Unfortunately, the medical text reports have been removed by TUH with the new release.
So trying to add them add_physician_reports=True should and will raise an error.

MohammadJavadD · 2023-03-05T19:03:44Z

Updated as discussed with Robin above. We do not yet have the new data release. Can someone please check whether this update works @ostormer @MohammadJavadD @dengemann

I checked with TUAB and TUEG 2022 and both worked! Good job! Thank you, everyone!

bruAristimunha · 2023-03-08T23:12:46Z

I am delighted with the outcome of this pull request! The whole team has done an excellent job, and I would like to extend my sincere appreciation to all of you, @MohammadJavadD, @dengemann, @ostormer, @gemeinl and @robintibor.

@robintibor, would you be so kind as to consider merging the changes into the main branch?"

robintibor · 2023-03-09T12:53:37Z

Thanks for the kind words @bruAristimunha we did unfortunately realize one more issue:
Due to chronologically sorting after loading the data, now the recording_ids parameter` will behave differently and in some cases this can be a bigger problem, like loading first 300 recordings will now load only abnormal recordings etc.

We want to solve this by doing the sorting again before loading and in case the date cannot be read from the path, loading the edf but storing a (text)-file with the date alongside the edf file so next time only the text file has to be read.

We either implement this tomorrow, if we don't manage we first just add a warning and merge this so people can load the new TUH from master, and then do this in a separate PR.

robintibor · 2023-03-10T14:00:59Z

So merging this now, @gemeinl will tackle the aforementioned problem in a separate PR. Also thanks from my side to everybody involved @MohammadJavadD, @dengemann, @ostormer, @gemeinl @bruAristimunha

MohammadJavadD added 2 commits February 2, 2023 20:10

Update tuh.py

f590399

Adaptation to the TUH v3

Update tuh.py

b1c25b0

bruAristimunha added 6 commits February 5, 2023 20:38

Removing year, month and day;

19671c8

Updating whats_new.rst

e4026ac

Updating the parse to decide based on the version what to return.

7ddc377

Updating the TUH dataset with a if for the new version of the dataset.

13369e4

Changing the parser function to not read the edf.

5544d2f

flake8

21affd7

MohammadJavadD added 2 commits February 6, 2023 15:57

Update tuh.py

f36c42c

I noticed even the version 2 is updated see [here](https://isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v2.0.0/AAREADME.txt) for more details. So this works for me now!

Update tuh.py

6cbe513

dengemann reviewed Feb 9, 2023

View reviewed changes

ostormer suggested changes Feb 17, 2023

View reviewed changes

Fixed test paths for TUH

99a3ae1

Remove lines for flake

c96f4ce

robintibor added 2 commits February 27, 2023 14:17

Working version parsing description for old and new TUH versions. Fix…

9514f6d

…ing recording ids.

Add space before comment for flake

4e4e174

MohammadJavadD mentioned this pull request Feb 28, 2023

"required changes 4 TUABv3 and MNE-BIDB-Pipeline" meeg-ml-benchmarks/brain-age-benchmark-paper#57

Open

gemeinl self-assigned this Mar 2, 2023

gemeinl linked an issue Mar 2, 2023 that may be closed by this pull request

TUAB data version 3.0.0 #430

Closed

avoiding to open edf files twice to get date

6b6cfae

robintibor and others added 2 commits March 6, 2023 18:02

for now sort abnormal test paths manually to fix test

c9a827e

Updating the whats_new.rst

2ec3c6d

robintibor added 2 commits March 10, 2023 14:08

Add warning for recording ids

8f7483a

Merge branch 'master' into patch-2

acdd6d3

robintibor merged commit bab64c7 into braindecode:master Mar 10, 2023
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update tuh.py #431

Update tuh.py #431

MohammadJavadD commented Feb 3, 2023

MohammadJavadD commented Feb 3, 2023

bruAristimunha commented Feb 5, 2023

codecov bot commented Feb 5, 2023 •

edited

bruAristimunha commented Feb 5, 2023

MohammadJavadD commented Feb 5, 2023 via email

dengemann Feb 9, 2023

ostormer left a comment

ostormer Feb 17, 2023

ostormer Feb 17, 2023

ostormer Feb 17, 2023

ostormer commented Feb 17, 2023

dengemann commented Feb 17, 2023

bruAristimunha commented Feb 20, 2023

dengemann commented Feb 21, 2023 •

edited

MohammadJavadD commented Feb 22, 2023

ostormer commented Feb 23, 2023

robintibor commented Feb 24, 2023 •

edited

robintibor commented Feb 24, 2023 •

edited

robintibor commented Feb 27, 2023 •

edited

bruAristimunha commented Feb 27, 2023

ostormer commented Mar 1, 2023

gemeinl commented Mar 3, 2023 •

edited

gemeinl commented Mar 3, 2023

MohammadJavadD commented Mar 5, 2023

bruAristimunha commented Mar 8, 2023

robintibor commented Mar 9, 2023

robintibor commented Mar 10, 2023

Update tuh.py #431

Update tuh.py #431

Conversation

MohammadJavadD commented Feb 3, 2023

MohammadJavadD commented Feb 3, 2023

bruAristimunha commented Feb 5, 2023

codecov bot commented Feb 5, 2023 • edited

Codecov Report

bruAristimunha commented Feb 5, 2023

MohammadJavadD commented Feb 5, 2023 via email

dengemann Feb 9, 2023

Choose a reason for hiding this comment

ostormer left a comment

Choose a reason for hiding this comment

ostormer Feb 17, 2023

Choose a reason for hiding this comment

ostormer Feb 17, 2023

Choose a reason for hiding this comment

ostormer Feb 17, 2023

Choose a reason for hiding this comment

ostormer commented Feb 17, 2023

dengemann commented Feb 17, 2023

bruAristimunha commented Feb 20, 2023

dengemann commented Feb 21, 2023 • edited

MohammadJavadD commented Feb 22, 2023

ostormer commented Feb 23, 2023

robintibor commented Feb 24, 2023 • edited

robintibor commented Feb 24, 2023 • edited

robintibor commented Feb 27, 2023 • edited

bruAristimunha commented Feb 27, 2023

ostormer commented Mar 1, 2023

gemeinl commented Mar 3, 2023 • edited

gemeinl commented Mar 3, 2023

MohammadJavadD commented Mar 5, 2023

bruAristimunha commented Mar 8, 2023

robintibor commented Mar 9, 2023

robintibor commented Mar 10, 2023

codecov bot commented Feb 5, 2023 •

edited

dengemann commented Feb 21, 2023 •

edited

robintibor commented Feb 24, 2023 •

edited

robintibor commented Feb 24, 2023 •

edited

robintibor commented Feb 27, 2023 •

edited

gemeinl commented Mar 3, 2023 •

edited