Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tuh.py #431

Merged
merged 19 commits into from Mar 10, 2023
Merged

Update tuh.py #431

merged 19 commits into from Mar 10, 2023

Conversation

MohammadJavadD
Copy link
Contributor

Adaptation to the TUH v3 and might resolve issue #430

but getting this error in the braindecode.preprocessing.preprocess RuntimeError: info["meas_date"] seconds must be between "(-2147483648, 0)" and "(2147483647, 0)", got "-2209161600"

@MohammadJavadD
Copy link
Contributor Author

I think the tests failed because the unit tests are based on the old TUH version.

@bruAristimunha
Copy link
Collaborator

I will update the test, thank you some much @MohammadJavadD!!

@codecov
Copy link

codecov bot commented Feb 5, 2023

Codecov Report

Merging #431 (acdd6d3) into master (dad38a8) will decrease coverage by 0.29%.
The diff coverage is 63.82%.

@@            Coverage Diff             @@
##           master     #431      +/-   ##
==========================================
- Coverage   84.60%   84.31%   -0.29%     
==========================================
  Files          59       59              
  Lines        4221     4253      +32     
==========================================
+ Hits         3571     3586      +15     
- Misses        650      667      +17     

@bruAristimunha
Copy link
Collaborator

Hi @MohammadJavadD,

I maintained the compatibility of the old versions, but also prioritized the new version of the dataset. I don't have much experience with this dataset, so I decided not to create new tests for version 3.0. We can do that in the future.

Can you test with the new version of the dataset? if it's good, LGTM

@MohammadJavadD
Copy link
Contributor Author

MohammadJavadD commented Feb 5, 2023 via email

I noticed even the version 2 is updated see [here](https://isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v2.0.0/AAREADME.txt) for more details. So this works for me now!
version = tokens[-6]
else:
version = tokens[-7]
subject_id = tokens[-1].split('_')[-2].split('s')[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? The new subject identifier is a combination of letters.
e.g. 'aaaaaaav_s004_t000.edf'

From the readme:

The last segment is the filename ("aaaaamye_s001_t000.edf"). This
includes the subject identifier ("aaaaamye"), the session number
("s001") and a token number ("t000"). EEGs are split into a series of
files starting with *t000.edf, *t001.edf, ...  These represent pruned
EEGs, so the original EEG is split into these segments, and
uninteresting parts of the original recording were deleted (common in
clinical practice).

To make it work on my side, I avoided conversion to integer downstream.

Copy link

@ostormer ostormer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested changes to fix subject_id, session and segment extraction from file path

Comment on lines 135 to 137
subject_id = tokens[-1].split('_')[-2].split('s')[-1]
session = tokens[-2].split('_')[0]
segment = tokens[-1].split('_')[-1].split('.')[-2]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subject_id = tokens[-1].split('_')[-2].split('s')[-1]
session = tokens[-2].split('_')[0]
segment = tokens[-1].split('_')[-1].split('.')[-2]
subject_id = tokens[-1].split('_')[0]
session = tokens[-1].split('_')[1]
segment = tokens[-1].split('_')[2].split('.')[0]

Fixes subject_id, session, and sample extraction from filename.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not retain combability with older versions, needs to be added after dataset and version check if compatibility is desired

'year': int(year),
'month': int(month),
'day': int(day),
'subject': int(subject_id),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'subject': int(subject_id),
'subject': subject_id,

Is no longer a number, as noted by @dengemann

braindecode/datasets/tuh.py Show resolved Hide resolved
@ostormer
Copy link

This may need some more work to be usable with the different TUH datasets for versions both before and after the December 2022 update. For 'tuh_eeg' the newest version number is v2.0.0, while for 'tuh_eeg_abnormal' the newest version is v3.0.0

Checking the exact version to determine the file structure is a short term solution, the version number could be read as a number and checked to be version >= 2 for tuh_eeg and version >= 3 for abnormal.

@dengemann
Copy link
Contributor

Could it make sense to introduce an explicit version parameter, defaulting to the latest version?
In any case, one is not supposed to use older versions any longer.

@bruAristimunha
Copy link
Collaborator

Hey @dengemann and @ostormer,

Hope you're both doing great. I was wondering if you could help me out and assume this PR? I know you both have a lot of expertise in the dataset and your approval would be super helpful for the project.

Thanks so much for taking the time to review the PR, and I really appreciate any feedback you can give me considering I'm not that familiar with the dataset.

@dengemann
Copy link
Contributor

dengemann commented Feb 21, 2023

Hey @bruAristimunha, happy to help. I have already solved the problem for myself. In case it helps, check out this gist (in/out paths need to be set + the way I do it, the alphabetical ordering of the subject names made of letter sequences becomes a subject number): https://gist.github.com/dengemann/c2a411f50b7888d34ccd298cdfdf05c3
The question is, what design / functionality is good enough for braindecode.
What does it take to decide how we want it done here?

Edit: the gist does two things; it creates the braindecode dataset but it also converts to bids, based on our code from our brain-age benchmarks https://github.com/meeg-ml-benchmarks/brain-age-benchmark-paper

@MohammadJavadD
Copy link
Contributor Author

Hey @bruAristimunha, happy to help. I have already solved the problem for myself. In case it helps, check out this gist (in/out paths need to be set + the way I do it, the alphabetical ordering of the subject names made of letter sequences becomes a subject number): https://gist.github.com/dengemann/c2a411f50b7888d34ccd298cdfdf05c3 The question is, what design / functionality is good enough for braindecode. What does it take to decide how we want it done here?

Edit: the gist does two things; it creates the braindecode dataset but it also converts to bids, based on our code from our brain-age benchmarks https://github.com/meeg-ml-benchmarks/brain-age-benchmark-paper

The gist worked for me as well and the conversion makes things way more organized. As there are some problems with the save function so it would be great if Braindecode could integrate the Bids format.

@ostormer
Copy link

Hello! I have been traveling and so have been offline for a couple of days. Before seeing @dengemann 's answer I rewrote the _parse_description_from_file_path function to work with both old and new versions of both datasets, but have not been able to test it with old versions as I don't have access to any old versions myself.

I see that '_create_chronological_description' function could be updated to only ommit dates in the case of abnormal v3, which is the only version without dates in the file path.

def _parse_description_from_file_path(file_path):
    # stackoverflow.com/questions/3167154/how-to-split-a-dos-path-into-its-components-in-python  # noqa
    file_path = os.path.normpath(file_path)
    tokens = file_path.split(os.sep)
    # Extract version number and tuh_eeg_abnormal/tuh_eeg from file path
    if ('train' in tokens) or ('eval' in tokens):  # tuh_eeg_abnormal
        abnormal = True
        # Tokens[-2] is channel configuration (always 01_tcp_ar in abnormal) 
        # on new versions, or
        #               session (e.g. s004_2013_08_15) on old versions
        if tokens[-2].split('_')[0][0] == 's':  # s denoting session number
            version = tokens[-9] #  Before dec 2022 updata
        else:
            version = tokens[-6]  # After the dec 2022 update
        
    else:  # tuh_eeg
        abnormal = False
        version = tokens[-7]
    v_number = int(version[1])

    if (abnormal and v_number >= 3) or ((not abnormal) and v_number >= 2):
        # New file path structure for versions after december 2022,
        # expect file paths as 
        # tuh_eeg/v2.0.0/edf/000/aaaaaaaa/
        #     s001_2015_12_30/01_tcp_ar/aaaaaaaa_s001_t000.edf
        # or for abnormal:
        # tuh_eeg_abnormal/v3.0.0/edf/train/normal/
        #     01_tcp_ar/aaaaaaav_s004_t000.edf
        subject_id = tokens[-1].split('_')[0]
        session = tokens[-1].split('_')[1]
        segment = tokens[-1].split('_')[2].split('.')[0]

        description = {
            'path': file_path,
            'version': version,
            'subject': subject_id,
            'session': int(session[1:]),
            'segment': int(segment[1:]),
        }
        if not abnormal:
            year, month, day = tokens[-3].split('_')[1:]
            description['year'] = int(year)
            description['month'] = int(month)
            description['day'] = int(day)
        return description
    else:  # Old file path structure
        # expect file paths as tuh_eeg/version/file_type/reference/data_split/
        #                          subject/recording session/file
        # e.g.                 tuh_eeg/v1.1.0/edf/01_tcp_ar/027/00002729/
        #                          s001_2006_04_12/00002729_s001.edf
        # or for abnormal
        # version/file type/data_split/pathology status/
        #     reference/subset/subject/recording session/file
        # v2.0.0/edf/train/normal/01_tcp_ar/000/00000021/
        #     s004_2013_08_15/00000021_s004_t000.edf
        subject_id = tokens[-1].split('_')[0]
        session = tokens[-2].split('_')[0]  # string on format 's000'
        # According to the example path in the comment 8 lines above,
        # segment is not included in the file name
        segment = tokens[-1].split('_')[-1].split('.')[0]  # TODO: test with tuh_eeg
        year, month, day = tokens[-2].split('_')[1:]
        return {
            'path': file_path,
            'version': version,
            'year': int(year),
            'month': int(month),
            'day': int(day),
            'subject': subject_id,
            'session': int(session[1:]),
            'segment': int(segment[1:]),
        }

@robintibor
Copy link
Contributor

robintibor commented Feb 24, 2023

So @ostormer I have tried your code, we have only the old version of TUH here. It seems to be fine, except we still need to convert subject_id to int

            'subject': int(subject_id)

in the else clause for the old TUH to keep everything the same as before and for the tests to run.

In general, you can run pytest test/unit_tests/datasets/test_tuh.py -v to check that the new parsing code still works for the old style.

Using your function to replace the existing one should make everything work also for the new TUH right? Then maybe we can just do that?

@robintibor
Copy link
Contributor

robintibor commented Feb 24, 2023

Just to clarify you would need to use the updated test_tuh.py that I pushed into this branch @ostormer

@robintibor
Copy link
Contributor

robintibor commented Feb 27, 2023

So I have now pushed some changes that should work for the old and new versions based on the existing parts from @MohammadJavadD and @ostormer . However, it will load edfs twice for the new abnormal version, and even load it the first time without parallelization (would slow loading down).

After some discussion with @gemeinl , I think the better way would be:

  1. Parsing all information that can be extracted from the path without opening the edf into per-dataset descriptions (as happened before, without the new line that reads edf to get year/month/day)
  2. Load edfs and add additional information such as year/month/day to the description (also as is now)
  3. Sort the datasets chronologically based on the now-complete descriptions (would need to be added, basically moving the sorting)

This would circumvent loading the edf twice. What do you think? @bruAristimunha

@bruAristimunha
Copy link
Collaborator

Great, as always! I like this proposition, and it's a great alternative; we'll deal with version compatibility and try to maintain performance in line with the parallelization tutorials.

In the future, we can generate new tests for version 3.0.

LGTM =)

@ostormer
Copy link

ostormer commented Mar 1, 2023

I like your suggestions @robintibor, seems to be a good solution! My code was written while traveling, without testing anything other than the simple file path parsing. I agree that the full loading process was not ideal with just those changes.

I don't know enough about the rest of the Braindecode library as I just recently started using it, so thanks and great work to you guys more familiar with the whole picture :)

@gemeinl gemeinl self-assigned this Mar 2, 2023
@gemeinl gemeinl linked an issue Mar 2, 2023 that may be closed by this pull request
@gemeinl
Copy link
Collaborator

gemeinl commented Mar 3, 2023

Updated as discussed with Robin above.
We do not yet have the new data release. Can someone please check whether this update works @ostormer @MohammadJavadD @dengemann

@gemeinl
Copy link
Collaborator

gemeinl commented Mar 3, 2023

Unfortunately, the medical text reports have been removed by TUH with the new release.
So trying to add them add_physician_reports=True should and will raise an error.

@MohammadJavadD
Copy link
Contributor Author

Updated as discussed with Robin above. We do not yet have the new data release. Can someone please check whether this update works @ostormer @MohammadJavadD @dengemann

I checked with TUAB and TUEG 2022 and both worked! Good job! Thank you, everyone!

@bruAristimunha
Copy link
Collaborator

I am delighted with the outcome of this pull request! The whole team has done an excellent job, and I would like to extend my sincere appreciation to all of you, @MohammadJavadD, @dengemann, @ostormer, @gemeinl and @robintibor.

@robintibor, would you be so kind as to consider merging the changes into the main branch?"

@robintibor
Copy link
Contributor

Thanks for the kind words @bruAristimunha we did unfortunately realize one more issue:
Due to chronologically sorting after loading the data, now the recording_ids parameter` will behave differently and in some cases this can be a bigger problem, like loading first 300 recordings will now load only abnormal recordings etc.

We want to solve this by doing the sorting again before loading and in case the date cannot be read from the path, loading the edf but storing a (text)-file with the date alongside the edf file so next time only the text file has to be read.

We either implement this tomorrow, if we don't manage we first just add a warning and merge this so people can load the new TUH from master, and then do this in a separate PR.

@robintibor robintibor merged commit bab64c7 into braindecode:master Mar 10, 2023
3 of 4 checks passed
@robintibor
Copy link
Contributor

So merging this now, @gemeinl will tackle the aforementioned problem in a separate PR. Also thanks from my side to everybody involved @MohammadJavadD, @dengemann, @ostormer, @gemeinl @bruAristimunha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TUAB data version 3.0.0
6 participants