New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update tuh.py #431
Update tuh.py #431
Conversation
Adaptation to the TUH v3
I think the tests failed because the unit tests are based on the old TUH version. |
I will update the test, thank you some much @MohammadJavadD!! |
Codecov Report
@@ Coverage Diff @@
## master #431 +/- ##
==========================================
- Coverage 84.60% 84.31% -0.29%
==========================================
Files 59 59
Lines 4221 4253 +32
==========================================
+ Hits 3571 3586 +15
- Misses 650 667 +17 |
Hi @MohammadJavadD, I maintained the compatibility of the old versions, but also prioritized the new version of the dataset. I don't have much experience with this dataset, so I decided not to create new tests for version 3.0. We can do that in the future. Can you test with the new version of the dataset? if it's good, LGTM |
Thank you for your time. AFAIK the old version should not be in use anymore
as the data authorities asked everyone to delete all the old data. Sure,
I'll try the code. Also as I said before I had problems with saving the
data.
On Sun, Feb 5, 2023 at 3:48 PM Bru ***@***.***> wrote:
Hi @MohammadJavadD <https://github.com/MohammadJavadD>,
I maintained the compatibility of the old versions, but also prioritized
the new version of the dataset. I don't have much experience with this
dataset, so I decided not to create new tests for version 3.0. We can do
that in the future.
Can you test with the new version of the dataset? if it's good, LGTM
—
Reply to this email directly, view it on GitHub
<#431 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJHECINUX5YRHOROVTF3TWTWWAG2HANCNFSM6AAAAAAUPWJNW4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
- Mohammad
|
I noticed even the version 2 is updated see [here](https://isip.piconepress.com/projects/tuh_eeg/downloads/tuh_eeg/v2.0.0/AAREADME.txt) for more details. So this works for me now!
braindecode/datasets/tuh.py
Outdated
version = tokens[-6] | ||
else: | ||
version = tokens[-7] | ||
subject_id = tokens[-1].split('_')[-2].split('s')[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? The new subject identifier is a combination of letters.
e.g. 'aaaaaaav_s004_t000.edf'
From the readme:
The last segment is the filename ("aaaaamye_s001_t000.edf"). This
includes the subject identifier ("aaaaamye"), the session number
("s001") and a token number ("t000"). EEGs are split into a series of
files starting with *t000.edf, *t001.edf, ... These represent pruned
EEGs, so the original EEG is split into these segments, and
uninteresting parts of the original recording were deleted (common in
clinical practice).
To make it work on my side, I avoided conversion to integer downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested changes to fix subject_id, session and segment extraction from file path
braindecode/datasets/tuh.py
Outdated
subject_id = tokens[-1].split('_')[-2].split('s')[-1] | ||
session = tokens[-2].split('_')[0] | ||
segment = tokens[-1].split('_')[-1].split('.')[-2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subject_id = tokens[-1].split('_')[-2].split('s')[-1] | |
session = tokens[-2].split('_')[0] | |
segment = tokens[-1].split('_')[-1].split('.')[-2] | |
subject_id = tokens[-1].split('_')[0] | |
session = tokens[-1].split('_')[1] | |
segment = tokens[-1].split('_')[2].split('.')[0] |
Fixes subject_id, session, and sample extraction from filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not retain combability with older versions, needs to be added after dataset and version check if compatibility is desired
braindecode/datasets/tuh.py
Outdated
'year': int(year), | ||
'month': int(month), | ||
'day': int(day), | ||
'subject': int(subject_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'subject': int(subject_id), | |
'subject': subject_id, |
Is no longer a number, as noted by @dengemann
This may need some more work to be usable with the different TUH datasets for versions both before and after the December 2022 update. For 'tuh_eeg' the newest version number is v2.0.0, while for 'tuh_eeg_abnormal' the newest version is v3.0.0 Checking the exact version to determine the file structure is a short term solution, the version number could be read as a number and checked to be |
Could it make sense to introduce an explicit version parameter, defaulting to the latest version? |
Hey @dengemann and @ostormer, Hope you're both doing great. I was wondering if you could help me out and assume this PR? I know you both have a lot of expertise in the dataset and your approval would be super helpful for the project. Thanks so much for taking the time to review the PR, and I really appreciate any feedback you can give me considering I'm not that familiar with the dataset. |
Hey @bruAristimunha, happy to help. I have already solved the problem for myself. In case it helps, check out this gist (in/out paths need to be set + the way I do it, the alphabetical ordering of the subject names made of letter sequences becomes a subject number): https://gist.github.com/dengemann/c2a411f50b7888d34ccd298cdfdf05c3 Edit: the gist does two things; it creates the braindecode dataset but it also converts to bids, based on our code from our brain-age benchmarks https://github.com/meeg-ml-benchmarks/brain-age-benchmark-paper |
The gist worked for me as well and the conversion makes things way more organized. As there are some problems with the save function so it would be great if Braindecode could integrate the Bids format. |
Hello! I have been traveling and so have been offline for a couple of days. Before seeing @dengemann 's answer I rewrote the I see that '_create_chronological_description' function could be updated to only ommit dates in the case of abnormal v3, which is the only version without dates in the file path. def _parse_description_from_file_path(file_path):
# stackoverflow.com/questions/3167154/how-to-split-a-dos-path-into-its-components-in-python # noqa
file_path = os.path.normpath(file_path)
tokens = file_path.split(os.sep)
# Extract version number and tuh_eeg_abnormal/tuh_eeg from file path
if ('train' in tokens) or ('eval' in tokens): # tuh_eeg_abnormal
abnormal = True
# Tokens[-2] is channel configuration (always 01_tcp_ar in abnormal)
# on new versions, or
# session (e.g. s004_2013_08_15) on old versions
if tokens[-2].split('_')[0][0] == 's': # s denoting session number
version = tokens[-9] # Before dec 2022 updata
else:
version = tokens[-6] # After the dec 2022 update
else: # tuh_eeg
abnormal = False
version = tokens[-7]
v_number = int(version[1])
if (abnormal and v_number >= 3) or ((not abnormal) and v_number >= 2):
# New file path structure for versions after december 2022,
# expect file paths as
# tuh_eeg/v2.0.0/edf/000/aaaaaaaa/
# s001_2015_12_30/01_tcp_ar/aaaaaaaa_s001_t000.edf
# or for abnormal:
# tuh_eeg_abnormal/v3.0.0/edf/train/normal/
# 01_tcp_ar/aaaaaaav_s004_t000.edf
subject_id = tokens[-1].split('_')[0]
session = tokens[-1].split('_')[1]
segment = tokens[-1].split('_')[2].split('.')[0]
description = {
'path': file_path,
'version': version,
'subject': subject_id,
'session': int(session[1:]),
'segment': int(segment[1:]),
}
if not abnormal:
year, month, day = tokens[-3].split('_')[1:]
description['year'] = int(year)
description['month'] = int(month)
description['day'] = int(day)
return description
else: # Old file path structure
# expect file paths as tuh_eeg/version/file_type/reference/data_split/
# subject/recording session/file
# e.g. tuh_eeg/v1.1.0/edf/01_tcp_ar/027/00002729/
# s001_2006_04_12/00002729_s001.edf
# or for abnormal
# version/file type/data_split/pathology status/
# reference/subset/subject/recording session/file
# v2.0.0/edf/train/normal/01_tcp_ar/000/00000021/
# s004_2013_08_15/00000021_s004_t000.edf
subject_id = tokens[-1].split('_')[0]
session = tokens[-2].split('_')[0] # string on format 's000'
# According to the example path in the comment 8 lines above,
# segment is not included in the file name
segment = tokens[-1].split('_')[-1].split('.')[0] # TODO: test with tuh_eeg
year, month, day = tokens[-2].split('_')[1:]
return {
'path': file_path,
'version': version,
'year': int(year),
'month': int(month),
'day': int(day),
'subject': subject_id,
'session': int(session[1:]),
'segment': int(segment[1:]),
} |
So @ostormer I have tried your code, we have only the old version of TUH here. It seems to be fine, except we still need to convert subject_id to int
in the else clause for the old TUH to keep everything the same as before and for the tests to run. In general, you can run Using your function to replace the existing one should make everything work also for the new TUH right? Then maybe we can just do that? |
Just to clarify you would need to use the updated |
So I have now pushed some changes that should work for the old and new versions based on the existing parts from @MohammadJavadD and @ostormer . However, it will load edfs twice for the new abnormal version, and even load it the first time without parallelization (would slow loading down). After some discussion with @gemeinl , I think the better way would be:
This would circumvent loading the edf twice. What do you think? @bruAristimunha |
Great, as always! I like this proposition, and it's a great alternative; we'll deal with version compatibility and try to maintain performance in line with the parallelization tutorials. In the future, we can generate new tests for version 3.0. LGTM =) |
I like your suggestions @robintibor, seems to be a good solution! My code was written while traveling, without testing anything other than the simple file path parsing. I agree that the full loading process was not ideal with just those changes. I don't know enough about the rest of the Braindecode library as I just recently started using it, so thanks and great work to you guys more familiar with the whole picture :) |
Updated as discussed with Robin above. |
Unfortunately, the medical text reports have been removed by TUH with the new release. |
I checked with TUAB and TUEG 2022 and both worked! Good job! Thank you, everyone! |
I am delighted with the outcome of this pull request! The whole team has done an excellent job, and I would like to extend my sincere appreciation to all of you, @MohammadJavadD, @dengemann, @ostormer, @gemeinl and @robintibor. @robintibor, would you be so kind as to consider merging the changes into the main branch?" |
Thanks for the kind words @bruAristimunha we did unfortunately realize one more issue: We want to solve this by doing the sorting again before loading and in case the date cannot be read from the path, loading the edf but storing a (text)-file with the date alongside the edf file so next time only the text file has to be read. We either implement this tomorrow, if we don't manage we first just add a warning and merge this so people can load the new TUH from master, and then do this in a separate PR. |
So merging this now, @gemeinl will tackle the aforementioned problem in a separate PR. Also thanks from my side to everybody involved @MohammadJavadD, @dengemann, @ostormer, @gemeinl @bruAristimunha |
Adaptation to the TUH v3 and might resolve issue #430
but getting this error in the braindecode.preprocessing.preprocess
RuntimeError: info["meas_date"] seconds must be between "(-2147483648, 0)" and "(2147483647, 0)", got "-2209161600"