Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Self-supervised learning example on Sleep Physionet #178

Merged
merged 5 commits into from
Nov 16, 2020

Conversation

hubertjb
Copy link
Collaborator

@hubertjb hubertjb commented Nov 4, 2020

This PR introduces a new example, along with a new samplers module, to demonstrate how self-supervised learning can be used to learn representations in an unsupervised manner on raw EEG. The example implements the relative positioning task described in this preprint.

I've kept the number of recordings used in the example as well as the number of self-supervised samples and training epochs pretty low, so the example could run relatively quickly. The results are already interesting in that setting, although nicer results are obtained when more examples are included. For instance, with 40 recordings:

rp_learning_curves_40subjects
rp_pca_40subjects

@hubertjb
Copy link
Collaborator Author

hubertjb commented Nov 5, 2020

It looks like the failed CI is caused by differences in the recorded loss of other tests (test_cropped_decoding.py etc.), similar to what happened some time ago due to a change in skorch version. I'm not sure there's a link with the changes of this PR.

In any case, the rendered example can be found here!

Comment on lines 163 to 185

@property
def metadata(self):
"""Concatenate the metadata and description of the wrapped Epochs.

NOTE: This is implemented as a property to avoid having to keep a very
large DataFrame in case the dataset contains very long or many
recordings.

Returns
-------
pd.DataFrame:
DataFrame containing as many rows as there are windows in the
BaseConcatDataset, with the metadata and description information
for each window.
"""
if not all([isinstance(ds, WindowsDataset) for ds in self.datasets]):
raise TypeError('Property metadata can only be computed when all '
'datasets are WindowsDataset.')

all_dfs = list()
for ds in self.datasets:
df = ds.windows.metadata
for k, v in ds.description.items():
df[k] = v
all_dfs.append(df)

return pd.concat(all_dfs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the critical part of this PR with regards to discussing. I am not sure I have a good mental overview over the dataset part. so metadata here will be concatenated from windows datasets metadata as far as I see.
Regarding implementation as property, could be fine, I am a bit worried as I don't think we use properties in other places, and it can always cause some confusion if there is non-neglible execution time on access to a member (meaning looks like access to a member from caller side). But as said, could be fine.

@gemeinl do you think this metadata here fits in nicely with rest of data structures as you see them? or do you envision any potential conflicts/confusions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, this part requires some discussion. I guess this could instead be a method get_metadata(), which would make the fact that this requires computations more explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemeinl can maybe also give is view tomorrow on metadata vs get_metadata() tomorrow. I am not sure. metadata more consistent with windows.metadata, get_metadata() more explicit that there is a computation.... what does yoda @agramfort think?

Comment on lines 18 to 21
metadata : pd.DataFrame
DataFrame describing at least one of {subject, session, run}
information for each window in the BaseConcatDataset to sample examples
from. Normally obtained as the `metadata` propery of BaseConcatDataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on "propery".
Also this is still not so clear to me, maybe you an also put an minimal example in the docstring? Does not have be something that runs by itself, like just show relevant lines for understanding what is in metadata and what would happen.

@robintibor
Copy link
Contributor

It looks like the failed CI is caused by differences in the recorded loss of other tests (test_cropped_decoding.py etc.), similar to what happened some time ago due to a change in skorch version. I'm not sure there's a link with the changes of this PR.

In any case, the rendered example can be found here!

Hm does it not help to rebase from master? like I thought the problem is solved now.

@robintibor
Copy link
Contributor

In any case thanks for sharing self-supervised code, really appreciate it!

@hubertjb
Copy link
Collaborator Author

hubertjb commented Nov 5, 2020

It looks like the failed CI is caused by differences in the recorded loss of other tests (test_cropped_decoding.py etc.), similar to what happened some time ago due to a change in skorch version. I'm not sure there's a link with the changes of this PR.
In any case, the rendered example can be found here!

Hm does it not help to rebase from master? like I thought the problem is solved now.

It looks like it's already based on master actually. I haven't looked yet but maybe there's been another skorch update in the meantime...?

@robintibor
Copy link
Contributor

It looks like the failed CI is caused by differences in the recorded loss of other tests (test_cropped_decoding.py etc.), similar to what happened some time ago due to a change in skorch version. I'm not sure there's a link with the changes of this PR.
In any case, the rendered example can be found here!

Hm does it not help to rebase from master? like I thought the problem is solved now.

It looks like it's already based on master actually. I haven't looked yet but maybe there's been another skorch update in the meantime...?

yeah since in sliwys PR the same problem, feel free to ignore. we will have to look at it separately again :(

@robintibor
Copy link
Contributor

rebasing on master should now make all tests run again @hubertjb. Seems some strange travis UI delays atm, but I received a mail that they pass.
Then you could also fix typo, and we can merge once decision about metadata vs get_metadata() is done (within next few days?).

@agramfort
Copy link
Collaborator

agramfort commented Nov 11, 2020 via email

- adding metadata property to BaseConcatDataset
- adding samplers module, with base class RecordingSampler and implementation of a relative positioning sampler
- changing default on_missing in SleepPhysionet from raise to warn
- adding batch norm option to SleepStagerChambon2018 model
- adding tests
- adding example of `metadata` in docstring of `Sampler` class
- changing `metadata` property of `BaseConcatDataset` to a method `get_metadata`
@hubertjb
Copy link
Collaborator Author

@robintibor I just pushed the latest changes, including changing the metadata property to a get_metadata method, and making sure that the model that is used for visualization is the one that achieved the best valid loss.

@gemeinl
Copy link
Collaborator

gemeinl commented Nov 12, 2020

I am asking myself if it is necessary or good that the dataset implements get_metadata. From what I see, all the information is accessible from the outside, dataset.description as well as dataset.windows.metadata so we could just move the code into the RecordingSampler and RelativePositionSampler. This would avoid all the problems regarding class properties or naming and could also avoid potential confusion about the term metadata and what it is. Any thoughts on this?

preload: bool
if True, preload the data of the Raw objects.
If True, preload the data of the Raw objects.
load_eeg_only: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is seems that this is currently not used. Do you think it is necessary to have this option? It looks like a shortcut to avoid using mne.Raw.drop_channels or mne.Raw.pick / mne.Raw.pick_channels / mne.Raw.pick_types in a preprocessing step.

Copy link
Collaborator

@gemeinl gemeinl Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, nevermind. It is used and channels are already excluded on read from HDD.

@robintibor
Copy link
Contributor

I am asking myself if it is necessary or good that the dataset implements get_metadata. From what I see, all the information is accessible from the outside, dataset.description as well as dataset.windows.metadata so we could just move the code into the RecordingSampler and RelativePositionSampler. This would avoid all the problems regarding class properties or naming and could also avoid potential confusion about the term metadata and what it is. Any thoughts on this?

Hm to me get_metadata seems still ok to me, like it makes sense to have in the class, since it uses a lot of the class members, and to me logically makes sense to be there... also does not introduce any additional changes outside the new function itself...

@hubertjb
Copy link
Collaborator Author

I am asking myself if it is necessary or good that the dataset implements get_metadata. From what I see, all the information is accessible from the outside, dataset.description as well as dataset.windows.metadata so we could just move the code into the RecordingSampler and RelativePositionSampler. This would avoid all the problems regarding class properties or naming and could also avoid potential confusion about the term metadata and what it is. Any thoughts on this?

Hm to me get_metadata seems still ok to me, like it makes sense to have in the class, since it uses a lot of the class members, and to me logically makes sense to be there... also does not introduce any additional changes outside the new function itself...

@gemeinl I think it makes sense to have BaseConcatDataset return a metadata dataframe directly as it's pretty much what we're already doing with the description attribute (the description attribute of the individual WindowsDatasets are also concatenated into a single dataframe, also called description). I think there are other scenarios in which it might be convenient to have access to this concatenated metadata DataFrame directly too, e.g., for splitting datasets, converting to a (X,y) dataset, sanity checking, etc. For that reason I would at least implement get_metadata outside of the Sampler objects so it can be easily reused down the line.

@robintibor
Copy link
Contributor

ok, will merge this!

Keep in mind:

  1. If we agree on some specific data augmentation implementation
    AND
  2. this way of implementing augmentations can also very nicely cover the case here (I am aware this is not same as augmentation here)
    AND
  3. it is different from what you have done

We may want to adapt the code at that point.

But until then:
https://twitter.com/malonebarry/status/1325284260777955328 👍 💯 🥳 😃

@robintibor robintibor merged commit 12334c4 into braindecode:master Nov 16, 2020
@agramfort
Copy link
Collaborator

🍾 🎉 🍺

@hubertjb hubertjb deleted the rel_pos branch November 16, 2020 14:21
@hubertjb
Copy link
Collaborator Author

Awesome, thanks @robintibor! :)

Yes, I agree, if this can be implemented with data augmentations it will be worth it to look into changing the implementation of this example. To keep in mind though, I think using a Sampler object is pretty natural and intuitive here, so this could be a reason to keep this implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants