Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

pieces for multitask learning #2369

Merged
merged 6 commits into from Jan 17, 2019

Conversation

Projects
None yet
2 participants
@joelgrus
Copy link
Contributor

commented Jan 16, 2019

as discussed at happy hour, this includes an InterleavingDatasetReader which wraps multiple other dataset readers and interleaves their instances (adding a MetadataField indicating which dataset each instance came from) and a HomogeneousBatchIterator, which assumes such a MetadataField exists and constructs batches that are homogeneous with respect to its value.

The only "weird" thing is that the file_path passed to InterleavingDatasetReader.read() needs to be a JSON-serialized dict { wrapped_reader_key -> file_path }. We discussed alternative designs like passing in a directory and requiring each wrapped reader to know to look for a specific file under the provided directory; I felt like that seemed a little bit too prescriptive about data layout and harder to configure.

I believe that with these pieces most of the multitask things that S2 research wants to do should be relatively easy. (Notably, with the file-path-as-JSON-dict innovation, we can just use the usual Trainer 馃槵 )

FYI @amandalynne

@joelgrus joelgrus requested a review from DeNeutoy Jan 16, 2019

joelgrus added some commits Jan 16, 2019

@DeNeutoy
Copy link
Contributor

left a comment

LGTM

from allennlp.data.instance import Instance
from allennlp.data.iterators.data_iterator import DataIterator

@DataIterator.register("homogeneous-batch")

This comment has been minimized.

Copy link
@DeNeutoy

DeNeutoy Jan 17, 2019

Contributor

By convention this should be registered as homogeneous_batch

If false, it will do the tensorization anew each iteration.
track_epoch : ``bool``, optional, (default = False)
If true, each instance will get a ``MetadataField`` containing the epoch number.
partition_key : ``str``, optional, (default = "dataset")

This comment has been minimized.

Copy link
@DeNeutoy

DeNeutoy Jan 17, 2019

Contributor

This is a bit gross, I wonder if it's better to allow setting an "origin" attribute on an Instance or something. Maybe not for this PR

This comment has been minimized.

Copy link
@joelgrus

joelgrus Jan 17, 2019

Author Contributor

it needs to make it into the model, so if you did that you'd have to change all the "batch to tensor" logic to account for that and then make sure it doesn't somehow collide with other tensors and so on

joelgrus added some commits Jan 17, 2019

@joelgrus joelgrus merged commit 385e66e into allenai:master Jan 17, 2019

3 checks passed

Pull Requests (AllenNLP Library) TeamCity build finished
Details
codecov/patch 92% of diff hit (target 90%)
Details
codecov/project 92% (-1%) compared to ae63a2a
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.