Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Multitask example #4898

Merged
merged 83 commits into from
Jan 19, 2021
Merged
Show file tree
Hide file tree
Changes from 80 commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
cc083d0
Make the VQA reader work for the other datasets
dirkgr Nov 26, 2020
5333901
Also find pngs
dirkgr Nov 26, 2020
8808b7d
Really support pngs
dirkgr Nov 26, 2020
37cc69c
Remove debug code
dirkgr Nov 26, 2020
4668bb1
More logging
dirkgr Nov 26, 2020
1844709
Unexpected formatting
dirkgr Nov 26, 2020
4d1cd4e
Respect the device
dirkgr Nov 26, 2020
2351649
This is how your replace things in named tuples.
dirkgr Nov 26, 2020
9c6e505
Remove unused import
dirkgr Nov 26, 2020
3fe3bd3
This is how you override a method properly.
dirkgr Nov 26, 2020
1ef85c5
This is how you set parameters in detectron.
dirkgr Nov 26, 2020
1f1abe4
Also set the device for the region detector
dirkgr Nov 26, 2020
53ba25c
Training configs for all three datasets contained in VQA
dirkgr Nov 30, 2020
947735e
Bigger batches
dirkgr Nov 30, 2020
c561870
Bigger batches for image processing
dirkgr Nov 30, 2020
9cd8cb1
Fix vilbert-from-huggingface config
dirkgr Dec 1, 2020
8432834
Make the config switch modes for constructing vocab
dirkgr Dec 1, 2020
4db41b0
Merge remote-tracking branch 'origin/vision' into OtherVQAData2
dirkgr Dec 1, 2020
9cea943
More vocab, more docs, better way of deriving vocab
dirkgr Dec 2, 2020
1eeeed5
Modernize the from_huggingface config
dirkgr Dec 2, 2020
11fd67c
More updates to the from_huggingface config
dirkgr Dec 2, 2020
7aa7e0e
Better hyperparameters stolen from another project
dirkgr Dec 2, 2020
08a498c
Fix for inverted parameter
dirkgr Dec 2, 2020
a12769c
Formatting
dirkgr Dec 2, 2020
16f1d12
Throw a meaningful error message when we don't have images
dirkgr Dec 2, 2020
b363af5
Add a warning that includes instructions for how to fix things
dirkgr Dec 2, 2020
bcdc63c
Merge branch 'vision' into OtherVQAData2
dirkgr Dec 2, 2020
639d043
Remove unused script
dirkgr Dec 2, 2020
a55f141
Merge branch 'OtherVQAData2' of https://github.com/allenai/allennlp i…
dirkgr Dec 2, 2020
184c45f
Merge issue
dirkgr Dec 2, 2020
a1367cf
Adds named splits to the SNLI-VE reader
dirkgr Dec 3, 2020
16dc3c4
Make the multitask data loader discoverable
dirkgr Dec 4, 2020
e37dd4a
Formatting
dirkgr Dec 4, 2020
ac1a2d9
More flexible inputs to the dataset readers
dirkgr Dec 4, 2020
0a9ac4e
Prototype config for the multitask training job
dirkgr Dec 4, 2020
866d43d
Merge remote-tracking branch 'origin/vision' into MultitaskExample
dirkgr Dec 16, 2020
6aa7e0e
Merge remote-tracking branch 'origin/vision' into MultitaskExample
dirkgr Jan 4, 2021
ff35247
json_lines_from_file() already calls cached_path()
dirkgr Jan 6, 2021
626c509
Visual entailment should track accuracy
dirkgr Jan 6, 2021
67f85d5
Switching to torch
dirkgr Jan 6, 2021
140bbd3
Fixing VE image paths
dirkgr Jan 6, 2021
8b7668e
Formatting
dirkgr Jan 6, 2021
a633e67
Experimentally use threaded_generator to read instances from readers …
dirkgr Jan 6, 2021
67b942b
Vilbert backbone
dirkgr Jan 6, 2021
54c8a46
Fixed paths
dirkgr Jan 6, 2021
ddb0501
Formatting
dirkgr Jan 6, 2021
5e30634
Merge remote-tracking branch 'origin/main' into MultitaskExample
dirkgr Jan 8, 2021
42d0a65
Adds heads
dirkgr Jan 8, 2021
73111ad
Revert "Experimentally use threaded_generator to read instances from …
dirkgr Jan 8, 2021
74c8637
Multitask trains now!
dirkgr Jan 12, 2021
6a79238
Remove useless parameter from GQA reader
dirkgr Jan 12, 2021
85c2329
Updated multitask config
dirkgr Jan 12, 2021
493712f
Schedulers produce batches, not instances
dirkgr Jan 13, 2021
6dac87d
Track multiple metrics
dirkgr Jan 13, 2021
0858d8d
Make mypy happy
dirkgr Jan 13, 2021
1c064b7
Formatting
dirkgr Jan 13, 2021
ce20246
Keep better track of which heads have been called
dirkgr Jan 13, 2021
58b6745
Merge remote-tracking branch 'origin/vision' into MultitaskExample
dirkgr Jan 13, 2021
0a4db8f
Fix the merge
dirkgr Jan 13, 2021
9ae0555
We have more than strings for input
dirkgr Jan 13, 2021
7d420bb
Remove unused imports
dirkgr Jan 13, 2021
79baa38
-1 is CPU
dirkgr Jan 13, 2021
4b52a56
Go back to tracking instances per epoch so that the samplers can work
dirkgr Jan 14, 2021
a9ddd17
Better error message
dirkgr Jan 14, 2021
50baedc
A useful sampler to have
dirkgr Jan 14, 2021
da29019
We haven't indexed until we've indexed
dirkgr Jan 14, 2021
25dad23
Makes tests pass
dirkgr Jan 14, 2021
1258dd0
Formatting
dirkgr Jan 14, 2021
0da384f
Fine-tuning the metric tracker
dirkgr Jan 14, 2021
6e7030b
Update model configs for my changes
dirkgr Jan 14, 2021
0d011a3
Fixing model configs for Akshita's changes
dirkgr Jan 14, 2021
7165edf
Implement VisionTextModel in terms of VilbertBackbone
dirkgr Jan 14, 2021
8ccb259
Formatting
dirkgr Jan 14, 2021
8a5d8cf
Merge remote-tracking branch 'origin/vision' into MultitaskExample
dirkgr Jan 14, 2021
b926016
Fix stale comment
dirkgr Jan 14, 2021
18a6772
Use the server paths by default, not Dirk's desktop
dirkgr Jan 14, 2021
931af0d
Fix tests
dirkgr Jan 14, 2021
9c0e950
Merge branch 'MultitaskExample' of https://github.com/allenai/allennl…
dirkgr Jan 14, 2021
f11a0f4
Formatting again
dirkgr Jan 14, 2021
440c7aa
Merge branch 'vision' into MultitaskExample
dirkgr Jan 16, 2021
e771934
Merge branch 'vision' into MultitaskExample
epwalsh Jan 19, 2021
bd1a2f1
Removed data loader parameters that don't exist anymore
dirkgr Jan 19, 2021
7c145b8
Clarified comment
dirkgr Jan 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 3 additions & 6 deletions allennlp/commands/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,9 +483,6 @@ def _train_worker(
return None


DataPath = Union[str, List[str], Dict[str, str]]


class TrainModel(Registrable):
"""
This class exists so that we can easily read a configuration file with the `allennlp train`
Expand Down Expand Up @@ -557,16 +554,16 @@ def from_partial_objects(
serialization_dir: str,
local_rank: int,
dataset_reader: DatasetReader,
train_data_path: DataPath,
train_data_path: Any,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inputs for multitask are dicts. Some other readers do lists.

model: Lazy[Model],
data_loader: Lazy[DataLoader],
trainer: Lazy[Trainer],
vocabulary: Lazy[Vocabulary] = Lazy(Vocabulary),
datasets_for_vocab_creation: List[str] = None,
validation_dataset_reader: DatasetReader = None,
validation_data_path: DataPath = None,
validation_data_path: Any = None,
validation_data_loader: Lazy[DataLoader] = None,
test_data_path: DataPath = None,
test_data_path: Any = None,
evaluate_on_test: bool = False,
batch_weight_key: str = "",
) -> "TrainModel":
Expand Down
2 changes: 1 addition & 1 deletion allennlp/data/data_loaders/multiprocess_data_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ def __init__(
raise ValueError("max_instances_in_memory must be at least 1")

self.reader = reader
self.data_path = str(data_path)
self.data_path = data_path
self.batch_size = batch_size
self.drop_last = drop_last
self.shuffle = shuffle
Expand Down
151 changes: 59 additions & 92 deletions allennlp/data/data_loaders/multitask_data_loader.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,16 @@
from typing import Any, Dict, Iterable, Iterator, List, Union, Optional
from typing import Any, Dict, Iterable, Iterator, Union, Optional
import itertools
import math

import torch
from overrides import overrides

from allennlp.common import util
from allennlp.data.dataset_readers.dataset_reader import DatasetReader, DatasetReaderInput
from allennlp.data.batch import Batch
from allennlp.data.data_loaders.data_loader import DataLoader, TensorDict
from allennlp.data.data_loaders.multiprocess_data_loader import MultiProcessDataLoader
from allennlp.data.data_loaders.multitask_scheduler import (
MultiTaskScheduler,
HomogeneousRoundRobinScheduler,
)
from allennlp.data.data_loaders.multitask_scheduler import MultiTaskScheduler
from allennlp.data.data_loaders.multitask_epoch_sampler import MultiTaskEpochSampler
from allennlp.data.dataset_readers.multitask import MultiTaskDatasetReader
from allennlp.data.instance import Instance
Expand Down Expand Up @@ -54,10 +52,6 @@ class MultiTaskDataLoader(DataLoader):
data_path: `Dict[str, str]`
One file per underlying dataset reader in the `MultiTaskDatasetReader`, which will be passed
to those readers to construct one `DataLoader` per dataset.
batch_size: `int`
The number of instances (from any dataset) that should be combined together into a single
batch. See also the `batch_size_multiplier` argument for additional control over exactly
how batch size is computed.
scheduler: `MultiTaskScheduler`, optional (default = `HomogeneousRoundRobinScheduler`)
The `scheduler` determines how instances are ordered within an epoch. By default, we'll
select one batch of instances from each dataset in turn, trying to ensure as uniform a mix
Expand All @@ -71,25 +65,9 @@ class MultiTaskDataLoader(DataLoader):
for an epoch, this `sampler` will tell us with what proportion we should sample from each
dataset. For instance, we might want to focus more on datasets that are underperforming in
some way, by having those datasets contribute more instances this epoch than other datasets.
batch_size_multiplier: `Dict[str, float]`, optional (default = `None`)
If this is not `None`, it specifies how much of the batch an instance from each dataset
takes up. That is, if this is 1 for every dataset (which is the default), then batch size
is computed as normal. If dataset "A" has a value of 1.5 in this dictionary, than each
instance from dataset "A" counts as 1.5 instances for the purposes of computing batch size.
This option is available to you to account for the fact that some operations might be *much*
less costly than others (e.g., if you are multitasking a coref model with a simple document
classification model). If you use this, you're on your own as far as figuring out how it
interacts with optimization behavior.
instances_per_epoch: `int`, optional (default = `None`)
If not `None`, we will use this many instances per epoch of training, drawing from the
underlying datasets with proportions given by the `scheduler`. Note that this is
_instances_, not _batches_, because if you're using batch size multipliers we don't know how
many batches the instances specified by the `scheduler` will turn out to be.
drop_last: `bool`, optional (default = `False`)
If this is `True`, we will not return the last batch if it is smaller than `batch_size`.
Note that this is kind of nonsensical to use if you're using `batch_size_multipliers`, as
you are not guaranteed to get an optimal packing, so you will likely have batches that don't
fill up the `batch_size` in that case, anyway.
underlying datasets according to the `sampler`.
num_workers: `Dict[str, int]`, optional (default = `None`)
Used when creating one `MultiProcessDataLoader` per dataset. If you want non-default
behavior for this parameter in the `DataLoader` for a particular dataset, pass the
Expand Down Expand Up @@ -126,13 +104,10 @@ def __init__(
self,
reader: MultiTaskDatasetReader,
data_path: Dict[str, str],
batch_size: int,
scheduler: MultiTaskScheduler,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest change here is that now the MultiTaskScheduler is responsible for making batches, not the data loader. When the data loader makes batches, but the scheduler sorts the instances, we can't guarantee things like homogeneous batches, so I changed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means that the scheduler is no longer optional. You have to specify one. Otherwise we don't even know a batch size.

*,
scheduler: MultiTaskScheduler = None,
sampler: MultiTaskEpochSampler = None,
instances_per_epoch: int = None,
batch_size_multiplier: Dict[str, float] = None,
drop_last: bool = False,
num_workers: Dict[str, int] = None,
max_instances_in_memory: Dict[str, int] = None,
start_method: Dict[str, str] = None,
Expand All @@ -143,7 +118,7 @@ def __init__(
) -> None:
self.readers = reader.readers
self.data_paths = data_path
self.scheduler = scheduler or HomogeneousRoundRobinScheduler(batch_size=batch_size)
self.scheduler = scheduler
self.sampler = sampler
self.cuda_device: Optional[torch.device] = None
if cuda_device is not None:
Expand All @@ -152,20 +127,12 @@ def __init__(
else:
self.cuda_device = cuda_device

self._batch_size = batch_size
self._instances_per_epoch = instances_per_epoch
self._batch_size_multiplier = batch_size_multiplier or {}
for multiplier in self._batch_size_multiplier.values():
if multiplier > batch_size:
raise ValueError(
f"Multiplier value ({multiplier}) is larger than batch size ({batch_size})"
)
self._drop_last = drop_last
self._shuffle = shuffle

if instances_per_epoch is not None and sampler is None:
raise ValueError(
"You must provide an EpochSampler if you want to not use all instances every epoch"
"You must provide an EpochSampler if you want to not use all instances every epoch."
)

self._num_workers = num_workers or {}
Expand All @@ -176,8 +143,8 @@ def __init__(

if self.readers.keys() != self.data_paths.keys():
raise ValueError(
f"Mismatch between readers ({self.readers.keys()}) and data paths"
" ({self.data_paths.keys()})"
f"Mismatch between readers ({self.readers.keys()}) and data paths "
f"({self.data_paths.keys()})"
)
self._loaders = {key: self._make_data_loader(key) for key in self.readers}

Expand All @@ -197,7 +164,7 @@ def __init__(
# for the loader variable, so a _different_ loader gets saved for every iterator.
# Dictionary comprehensions don't create new scopes in python. If you don't have
# this loader, you end up with `loader` always referring to the last loader in the
# iteration... mypy also doesn't know what to do with this, for some reason I can't
# iteration... mypy also doesn't know what to do with this, for some reason I can't
# figure out.
lambda l=loader: maybe_shuffle_instances(l, self._shuffle) # type: ignore
)
Expand All @@ -206,56 +173,27 @@ def __init__(

@overrides
def __len__(self) -> int:
if self._instances_per_epoch is not None:
return self._instances_per_epoch

# Here we try to estimate the actual length. If you are using varying batch size
# multipliers per task, we may get batch packing orders that make this an underestimate, as
# this assumes that all batches are full, which may not be true.
total_instances = 0.0
for key, loader in self._loaders.items():
if self._instances_per_epoch is None:
# This will raise a TypeError if any of the underlying loaders doesn't have a length,
# which is actually what we want. If the loader has a length, we set batch_size = 1, so
# this will give us the right number of instances.
total_instances += self._batch_size_multiplier.get(key, 1.0) * len(loader)
if self._drop_last or total_instances % self._batch_size == 0:
return int(total_instances) // self._batch_size
# which is actually what we want.
return self.scheduler.count_batches(
{dataset: len(loader) for dataset, loader in self._loaders.items()}
)
Comment on lines +179 to +181
Copy link
Member Author

@dirkgr dirkgr Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because schedulers now make batches, only schedulers know how many batches they are making. So the responsibility of counting is devolved to them.

else:
return int(1 + total_instances) // self._batch_size
return self.scheduler.count_batches(
{dataset: self._instances_per_epoch for dataset in self._loaders.keys()}
)

@overrides
def __iter__(self) -> Iterator[TensorDict]:
# Basic outline: first we _sample_ the instances that we're going to be using for this
# epoch, which relies on the scheduler if `self._instances_per_epoch` is not None. This is
# basically just saying how many instances we should use this epoch for each dataset, and we
# grab bounded-length iterators over that many instances for each dataset. Second, we
# _schedule_ the epoch's instances into a single list, again relying on the scheduler.
# Finally, we take that combined list and yield `batch_size` batches from it.
epoch_instances = self._get_instances_for_epoch()
scheduled_instances = self.scheduler.order_epoch_instances(epoch_instances)
batch_instances: List[Instance] = []
current_batch_size = 0.0
for dataset, instance in scheduled_instances:
current_batch_size += self._batch_size_multiplier.get(dataset, 1.0)
if current_batch_size > self._batch_size:
batch = Batch(batch_instances)
tensor_dict = batch.as_tensor_dict()
if self.cuda_device is not None:
tensor_dict == nn_util.move_to_device(tensor_dict, self.cuda_device)
yield tensor_dict
batch_instances = [instance]
current_batch_size = self._batch_size_multiplier.get(dataset, 1.0)
else:
batch_instances.append(instance)

# Based on how we yield batches above, we are guaranteed to always have leftover instances,
# so we don't need a check for that here.
if not self._drop_last or current_batch_size == self._batch_size:
batch = Batch(batch_instances)
tensor_dict = batch.as_tensor_dict()
if self.cuda_device is not None:
tensor_dict == nn_util.move_to_device(tensor_dict, self.cuda_device)
yield tensor_dict
return (
nn_util.move_to_device(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling move_to_device here isn't necessary as long as we pass the target device through to self._loaders in set_target_device().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds correct. I will check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind. This is definitely necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, wait. We read Instances from the child loaders, not batches. Instances don't have devices. This is the place where we make batches, and only those have devices, so this is where we have to set the device.

Batch(instances).as_tensor_dict(),
-1 if self.cuda_device is None else self.cuda_device,
)
for instances in self.scheduler.batch_instances(epoch_instances)
)

@overrides
def iter_instances(self) -> Iterator[Instance]:
Expand All @@ -271,7 +209,7 @@ def iter_instances(self) -> Iterator[Instance]:
# complex, configurable scheduling.
#
# The underlying data loaders here could be using multiprocessing; we don't need to worry
# about that in this class. Caching is also handled by the underlying data loaders.
# about that in this class. Caching is also handled by the underlying data loaders.
for loader in self._loaders.values():
yield from loader.iter_instances()

Expand All @@ -292,9 +230,9 @@ def _get_instances_for_epoch(self) -> Dict[str, Iterable[Instance]]:
}
if self.sampler is None:
# We already checked for this in the constructor, so this should never happen unless you
# modified the object after creation. But mypy is complaining, so here's another check.
# modified the object after creation. But mypy is complaining, so here's another check.
raise ValueError(
"You must specify an EpochSampler if self._instances_per_epoch is not None"
"You must specify an EpochSampler if self._instances_per_epoch is not None."
)
dataset_proportions = self.sampler.get_task_proportions(self._loaders)
proportion_sum = sum(dataset_proportions.values())
Expand All @@ -309,9 +247,9 @@ def _get_instances_for_epoch(self) -> Dict[str, Iterable[Instance]]:

def _make_data_loader(self, key: str) -> MultiProcessDataLoader:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 259-262 can be removed. Those params to the data loader don't exist anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

kwargs: Dict[str, Any] = {}
kwargs["reader"] = self.readers[key]
kwargs["reader"] = _MultitaskDatasetReaderShim(self.readers[key], key)
kwargs["data_path"] = self.data_paths[key]
kwargs["batch_size"] = 1
kwargs["batch_size"] = 1 # So that the loader gives us one instance at a time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use it to call iter_instances(), right? So batch_size won't matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but we have to set something. I clarified the comment.

if key in self._num_workers:
kwargs["num_workers"] = self._num_workers[key]
if key in self._max_instances_in_memory:
Expand All @@ -323,3 +261,32 @@ def _make_data_loader(self, key: str) -> MultiProcessDataLoader:
if key in self._instance_chunk_size:
kwargs["instance_chunk_size"] = self._instance_chunk_size[key]
return MultiProcessDataLoader(**kwargs)


@DatasetReader.register("multitask_shim")
class _MultitaskDatasetReaderShim(DatasetReader):
"""This dataset reader wraps another dataset reader and adds the name of the "task" into
each instance as a metadata field. This exists only to support `MultitaskDataLoader`. You
should not have to use this yourself."""

def __init__(self, inner: DatasetReader, head: str, **kwargs):
super().__init__(**kwargs)
self.inner = inner
self.head = head

def read(self, file_path: DatasetReaderInput) -> Iterator[Instance]:
from allennlp.data.fields import MetadataField

for instance in self.inner.read(file_path):
instance.add_field("task", MetadataField(self.head))
yield instance

def text_to_instance(self, *inputs) -> Instance:
from allennlp.data.fields import MetadataField

instance = self.inner.text_to_instance(*inputs)
instance.add_field("task", MetadataField(self.head))
return instance

def apply_token_indexers(self, instance: Instance) -> None:
self.inner.apply_token_indexers(instance)
17 changes: 17 additions & 0 deletions allennlp/data/data_loaders/multitask_epoch_sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,23 @@ def get_task_proportions(self, data_loaders: Mapping[str, DataLoader]) -> Dict[s
return {key: 1 / len(data_loaders) for key in data_loaders}


@MultiTaskEpochSampler.register("weighted")
class WeightedSampler(MultiTaskEpochSampler):
"""
Returns a weighted distribution over datasets at every epoch, where every
task has a weight.

Registered as a `MultiTaskEpochSampler` with name "weighted".
"""

def __init__(self, weights: Dict[str, float]):
self.weights = weights

def get_task_proportions(self, data_loaders: Mapping[str, DataLoader]) -> Dict[str, float]:
total = sum(self.weights[task] for task in data_loaders.keys())
return {task: self.weights[task] / total for task in data_loaders.keys()}


@MultiTaskEpochSampler.register("proportional")
class ProportionalSampler(MultiTaskEpochSampler):
"""
Expand Down