Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Fix handling of "datasets_for_vocab_creation" param #4350

Merged
merged 10 commits into from
Jun 12, 2020

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jun 10, 2020

This fixes a couple of bugs related to the datasets_for_vocab_creation config parameter.

The main bug is that datasets_for_vocab_creation: [] ends up being treated like datasets_for_vocab_creation: null due to this statement in commands.train.TrainModel.from_partial_objects:

        instance_generator = (
            instance
            for key, dataset in datasets.items()
            if not datasets_for_vocab_creation or key in datasets_for_vocab_creation
            for instance in dataset
        )

which should be

        instance_generator = (
            instance
            for key, dataset in datasets.items()
            if datasets_for_vocab_creation is None or key in datasets_for_vocab_creation
            for instance in dataset
        )

The 2nd bug occurs when training.util.vocab_from_params(...) is called. Even if datasets_for_vocab_creation is set to [], all of the datasets will be initialized. This means that with non-lazy readers, you read all of the data unnecessarily.

@epwalsh epwalsh changed the title Fix data loading Fix handling of "datasets_for_vocab_creation" param Jun 10, 2020
allennlp/commands/train.py Show resolved Hide resolved
allennlp/training/util.py Outdated Show resolved Hide resolved
allennlp/training/util.py Outdated Show resolved Hide resolved
allennlp/training/util.py Outdated Show resolved Hide resolved
allennlp/training/util.py Outdated Show resolved Hide resolved
@dirkgr
Copy link
Member

dirkgr commented Jun 11, 2020

This is exactly the kind of bug that makes me stay away from Python falsyness in virtually all cases.

allennlp/training/util.py Outdated Show resolved Hide resolved
)
# Do a quick sanity check here. There's no need to load any datasets if the vocab
# type is "empty".
if datasets_for_vocab_creation is None and vocab_params.get("type") == "empty":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about this ad hoc solution here but I don't see another way. At least I added a test for this case to make sure we don't get a regression in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible alternative is to build a lazy generator here that doesn't actually call .read() on anything unless it's needed. Not sure how feasible that is, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One obvious concern is that it requires changes to datasets_from_params also in ways that change the return type of that method; I'm pretty sure, though that this is the only use of datasets_from_params, and this function is only ever called on the master process in a distributed setting. In every other configuration, we go through a different code path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also called from the find_learning_rate command, but that's it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reason why I think readers should always be lazy. Leave the lazy/non-lazy decision up to the dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see that. Where do you make the lazy / non-lazy decision, then? You have to make a choice once you get to the DataLoader, because of how batching / sampling works. You just base the decision of which dataset to use on how you've decided to do data loading?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was either thinking we'd make that decision part of the DataLoader API, or have a separate configuration for the type of Dataset to use.

I like the first option because the lazy vs. non-lazy is literally an issue of data loading. The downside though is that we'd be breaking from the PyTorch API. But having to separately configure a DatasetReader, Dataset, and DataLoader seems overboard.

@schmmd schmmd added this to the 1.0.0 milestone Jun 12, 2020
@epwalsh epwalsh merged commit 87c23e4 into allenai:master Jun 12, 2020
@epwalsh epwalsh deleted the fix-data-loading branch June 12, 2020 19:36
@elkotito
Copy link

Lol, I discovered this some time ago, but I had thought it was on purpose. @epwalsh What is the best way to avoid building vocabulary when I use prebuilt transformers? Should I use datasets_for_vocab_creation: [] or is there something else?

@epwalsh
Copy link
Member Author

epwalsh commented Jun 29, 2020

@mateuszpieniak just using {vocab: {type: empty}} will work now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants