Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for extending Embeddings/Embedders with Extended Vocabulary. #2374

Merged
merged 27 commits into from Jan 18, 2019

Conversation

Projects
None yet
2 participants
@HarshTrivedi
Copy link
Member

commented Jan 16, 2019

Fixes #2363.

@HarshTrivedi

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

Should we make vocab expansion default in fine-tune? Default was changed to not expand in #1623, due to #1596, which this PR fixes. I am fine either ways.

@matt-gardner

This comment has been minimized.

Copy link
Member

commented Jan 17, 2019

Until we fix the general problem of loading pretrained embeddings from an archived file, I'd vote to keep the default as it is.

@matt-gardner
Copy link
Member

left a comment

Looks really good, thanks for doing this! These are mostly minor style / documentation fixes, except for moving a test to the right place.

# config_file used: str(self.FIXTURES_ROOT / 'decomposable_attention' / 'experiment.json')
# dataset used: str(self.FIXTURES_ROOT / 'data' / 'snli.jsonl')

original_weight = trained_model._text_field_embedder.token_embedder_tokens.weight # pylint: disable=protected-access

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

In a test file, it's fine to put the # pylint: disable=protected-access at the top, so you don't have to repeat it throughout. For this particular file, because it's technically a test helper and not a test file, I might put it at the top of the function, at least.

Oh, wait, this won't get run as you expect. I'm not sure if it's running at all; if it is, it's probably running once for every model test that we have, which is not what you want. [I just checked the build log, and yes, it's getting run many times.] Just make a new model_test.py file in the right place, and put this in it (and then the pylint command can go at the top of the file).

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

Ok, I see. Will fix this. 👍

extended vocabulary (original + new-data vocabulary)
"""
for _, module in self._modules.items():
if isinstance(module, TextFieldEmbedder):

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

It might be nice to also check if the module is an Embedding, in case someone used an embedding layer directly.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

Sure 👍

construct it in the original training. We store vocab_namespace used during the original
training as an attribute, so that it can be retrieved during fine-tuning.
Caveat: This would prevent models trained before this commit to be unextendedable.
Any other way to know the namespace used by token_embedder in original training?
Returns

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

Replace the blank line here after the parameters.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

👍

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 18, 2019

Member

Looks like you forgot this one.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 18, 2019

Author Member

Oops, I thought you are talking about blank line before """ and not after """. Anyways, it's fixed now.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 18, 2019

Member

Sorry I wasn't clear - I meant the line right before the line I commented on. There should be a blank line before the "Returns" block. I'm pretty sure sphinx is picky about this.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 18, 2019

Author Member

Ok, I see. Sure, no problem. Fixed now.

extend the embedding-matrix, it's necessary to know which vocab_namspace was used to
construct it in the original training. We store vocab_namespace used during the original
training as an attribute, so that it can be retrieved during fine-tuning.
Caveat: This would prevent models trained before this commit to be unextendedable.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

This caveat belongs in a TODO in a comment, not in the documentation.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

I don't understand. I meant that earlier we didn't store _vocab_namespace as an attribute so if we want to extend an old model, there is no way to find original namespace used for Embedding objects. So it's not extendable (and won't be extendable even in future), unless the archived model's embedding uses default namespace 'tokens'. I will shift Caveat to comment, but it should still be caveat, and not TODO, right?

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

I guess I was looking at the question below. That question is definitely a TODO, or a PR comment. Maybe it could just be removed. We could do crazy things to try to recover it (like look through the original config file), but that's not worth the effort and complexity.

Then the caveat can stay there, but you can't reference "this commit" in a docstring or a comment, because it doesn't mean anything. Maybe: "Added in version 0.9; models trained with 0.8.1 or below need special treatment when extending their vocabularies."

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

Ohh right, my bad! The question below was only to remind me to have this discussion with you; just to check if there is obvious way of knowing vocab_namespace used that I am missing. I was anyway going to remove that before merging. About the caveat line, I will change the reference of "this commit" to version.

@overrides
def extend_vocab(self, extended_vocab: Vocabulary, vocab_namespace: Optional[str] = None):
if not vocab_namespace:
vocab_namespace = getattr(self, "_vocab_namespace", None)

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

Can't you just do self._vocab_namespace? This is still backwards compatible, because an old model will still use the current constructor, which sets self._vocab_namespace = None.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

Ohh right! This would still be backward compatible, will fix. 👍

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

This also means you can just do vocab_namespace = vocab_namespace or self._vocab_namespace


if not vocab_namespace:
vocab_namespace = "tokens"
logging.warning("Neither original training store vocab_namespace for Embedding class "

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

Wording tweak: just say "No vocab_namespace provided to Embedder.extend_vocab. Defaulting to 'tokens'."

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

Sure 👍

@@ -140,6 +149,25 @@ def forward(self, inputs): # pylint: disable=arguments-differ
embedded = projection(embedded)
return embedded

@overrides
def extend_vocab(self, extended_vocab: Vocabulary, vocab_namespace: Optional[str] = None):

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

The Optional annotation on vocab_namespace is redundant, because you immediately follow it with = None, which means exactly the same thing. Just do vocab_namespace: str = None.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

Ok, sure. I've made this mistake a couple of places. Will fix them.

Show resolved Hide resolved allennlp/modules/token_embedders/embedding.py
@@ -35,6 +37,10 @@ def forward(self, token_characters: torch.Tensor) -> torch.Tensor: # pylint: di
mask = (token_characters != 0).long()
return self._dropout(self._encoder(self._embedding(token_characters), mask))

@overrides
def extend_vocab(self, extended_vocab: Vocabulary, vocab_namespace: Optional[str] = "token_characters"):

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

Again here I'd remove the Optional annotation.

@@ -24,3 +27,10 @@ def get_output_dim(self) -> int:
token. This is `not` the shape of the returned tensor, but the last element of that shape.
"""
raise NotImplementedError

def extend_vocab(self, extended_vocab: Vocabulary, vocab_namespace: Optional[str]):

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 17, 2019

Member

Here you say it's optional, but you don't actually give a default of None. Just remove the Optional.

This comment has been minimized.

Copy link
@HarshTrivedi

HarshTrivedi Jan 17, 2019

Author Member

Actually, I intended to make it optional, it's a typo. The tests are not giving error because the overrided methods are fixing it.

@HarshTrivedi

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

@matt-gardner, this is ready for another look.

@HarshTrivedi

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

Minor update: I updated of Embedding's extend_vocab to allow pretrained_file (in separate branch). I'll keep it in separate pr though.

@matt-gardner
Copy link
Member

left a comment

Looks great, thanks! One very minor fix, and then this is good to merge.

construct it in the original training. We store vocab_namespace used during the original
training as an attribute, so that it can be retrieved during fine-tuning.
Caveat: This would prevent models trained before this commit to be unextendedable.
Any other way to know the namespace used by token_embedder in original training?
Returns

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Jan 18, 2019

Member

Looks like you forgot this one.

HarshTrivedi added some commits Jan 18, 2019

@matt-gardner matt-gardner merged commit 2d29736 into allenai:master Jan 18, 2019

3 checks passed

Pull Requests (AllenNLP Library) TeamCity build finished
Details
codecov/patch 91% of diff hit (target 90%)
Details
codecov/project 92% (-1%) compared to 7cc8ba0
Details
@matt-gardner

This comment has been minimized.

Copy link
Member

commented Jan 18, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.