Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[API] Better handle case when backoff is not possible in TokenEmbedding #459

Merged
merged 8 commits into from
Jan 9, 2019

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Dec 8, 2018

Description

Previously, if TokenEmbedding.unknown_lookup was set, it was always used, even for words that could not be inferred by unknown_lookup. This would raise KeyError. But, if TokenEmbedding.unknown_token exists, the unknown vector should be returned. Tests are added.

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Only lookup unknown words in TokenEmbedding.unknown_lookup if it is contained in TokenEmbedding.unknown_lookup and fallback to TokenEmbedding.unknown_token otherwise
  • Remove automatic extension of TokenEmbedding when looking up unknown words (TokenEmbedding.unknown_autoextend). If you want to extend a TokenEmbedding, be explicit instead: embedding[new_tokens] = embedding.unknown_lookup[new_tokens]. Extending a TokenEmbedding assigns new indices to the previously unknown words and saves their vectors as computed by the unknown_lookup to the embedding.idx_to_vec matrix
  • Make sure all embeddings in vocab.set_embedding are valid, ie. embedding.idx_to_vec is not None. Invalid embeddings were silently ignored before, which could lead to vocab.embedding being invalid after calling vocab.set_embedding

@mli
Copy link
Member

mli commented Dec 8, 2018

Job PR-459/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/1/index.html

@codecov
Copy link

codecov bot commented Dec 8, 2018

Codecov Report

Merging #459 into master will increase coverage by 0.26%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   72.42%   72.69%   +0.26%     
==========================================
  Files         113      113              
  Lines        9788     9616     -172     
==========================================
- Hits         7089     6990      -99     
+ Misses       2699     2626      -73
Flag Coverage Δ
#PR435 ?
#PR456 ?
#PR459 72.69% <100%> (?)
#master ?
#notserial 47.41% <100%> (+0.69%) ⬆️
#py2 72.46% <100%> (+0.26%) ⬆️
#py3 72.54% <100%> (+0.26%) ⬆️
#serial 58.47% <33.33%> (-0.02%) ⬇️

@codecov
Copy link

codecov bot commented Dec 8, 2018

Codecov Report

Merging #459 into master will increase coverage by 0.08%.
The diff coverage is 98.3%.

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   70.05%   70.13%   +0.08%     
==========================================
  Files         122      122              
  Lines       10461    10485      +24     
==========================================
+ Hits         7328     7354      +26     
+ Misses       3133     3131       -2
Flag Coverage Δ
#PR459 70.13% <98.3%> (?)
#master ?
#notserial 46.69% <90.9%> (+0.02%) ⬆️
#py2 69.9% <98.3%> (+0.13%) ⬆️
#py3 69.94% <98.3%> (+0.04%) ⬆️
#serial 54.83% <94.91%> (+0.14%) ⬆️

@leezu
Copy link
Contributor Author

leezu commented Dec 8, 2018

Updated to also remove the unknown_autoextend feature. The motivation was to allow easy caching of the computed vectors of unknown words. But explicit caching is easy too and explicit is better than implicit, so remove the implicit option to keep things simple.

Example of caching explicitly: embedding[new_tokens] = embedding.unknown_lookup[new_tokens]
unknown_autoextend was disabled by default before.

@mli
Copy link
Member

mli commented Dec 8, 2018

Job PR-459/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/3/index.html

@szha szha added the API change label Dec 9, 2018
@szha
Copy link
Member

szha commented Dec 9, 2018

The changes make sense, though since it's an API change we need to make sure that we clearly document it in the release note.

@szha szha requested a review from astonzhang December 9, 2018 02:23
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense. +1 on Sheng's comment

embedding[idx_to_token] = model[idx_to_token]
# If there are any remaining tokens we may precompute
if idx_to_token:
with utils.print_time('compute vectors from subwords '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this util only available for embedding training? It looks very useful and should probably be added to gluonnlp api

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only in the scripts folder. It's very simple, but I agree may be useful to add to main API

@contextmanager
def print_time(task):
    start_time = time.time()
    logging.info('Starting to %s', task)
    yield
    logging.info('Finished to {} in {:.2f} seconds'.format(
        task,
        time.time() - start_time))

Do you have any suggestions on the logging format used here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. The format looks good. I think this can be added to the src/gluonnlp/utils folder

@eric-haibin-lin eric-haibin-lin changed the title Better handle case when backoff is not possible in TokenEmbedding [API] Better handle case when backoff is not possible in TokenEmbedding Dec 10, 2018
@leezu
Copy link
Contributor Author

leezu commented Dec 10, 2018

I edited #459 (comment) so that the "Changes" entry better explains the API change

Copy link
Member

@astonzhang astonzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, since this is an API change, for people who do not have background knowledge of unknown_lookup, it may be hard for them to understand the current descriptions. Can you add some examples to explain the change?

@@ -198,14 +198,16 @@ def enforce_max_size(token_embedding, size):

enforce_max_size(token_embedding_, args_.max_vocab_size)
known_tokens = set(token_embedding_.idx_to_token)
# Auto-extend token_embedding with unknown extra eval tokens

# Extend token_embedding with unknown extra eval tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a period.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -173,23 +173,15 @@ class TokenEmbedding(object):
tokens can be updated.
unknown_lookup : object subscriptable with list of tokens returning nd.NDarray, default None
If not None, unknown_lookup[tokens] is called for any unknown tokens.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Descriptions of unknown_lookup may be a bit vague for people without background information. Can you improve it with better illustrations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more comments

@mli
Copy link
Member

mli commented Dec 11, 2018

Job PR-459/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/4/index.html

@@ -307,8 +311,7 @@ def set_embedding(self, *embeddings):
new_embedding._token_to_idx = self.token_to_idx
new_embedding._idx_to_token = self.idx_to_token

new_vec_len = sum(embs.idx_to_vec.shape[1] for embs in embeddings
if embs and embs.idx_to_vec is not None)
new_vec_len = sum(embs.idx_to_vec.shape[1] for embs in embeddings)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astonzhang removing if embs and embs.idx_to_vec is not None. It does not make sense to to set_embedding with an embedding that does not have embs.idx_to_vec as we do not know the embedding dimensionality. Instead of silently ignoring the error, I added an assertion above.

Consequence of silent ignoring is that new_vec_len may be 0 and consequently vocab.embedding.idx_to_vec.shape == (X, 0) which is invalid. See also the test_vocab_set_embedding_with_subword_lookup_only_token_embedding test below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mli
Copy link
Member

mli commented Dec 11, 2018

Job PR-459/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/5/index.html

@mli
Copy link
Member

mli commented Dec 11, 2018

Job PR-459/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/6/index.html

@szha
Copy link
Member

szha commented Jan 4, 2019

@leezu seems that the tests are failing.

@leezu
Copy link
Contributor Author

leezu commented Jan 8, 2019

Rebased and fixed the test

@mli
Copy link
Member

mli commented Jan 8, 2019

Job PR-459/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/12/index.html

@leezu
Copy link
Contributor Author

leezu commented Jan 9, 2019

@astonzhang do you have any other suggestions?

@astonzhang astonzhang merged commit f4275c0 into dmlc:master Jan 9, 2019
@leezu leezu deleted the backoff branch January 9, 2019 10:51
paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019
…ng (dmlc#459)

* Handle case when backoff is not possible

* Remove unknown_autoextend

* Improve doc

* Improve Vocab error checking if any idx_to_vec is None in set_embedding

* Improve path detection for fastText bin models

* Small refactor of evaluate_pretrained.py

* Add missing import

* Fix test
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants