-
Notifications
You must be signed in to change notification settings - Fork 538
[API] Better handle case when backoff is not possible in TokenEmbedding #459
Conversation
Job PR-459/1 is complete. |
Codecov Report
@@ Coverage Diff @@
## master #459 +/- ##
==========================================
+ Coverage 72.42% 72.69% +0.26%
==========================================
Files 113 113
Lines 9788 9616 -172
==========================================
- Hits 7089 6990 -99
+ Misses 2699 2626 -73
|
Codecov Report
@@ Coverage Diff @@
## master #459 +/- ##
==========================================
+ Coverage 70.05% 70.13% +0.08%
==========================================
Files 122 122
Lines 10461 10485 +24
==========================================
+ Hits 7328 7354 +26
+ Misses 3133 3131 -2
|
Updated to also remove the Example of caching explicitly: |
Job PR-459/3 is complete. |
The changes make sense, though since it's an API change we need to make sure that we clearly document it in the release note. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change makes sense. +1 on Sheng's comment
embedding[idx_to_token] = model[idx_to_token] | ||
# If there are any remaining tokens we may precompute | ||
if idx_to_token: | ||
with utils.print_time('compute vectors from subwords ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this util only available for embedding training? It looks very useful and should probably be added to gluonnlp api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only in the scripts folder. It's very simple, but I agree may be useful to add to main API
@contextmanager
def print_time(task):
start_time = time.time()
logging.info('Starting to %s', task)
yield
logging.info('Finished to {} in {:.2f} seconds'.format(
task,
time.time() - start_time))
Do you have any suggestions on the logging format used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. The format looks good. I think this can be added to the src/gluonnlp/utils folder
I edited #459 (comment) so that the "Changes" entry better explains the API change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, since this is an API change, for people who do not have background knowledge of unknown_lookup
, it may be hard for them to understand the current descriptions. Can you add some examples to explain the change?
@@ -198,14 +198,16 @@ def enforce_max_size(token_embedding, size): | |||
|
|||
enforce_max_size(token_embedding_, args_.max_vocab_size) | |||
known_tokens = set(token_embedding_.idx_to_token) | |||
# Auto-extend token_embedding with unknown extra eval tokens | |||
|
|||
# Extend token_embedding with unknown extra eval tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should follow PEP 8 https://www.python.org/dev/peps/pep-0008/#inline-comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -173,23 +173,15 @@ class TokenEmbedding(object): | |||
tokens can be updated. | |||
unknown_lookup : object subscriptable with list of tokens returning nd.NDarray, default None | |||
If not None, unknown_lookup[tokens] is called for any unknown tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Descriptions of unknown_lookup
may be a bit vague for people without background information. Can you improve it with better illustrations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more comments
Job PR-459/4 is complete. |
@@ -307,8 +311,7 @@ def set_embedding(self, *embeddings): | |||
new_embedding._token_to_idx = self.token_to_idx | |||
new_embedding._idx_to_token = self.idx_to_token | |||
|
|||
new_vec_len = sum(embs.idx_to_vec.shape[1] for embs in embeddings | |||
if embs and embs.idx_to_vec is not None) | |||
new_vec_len = sum(embs.idx_to_vec.shape[1] for embs in embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astonzhang removing if embs and embs.idx_to_vec is not None
. It does not make sense to to set_embedding
with an embedding that does not have embs.idx_to_vec
as we do not know the embedding dimensionality. Instead of silently ignoring the error, I added an assertion above.
Consequence of silent ignoring is that new_vec_len
may be 0
and consequently vocab.embedding.idx_to_vec.shape == (X, 0)
which is invalid. See also the test_vocab_set_embedding_with_subword_lookup_only_token_embedding
test below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Job PR-459/5 is complete. |
Job PR-459/6 is complete. |
@leezu seems that the tests are failing. |
Rebased and fixed the test |
Job PR-459/12 is complete. |
@astonzhang do you have any other suggestions? |
…ng (dmlc#459) * Handle case when backoff is not possible * Remove unknown_autoextend * Improve doc * Improve Vocab error checking if any idx_to_vec is None in set_embedding * Improve path detection for fastText bin models * Small refactor of evaluate_pretrained.py * Add missing import * Fix test
Description
Previously, if
TokenEmbedding.unknown_lookup
was set, it was always used, even for words that could not be inferred byunknown_lookup
. This would raiseKeyError
. But, if TokenEmbedding.unknown_token exists, the unknown vector should be returned. Tests are added.Checklist
Essentials
Changes
TokenEmbedding.unknown_autoextend
). If you want to extend a TokenEmbedding, be explicit instead:embedding[new_tokens] = embedding.unknown_lookup[new_tokens]
. Extending a TokenEmbedding assigns new indices to the previously unknown words and saves their vectors as computed by theunknown_lookup
to theembedding.idx_to_vec
matrixvocab.set_embedding
are valid, ie.embedding.idx_to_vec is not None
. Invalid embeddings were silently ignored before, which could lead tovocab.embedding
being invalid after callingvocab.set_embedding