[API] Better handle case when backoff is not possible in TokenEmbedding #459

leezu · 2018-12-08T11:30:35Z

Description

Previously, if TokenEmbedding.unknown_lookup was set, it was always used, even for words that could not be inferred by unknown_lookup. This would raise KeyError. But, if TokenEmbedding.unknown_token exists, the unknown vector should be returned. Tests are added.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Only lookup unknown words in TokenEmbedding.unknown_lookup if it is contained in TokenEmbedding.unknown_lookup and fallback to TokenEmbedding.unknown_token otherwise
Remove automatic extension of TokenEmbedding when looking up unknown words (TokenEmbedding.unknown_autoextend). If you want to extend a TokenEmbedding, be explicit instead: embedding[new_tokens] = embedding.unknown_lookup[new_tokens]. Extending a TokenEmbedding assigns new indices to the previously unknown words and saves their vectors as computed by the unknown_lookup to the embedding.idx_to_vec matrix
Make sure all embeddings in vocab.set_embedding are valid, ie. embedding.idx_to_vec is not None. Invalid embeddings were silently ignored before, which could lead to vocab.embedding being invalid after calling vocab.set_embedding

mli · 2018-12-08T12:43:40Z

Job PR-459/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/1/index.html

codecov · 2018-12-08T12:43:48Z

Codecov Report

Merging #459 into master will increase coverage by 0.26%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   72.42%   72.69%   +0.26%     
==========================================
  Files         113      113              
  Lines        9788     9616     -172     
==========================================
- Hits         7089     6990      -99     
+ Misses       2699     2626      -73

Flag	Coverage Δ
#PR435	`?`
#PR456	`?`
#PR459	`72.69% <100%> (?)`
#master	`?`
#notserial	`47.41% <100%> (+0.69%)`	⬆️
#py2	`72.46% <100%> (+0.26%)`	⬆️
#py3	`72.54% <100%> (+0.26%)`	⬆️
#serial	`58.47% <33.33%> (-0.02%)`	⬇️

codecov · 2018-12-08T12:43:48Z

Codecov Report

Merging #459 into master will increase coverage by 0.08%.
The diff coverage is 98.3%.

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   70.05%   70.13%   +0.08%     
==========================================
  Files         122      122              
  Lines       10461    10485      +24     
==========================================
+ Hits         7328     7354      +26     
+ Misses       3133     3131       -2

Flag	Coverage Δ
#PR459	`70.13% <98.3%> (?)`
#master	`?`
#notserial	`46.69% <90.9%> (+0.02%)`	⬆️
#py2	`69.9% <98.3%> (+0.13%)`	⬆️
#py3	`69.94% <98.3%> (+0.04%)`	⬆️
#serial	`54.83% <94.91%> (+0.14%)`	⬆️

leezu · 2018-12-08T13:56:53Z

Updated to also remove the unknown_autoextend feature. The motivation was to allow easy caching of the computed vectors of unknown words. But explicit caching is easy too and explicit is better than implicit, so remove the implicit option to keep things simple.

Example of caching explicitly: embedding[new_tokens] = embedding.unknown_lookup[new_tokens]
unknown_autoextend was disabled by default before.

mli · 2018-12-08T15:11:30Z

Job PR-459/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/3/index.html

szha · 2018-12-09T02:22:59Z

The changes make sense, though since it's an API change we need to make sure that we clearly document it in the release note.

eric-haibin-lin

Change makes sense. +1 on Sheng's comment

eric-haibin-lin · 2018-12-10T05:16:31Z

scripts/word_embeddings/evaluate_pretrained.py

-            embedding[idx_to_token] = model[idx_to_token]
+        # If there are any remaining tokens we may precompute
+        if idx_to_token:
+            with utils.print_time('compute vectors from subwords '


is this util only available for embedding training? It looks very useful and should probably be added to gluonnlp api

It's only in the scripts folder. It's very simple, but I agree may be useful to add to main API

@contextmanager def print_time(task): start_time = time.time() logging.info('Starting to %s', task) yield logging.info('Finished to {} in {:.2f} seconds'.format( task, time.time() - start_time))

Do you have any suggestions on the logging format used here?

Sorry for the late reply. The format looks good. I think this can be added to the src/gluonnlp/utils folder

leezu · 2018-12-10T07:13:45Z

I edited #459 (comment) so that the "Changes" entry better explains the API change

astonzhang

Besides, since this is an API change, for people who do not have background knowledge of unknown_lookup, it may be hard for them to understand the current descriptions. Can you add some examples to explain the change?

astonzhang · 2018-12-10T19:51:08Z

scripts/word_embeddings/evaluate_pretrained.py

@@ -198,14 +198,16 @@ def enforce_max_size(token_embedding, size):

    enforce_max_size(token_embedding_, args_.max_vocab_size)
    known_tokens = set(token_embedding_.idx_to_token)
-    # Auto-extend token_embedding with unknown extra eval tokens
+
+    # Extend token_embedding with unknown extra eval tokens


Add a period.

We should follow PEP 8 https://www.python.org/dev/peps/pep-0008/#inline-comments

astonzhang · 2018-12-10T19:57:26Z

src/gluonnlp/embedding/token_embedding.py

@@ -173,23 +173,15 @@ class TokenEmbedding(object):
        tokens can be updated.
    unknown_lookup : object subscriptable with list of tokens returning nd.NDarray, default None
        If not None, unknown_lookup[tokens] is called for any unknown tokens.


Descriptions of unknown_lookup may be a bit vague for people without background information. Can you improve it with better illustrations?

I added some more comments

mli · 2018-12-11T05:02:13Z

Job PR-459/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/4/index.html

leezu · 2018-12-11T11:56:07Z

src/gluonnlp/vocab/vocab.py

@@ -307,8 +311,7 @@ def set_embedding(self, *embeddings):
        new_embedding._token_to_idx = self.token_to_idx
        new_embedding._idx_to_token = self.idx_to_token

-        new_vec_len = sum(embs.idx_to_vec.shape[1] for embs in embeddings
-                          if embs and embs.idx_to_vec is not None)
+        new_vec_len = sum(embs.idx_to_vec.shape[1] for embs in embeddings)


@astonzhang removing if embs and embs.idx_to_vec is not None. It does not make sense to to set_embedding with an embedding that does not have embs.idx_to_vec as we do not know the embedding dimensionality. Instead of silently ignoring the error, I added an assertion above.

Consequence of silent ignoring is that new_vec_len may be 0 and consequently vocab.embedding.idx_to_vec.shape == (X, 0) which is invalid. See also the test_vocab_set_embedding_with_subword_lookup_only_token_embedding test below

mli · 2018-12-11T13:04:09Z

Job PR-459/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/5/index.html

mli · 2018-12-11T16:50:48Z

Job PR-459/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/6/index.html

szha · 2019-01-04T04:03:00Z

@leezu seems that the tests are failing.

leezu · 2019-01-08T06:08:54Z

Rebased and fixed the test

mli · 2019-01-08T08:34:44Z

Job PR-459/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-459/12/index.html

leezu · 2019-01-09T08:00:40Z

@astonzhang do you have any other suggestions?

…ng (dmlc#459) * Handle case when backoff is not possible * Remove unknown_autoextend * Improve doc * Improve Vocab error checking if any idx_to_vec is None in set_embedding * Improve path detection for fastText bin models * Small refactor of evaluate_pretrained.py * Add missing import * Fix test

leezu requested review from szha and eric-haibin-lin December 8, 2018 11:30

leezu force-pushed the backoff branch from 4fa26db to 4c9f607 Compare December 8, 2018 13:59

leezu mentioned this pull request Dec 8, 2018

How to use use 2 embeddings with one acting as a backup? #423

Closed

szha added the API change label Dec 9, 2018

szha requested a review from astonzhang December 9, 2018 02:23

eric-haibin-lin reviewed Dec 10, 2018

View reviewed changes

eric-haibin-lin changed the title ~~Better handle case when backoff is not possible in TokenEmbedding~~ [API] Better handle case when backoff is not possible in TokenEmbedding Dec 10, 2018

astonzhang reviewed Dec 10, 2018

View reviewed changes

leezu commented Dec 11, 2018

View reviewed changes

leezu added 7 commits January 8, 2019 05:57

Handle case when backoff is not possible

d496f82

Remove unknown_autoextend

ba24160

Improve doc

1d38d34

Improve Vocab error checking if any idx_to_vec is None in set_embedding

fec04da

Improve path detection for fastText bin models

2000154

Small refactor of evaluate_pretrained.py

ebf18cf

Add missing import

4a0c95e

leezu force-pushed the backoff branch from 31c867d to 8838ec0 Compare January 8, 2019 05:59

Fix test

633021f

leezu force-pushed the backoff branch from 8838ec0 to 633021f Compare January 8, 2019 06:07

szha approved these changes Jan 8, 2019

View reviewed changes

astonzhang approved these changes Jan 9, 2019

View reviewed changes

astonzhang merged commit f4275c0 into dmlc:master Jan 9, 2019

leezu deleted the backoff branch January 9, 2019 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API] Better handle case when backoff is not possible in TokenEmbedding #459

[API] Better handle case when backoff is not possible in TokenEmbedding #459

leezu commented Dec 8, 2018 •

edited

Loading

mli commented Dec 8, 2018

codecov bot commented Dec 8, 2018

codecov bot commented Dec 8, 2018 •

edited

Loading

leezu commented Dec 8, 2018

mli commented Dec 8, 2018

szha commented Dec 9, 2018

eric-haibin-lin left a comment •

edited

Loading

eric-haibin-lin Dec 10, 2018

leezu Dec 10, 2018

eric-haibin-lin Jan 3, 2019

leezu commented Dec 10, 2018

astonzhang left a comment

astonzhang Dec 10, 2018

leezu Dec 11, 2018

astonzhang Dec 11, 2018

astonzhang Dec 10, 2018

leezu Dec 11, 2018

mli commented Dec 11, 2018

leezu Dec 11, 2018

astonzhang Dec 11, 2018

mli commented Dec 11, 2018

mli commented Dec 11, 2018

szha commented Jan 4, 2019

leezu commented Jan 8, 2019

mli commented Jan 8, 2019

leezu commented Jan 9, 2019

[API] Better handle case when backoff is not possible in TokenEmbedding #459

[API] Better handle case when backoff is not possible in TokenEmbedding #459

Conversation

leezu commented Dec 8, 2018 • edited Loading

Description

Checklist

Essentials

Changes

mli commented Dec 8, 2018

codecov bot commented Dec 8, 2018

Codecov Report

codecov bot commented Dec 8, 2018 • edited Loading

Codecov Report

leezu commented Dec 8, 2018

mli commented Dec 8, 2018

szha commented Dec 9, 2018

eric-haibin-lin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leezu commented Dec 10, 2018

astonzhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mli commented Dec 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mli commented Dec 11, 2018

mli commented Dec 11, 2018

szha commented Jan 4, 2019

leezu commented Jan 8, 2019

mli commented Jan 8, 2019

leezu commented Jan 9, 2019

leezu commented Dec 8, 2018 •

edited

Loading

codecov bot commented Dec 8, 2018 •

edited

Loading

eric-haibin-lin left a comment •

edited

Loading