Word embeddings update #159

leezu · 2018-06-22T15:20:41Z

Description

This PR contains a few improvements for the word embeddings training and inference.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Support masking of accidental hits in UnigramCandidateSampler (moved to scripts)
TokenEmbedding.__setitem__ now allows setting vectors for new/unknown tokens by default. Can be disabled by setting allow_extend=False.
TokenEmbedding supports unknown_lookup and unknown_autoextend arguments. If unknown token is encountered and unknown_lookup specified, unknown_lookup[tokens] will be called to obtain an embedding for the unknown token. If unknown_autoextend is True, a new index will be assigned to the token and the embedding will be saved in the TokenEmbedding.
Remove EmbeddingModel.to_token_embedding and introduce EmbeddingModel.getitem, making the trainable EmbeddingModels valid arguments for TokenEmbedding unknown_lookup.
Add load_fasttext_format to FasttextEmbeddingModel which supports reading a model.bin file created by facebookresearch/fastText library.
Change ngram hash function to match facebookresearch/fastText implementation compiled with GCC / clang compiled on x86 for non-ASCII (previously matched version compiled on ARM)

Comments

Speed comparision
- fasttext C++: ~20k words per second per thread
- gluon: ~150k per second with batch size 2048
Merge train_fasttext.py and train_word2vec.py to a single script
- It is easier to maintain a single script and put the subword related functionality behind a few if statements. I think it is still easy to understand for users.
Speed-up thanks to now not anymore needed nd.waitall() calls.
- [MXNET-555] Add subgraph storage type inference to CachedOp apache/mxnet#11306
Speed-up and memory savings due to re-introducing word-deduplication before computing subwords
- There is a bug in nd.Embedding that caused nan values before. Now a workaround to disable the flaky Operator was merged: Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values apache/mxnet#11314

leezu · 2018-06-22T16:56:01Z

gluonnlp/data/candidate_sampler.py

+
+
+@numba_njit
+def _candidates_mask(negatives, true_samples, true_samples_mask):


@eric-haibin-lin you mentioned you also need the accidental hits masking feature. Please take a look if this would work for you and let me know if you have any suggestions.

For my case my true and negatives are all reshaped to 1-D, and I just do a simple broadcast comparison.
https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118

I'll move the candidate sampler for word embedding learning to the scripts folder, similar to your PR.

mli · 2018-06-22T17:48:28Z

Job PR-159/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/6/index.html

eric-haibin-lin · 2018-06-22T20:24:52Z

scripts/word_embeddings/train_fasttext.py

 """
+# Set a few mxnet specific environment variables
+import os
+os.environ['MXNET_FORCE_ADDTAKEGRAD'] = '1'  # Workaround for #11314


do you mind adding the actual link to the issue?

eric-haibin-lin · 2018-06-22T20:28:34Z

gluonnlp/data/candidate_sampler.py

+
+
+@numba_njit
+def _candidates_mask(negatives, true_samples, true_samples_mask):


For my case my true and negatives are all reshaped to 1-D, and I just do a simple broadcast comparison.
https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118

ThomasDelteil · 2018-06-25T01:42:43Z

gluonnlp/data/candidate_sampler.py

+
+        # Remove accidental hits
+        if true_samples is not None:
+            candidates_np = candidates.asnumpy()


@leezu is it more performant to use numpy than nd for sampling here?

The conversion to numpy is necessary for using just in time compilation with numba for the _candidates_mask function. Alternatively the negative samples could be shared among all words in a batch (as commonly done for language modeling https://arxiv.org/abs/1602.02410 http://www.aclweb.org/anthology/N16-1145.pdf), allowing the use of a simple broadcast comparision as in https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118 for computing the mask. On the other hand this would implicitly change the sampling distribution as unrelated words (ie. words that seldomly co-occur in a context) would be more likely masked (given that they may still occur in the same batch)

mli · 2018-06-25T18:36:19Z

Job PR-159/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/9/index.html

leezu · 2018-06-25T18:44:00Z

(This failed for a few times due to the pandoc OSError when building doc. It passed after rebasing.)

mli · 2018-06-25T20:09:24Z

Job PR-159/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/10/index.html

mli · 2018-06-26T02:06:39Z

Job PR-159/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/12/index.html

leezu · 2018-06-26T17:11:39Z

I suggest to review this and the changes from #160 together as the changes are intertwined. I have closed #160 and pushed the commits here.

szha · 2018-06-26T17:22:38Z

Please document all the API changes at the top. We should start providing this information in releases.

mli · 2018-06-26T17:29:56Z

Job PR-159/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/13/index.html

codecov · 2018-06-27T02:38:35Z

Codecov Report

❗ No coverage uploaded for pull request base (master@2be7ce2). Click here to learn what that means.
The diff coverage is 56.55%.

@@            Coverage Diff            @@
##             master     #159   +/-   ##
=========================================
  Coverage          ?   67.41%           
=========================================
  Files             ?       62           
  Lines             ?     5109           
  Branches          ?        0           
=========================================
  Hits              ?     3444           
  Misses            ?     1665           
  Partials          ?        0

Impacted Files	Coverage Δ
gluonnlp/embedding/evaluation.py	`90.9% <0%> (ø)`
gluonnlp/data/batchify.py	`90.76% <100%> (ø)`
gluonnlp/data/sampler.py	`95.65% <100%> (ø)`
gluonnlp/model/train/embedding.py	`20.79% <15.38%> (ø)`
gluonnlp/data/candidate_sampler.py	`22.5% <66.66%> (ø)`
gluonnlp/vocab.py	`90.29% <87.5%> (ø)`
gluonnlp/embedding/token_embedding.py	`94.97% <94.23%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2be7ce2...612b35b. Read the comment docs.

…to repo

Instead of custom evaluate_fasttext_bin script

Previously __setitem__ was only allowed to update known tokens.

mli · 2018-07-01T04:50:24Z

Job PR-159/37 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/37/index.html

* Mask accidental hits * Simplify frequent token subsampling * Remove tqdm dependency * Simplifications * Support read from vec format * Add back DeduplicatedFasttext * Average the subword embeddings for FastText * Fix Fasttext hash function for ngrams containing non-ASCII data std::string in C++ uses signed char on most implementations. While the behavior is implementation defined and binary Fasttext models trained after compiling Fasttext with different compilers may result in different behavior, let's match the behavior of the officially distributed binary models here. * Merge train_word2vec and train_fasttext * Clean up fasttext evaluation binary script - Fix support of loading bin Fasttext models without subwords * Remove waitall * Only evaluate at end of training by default * Set mxnet env variables * Increase number of subword units considered by default * Update hyperparameters * Fix cbow * Use separate batch-size for evaluation * Fix lint * Rerun extended_results.ipynb and commit dependant results/*tvs files to repo * Refactor TokenEmbedding OOV inference * Clean up TokenEmbedding API docs * Use GluonNLP load_fasttext_model for word embeddings evaluation script Instead of custom evaluate_fasttext_bin script * Add tests * Remove deprecated to_token_embedding method from train/embedding.py * Merge TokenEmbedding.extend in TokenEmbedding.__setitem__ Previously __setitem__ was only allowed to update known tokens. * Use full link to #11314 * Improve test coverage * Update notebook * Fix doc * Cache word ngram hashes * Move results to dmlc/web-data * Move candidate_sampler to scripts * Update --negative doc * Match old default behavior of TokenEmbedding and add warnings * Match weight context in UnigramCandidateSampler * Add Pad test case with empty ndarray input * Address review comments * Fix doc and superfluous inheritance

leezu requested a review from szha as a code owner June 22, 2018 15:20

leezu force-pushed the wembtrainingfixes branch 2 times, most recently from 8fb84d4 to 6168374 Compare June 22, 2018 16:54

leezu commented Jun 22, 2018

View reviewed changes

leezu force-pushed the wembtrainingfixes branch 2 times, most recently from a2bdd5f to f8ef830 Compare June 22, 2018 17:35

szha requested a review from eric-haibin-lin June 22, 2018 18:25

eric-haibin-lin reviewed Jun 22, 2018

View reviewed changes

leezu mentioned this pull request Jun 22, 2018

Add OOV vector imputation to TokenEmbedding #160

Closed

4 tasks

leezu requested a review from cgraywang June 22, 2018 21:46

ThomasDelteil reviewed Jun 25, 2018

View reviewed changes

leezu force-pushed the wembtrainingfixes branch 2 times, most recently from e9f6198 to 2d2d47c Compare June 25, 2018 18:17

leezu mentioned this pull request Jun 25, 2018

fix bug in utils.py #163

Merged

leezu force-pushed the wembtrainingfixes branch 2 times, most recently from e121bc9 to 941f6b1 Compare June 26, 2018 01:49

leezu force-pushed the wembtrainingfixes branch 3 times, most recently from 256d68e to 79e1d60 Compare June 27, 2018 00:11

leezu force-pushed the wembtrainingfixes branch 2 times, most recently from ae83748 to 1b4b6f9 Compare June 27, 2018 02:42

leezu added 23 commits July 1, 2018 03:59

Fix cbow

9700e11

Use separate batch-size for evaluation

61f9f5f

Fix lint

b338e8d

Rerun extended_results.ipynb and commit dependant results/*tvs files …

09fb6df

…to repo

Clean up TokenEmbedding API docs

e215118

Refactor TokenEmbedding OOV inference

ab1b5ed

Use GluonNLP load_fasttext_model for word embeddings evaluation script

4d02b7a

Instead of custom evaluate_fasttext_bin script

Add tests

f3b257b

Remove deprecated to_token_embedding method from train/embedding.py

6eb685f

Merge TokenEmbedding.extend in TokenEmbedding.__setitem__

35bcb7b

Previously __setitem__ was only allowed to update known tokens.

Use full link to #11314

7da4c6f

Improve test coverage

08858d7

Update notebook

5e960fa

Fix doc

348c46f

Cache word ngram hashes

f5cfc84

Move results to dmlc/web-data

7e531d4

Move candidate_sampler to scripts

897b000

Update --negative doc

1637ef7

Match old default behavior of TokenEmbedding and add warnings

546a9af

Match weight context in UnigramCandidateSampler

4a32070

Add Pad test case with empty ndarray input

c307061

Address review comments

e90bd33

Fix doc and superfluous inheritance

9c79163

leezu force-pushed the wembtrainingfixes branch from d3dde28 to 9c79163 Compare July 1, 2018 04:35

szha approved these changes Jul 1, 2018

View reviewed changes

szha merged commit 8859c20 into dmlc:master Jul 1, 2018

leezu deleted the wembtrainingfixes branch July 1, 2018 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word embeddings update #159

Word embeddings update #159

leezu commented Jun 22, 2018 •

edited

Loading

leezu Jun 22, 2018

eric-haibin-lin Jun 22, 2018

leezu Jun 29, 2018

mli commented Jun 22, 2018

eric-haibin-lin Jun 22, 2018

eric-haibin-lin Jun 22, 2018

ThomasDelteil Jun 25, 2018

leezu Jun 25, 2018 •

edited

Loading

mli commented Jun 25, 2018

leezu commented Jun 25, 2018 •

edited

Loading

mli commented Jun 25, 2018

mli commented Jun 26, 2018

leezu commented Jun 26, 2018 •

edited

Loading

szha commented Jun 26, 2018

mli commented Jun 26, 2018

codecov bot commented Jun 27, 2018 •

edited

Loading

mli commented Jul 1, 2018



		@numba_njit
		def _candidates_mask(negatives, true_samples, true_samples_mask):

Word embeddings update #159

Word embeddings update #159

Conversation

leezu commented Jun 22, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

leezu Jun 22, 2018

Choose a reason for hiding this comment

eric-haibin-lin Jun 22, 2018

Choose a reason for hiding this comment

leezu Jun 29, 2018

Choose a reason for hiding this comment

mli commented Jun 22, 2018

eric-haibin-lin Jun 22, 2018

Choose a reason for hiding this comment

eric-haibin-lin Jun 22, 2018

Choose a reason for hiding this comment

ThomasDelteil Jun 25, 2018

Choose a reason for hiding this comment

leezu Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

mli commented Jun 25, 2018

leezu commented Jun 25, 2018 • edited Loading

mli commented Jun 25, 2018

mli commented Jun 26, 2018

leezu commented Jun 26, 2018 • edited Loading

szha commented Jun 26, 2018

mli commented Jun 26, 2018

codecov bot commented Jun 27, 2018 • edited Loading

Codecov Report

mli commented Jul 1, 2018

leezu commented Jun 22, 2018 •

edited

Loading

leezu Jun 25, 2018 •

edited

Loading

leezu commented Jun 25, 2018 •

edited

Loading

leezu commented Jun 26, 2018 •

edited

Loading

codecov bot commented Jun 27, 2018 •

edited

Loading