Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Word embeddings update #159

Merged
merged 38 commits into from
Jul 1, 2018
Merged

Word embeddings update #159

merged 38 commits into from
Jul 1, 2018

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Jun 22, 2018

Description

This PR contains a few improvements for the word embeddings training and inference.

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Support masking of accidental hits in UnigramCandidateSampler (moved to scripts)
  • TokenEmbedding.__setitem__ now allows setting vectors for new/unknown tokens by default. Can be disabled by setting allow_extend=False.
  • TokenEmbedding supports unknown_lookup and unknown_autoextend arguments. If unknown token is encountered and unknown_lookup specified, unknown_lookup[tokens] will be called to obtain an embedding for the unknown token. If unknown_autoextend is True, a new index will be assigned to the token and the embedding will be saved in the TokenEmbedding.
  • Remove EmbeddingModel.to_token_embedding and introduce EmbeddingModel.getitem, making the trainable EmbeddingModels valid arguments for TokenEmbedding unknown_lookup.
  • Add load_fasttext_format to FasttextEmbeddingModel which supports reading a model.bin file created by facebookresearch/fastText library.
  • Change ngram hash function to match facebookresearch/fastText implementation compiled with GCC / clang compiled on x86 for non-ASCII (previously matched version compiled on ARM)

Comments

@leezu leezu requested a review from szha as a code owner June 22, 2018 15:20
@leezu leezu force-pushed the wembtrainingfixes branch 2 times, most recently from 8fb84d4 to 6168374 Compare June 22, 2018 16:54


@numba_njit
def _candidates_mask(negatives, true_samples, true_samples_mask):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin you mentioned you also need the accidental hits masking feature. Please take a look if this would work for you and let me know if you have any suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my case my true and negatives are all reshaped to 1-D, and I just do a simple broadcast comparison.
https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the candidate sampler for word embedding learning to the scripts folder, similar to your PR.

@leezu leezu force-pushed the wembtrainingfixes branch 2 times, most recently from a2bdd5f to f8ef830 Compare June 22, 2018 17:35
@mli
Copy link
Member

mli commented Jun 22, 2018

Job PR-159/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/6/index.html

"""
# Set a few mxnet specific environment variables
import os
os.environ['MXNET_FORCE_ADDTAKEGRAD'] = '1' # Workaround for #11314
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind adding the actual link to the issue?



@numba_njit
def _candidates_mask(negatives, true_samples, true_samples_mask):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my case my true and negatives are all reshaped to 1-D, and I just do a simple broadcast comparison.
https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118


# Remove accidental hits
if true_samples is not None:
candidates_np = candidates.asnumpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leezu is it more performant to use numpy than nd for sampling here?

Copy link
Contributor Author

@leezu leezu Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion to numpy is necessary for using just in time compilation with numba for the _candidates_mask function. Alternatively the negative samples could be shared among all words in a batch (as commonly done for language modeling https://arxiv.org/abs/1602.02410 http://www.aclweb.org/anthology/N16-1145.pdf), allowing the use of a simple broadcast comparision as in https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118 for computing the mask. On the other hand this would implicitly change the sampling distribution as unrelated words (ie. words that seldomly co-occur in a context) would be more likely masked (given that they may still occur in the same batch)

@leezu leezu force-pushed the wembtrainingfixes branch 2 times, most recently from e9f6198 to 2d2d47c Compare June 25, 2018 18:17
@leezu leezu mentioned this pull request Jun 25, 2018
@mli
Copy link
Member

mli commented Jun 25, 2018

Job PR-159/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/9/index.html

@leezu
Copy link
Contributor Author

leezu commented Jun 25, 2018

(This failed for a few times due to the pandoc OSError when building doc. It passed after rebasing.)

@mli
Copy link
Member

mli commented Jun 25, 2018

Job PR-159/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/10/index.html

@leezu leezu force-pushed the wembtrainingfixes branch 2 times, most recently from e121bc9 to 941f6b1 Compare June 26, 2018 01:49
@mli
Copy link
Member

mli commented Jun 26, 2018

Job PR-159/12 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/12/index.html

@leezu
Copy link
Contributor Author

leezu commented Jun 26, 2018

I suggest to review this and the changes from #160 together as the changes are intertwined. I have closed #160 and pushed the commits here.

@szha
Copy link
Member

szha commented Jun 26, 2018

Please document all the API changes at the top. We should start providing this information in releases.

@mli
Copy link
Member

mli commented Jun 26, 2018

Job PR-159/13 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/13/index.html

@leezu leezu force-pushed the wembtrainingfixes branch 3 times, most recently from 256d68e to 79e1d60 Compare June 27, 2018 00:11
@codecov
Copy link

codecov bot commented Jun 27, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@2be7ce2). Click here to learn what that means.
The diff coverage is 56.55%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #159   +/-   ##
=========================================
  Coverage          ?   67.41%           
=========================================
  Files             ?       62           
  Lines             ?     5109           
  Branches          ?        0           
=========================================
  Hits              ?     3444           
  Misses            ?     1665           
  Partials          ?        0
Impacted Files Coverage Δ
gluonnlp/embedding/evaluation.py 90.9% <0%> (ø)
gluonnlp/data/batchify.py 90.76% <100%> (ø)
gluonnlp/data/sampler.py 95.65% <100%> (ø)
gluonnlp/model/train/embedding.py 20.79% <15.38%> (ø)
gluonnlp/data/candidate_sampler.py 22.5% <66.66%> (ø)
gluonnlp/vocab.py 90.29% <87.5%> (ø)
gluonnlp/embedding/token_embedding.py 94.97% <94.23%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2be7ce2...612b35b. Read the comment docs.

@leezu leezu force-pushed the wembtrainingfixes branch 2 times, most recently from ae83748 to 1b4b6f9 Compare June 27, 2018 02:42
@mli
Copy link
Member

mli commented Jul 1, 2018

Job PR-159/37 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-159/37/index.html

@szha szha merged commit 8859c20 into dmlc:master Jul 1, 2018
@leezu leezu deleted the wembtrainingfixes branch July 1, 2018 22:58
leezu added a commit to leezu/gluon-nlp that referenced this pull request Jul 11, 2018
* Mask accidental hits

* Simplify frequent token subsampling

* Remove tqdm dependency

* Simplifications

* Support read from vec format

* Add back DeduplicatedFasttext

* Average the subword embeddings for FastText

* Fix Fasttext hash function for ngrams containing non-ASCII data

std::string in C++ uses signed char on most implementations. While the behavior
is implementation defined and binary Fasttext models trained after compiling
Fasttext with different compilers may result in different behavior, let's match
the behavior of the officially distributed binary models here.

* Merge train_word2vec and train_fasttext

* Clean up fasttext evaluation binary script

- Fix support of loading bin Fasttext models without subwords

* Remove waitall

* Only evaluate at end of training by default

* Set mxnet env variables

* Increase number of subword units considered by default

* Update hyperparameters

* Fix cbow

* Use separate batch-size for evaluation

* Fix lint

* Rerun extended_results.ipynb and commit dependant results/*tvs files to repo

* Refactor TokenEmbedding OOV inference

* Clean up TokenEmbedding API docs

* Use GluonNLP load_fasttext_model for word embeddings evaluation script

Instead of custom evaluate_fasttext_bin script

* Add tests

* Remove deprecated to_token_embedding method from train/embedding.py

* Merge TokenEmbedding.extend in TokenEmbedding.__setitem__

Previously __setitem__ was only allowed to update known tokens.

* Use full link to #11314

* Improve test coverage

* Update notebook

* Fix doc

* Cache word ngram hashes

* Move results to dmlc/web-data

* Move candidate_sampler to scripts

* Update --negative doc

* Match old default behavior of TokenEmbedding and add warnings

* Match weight context in UnigramCandidateSampler

* Add Pad test case with empty ndarray input

* Address review comments

* Fix doc and superfluous inheritance
paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019
* Mask accidental hits

* Simplify frequent token subsampling

* Remove tqdm dependency

* Simplifications

* Support read from vec format

* Add back DeduplicatedFasttext

* Average the subword embeddings for FastText

* Fix Fasttext hash function for ngrams containing non-ASCII data

std::string in C++ uses signed char on most implementations. While the behavior
is implementation defined and binary Fasttext models trained after compiling
Fasttext with different compilers may result in different behavior, let's match
the behavior of the officially distributed binary models here.

* Merge train_word2vec and train_fasttext

* Clean up fasttext evaluation binary script

- Fix support of loading bin Fasttext models without subwords

* Remove waitall

* Only evaluate at end of training by default

* Set mxnet env variables

* Increase number of subword units considered by default

* Update hyperparameters

* Fix cbow

* Use separate batch-size for evaluation

* Fix lint

* Rerun extended_results.ipynb and commit dependant results/*tvs files to repo

* Refactor TokenEmbedding OOV inference

* Clean up TokenEmbedding API docs

* Use GluonNLP load_fasttext_model for word embeddings evaluation script

Instead of custom evaluate_fasttext_bin script

* Add tests

* Remove deprecated to_token_embedding method from train/embedding.py

* Merge TokenEmbedding.extend in TokenEmbedding.__setitem__

Previously __setitem__ was only allowed to update known tokens.

* Use full link to #11314

* Improve test coverage

* Update notebook

* Fix doc

* Cache word ngram hashes

* Move results to dmlc/web-data

* Move candidate_sampler to scripts

* Update --negative doc

* Match old default behavior of TokenEmbedding and add warnings

* Match weight context in UnigramCandidateSampler

* Add Pad test case with empty ndarray input

* Address review comments

* Fix doc and superfluous inheritance
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants