Skip to content

@szha szha released this Aug 8, 2019 · 7 commits to master since this release

News

Models

RoBERTa

  • RoBERTa is now available in GluonNLP BERT model zoo. #870

Transformer-XL

Assets 2

@eric-haibin-lin eric-haibin-lin released this Jul 17, 2019 · 3 commits to v0.7.x since this release

News

Models and Scripts

BERT

  • a BERT BASE model pre-trained on a large corpus including OpenWebText Corpus, BooksCorpus, and English Wikipedia, which has comparable performance with the BERT large model from Google. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang @vanyacohen @Skylion007
Source GluonNLP google-research/bert google-research/bert
Model bert_12_768_12 bert_12_768_12 bert_24_1024_16
Dataset openwebtext_book_corpus_wiki_en_uncased book_corpus_wiki_en_uncased book_corpus_wiki_en_uncased
SST-2 95.3 93.5 94.9
RTE 73.6 66.4 70.1
QQP 72.3 71.2 72.1
SQuAD 1.1 91.0/84.4 88.5/80.8 90.9/84.1
STS-B 87.5 85.8 86.5
MNLI-m/mm 85.3/84.9 84.6/83.4 86.7/85.9

GPT-2

ESIM

Data

  • Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
  • Sentiment analysis datasets: CR, MPQA (#663)
  • Intent classification and slot labeling datasets: ATIS and SNIPS (#816)

New Features

  • [Feature] support save model / trainer states to S3 (#700)
  • [Feature] support load model/trainer states from s3 (#702)
  • [Feature] Add SentencePieceTokenizer for BERT (#669)
  • [FEATURE] Flexible vocabulary (#732)
  • [API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
  • [Feature] add the List batchify function (#812) thanks @ThomasDelteil
  • [FEATURE] Add LAMB optimizer (#733)

Bug Fixes

  • [BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
  • [BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
  • fix bert forward call parameter mismatch (#695) thanks @paperplanet
  • [BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
  • Fix _get_rnn_cell (#648) thanks @MarisaKirisame
  • [BUGFIX] fix mrpc dataset idx (#708)
  • [bugfix] fix hybrid beam search sampler(#710)
  • [BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
  • [BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
  • [BUGFIX] Fix TokenEmbedding serialization with emb[emb.unknown_token] != 0 (#763)
  • [BUGFIX] Fix glue test result serialization (#773)
  • [BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori

API Changes

  • [API] Dropping support for wiki_multilingual and wiki_cn (#764)
  • [API] Remove get_bert_model from the public API list (#767)

Enhancements

  • [FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
  • [Script] Add inference function for BERT classification (#639) thanks @TaoLv
  • [SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
  • [Enhancement] One script to export bert for classification/regression/QA (#705)
  • [enhancement] refactor bert finetuning script (#692)
  • [Enhancement] only use the best model for inference for bert classification (#716)
  • [Dataset] redistribute conll2004 (#719)
  • [Enhancement] add periodic evaluation for BERT pre-training (#720)
  • [FEATURE]add XNLI task (#717)
  • [refactor] Refactor BERT script folder (#744)
  • [Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
  • [REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
  • [Refactor] Refactor BERT SQuAD inference code (#758)
  • [Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
  • [Dataset] Move MRPC dataset to API (#780)
  • [BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
  • [DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
  • [Improvement] Implement parser evaluation in Python (#772)
  • [Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
  • [Enhancement] Mix precision support for BERT finetuning (#793)
  • Generate BERT training samples in compressed format (#651)

Minor Fixes

Continuous Integration

  • skip failing tests in mxnet master (#685)
  • [CI] update nodes for CI (#686)
  • [CI] CI refactoring to speed up tests (#566)
  • [CI] fix codecov (#693)
  • use fixture for squad dataset tests (#699)
  • [CI] create zipped notebooks for link check (#712)
  • Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
  • [CI] set root in BERT tests (#738)
  • Fix conftest.py function_scope_seed (#748)
  • [CI] Fix links in contribute.rst (#752)
  • [CI] Update CI dependencies (#756)
  • Revert "[CI] Update CI dependencies (#756)" (#769)
  • [CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
  • [CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
  • [CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
  • add license checker (#804)
  • enable timeout (#813)
  • Fix website build on master branch (#819)
Assets 2

@eric-haibin-lin eric-haibin-lin released this Jul 9, 2019 · 8 commits to v0.7.x since this release

News

Models and Scripts

BERT

  • BERT model pre-trained on OpenWebText Corpus, BooksCorpus, and English Wikipedia. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang
Source GluonNLP google-research/bert google-research/bert
Model bert_12_768_12 bert_12_768_12 bert_24_1024_16
Dataset openwebtext_book_corpus_wiki_en_uncased book_corpus_wiki_en_uncased book_corpus_wiki_en_uncased
SST-2 95.3 93.5 94.9
RTE 73.6 66.4 70.1
QQP 72.3 71.2 72.1
SQuAD 1.1 91.0/84.4 88.5/80.8 90.9/84.1
STS-B 87.5 85.8 86.5
MNLI-m/mm 85.3/84.9 84.6/83.4 86.7/85.9

GPT-2

ESIM

Data

  • Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
  • Sentiment analysis datasets: CR, MPQA (#663)
  • Intent classification and slot labeling datasets: ATIS and SNIPS (#816)

New Features

  • [Feature] support save model / trainer states to S3 (#700)
  • [Feature] support load model/trainer states from s3 (#702)
  • [Feature] Add SentencePieceTokenizer for BERT (#669)
  • [FEATURE] Flexible vocabulary (#732)
  • [API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
  • [Feature] add the List batchify function (#812) thanks @ThomasDelteil
  • [FEATURE] Add LAMB optimizer (#733)

Bug Fixes

  • [BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
  • [BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
  • fix bert forward call parameter mismatch (#695) thanks @paperplanet
  • [BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
  • Fix _get_rnn_cell (#648) thanks @MarisaKirisame
  • [BUGFIX] fix mrpc dataset idx (#708)
  • [bugfix] fix hybrid beam search sampler(#710)
  • [BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
  • [BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
  • [BUGFIX] Fix TokenEmbedding serialization with emb[emb.unknown_token] != 0 (#763)
  • [BUGFIX] Fix glue test result serialization (#773)
  • [BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori

API Changes

  • [API] Dropping support for wiki_multilingual and wiki_cn (#764)
  • [API] Remove get_bert_model from the public API list (#767)

Enhancements

  • [FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
  • [Script] Add inference function for BERT classification (#639) thanks @TaoLv
  • [SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
  • [Enhancement] One script to export bert for classification/regression/QA (#705)
  • [enhancement] refactor bert finetuning script (#692)
  • [Enhancement] only use the best model for inference for bert classification (#716)
  • [Dataset] redistribute conll2004 (#719)
  • [Enhancement] add periodic evaluation for BERT pre-training (#720)
  • [FEATURE]add XNLI task (#717)
  • [refactor] Refactor BERT script folder (#744)
  • [Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
  • [REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
  • [Refactor] Refactor BERT SQuAD inference code (#758)
  • [Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
  • [Dataset] Move MRPC dataset to API (#780)
  • [BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
  • [DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
  • [Improvement] Implement parser evaluation in Python (#772)
  • [Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
  • [Enhancement] Mix precision support for BERT finetuning (#793)
  • Generate BERT training samples in compressed format (#651)

Minor Fixes

Continuous Integration

  • skip failing tests in mxnet master (#685)
  • [CI] update nodes for CI (#686)
  • [CI] CI refactoring to speed up tests (#566)
  • [CI] fix codecov (#693)
  • use fixture for squad dataset tests (#699)
  • [CI] create zipped notebooks for link check (#712)
  • Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
  • [CI] set root in BERT tests (#738)
  • Fix conftest.py function_scope_seed (#748)
  • [CI] Fix links in contribute.rst (#752)
  • [CI] Update CI dependencies (#756)
  • Revert "[CI] Update CI dependencies (#756)" (#769)
  • [CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
  • [CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
  • [CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
  • add license checker (#804)
  • enable timeout (#813)
  • Fix website build on master branch (#819)
Assets 2

@eric-haibin-lin eric-haibin-lin released this Mar 18, 2019 · 138 commits to master since this release

News

  • Tutorial proposal for GluonNLP is accepted at EMNLP 2019, Hong Kong, and KDD 2019, Anchorage.

Models and Scripts

  • BERT pre-training on BooksCorpus and English Wikipedia with mixed precision and gradient accumulation on GPUs. We achieved the following fine-tuning results based on the produced checkpoint on validation sets(#482, #505, #489). Thank you @haven-jeon

    • Dataset MRPC SQuAD 1.1 SST-2 MNLI-mm
      Score 87.99% 80.99/88.60 93% 83.6%
  • BERT fine-tuning on various sentence classification datasets with checkpoints converted from the official repository(#600, #571, #481). Thank you @kenjewu @haven-jeon

    • Dataset MRPC RTE SST-2 MNLI-m/mm
      Score 88.7% 70.8% 93% 84.55%, 84.66%
  • BERT fine-tuning on question answering datasets with checkpoints converted from the official repository(#493). Thank you @fierceX

    • Dataset SQuAD 1.1 SQuAD 1.1 SQuAD 2.0
      Model bert_12_768_12 bert_24_1024_16 bert_24_1024_16
      F1/EM 88.53/80.98 90.97/84.05 77.96/81.02
  • BERT model convertion scripts for checkpoints from the original tensorflow repository, and more converted models(#456, #461, #449). Thank you @fierceX:

    • Multilingual Wikipedia (cased, BERT Base)
    • Chinese Wikipedia (cased, BERT Base)
    • Books Corpus & English Wikipedia (uncased, BERT Large)
  • Scripts and command line interface for BERT embedding of raw sentences(#587, #618). Thank you @imgarylai

  • Scripts for exporting BERT model for deployment (#624)

New Features

  • [API] Add BERTVocab (#509) thanks @kenjewu
  • [API] Add Transforms for BERT (#526) thanks @kenjewu
  • [API] add data parallel for transformer (#387)
  • [FEATURE] Add squad2.0 Dataset (#551) thanks @fierceX
  • [FEATURE] Add NumpyDataset (#498)
  • [FEATURE] Add TruncNorm initializer for BERT (#548) thanks @Ishitori
  • [FEATURE] Add split sampler for distributed training (#494)
  • [FEATURE] Custom metric for masked accuracy (#503)
  • [FEATURE] Support custom sampler in SimpleDatasetStream (#507)
  • [FEATURE] clip gradient norm by parameter (#470)

Bug Fixes

  • [BUGFIX] Fix Data Preprocessing for Translation Data (#568)
  • [FIX] fix parameter clip (#527)
  • [FIX] Fix divergence of the training of transformer (#543)
  • [FIX] Fix documentation and a bug in NCE Block (#558)
  • [FIX] Fix hashing single ngrams in NGramHashes (#450)
  • [FIX] Fix weight dying in BERTModel.decoder for BERT pre-training (#500)
  • [BUGFIX] Modifying the FastText Classification training for accurate mean pooling (#529) thanks @sravanbabuiitm

API Changes

  • [API] BERT return intermediate encodings per layer (#606) thanks @Ishitori
  • [API] Better handle case when backoff is not possible in TokenEmbedding (#459)
  • [FIX] Rename wiki_cn/wiki_multilingual to wiki_cn_cased/wiki_multilingual_uncased (#594) thanks @kenjewu
  • [FIX] Update default value of BERTAdam epsilon to 1e-6 (#601)
  • [FIX] Fix BERT decoder API for masked language model prediction (#501)
  • [FIX] Remove bias correction term in BERTAdam (#499)

Enhancements

  • [BUGFIX] use glove.840B.300d for NLI experiments (#567)
  • [API] Add debug option for parallel (#584)
  • [FEATURE] Skip dropout layer in Transformer when rate=0 (#597) thanks @TaoLv
  • [FEATURE] update sharded loader (#468)
  • [FIX] Update BERTLayerNorm Implementation (#485)
  • [TUTORIAL] Use FixedBucketSampler in BERT tutorial for better performance (#506) thanks @Ishitori
  • [API] Add Bert tokenizer to transforms.py (#464) thanks @fierceX
  • [FEATURE] Add data parallel to big rnn lm script (#564)

Minor Fixes

Assets 2

@eric-haibin-lin eric-haibin-lin released this Nov 27, 2018 · 229 commits to master since this release

Highlights

Models

New Tutorials

New Datasets

  • Sentiment Analysis
    • MR, a movie-review data set of 10,662 sentences labeled with respect to their overall sentiment polarity (positive or negative). (#391)
    • SST_1, an extension of the MR data set with fine-grained labels (#391)
    • SST_2, an extension of the MR data set with binary sentiment polarity labels (#391)
    • SUBJ, a subjectivity data set for sentiment analysis (#391)
    • TREC, a movie-review data set of 10,000 sentences labeled with respect to their subjectivity status (subjective or objective). (#391)

API Updates

  • Changed Vocab constructor from staticmethod to classmethod to handle inheritance (#386)
  • Added Transformer Encoder APIs (#409)
  • Added pre-trained ELMo model to model.get_model API (#227)
  • Added pre-trained BERT model to model.get_model API (#409)
  • Added unknown_lookup setter to TokenEmbedding (#429)
  • Added dtype support to EmbeddingCenterContextBatchify (#416)
  • Propagated exceptions from PrefetchingStream (#406)
  • Added sentencepiece tokenizer detokenizer (#380)
  • Added CSR format for variable length data in embedding training (#384)

Fixes & Small Changes

  • Included output of nlp.embedding.list_sources() in API docs (#421)
  • Supported symlinks in examples and scripts (#403)
  • Fixed weight tying in GNMT and Transformer (#413)
  • Simplified transformer notebook (#400)
  • Fixed LazyTransformDataStream prefetching (#397)
  • Adopted src/gluonnlp folder layout (#390)
  • Fixed text8 archive file name for downloads from S3 (#388) Thanks @bkktimber!
  • Fixed ppl reporting for training on multi gpu in the language model notebook (#365). Thanks @ThomasDelteil!
  • Fixed a spelling mistake in QA script. (#379) Thanks @qyhfbqz!
Assets 2

@cgraywang cgraywang released this Oct 24, 2018 · 276 commits to master since this release

Highlights

Models

  • Language Model
  • Document Classification
    • The Classification Model as introduced by Joulin, Armand, et al. “Bag of tricks for efficient text classification” achieved validation accuracy validation accuracy 98 on Yelp review dataset (#258 #297)
  • Question Answering
    • The QANet as introduced by Jozefowicz, Rafal, et al. “
      QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension”. ICLR 2018
      achieved F1 score 79.5 on SQuAD 1.1 dataset (#339) (coming soon to master branch)

New Tutorials

  • Machine Translation
    • The Google NMT as introduced by Wu, Yonghui, et al. “Google's neural machine translation system:
      Bridging the gap between human and machine translation”. arXiv preprint arXiv:1609.08144 (2016)
      is introduced as part of the gluonnlp tutorial (#261)
    • The Transformer based Machine Translation by Vaswani, Ashish, et al. “Attention is all you need.” Advances in Neural Information Processing Systems. 2017 is introduced as part of the gluonnlp tutorial (#279)
  • Sentence Embedding

New Datasets

API updates

  • Added dataloader that allows multi-shard sampling (#237 #280 #285)
  • Simplified DataStream, added DatasetStream, refactored and extended PrefetchingStream (#235)
  • Unified BPTT batchify for dataset and stream (#246)
  • Added symbolic beam search (#233)
  • Added SequenceSampler (#272)
  • Refactored Transform APIs (#282)
  • Reorganized index of the repo and model zoo page (#357)

Fixes & Small Changes

  • Fixed module name in batchify.py example (#239)
  • Improved imports structure (#248)
  • Added test for nmt scripts (#234)
  • Speeded up batchify.Pad (#249)
  • Fixed LanguageModelDataset.bptt_batchify (#243)
  • Fixed weight drop and add tests (#268)
  • Fixed relative links that pypi doesn't handle (#293)
  • Updated notebook build logic (#309)
  • Added community link (#313)
  • Enabled run tests in parallel (#317)
  • Enabled word embedding scripts tests (#321)

See all commits

Assets 2
Aug 18, 2018
writing fix (#301)
* writing fix

* fix writing in language model

* update sentiment

* update zack lang

* add zack emb

* add zack sentiment

* clean word_embedding training

* contributor

* address comment

* fix level

* fix

@leezu leezu released this Jun 13, 2018 · 366 commits to master since this release

GluonNLP v0.3 contains many exciting new features.
(depends on MXNet 1.3.0b20180725)

Models

  • Language Models
  • Machine Translation
    • The Transformer Model as introduced by Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017* is introduced as part of the gluonnlp nmt scripts (#133)
  • Word embeddings
    • Trainable word embedding models are introduced as part of gluonnlp.model.train (#136)
      • Word2Vec by Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
      • FastText models by Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135-146.

New Datasets

API changes

  • The download directory for datasets and other artifacts can now be specified
    via the MXNET_HOME environment variable. (#106)
  • TokenEmbedding class now exposes the Inverse Vocab as well (#123)
  • SortedSampler now supports use_average_length option (#135)
  • Add more strategies for bucket creation (#145)
  • Add tokenizer to bleu (#154)
  • Add Convolutional Encoder and Highway Layer (#129) (#186)
  • Add plain text of translation data. (#158)
  • Use Sherlock Holmes dataset instead of PTB for language model notebook (#174)
  • Add classes JiebaToknizer and NLTKStanfordSegmenter for Chinese Word Segmentation (#164)
  • Allow toggling output and prompt in documentation website (#184)
  • Add shape assertion statements for better user experience to some attention cells (#201)
  • Add support for computation of word embeddings for unknown words in TokenEmbedding class (#185)
  • Distribute subword vectors for pretrained fastText embeddings enabling embeddings for unknown words (#185)

Fixes & Small Changes

  • fixed bptt_batchify sometimes returned an invalid last batch (#120)
  • Fixed wrong PPL calculation in word language model script for multi-GPU (#150)
  • Fix split compound words and wmt16 results (#151)
  • Adapt pretrained word embeddings example notebook for nd.topk change in mxnet 1.3 (#153)
  • Fix beam search script (#175)
  • Fix small bugs in parser (#183)
  • TokenEmbedding: Skip lines with invalid bytes instead of crashing (#188)
  • Fix overly large memory use in TokenEmbedding serialization/deserialization if some tokens are overly large (eg. 50k characters) (#187)
  • Remove duplicates in WordSim353 when combining segments (#192)

See all commits

Assets 2
Jul 20, 2018
Release 0.3.2 (#220)
Jul 11, 2018
You can’t perform that action at this time.