Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCRIPT] Pre-trainining script for BERT #505

Merged
merged 130 commits into from Mar 3, 2019

Conversation

@eric-haibin-lin
Copy link
Member

commented Jan 2, 2019

Description

requires #499 #500 #501 #503 #504 #489
Please review after these PRs are merged, otherwise the diff contains duplicated code changes.

Summary of changes:

  • enhanced training data generation script to produce npz files with multi-processing
  • added pre-training script for BERT_base. It currently takes about 8-9 days using fp16 on 8 V100 for training 1M steps from scratch. The produced checkpoint achieves 93% on SST-2, 87.99% on MRPC and 80.99/88.6 on SQuAD 1.1
  • added dynamic loss scaling and fp16 trainer, which avoid gradient overflow/underflow using fp16
  • modified a few blocks related to BERT to support fp16 dtype, some of them requires explicit casting to avoid loss of precision (e.g. LayerNorm, norm)
  • added sample text dataset, and test for training data generation / pre-training on this sample dataset

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Ubuntu and others added some commits Dec 25, 2018

Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
Show resolved Hide resolved scripts/bert/fp16_utils.py Outdated
Show resolved Hide resolved scripts/bert/fp16_utils.py Outdated

Ubuntu added some commits Feb 21, 2019

Show resolved Hide resolved scripts/bert/fp16_utils.py Outdated
Show resolved Hide resolved scripts/bert/fp16_utils.py Outdated

Ubuntu and others added some commits Feb 25, 2019

Ubuntu
Ubuntu
Ubuntu
Ubuntu
Ubuntu
@mli

This comment has been minimized.

Copy link
Member

commented Feb 27, 2019

@szha

szha approved these changes Mar 2, 2019

@mli

This comment has been minimized.

Copy link
Member

commented Mar 3, 2019

@eric-haibin-lin eric-haibin-lin merged commit def3431 into dmlc:master Mar 3, 2019

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details

BERT automation moved this from In progress to Done Mar 3, 2019

@eric-haibin-lin eric-haibin-lin referenced this pull request Mar 4, 2019

Closed

[feature] support for BERT fp16 inference #578

0 of 6 tasks complete

paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019

[SCRIPT] Pre-trainining script for BERT (dmlc#505)
* use adam_w and bucketing

* add missing code

* add helper msg for OOM error

* move optimizer def to gluonnlp. also fix padding token

* update documentation

* fix lint

* address CR comments

* revert unintended changes

* dataset preparation script for pre-training

* fix pylint

* fix object inheritance

* remove mem pool env var

* Revert "remove mem pool env var"

This reverts commit 60b8fdd.

* remove mem pool env var

* Add split sampler for distributed training

* update unit test

* [Feature] Add NumpyDataset

* test enhancement

* fix ci

* stage draft

* support custom sampler in simpledataset stream

* runnable script

* Revert "support custom sampler in simpledataset stream"

This reverts commit f346eb1.

* Fix pylint and apply ode reviews

* add test for adam bert

* update doc

* update doc

* doc enhancement

* remove unused code

* move mx.test_utils.compare_optimizer to gluonnlp since it is not available in mxnet 1.3 yet

* cleanup

* fix lint

* fix test and doc

* fix lint

* fix import error

* fix test

* add pretrianing data creation script

* add sample text

* doc for pre training

* custom metric for masked accuracy

* fix lint, and more test cases

* use generic bert api

* merge with metric branch

* remove bert_for_pretraining from the commit

* remove unused file

* add sample text

add numpy output support

* update doc

* update throughput calculation

* remove padding from numpy format

* remove object array

* remove segment length

* add back weight

* add h5py

* h5py stream/array dataset

* add multiprocessing and toeknized option to create data

* add cmd

* fix parameter clip (dmlc#527)

* multi-gpu

* update run.sh

* save and load

* loop forever

* improve grad clipping

* add kvstore type

* fix acc and loss calculation

* support dist kvstore

* update run.sh

* by token

* global norm

* use global norm api

* Fp16 (dmlc#8)

* support fp16 inference for transformer

* more fixes for dtype

* stage code

* stash transformer related changes

* Revert "stash transformer related changes"

This reverts commit 9ca65c5.

* revert transformer related chagnes

* cleanup

* fix layernorm dtype

* mp adamw

* add missing file

* fp16 support

* cleanup fp16 utils

* remove unused fiel

* fix lint

* fix lints

* fix lints

* fix lint

* fix lint

* fix lint

* remove unused code

* fix test

* bug fix

* use os.path.expanduser

* remove unused inputs

* add test for bert create_pretraining_data

* Many minor changes

* add test

* update test

* fix lint

* fix test

* fix lint

* update default val

* use float val mask

* Add CI to track MXNet master

* add test for mx-master

* add doc

* fix overflow and support general optimizers

* refactor forward_backward fucntion

* address CR comments

* fix lint

* fix test

* update default epsilon val in test

* fix lint

* update with pre-trianing result

@eric-haibin-lin eric-haibin-lin deleted the eric-haibin-lin:pretrain branch Jun 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.