[FEATURE] Add transformer inference code #852

pengxin99 · 2019-07-29T01:33:47Z

Description

Add transformer inference code to make inference easy and convenient to analysis the performance of transformer inference.
@TaoLv @juliusshufan @pengzhao-intel

can use below command to do inference:
python inference_transformer.py --dataset WMT2014BPE --src_lang en --tgt_lang de --batch_size 2700 --scaled --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --model_parameter PATH/TO/valid_best.params

will get output:

2019-08-19 22:03:57,600 - root - batch id=10, batch_bleu=26.0366
2019-08-19 22:04:45,904 - root - batch id=20, batch_bleu=30.8409
2019-08-19 22:05:26,991 - root - batch id=30, batch_bleu=25.3955
2019-08-19 22:06:11,089 - root - batch id=40, batch_bleu=21.9322
2019-08-19 22:06:58,313 - root - batch id=50, batch_bleu=29.7584
2019-08-19 22:07:49,634 - root - batch id=60, batch_bleu=26.5373
2019-08-19 22:08:33,846 - root - batch id=70, batch_bleu=23.2735
2019-08-19 22:09:24,003 - root - batch id=80, batch_bleu=22.8065
2019-08-19 22:10:03,324 - root - batch id=90, batch_bleu=26.0000
2019-08-19 22:10:41,997 - root - batch id=100, batch_bleu=27.7887
2019-08-19 22:11:26,346 - root - batch id=110, batch_bleu=22.6277
2019-08-19 22:12:10,353 - root - batch id=120, batch_bleu=25.9580
2019-08-19 22:12:47,614 - root - batch id=130, batch_bleu=22.6479
2019-08-19 22:13:20,316 - root - batch id=140, batch_bleu=26.6224
2019-08-19 22:13:54,895 - root - batch id=150, batch_bleu=30.2036
2019-08-19 22:14:32,938 - root - batch id=160, batch_bleu=22.4694
2019-08-19 22:15:09,624 - root - batch id=170, batch_bleu=26.4245
2019-08-19 22:15:39,387 - root - batch id=180, batch_bleu=28.8940
2019-08-19 22:16:11,217 - root - batch id=190, batch_bleu=26.2148
2019-08-19 22:16:47,089 - root - batch id=200, batch_bleu=24.3723
2019-08-19 22:17:22,472 - root - batch id=210, batch_bleu=27.1375
2019-08-19 22:18:00,030 - root - batch id=220, batch_bleu=25.5695
2019-08-19 22:18:32,847 - root - batch id=230, batch_bleu=25.9404
2019-08-19 22:19:01,637 - root - batch id=240, batch_bleu=25.6699
2019-08-19 22:19:29,690 - root - batch id=250, batch_bleu=22.1795
2019-08-19 22:19:58,859 - root - batch id=260, batch_bleu=21.1670
2019-08-19 22:20:28,113 - root - batch id=270, batch_bleu=24.0742
2019-08-19 22:20:53,027 - root - batch id=280, batch_bleu=27.6126
2019-08-19 22:21:20,014 - root - batch id=290, batch_bleu=25.6340
2019-08-19 22:21:50,416 - root - batch id=300, batch_bleu=22.7178
2019-08-19 22:22:14,171 - root - batch id=310, batch_bleu=30.1331
2019-08-19 22:22:37,462 - root - batch id=320, batch_bleu=23.2388
2019-08-19 22:23:01,075 - root - batch id=330, batch_bleu=27.9605
2019-08-19 22:23:22,236 - root - batch id=340, batch_bleu=23.9418
2019-08-19 22:23:40,851 - root - batch id=350, batch_bleu=22.2135
2019-08-19 22:24:01,679 - root - batch id=360, batch_bleu=23.6225
2019-08-19 22:24:15,178 - root - Inference at test dataset. inference bleu=26.0137, throughput=0.1236K wps

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

codecov · 2019-07-29T01:33:49Z

Codecov Report

❗ No coverage uploaded for pull request head (inference-transformer@55b7693). Click here to learn what that means.
The diff coverage is n/a.

codecov · 2019-07-29T01:33:49Z

Codecov Report

Merging #852 into master will increase coverage by 0.08%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #852      +/-   ##
==========================================
+ Coverage   90.38%   90.47%   +0.08%     
==========================================
  Files          66       66              
  Lines        6367     6405      +38     
==========================================
+ Hits         5755     5795      +40     
+ Misses        612      610       -2

Impacted Files	Coverage Δ
src/gluonnlp/data/utils.py	`70.27% <0%> (-0.48%)`	⬇️
src/gluonnlp/data/batchify/language_model.py	`96.26% <0%> (ø)`	⬆️
src/gluonnlp/model/bert.py	`99.45% <0%> (+0.06%)`	⬆️
src/gluonnlp/data/batchify/batchify.py	`96.59% <0%> (+0.16%)`	⬆️
src/gluonnlp/model/transformer.py	`91.2% <0%> (+1.07%)`	⬆️

leezu

Please update the module docstring and the argparse description to reflect this is only for inference. Thanks!

pengzhao-intel · 2019-08-04T00:37:47Z

@pengxin99 please help to resolve the comments and we need to merge this soon.

mli · 2019-08-06T08:38:57Z

Job PR-852/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/3/index.html

pengxin99 · 2019-08-06T08:55:25Z

thanks for your comments, code has been modified accordingly, please take review. @leezu . @ciyongch @eric-haibin-lin

mli · 2019-08-06T09:25:37Z

Job PR-852/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/4/index.html

scripts/machine_translation/inference_transformer.py

mli · 2019-08-07T14:55:55Z

Job PR-852/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/5/index.html

mli · 2019-08-07T15:19:10Z

Job PR-852/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/6/index.html

ciyongch · 2019-08-08T01:06:32Z

@pengxin99 please take a look at CI failure, pylint/format error.

ciyongch

For inference, only test dataset is required (both src_test and target_test), while training and validation dataset are for training phase. loss is also useless in inference mode, bleu/ppl is enough.

Please cleanup all those training related arguments/dataset/metrics(loss).

ciyongch · 2019-08-08T01:09:01Z

scripts/machine_translation/inference_transformer.py

+
+parser = argparse.ArgumentParser(description='Neural Machine Translation Example.'
+                                             'We use this script only for transformer inference.')
+parser.add_argument('--dataset', type=str, default='WMT2016BPE', help='Dataset to use.')


Set default value to "WMT2014BPE" ?

ciyongch · 2019-08-08T01:09:34Z

scripts/machine_translation/inference_transformer.py

+parser.add_argument('--dataset', type=str, default='WMT2016BPE', help='Dataset to use.')
+parser.add_argument('--src_lang', type=str, default='en', help='Source language')
+parser.add_argument('--tgt_lang', type=str, default='de', help='Target language')
+parser.add_argument('--epochs', type=int, default=10, help='upper epoch limit')


Do we really need --epochs for inference mode?

ciyongch · 2019-08-08T01:10:35Z

scripts/machine_translation/inference_transformer.py

+                    help='Dimension of the hidden state in position-wise feed-forward networks.')
+parser.add_argument('--dropout', type=float, default=0.1,
+                    help='dropout applied to layers (0 = no dropout)')
+parser.add_argument('--epsilon', type=float, default=0.1,


Training only parameters?

ciyongch · 2019-08-08T01:11:49Z

scripts/machine_translation/inference_transformer.py

+parser.add_argument('--num_heads', type=int, default=8,
+                    help='number of heads in multi-head attention')
+parser.add_argument('--scaled', action='store_true', help='Turn on to use scale in attention')
+parser.add_argument('--batch_size', type=int, default=1024,


--batch_size should not be related with hardware back-end.

ciyongch · 2019-08-08T01:12:59Z

scripts/machine_translation/inference_transformer.py

+parser.add_argument('--lp_alpha', type=float, default=0.6,
+                    help='Alpha used in calculating the length penalty')
+parser.add_argument('--lp_k', type=int, default=5, help='K used in calculating the length penalty')
+parser.add_argument('--test_batch_size', type=int, default=256, help='Test batch size')


Redundant parameter?

ciyongch · 2019-08-08T01:15:19Z

scripts/machine_translation/inference_transformer.py

+                    help='Perform final testing based on the '
+                         'average of last num_averages checkpoints. '
+                         'This is only used if average_checkpoint is True')
+parser.add_argument('--average_start', type=int, default=5,


--optimizier to --average_start are all training only parameters, please remove all of these in inference script.

mli · 2019-08-08T02:59:28Z

Job PR-852/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/7/index.html

pengxin99 · 2019-08-08T03:54:38Z

For inference, only test dataset is required (both src_test and target_test), while training and validation dataset are for training phase. loss is also useless in inference mode, bleu/ppl is enough.

Please cleanup all those training related arguments/dataset/metrics(loss).

@ciyongch thanks for review :)
and i will clean up these arguments. but some thing is related to loss and dataset:

ppl calculate: np.exp(ave_loss), if we remove calculation of loss, we will remove ppl too, but the bleu score si still there.

dataset: the transformer make data loader on train_dataset, val_dataset, test_dataset together, so i think once we use one of them, we make all of them. Unless we also change the make_dataloader func code.

gluon-nlp/scripts/machine_translation/dataprocessor.py

Lines 206 to 208 in 5f18934

    
           def make_dataloader(data_train, data_val, data_test, args, 
        
                               use_average_length=False, num_shards=0, num_workers=8): 
        
               """Create data loaders for training/validation/test."""

ciyongch · 2019-08-08T05:58:56Z

@pengxin99 , only bleu score should be fine for inference. It's ok to reuse make_dataloader to get the dataset, and then only test dataset is used for inference.

ciyongch · 2019-08-12T01:52:52Z

@pengxin99 please take a look at the failure and check if it's related to your latest code changes.

leezu · 2019-08-14T10:39:02Z

There were some unrelated CI failures. They will go away after #875 is merged and if this PR merges or rebases on current master.

ciyongch

LGTM.

scripts/machine_translation/index.rst

scripts/machine_translation/inference_transformer.py

mli · 2019-09-02T23:04:43Z

Job PR-852/16 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/16/index.html

szha · 2019-09-03T00:27:23Z

Still some issues left with the testing: http://ci.mxnet.io/blue/organizations/jenkins/GluonNLP-py3-gpu-integration/detail/PR-852/16/pipeline#step-126-log-504

eric-haibin-lin

@ciyongch the test failed because the param file is not available.

mxnet.base.MXNetError: [23:00:08] src/io/local_filesys.cc:209: Check failed: allow_null:  LocalFileSystem::Open "./scripts/machine_translation/transformer_en_de_u512/valid_best.params": No such file or directory

Can you make sure the checkpoint is downloaded in the test?

ciyongch · 2019-09-03T06:55:11Z

@ciyongch the test failed because the param file is not available.
mxnet.base.MXNetError: [23:00:08] src/io/local_filesys.cc:209: Check failed: allow_null:  LocalFileSystem::Open "./scripts/machine_translation/transformer_en_de_u512/valid_best.params": No such file or directory
Can you make sure the checkpoint is downloaded in the test?

@eric-haibin-lin Sure, I will add this.

ciyongch · 2019-09-04T07:36:03Z

@eric-haibin-lin , I've added the support of downloading params file if needed. Do you have any preferred location to stores this file ( ~387MB)? Currently I just put it on google drive.

mli · 2019-09-04T08:06:20Z

Job PR-852/17 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/17/index.html

eric-haibin-lin

@szha shall we host it on s3?
@ciyongch would you mind fixing the lint error?

eric-haibin-lin · 2019-09-04T16:26:18Z

scripts/machine_translation/inference_transformer.py

+
+param_name = args.model_parameter
+if (not os.path.exists(param_name)):
+  download("https://drive.google.com/open?id=1588i6OoaL8qC0K8gI3p2iFOYY5AEuRIN", fname=param_name)


Can you add a warning that the provided file does not exist, and the download will happen?

@eric-haibin-lin I've created a new commit to address your comments :).
And I think it's better to host this params on s3 to keep align with other dataset/param files.

mli · 2019-09-05T02:33:55Z

Job PR-852/18 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/18/index.html

ciyongch · 2019-09-05T05:47:44Z

@eric-haibin-lin looks like CI failed to download completed params file (Invalid NDArray file format). Could we upload this files to s3, and besides I will change to use gluon.utils.download which with a sha1_hash check.

szha · 2019-09-05T07:33:33Z

@ciyongch I can help upload the file. Just let me know where I can download the complete file and I can share the link with you once done

ciyongch · 2019-09-05T10:45:35Z

Thanks @szha :)
Please get the trained params file from google drive: https://drive.google.com/open?id=1588i6OoaL8qC0K8gI3p2iFOYY5AEuRIN
md5sum is: d02cbb76349e7ffddca3b3e719a56a82

szha · 2019-09-06T16:27:49Z

http://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/transformer_en_de_512_WMT2014-97ffd554a.zip

…ms link

mli · 2019-09-08T06:46:35Z

Job PR-852/19 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/19/index.html

mli · 2019-09-08T08:27:13Z

Job PR-852/20 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/20/index.html

ciyongch · 2019-09-08T09:05:55Z

@szha @eric-haibin-lin The params file is updated to use s3 link, also sha1_hash is added to check the file, Please help to take a check.

eric-haibin-lin · 2019-09-08T22:23:28Z

@ciyongch @pengxin99 nice work. Thanks!

leezu · 2020-01-15T16:29:56Z

@ciyongch @pengxin99 @eric-haibin-lin what is the difference between the trained parameters added in this PR (http://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/transformer_en_de_512_WMT2014-97ffd554a.zip) and the ones we used earlier and still linked in the website (http://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/transformer_en_de_512_WMT2014-e25287c5.zip)

ciyongch · 2020-01-16T02:16:28Z

@leezu @eric-haibin-lin The new params 97ffd554a was introduced for testing the transformer inference script when adding it as we didn't notice the available pre-trained params.
After trying out with both version of params (97ffd554a vs e25287c5) local, the latter one showed better blue number then former one, so I think it's better to update the inference script to use e25287c5 by default, which will also make it align with the website.

add transformer inference code

55b7693

pengxin99 requested a review from szha as a code owner July 29, 2019 01:33

fix lint

9617881

leezu reviewed Jul 29, 2019

View reviewed changes

szha added the release focus Progress focus for release label Jul 31, 2019

change inference to translator step and improve OOB feature

f7141eb

add inference doc and fix pylint

7b9799f

leezu reviewed Aug 6, 2019

View reviewed changes

scripts/machine_translation/inference_transformer.py Outdated Show resolved Hide resolved

pengxin99 added 3 commits August 7, 2019 22:19

fix CI fail

46382e5

fix CI fail

8b08a1c

fix logging info

63ad5e9

ciyongch reviewed Aug 8, 2019

View reviewed changes

fix pylint

8b8fae0

delete code which relate to train

d6413a5

pengxin99 added 2 commits August 13, 2019 13:26

trigger CI

c10ec71

update for dataload method

64c63de

ciyongch approved these changes Aug 28, 2019

View reviewed changes

szhengac approved these changes Aug 28, 2019

View reviewed changes

eric-haibin-lin suggested changes Aug 28, 2019

View reviewed changes

scripts/machine_translation/index.rst Outdated Show resolved Hide resolved

eric-haibin-lin reviewed Aug 28, 2019

View reviewed changes

scripts/machine_translation/inference_transformer.py Outdated Show resolved Hide resolved

fix reviews

f302f18

eric-haibin-lin approved these changes Sep 2, 2019

View reviewed changes

eric-haibin-lin reviewed Sep 3, 2019

View reviewed changes

Download trained params if needed for transformer inference

d0476d2

eric-haibin-lin reviewed Sep 4, 2019

View reviewed changes

Fix lint error and add warnning for downlonging param file

2c3ce76

Change mx.test_utils.download to mx.gluon.utils.download, update para…

5c77be2

…ms link

retrigger CI

ba2d918

eric-haibin-lin merged commit 5059208 into dmlc:master Sep 8, 2019

leezu mentioned this pull request Jan 15, 2020

Website and scripts/machine_translation/inference_transformer.py recommend different pretrained weights #1112

Closed

[FEATURE] Add transformer inference code #852

[FEATURE] Add transformer inference code #852

Conversation

pengxin99 commented Jul 29, 2019 • edited

Description

Checklist

Essentials

Changes

Comments

codecov bot commented Jul 29, 2019

Codecov Report

codecov bot commented Jul 29, 2019 • edited

Codecov Report

leezu left a comment

Choose a reason for hiding this comment

pengzhao-intel commented Aug 4, 2019

mli commented Aug 6, 2019

pengxin99 commented Aug 6, 2019

mli commented Aug 6, 2019

mli commented Aug 7, 2019

mli commented Aug 7, 2019

ciyongch commented Aug 8, 2019

ciyongch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mli commented Aug 8, 2019

pengxin99 commented Aug 8, 2019

ciyongch commented Aug 8, 2019

ciyongch commented Aug 12, 2019

leezu commented Aug 14, 2019

ciyongch left a comment

Choose a reason for hiding this comment

mli commented Sep 2, 2019

szha commented Sep 3, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

ciyongch commented Sep 3, 2019

ciyongch commented Sep 4, 2019

mli commented Sep 4, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mli commented Sep 5, 2019

ciyongch commented Sep 5, 2019

szha commented Sep 5, 2019

ciyongch commented Sep 5, 2019

szha commented Sep 6, 2019

mli commented Sep 8, 2019

mli commented Sep 8, 2019

ciyongch commented Sep 8, 2019

eric-haibin-lin commented Sep 8, 2019

leezu commented Jan 15, 2020

ciyongch commented Jan 16, 2020

pengxin99 commented Jul 29, 2019 •

edited

codecov bot commented Jul 29, 2019 •

edited