Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[FEATURE] Add transformer inference code #852

Merged
merged 20 commits into from Sep 8, 2019

Conversation

pengxin99
Copy link
Contributor

@pengxin99 pengxin99 commented Jul 29, 2019

Description

Add transformer inference code to make inference easy and convenient to analysis the performance of transformer inference.
@TaoLv @juliusshufan @pengzhao-intel

can use below command to do inference:
python inference_transformer.py --dataset WMT2014BPE --src_lang en --tgt_lang de --batch_size 2700 --scaled --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --model_parameter PATH/TO/valid_best.params

will get output:

2019-08-19 22:03:57,600 - root - batch id=10, batch_bleu=26.0366
2019-08-19 22:04:45,904 - root - batch id=20, batch_bleu=30.8409
2019-08-19 22:05:26,991 - root - batch id=30, batch_bleu=25.3955
2019-08-19 22:06:11,089 - root - batch id=40, batch_bleu=21.9322
2019-08-19 22:06:58,313 - root - batch id=50, batch_bleu=29.7584
2019-08-19 22:07:49,634 - root - batch id=60, batch_bleu=26.5373
2019-08-19 22:08:33,846 - root - batch id=70, batch_bleu=23.2735
2019-08-19 22:09:24,003 - root - batch id=80, batch_bleu=22.8065
2019-08-19 22:10:03,324 - root - batch id=90, batch_bleu=26.0000
2019-08-19 22:10:41,997 - root - batch id=100, batch_bleu=27.7887
2019-08-19 22:11:26,346 - root - batch id=110, batch_bleu=22.6277
2019-08-19 22:12:10,353 - root - batch id=120, batch_bleu=25.9580
2019-08-19 22:12:47,614 - root - batch id=130, batch_bleu=22.6479
2019-08-19 22:13:20,316 - root - batch id=140, batch_bleu=26.6224
2019-08-19 22:13:54,895 - root - batch id=150, batch_bleu=30.2036
2019-08-19 22:14:32,938 - root - batch id=160, batch_bleu=22.4694
2019-08-19 22:15:09,624 - root - batch id=170, batch_bleu=26.4245
2019-08-19 22:15:39,387 - root - batch id=180, batch_bleu=28.8940
2019-08-19 22:16:11,217 - root - batch id=190, batch_bleu=26.2148
2019-08-19 22:16:47,089 - root - batch id=200, batch_bleu=24.3723
2019-08-19 22:17:22,472 - root - batch id=210, batch_bleu=27.1375
2019-08-19 22:18:00,030 - root - batch id=220, batch_bleu=25.5695
2019-08-19 22:18:32,847 - root - batch id=230, batch_bleu=25.9404
2019-08-19 22:19:01,637 - root - batch id=240, batch_bleu=25.6699
2019-08-19 22:19:29,690 - root - batch id=250, batch_bleu=22.1795
2019-08-19 22:19:58,859 - root - batch id=260, batch_bleu=21.1670
2019-08-19 22:20:28,113 - root - batch id=270, batch_bleu=24.0742
2019-08-19 22:20:53,027 - root - batch id=280, batch_bleu=27.6126
2019-08-19 22:21:20,014 - root - batch id=290, batch_bleu=25.6340
2019-08-19 22:21:50,416 - root - batch id=300, batch_bleu=22.7178
2019-08-19 22:22:14,171 - root - batch id=310, batch_bleu=30.1331
2019-08-19 22:22:37,462 - root - batch id=320, batch_bleu=23.2388
2019-08-19 22:23:01,075 - root - batch id=330, batch_bleu=27.9605
2019-08-19 22:23:22,236 - root - batch id=340, batch_bleu=23.9418
2019-08-19 22:23:40,851 - root - batch id=350, batch_bleu=22.2135
2019-08-19 22:24:01,679 - root - batch id=360, batch_bleu=23.6225
2019-08-19 22:24:15,178 - root - Inference at test dataset. inference bleu=26.0137, throughput=0.1236K wps

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@pengxin99 pengxin99 requested a review from szha as a code owner July 29, 2019 01:33
@codecov
Copy link

codecov bot commented Jul 29, 2019

Codecov Report

❗ No coverage uploaded for pull request head (inference-transformer@55b7693). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented Jul 29, 2019

Codecov Report

Merging #852 into master will increase coverage by 0.08%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #852      +/-   ##
==========================================
+ Coverage   90.38%   90.47%   +0.08%     
==========================================
  Files          66       66              
  Lines        6367     6405      +38     
==========================================
+ Hits         5755     5795      +40     
+ Misses        612      610       -2
Impacted Files Coverage Δ
src/gluonnlp/data/utils.py 70.27% <0%> (-0.48%) ⬇️
src/gluonnlp/data/batchify/language_model.py 96.26% <0%> (ø) ⬆️
src/gluonnlp/model/bert.py 99.45% <0%> (+0.06%) ⬆️
src/gluonnlp/data/batchify/batchify.py 96.59% <0%> (+0.16%) ⬆️
src/gluonnlp/model/transformer.py 91.2% <0%> (+1.07%) ⬆️

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the module docstring and the argparse description to reflect this is only for inference. Thanks!

@szha szha added the release focus Progress focus for release label Jul 31, 2019
@pengzhao-intel
Copy link

@pengxin99 please help to resolve the comments and we need to merge this soon.

@mli
Copy link
Member

mli commented Aug 6, 2019

Job PR-852/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/3/index.html

@pengxin99
Copy link
Contributor Author

thanks for your comments, code has been modified accordingly, please take review. @leezu . @ciyongch @eric-haibin-lin

@mli
Copy link
Member

mli commented Aug 6, 2019

Job PR-852/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/4/index.html

@mli
Copy link
Member

mli commented Aug 7, 2019

Job PR-852/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/5/index.html

@mli
Copy link
Member

mli commented Aug 7, 2019

Job PR-852/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/6/index.html

@ciyongch
Copy link
Contributor

ciyongch commented Aug 8, 2019

@pengxin99 please take a look at CI failure, pylint/format error.

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inference, only test dataset is required (both src_test and target_test), while training and validation dataset are for training phase. loss is also useless in inference mode, bleu/ppl is enough.

Please cleanup all those training related arguments/dataset/metrics(loss).


parser = argparse.ArgumentParser(description='Neural Machine Translation Example.'
'We use this script only for transformer inference.')
parser.add_argument('--dataset', type=str, default='WMT2016BPE', help='Dataset to use.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set default value to "WMT2014BPE" ?

parser.add_argument('--dataset', type=str, default='WMT2016BPE', help='Dataset to use.')
parser.add_argument('--src_lang', type=str, default='en', help='Source language')
parser.add_argument('--tgt_lang', type=str, default='de', help='Target language')
parser.add_argument('--epochs', type=int, default=10, help='upper epoch limit')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need --epochs for inference mode?

help='Dimension of the hidden state in position-wise feed-forward networks.')
parser.add_argument('--dropout', type=float, default=0.1,
help='dropout applied to layers (0 = no dropout)')
parser.add_argument('--epsilon', type=float, default=0.1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training only parameters?

parser.add_argument('--num_heads', type=int, default=8,
help='number of heads in multi-head attention')
parser.add_argument('--scaled', action='store_true', help='Turn on to use scale in attention')
parser.add_argument('--batch_size', type=int, default=1024,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--batch_size should not be related with hardware back-end.

parser.add_argument('--lp_alpha', type=float, default=0.6,
help='Alpha used in calculating the length penalty')
parser.add_argument('--lp_k', type=int, default=5, help='K used in calculating the length penalty')
parser.add_argument('--test_batch_size', type=int, default=256, help='Test batch size')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant parameter?

help='Perform final testing based on the '
'average of last num_averages checkpoints. '
'This is only used if average_checkpoint is True')
parser.add_argument('--average_start', type=int, default=5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--optimizier to --average_start are all training only parameters, please remove all of these in inference script.

@mli
Copy link
Member

mli commented Aug 8, 2019

Job PR-852/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/7/index.html

@pengxin99
Copy link
Contributor Author

For inference, only test dataset is required (both src_test and target_test), while training and validation dataset are for training phase. loss is also useless in inference mode, bleu/ppl is enough.

Please cleanup all those training related arguments/dataset/metrics(loss).

@ciyongch thanks for review :)
and i will clean up these arguments. but some thing is related to loss and dataset:

  • ppl calculate: np.exp(ave_loss), if we remove calculation of loss, we will remove ppl too, but the bleu score si still there.
  • dataset: the transformer make data loader on train_dataset, val_dataset, test_dataset together, so i think once we use one of them, we make all of them. Unless we also change the make_dataloader func code.
    def make_dataloader(data_train, data_val, data_test, args,
    use_average_length=False, num_shards=0, num_workers=8):
    """Create data loaders for training/validation/test."""

@ciyongch
Copy link
Contributor

ciyongch commented Aug 8, 2019

@pengxin99 , only bleu score should be fine for inference. It's ok to reuse make_dataloader to get the dataset, and then only test dataset is used for inference.

@ciyongch
Copy link
Contributor

@pengxin99 please take a look at the failure and check if it's related to your latest code changes.

@leezu
Copy link
Contributor

leezu commented Aug 14, 2019

There were some unrelated CI failures. They will go away after #875 is merged and if this PR merges or rebases on current master.

Copy link
Contributor

@ciyongch ciyongch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mli
Copy link
Member

mli commented Sep 2, 2019

Job PR-852/16 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/16/index.html

@szha
Copy link
Member

szha commented Sep 3, 2019

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ciyongch the test failed because the param file is not available.

mxnet.base.MXNetError: [23:00:08] src/io/local_filesys.cc:209: Check failed: allow_null:  LocalFileSystem::Open "./scripts/machine_translation/transformer_en_de_u512/valid_best.params": No such file or directory

Can you make sure the checkpoint is downloaded in the test?

@ciyongch
Copy link
Contributor

ciyongch commented Sep 3, 2019

@ciyongch the test failed because the param file is not available.

mxnet.base.MXNetError: [23:00:08] src/io/local_filesys.cc:209: Check failed: allow_null:  LocalFileSystem::Open "./scripts/machine_translation/transformer_en_de_u512/valid_best.params": No such file or directory

Can you make sure the checkpoint is downloaded in the test?

@eric-haibin-lin Sure, I will add this.

@ciyongch
Copy link
Contributor

ciyongch commented Sep 4, 2019

@eric-haibin-lin , I've added the support of downloading params file if needed. Do you have any preferred location to stores this file ( ~387MB)? Currently I just put it on google drive.

@mli
Copy link
Member

mli commented Sep 4, 2019

Job PR-852/17 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/17/index.html

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha shall we host it on s3?
@ciyongch would you mind fixing the lint error?


param_name = args.model_parameter
if (not os.path.exists(param_name)):
download("https://drive.google.com/open?id=1588i6OoaL8qC0K8gI3p2iFOYY5AEuRIN", fname=param_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a warning that the provided file does not exist, and the download will happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin I've created a new commit to address your comments :).
And I think it's better to host this params on s3 to keep align with other dataset/param files.

@mli
Copy link
Member

mli commented Sep 5, 2019

Job PR-852/18 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/18/index.html

@ciyongch
Copy link
Contributor

ciyongch commented Sep 5, 2019

@eric-haibin-lin looks like CI failed to download completed params file (Invalid NDArray file format). Could we upload this files to s3, and besides I will change to use gluon.utils.download which with a sha1_hash check.

@szha
Copy link
Member

szha commented Sep 5, 2019

@ciyongch I can help upload the file. Just let me know where I can download the complete file and I can share the link with you once done

@ciyongch
Copy link
Contributor

ciyongch commented Sep 5, 2019

Thanks @szha :)
Please get the trained params file from google drive: https://drive.google.com/open?id=1588i6OoaL8qC0K8gI3p2iFOYY5AEuRIN
md5sum is: d02cbb76349e7ffddca3b3e719a56a82

@mli
Copy link
Member

mli commented Sep 8, 2019

Job PR-852/19 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/19/index.html

@mli
Copy link
Member

mli commented Sep 8, 2019

Job PR-852/20 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-852/20/index.html

@ciyongch
Copy link
Contributor

ciyongch commented Sep 8, 2019

@szha @eric-haibin-lin The params file is updated to use s3 link, also sha1_hash is added to check the file, Please help to take a check.

@eric-haibin-lin eric-haibin-lin merged commit 5059208 into dmlc:master Sep 8, 2019
@eric-haibin-lin
Copy link
Member

@ciyongch @pengxin99 nice work. Thanks!

@leezu
Copy link
Contributor

leezu commented Jan 15, 2020

@ciyongch
Copy link
Contributor

@leezu @eric-haibin-lin The new params 97ffd554a was introduced for testing the transformer inference script when adding it as we didn't notice the available pre-trained params.
After trying out with both version of params (97ffd554a vs e25287c5) local, the latter one showed better blue number then former one, so I think it's better to update the inference script to use e25287c5 by default, which will also make it align with the website.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants