Hi Deepspeed team,
The BERT perf results from blog Microsoft DeepSpeed achieves the fastest BERT training time are very impressive. However I couldn't reproduce the perf results in Figure 1 in the blog.
For example, using the latest nvbert code, for bert large model max-seq-len=128, the max batch size I got is 136 with 194.57 examples/s. However, in figure1, nvbert's max batch size is about 82 with 215 examples/s. The nvbert perf is 10% better than what I got.
Can you share the detailed parameters (and/or code) for Figure 1 in the blog on how to reproduce the nvbert perf results and huggingface bert perf results? Thanks.
Thanks.
Liwei