Skip to content

Latest commit

 

History

History

transformers

Speeding Up Transformers Inferencing

Transfer learning on large scale pre-trained models based on transformation architecture on downstream applications have been gaining a lot of popularity due to its promising gains on model performance. However, in real world applications, often times, we have other constraints around latency, throughout as well as memory. This folder show cases several techniques that are commonly used to speed up transformers model's inferencing while still retaining the majority of its performance.

  • Response Knowledge Distillation for Training Student Model. [nbviewer][html]
  • Finetuning Pre-trained BERT Model on Text Classification Task And Inferencing with ONNX Runtime. [nbviewer][html]

Reference