## Mobile Deployment for Hugging Face Models

Due to size of the models on the Hugging Face, deploying them on Mobile or Edge device with limited memory size seems challenging.

Let try to reduce size of the lightweight models such as distibert-based-uncased

In [None]:
!pip install onnx onnxruntime onnxruntime-tools

In [1]:
from transformers import DistilBertModel 

model = DistilBertModel.from_pretrained('distilbert-base-uncased')

model.eval()



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

We would use the ONNX (Open Neutral Network Exchange) format in this case.

In [5]:
import torch 
dummy_input = torch.ones(1, 512, dtype=torch.long)

torch.onnx.export(model, dummy_input, 'distilbert-base-uncased.onnx',
                    input_names=['input_ids'],
                    output_names=['output'],
                    opset_version=11)

  mask, torch.tensor(torch.finfo(scores.dtype).min)


We pass a sample input and the structure for using the model while transforming them in ONNX format. 

Next, we would quantize the model to compress the model size even more.

In [6]:
from onnxruntime.quantization import quantize_dynamic, QuantType

model_fp32 = "distilbert-base-uncased.onnx"
model_quantized = "distilbert-base-uncased-quantized.onnx"

quantize_dynamic(model_fp32, model_quantized, weight_type=QuantType.QUInt8)



Let check the models' sizes 

In [8]:
import sys 

sys.getsizeof(open(model_fp32, 'rb').read())/1024/1024, sys.getsizeof(open(model_quantized, 'rb').read())/1024/1024

(253.23635959625244, 63.616641998291016)

We should test the inference speed of the quantized model. We can use the onnxruntime package to load the model and run inference.

In [9]:
import onnxruntime as ort
import numpy as np

ort_session = ort.InferenceSession(model_quantized)

dummy_input = np.ones((1, 512), dtype=np.int64)

outputs = ort_session.run(None, {'input_ids': dummy_input})
print(outputs[0].shape)

(1, 512, 768)


The quantized model is now ready for mobile deployment