<a href="https://colab.research.google.com/github/bhadreshpsavani/UnderstandingNLP/blob/master/TFLiteExperiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import


In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 1.3MB 5.8MB/s 
[K     |████████████████████████████████| 1.1MB 33.3MB/s 
[K     |████████████████████████████████| 890kB 48.3MB/s 
[K     |████████████████████████████████| 2.9MB 48.3MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


## Get Model

In [45]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')
print(tf.__version__)

2.3.0


In [25]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', return_dict=True)
inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
inputs["labels"] = tf.reshape(tf.constant(1), (-1, 1)) # Batch size 1

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_transform', 'activation_13', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_38', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [27]:
inputs

{'input_ids': <tf.Tensor: shape=(1, 8), dtype=int32, numpy=
array([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 8), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

In [50]:
%%time
outputs = model(inputs)

CPU times: user 139 ms, sys: 16.4 ms, total: 155 ms
Wall time: 133 ms


In [51]:
loss = outputs.loss
logits = outputs.logits

In [28]:
logits

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.11225172, 0.07725379]], dtype=float32)>

In [30]:
input_spec = tf.TensorSpec([1, 8], tf.int32)
# model._set_inputs(input_spec, training=False)
model._saved_model_inputs_spec = None
model._set_save_spec(input_spec)
input_spec

TensorSpec(shape=(1, 8), dtype=tf.int32, name=None)

In [31]:
model.inputs

In [49]:
model.save_weights('./tensorflow_distilbert/checkpoint')

# TensorFlow Lite:

## With Normal Converstion:

In [32]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# For normal conversion:
converter.target_spec.supported_ops = [tf.lite.OpsSet.SELECT_TF_OPS]

In [33]:
tflite_model = converter.convert()
open("distilbert.tflite", "wb").write(tflite_model)

























INFO:tensorflow:Assets written to: /tmp/tmphe9e_3tb/assets


INFO:tensorflow:Assets written to: /tmp/tmphe9e_3tb/assets
INFO:absl:Using experimental converter: If you encountered a problem please file a bug. You can opt-out by setting experimental_new_converter=False


266480416

In [55]:
# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="distilbert.tflite")
interpreter.allocate_tensors()

In [56]:
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

In [57]:
input_details

[{'dtype': numpy.int32,
  'index': 0,
  'name': 'args_0',
  'quantization': (0.0, 0),
  'quantization_parameters': {'quantized_dimension': 0,
   'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32)},
  'shape': array([1, 8], dtype=int32),
  'shape_signature': array([1, 8], dtype=int32),
  'sparsity_parameters': {}}]

In [58]:
output_details

[{'dtype': numpy.float32,
  'index': 638,
  'name': 'Identity',
  'quantization': (0.0, 0),
  'quantization_parameters': {'quantized_dimension': 0,
   'scales': array([], dtype=float32),
   'zero_points': array([], dtype=int32)},
  'shape': array([1, 2], dtype=int32),
  'shape_signature': array([1, 2], dtype=int32),
  'sparsity_parameters': {}}]

In [59]:
list(inputs['input_ids'].numpy()[0])

[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]

In [63]:
%%time
interpreter.set_tensor(input_details[0]['index'], inputs['input_ids'])
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
output_data

CPU times: user 67.6 ms, sys: 2.62 ms, total: 70.3 ms
Wall time: 53.9 ms


## FP16 Quantization:

In [46]:
# Below Two methods makes models size 4 time smaller
# For conversion with FP16 quantization:
# supports CPUs, GPUs
converter_16 = tf.lite.TFLiteConverter.from_keras_model(model)
converter_16.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
converter_16.target_spec.supported_types = [tf.float16]
converter_16.optimizations = [tf.lite.Optimize.DEFAULT]
converter_16.experimental_new_converter = True

tflite_model_fp16 = converter_16.convert()
open("distilbert_fp16.tflite", "wb").write(tflite_model_fp16)

# Load the TFLite model and allocate tensors.
interpreter_fp16 = tf.lite.Interpreter(model_path="distilbert_fp16.tflite")
interpreter_fp16.allocate_tensors()

# Get input and output tensors.
input_details = interpreter_fp16.get_input_details()
output_details = interpreter_fp16.get_output_details()

























INFO:tensorflow:Assets written to: /tmp/tmp_7o63ngk/assets


INFO:tensorflow:Assets written to: /tmp/tmp_7o63ngk/assets
INFO:absl:Using experimental converter: If you encountered a problem please file a bug. You can opt-out by setting experimental_new_converter=False


[[0.11221965 0.07723911]]


In [72]:
%%time
interpreter_fp16.set_tensor(input_details[0]['index'], inputs['input_ids'])
interpreter_fp16.invoke()
output_data_fp16 = interpreter_fp16.get_tensor(output_details[0]['index'])
output_data_fp16

CPU times: user 50.6 ms, sys: 1.05 ms, total: 51.7 ms
Wall time: 54.2 ms


## Hybrid Quantuzation:

In [47]:
# For conversion with hybrid quantization:
# This only support CPU
converter_hy = tf.lite.TFLiteConverter.from_keras_model(model)
converter_hy.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
converter_hy.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter_hy.experimental_new_converter = True

tflite_model_hy = converter_hy.convert()
open("distilbert_hy.tflite", "wb").write(tflite_model_hy)

# Load the TFLite model and allocate tensors.
interpreter_hy = tf.lite.Interpreter(model_path="distilbert_hy.tflite")
interpreter_hy.allocate_tensors()

# Get input and output tensors.
input_details = interpreter_hy.get_input_details()
output_details = interpreter_hy.get_output_details()

























INFO:tensorflow:Assets written to: /tmp/tmpkf129zvj/assets


INFO:tensorflow:Assets written to: /tmp/tmpkf129zvj/assets
INFO:absl:Using experimental converter: If you encountered a problem please file a bug. You can opt-out by setting experimental_new_converter=False


[[0.1182588  0.08871834]]


In [67]:
%%time
interpreter_hy.set_tensor(input_details[0]['index'], inputs['input_ids'])
interpreter_hy.invoke()
output_data_hy = interpreter_hy.get_tensor(output_details[0]['index'])
output_data_hy

CPU times: user 37.4 ms, sys: 34 µs, total: 37.5 ms
Wall time: 38.6 ms


## Observations:
For DistilBert For Sequence Classification Model here is Our Observation

|Method|Model Weight Size|Inference Time |
|----- | ----| ----- |
| Normal Tensorflow Weight saving | 255 Mb | 155ms |
| Normal TFLite | 254 Mb| 70ms |
| TFLite with FP16 Quantization | 127.28 Mb | 52ms|
| TFLite With Hybrid Quantization | 63.97 Mb | 37 ms |

* Only Support TensorFLow and Keras Model
* TFlite might not support Complex Model input While Inference

Reference:
* https://discuss.huggingface.co/t/how-can-we-test-transformer-models-after-converting-it-to-tflite-format/1670/4
* https://www.tensorflow.org/lite/guide/inference
* https://github.com/huggingface/tflite-android-transformers/blob/master/models_generation/distilbert.py
* https://www.tensorflow.org/lite/performance/post_training_quantization
* https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_lite/tflite_c02_transfer_learning.ipynb
* https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_lite/tflite_c04_exercise_convert_model_to_tflite_solution.ipynb