New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][Quantization] Support quantized models from TensorflowLite #2351

Open
FrozenGene opened this Issue Dec 28, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@FrozenGene
Copy link
Contributor

FrozenGene commented Dec 28, 2018

Let me reference @ajtulloch 's comment about quantization workflow firstly:

  1. Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

  2. (optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

  3. Train the model as usual

  4. Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

  • calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
  • using activation ranges learned during training (c2/tf).
  1. Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

  2. Deploy the quantized graph.

However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of #2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

  • Support TFLite FP32 Relay frontend. PR: #2365

  • Support TFLite INT8 Relay frontend

  • Extend the attribute of the convolution and related ops to support quantization

  • Auto-TVM on ARM CPU can work with INT8

Welcome any feedback.

@tqchen tqchen changed the title [Quantization] Support quantized models of Tensorflow [RFC][Quantization] Support quantized models from Tensorflow Dec 28, 2018

@tqchen tqchen added the status: RFC label Dec 28, 2018

@tqchen

This comment has been minimized.

Copy link
Member

tqchen commented Dec 31, 2018

Starting from TFLite importer to relay sounds great. cc @jroesch @ajtulloch @yzhliu

@ZihengJiang

This comment has been minimized.

Copy link
Member

ZihengJiang commented Jan 4, 2019

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

@FrozenGene

This comment has been minimized.

Copy link
Contributor

FrozenGene commented Jan 4, 2019

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

Thanks for reminding. However, I don't fully understand your reminder. Do you mean I should be careful quantize or multiply / add ops? If we import existing quantized model like TFLite, we shouldn't see quantize ops any more.

@tqchen tqchen changed the title [RFC][Quantization] Support quantized models from Tensorflow [RFC][Quantization] Support quantized models from TensorflowLite Jan 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment