Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][Quantization] Support quantized models from TensorflowLite #2351

Closed
1 of 4 tasks
FrozenGene opened this issue Dec 28, 2018 · 119 comments
Closed
1 of 4 tasks

[RFC][Quantization] Support quantized models from TensorflowLite #2351

FrozenGene opened this issue Dec 28, 2018 · 119 comments

Comments

@FrozenGene
Copy link
Member

FrozenGene commented Dec 28, 2018

Let me reference @ajtulloch 's comment about quantization workflow firstly:

  1. Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

  2. (optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

  3. Train the model as usual

  4. Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

  • calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
  • using activation ranges learned during training (c2/tf).
  1. Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

  2. Deploy the quantized graph.

However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of #2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

Welcome any feedback.

@tqchen tqchen changed the title [Quantization] Support quantized models of Tensorflow [RFC][Quantization] Support quantized models from Tensorflow Dec 28, 2018
@tqchen
Copy link
Member

tqchen commented Dec 31, 2018

Starting from TFLite importer to relay sounds great. cc @jroesch @ajtulloch @yzhliu

@ZihengJiang
Copy link
Contributor

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

@FrozenGene
Copy link
Member Author

If you want to support transforming quantized model, be careful to transform ops like quantize to small ops like multiply and add for reusing kernels and optimizations like fusion

Thanks for reminding. However, I don't fully understand your reminder. Do you mean I should be careful quantize or multiply / add ops? If we import existing quantized model like TFLite, we shouldn't see quantize ops any more.

@tqchen tqchen changed the title [RFC][Quantization] Support quantized models from Tensorflow [RFC][Quantization] Support quantized models from TensorflowLite Jan 8, 2019
@FrozenGene FrozenGene mentioned this issue Feb 19, 2019
28 tasks
@jnorwood
Copy link

Hi, I recently wrote some code to read in the tflite quantized examples and translate them to nnef output. Their operations are pretty similar to nnvm ops. I translated the two mobilenets and the four inception models. There's a cmake config that pulls down all the models and converts them. Please feel free to use whatever you want from it. I forked the NNEF Tools project, https://github.com/jnorwood and put the converter under the contrib/converters/tflite_converters/tflite_to_nnef

I only added processing for the ops I needed, and I only did quantized data. tflite uses uint8 quantization, btw, with offsets for both weights and features. Biases are int32. NNEF passes quantization configuration in a separate file from the graph. Also, note that tflite uses nhwc everywhere.

@anijain2305
Copy link
Contributor

@FrozenGene I am interested in contributing to this Issue. Is it possible to share the progress?

@FrozenGene
Copy link
Member Author

Hey, @anijain2305 Thanks for your interest. Currently, I am doing #3141. After that, I will start it. BTW, our internal support is based on NNVM and we have completed support it, we have the same result compared with TFLite and have better performance than TFLite. However, I have to spare some time translating to Relay when to make PR. But I have to say that I am busy this month in our product development and it will go to open source progress in my company. I will @ you when that PR is ready.

@anijain2305
Copy link
Contributor

Thanks. Let's lay down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.

Other non-TVM related links that were used to understand quantization

  • GemmLowP - Doc
  • TFlite reference code

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)


List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize


It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d)

Op quantize

def quantize(data, scale, zero_point, out_dtype):
    """
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor.

    Parameters
    -----------
    data: FP32 tensor
           The input tensor in FP32.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    Returns
    -------
    quantized_data: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss

  • The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Reference implementation in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.

Op quantized_conv2d

def quantized_conv2d(quantized_data, quantized_kernel, 
        input_scale, input_zero_point,
        kernel_scale, kernel_zero_point,
        output_scale, output_zero_point,
        out_dtype,

        # All the old remaining ones from conv2d
        strides=(1, 1),
        padding=(0, 0),
        dilation=(1, 1),
        groups=1,
        channels=None,
        kernel_size=None,
        data_layout="NCHW",
        kernel_layout="OIHW",
        out_layout=""):
    """
    
    Quantize takes the scale and zero_point attributes and quantizes the 
    FP32 input data to int8/uint8 tensor. The scale and zero_point calculations
    happen outside the relay graph, i.e., the framework parsers will have to compute
    the scale and offset if only min and max are provided. 

    Parameters
    -----------
    quantized_data: int8/uint8 tensor
           The quantized input tensor in int8/uint8.

    quantized_kernel: FP32 tensor
           The quantized kernel tensor in int8/uint8.
    
    input_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_data int8 values back to FP32.

    input_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_data distribution.

    kernel_scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the quantized_kernel int8 values back to FP32.

    kernel_zero_point: Int32 zero point (An attribute of the op)
           The zero point of the quantized_kernel distribution.

    output_scale: FP32 scalar (An attribute of the op)
           The output scale is set during the quantization process using training/calibration.
           The float scalar to scale the quantized_output int8 values back to FP32.

    output_zero_point: Int32 zero point (An attribute of the op)
           The output zero point is set during the quantization process using training/calibration.
           The zero point of the quantized_output distribution.

    out_dtype: String
           The dtype of the quantized_output. Can only be int8/uint8.
           The requantization from int32 to int8/uint8 is a part of the op compute.

    out_dtype: String
           The dtype of the output. Can only be int8/uint8

    ..... Other attributes are same as before.


    Returns
    -------
    quantized_output: int8/uint8 tensor
           The quantized tensor.

    """

Key points to discuss further

  • This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in discuss forum.
    • First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.
    • Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. I am not sure about this part, so please comment.
  • The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).

Op dequantize

Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.

def dequantize(quantized_data, scale, zero_point, out_dtype):
    """
    Dequantize takes the scale and zero_point attributes and dequantizes the 
    int8/uint8 tensor to FP32 tensor.

    Parameters
    -----------
    quantized_data: int8/uint8 quantized input tensor
           The input tensor in int8/uint8.
    
    scale: FP32 scalar (An attribute of the op)
           The float scalar to scale the int8 values back to FP32.

    zero_point: Int32 zero point (An attribute of the op)
           The zero point of the distribution.

    out_dtype: String
           The dtype of the output. Can only be float32.

    Returns
    -------
    data: FP32 tensor
           The dequantized tensor.

    """

@FrozenGene
Copy link
Member Author

@anijain2305

For the q_conv2d, we will add two more arguments.

  output_min=0, 
  output_max=0

These will be used for restrict the output range, which could be calculated previously. see TFLite's CalculateActivationRangeUint8 function.

From my experience, we needn't q_relu. But we need q_add / q_concate and so on. I suggest we use MobilenetV2 quant model for example, which is used very widely and have common ops we should consider. For example, depthwise convolution / add / pool and so on.

@jnorwood
Copy link

jnorwood commented May 29, 2019

From my experience, we needn't q_relu. But we need q_add / q_concate and so on. I suggest we use MobilenetV2 quant model for example,

Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.

Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.

Also, the MobilenetV2 q_add inputs require rescale... but in both q_concat and q_add you can recalculate the prior op downscale multipliers so you can eliminate the extra rescales.

Also, depending on your allocation capabilities, you can get rid of all concats.

@zhenhuaw-me
Copy link
Contributor

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

@anijain2305
Copy link
Contributor

For the q_conv2d, we will add two more arguments.

  output_min=0, 
  output_max=0

These will be used for restrict the output range, which could be calculated previously.

I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

@anijain2305
Copy link
Contributor

Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the downscale saturation. You might need it if you want to support their way of training, though.

Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate, since it also has concat nodes feeding into concat nodes, and tflite also has to rescale inputs inside the concat operations.

Make sense. For now, I was thinking of not worrying about depth-wise conv. So, decided to take Inception V3 into account. I think given we are in the starting position, I don't have any big inclination towards any network. My motive is to focus on getting the right infrastructure in the start and showcase it with one large network. The performance micro-optimizations can then phased.

@anijain2305
Copy link
Contributor

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.

@FrozenGene
Copy link
Member Author

For the q_conv2d, we will add two more arguments.

  output_min=0, 
  output_max=0

These will be used for restrict the output range, which could be calculated previously.

I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.

@anijain2305
Copy link
Contributor

For the q_conv2d, we will add two more arguments.

  output_min=0, 
  output_max=0

These will be used for restrict the output range, which could be calculated previously.

I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.

In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?

Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.

The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.

@FrozenGene
Copy link
Member Author

For the q_conv2d, we will add two more arguments.

  output_min=0, 
  output_max=0

These will be used for restrict the output range, which could be calculated previously.

I see what you are saying, but I am not sure if this is the right approach. In my opinion, it will be better to put it out of conv. The reason we have these 2 extra min/maxes is because of fused activation in TFLite. It seems better to keep it separate so that both MxNet and TFLite can share quantized_conv2d. In case of TFLite, when we see a fused conv, we can add one more clamp operator in the sequence of ops at the end.

No matter whether we have fused activation function, we always need output_min / output_max. Because we will get conv int32 result, but we will need uint8 result. So we must restrict int32 to uint8. If we don't have fused activation function, (When we have quantized model of TFLite, we don't have fused activation many cases), the output_min / output_max will be 0 / 255 to restrict int32 result. If we have relu6, output_min / output_max will be 0 / 6. So I think we are better put these two into conv argument. And we could avoid producing another clamp, just be calculated in conv2d requantize int32 -> uint8 process and it is nature.

In the case the activation is not fused, the values have to clamped to 0/255 or uint8 range, which is basically the out_dtype. So, we do not need any extra information for the quantized_conv2d for going back to uint8/int8 other than out_dtype. Correct?

Now, If the activation is fused, I agree that we will have two clamps now. One inside the quantized_conv2d (0/255), and one for the relu6 (0/6). I think this is fine. We can also write a Relay that replaces two back-to-back clamping with one clamp Relay operator.

The reason I am saying this is that TFLite chooses one way to handle things, which other frameworks might not. So, it is necessary to come up with right abstractions first. The performance can be then be achieved by writing Relay passes.

Yes, I agree when we don't have activation, we don't need anything. However, Another thing we should consider: How to integrate with other libraries, such as QNNPACK. QNNPACK also need output min / output max too. https://github.com/pytorch/QNNPACK/blob/master/include/qnnpack.h#L62-L63

@tqchen
Copy link
Member

tqchen commented May 30, 2019

Here are some points to discuss:

  • namespace for the tflite quantize style dialect
  • List of ops that might need tvm's compute declaration
  • set of possible passes that lower the rest into the core ops

Some of the discussions involve fusion, and that is something where TVM might be able to help. For example, in the current symmetric scheme, clip, relu6, and subsequent downcasting ops are automatically fused into the conv2d ops. While the conv2d op can simply just output int32(because followup ops will get fused).

I agree with @anijain2305 that we could try to get something minimum that is working, then start thinking about possible rewriting rules to get to some useful patterns if we decide that manual intervention is necessary.

Ideally, we should have a generic schedule template that works for any fused patterns, just as those in the current symmetric version, so we do not need to have all the different variants of fused conv2d ops

also cc @vinx13 @ZihengJiang

@jnorwood
Copy link

jnorwood commented May 30, 2019

I want to point out that the min and max values you mentioned are not related to the activation range in the original model. They are saturation values. In the case of mobilenet, for example, which has relu_6 use everywhere, I'm printing out the min and max activation values from the tflite mobilenet V2 below. The model uses uint8 downscale between layers, and uses the min and max value to clamp/saturate the values to 0..255 for all layers in that model. The thing it could be used for (but isn't here) is for more or fewer quantization bits or for signed int quantization ... but tflite is using all uint8 quantization for MobilenetV2.

the amin and amax values below are tflite output_activation_min, output_activation_max from their quantized reference ops for conv and dw_conv.

(base) jay@jay-desktop:/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log
`
(base) jay@jay-desktop:
/tensorflow/tensorflow/lite/dbg$ grep conv mod2.log
---------conv in_h=224, in_w=224,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1992157658,shft=-7,amin=0, amax=255
-------dwconv in_h=112, in_w=112,out_h=112,out_w=112,f_h=3,f_w=3,mpy=1254985768,shft=-1,amin=0, amax=255
---------conv in_h=112, in_w=112,out_h=112,out_w=112,f_h=1,f_w=1,mpy=2090511665,shft=-5,amin=0, amax=255
-------dwconv in_h=112, in_w=112,out_h=56,out_w=56,f_h=3,f_w=3,mpy=1729896231,shft=-1,amin=0, amax=255
---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=2081950125,shft=-6,amin=0, amax=255
-------dwconv in_h=56, in_w=56,out_h=56,out_w=56,f_h=3,f_w=3,mpy=2080045879,shft=-4,amin=0, amax=255
---------conv in_h=56, in_w=56,out_h=56,out_w=56,f_h=1,f_w=1,mpy=1890535782,shft=-6,amin=0, amax=255
-------dwconv in_h=56, in_w=56,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1151606277,shft=-5,amin=0, amax=255
---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=2089579858,shft=-7,amin=0, amax=255
-------dwconv in_h=28, in_w=28,out_h=28,out_w=28,f_h=3,f_w=3,mpy=1410648286,shft=-4,amin=0, amax=255
---------conv in_h=28, in_w=28,out_h=28,out_w=28,f_h=1,f_w=1,mpy=1767908551,shft=-7,amin=0, amax=255
-------dwconv in_h=28, in_w=28,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1850037283,shft=-6,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1260482936,shft=-6,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1269068532,shft=-4,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1456865727,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1464063813,shft=-4,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1364297475,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1948805937,shft=-5,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=2136047634,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1671906928,shft=-5,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1327474777,shft=-6,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=14,out_w=14,f_h=3,f_w=3,mpy=1330877207,shft=-5,amin=0, amax=255
---------conv in_h=14, in_w=14,out_h=14,out_w=14,f_h=1,f_w=1,mpy=1497258311,shft=-7,amin=0, amax=255
-------dwconv in_h=14, in_w=14,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1076915935,shft=-6,amin=0, amax=255
---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1124144746,shft=-6,amin=0, amax=255
-------dwconv in_h=7, in_w=7,out_h=7,out_w=7,f_h=3,f_w=3,mpy=1083785823,shft=-2,amin=0, amax=255
---------conv in_h=7, in_w=7,out_h=7,out_w=7,f_h=1,f_w=1,mpy=1240259613,shft=-5,amin=0, amax=255
---------conv in_h=1, in_w=1,out_h=1,out_w=1,f_h=1,f_w=1,mpy=1553319078,shft=-10,amin=0, amax=255

`

@jnorwood
Copy link

similarly, for the tflite quantized inception v3 model, all those output_activation_min, output_activation_max are 0 and 255
I'll attach a zip file with the log.
inv3.zip

@jnorwood
Copy link

to explain a little further ... during training they determine the range of input values, and they determine the downscale multiplier that will shrink the observed range to 0..255 (for the uint8 quantization). The fp downscale multiplier is coverted to integer mpy and right-shift constants, which are the mpy and shft values in my log. At inference time, the downscaled accumulator (after applying the downscale) may be outside the uint8 quantization range, and so they clamp/saturate to that range. In these current models, they are using uint8 quantization ... so the range is 0..255, but it appears to me they are providing the min and max to support other numbers of bits in the quantization. I see support for several 4 bit gpu implementations recently, so maybe this is to support something like that.

@zhenhuaw-me
Copy link
Contributor

zhenhuaw-me commented May 30, 2019

Some comments for @anijain2305 's reply :)

Hi @anijain2305 regarding the requantization, if the it is not going to put in conv op, the op may suppose to output FP32, otherwise the semantic is confusing. The requantization can convert FP32 to INT8. The multiplier/shift based reuantization approach introduced by TFLite is also adopted by Caffe2/QNNPACK.

Makes sense. Does it make sense to add accumulator_dtype as one of the attributes of quantized_conv2d. This will be set to int32 for TFLite, Caffe2, QNNPACK. But, if some network needs accumulation in FP32, then it will support that as well.

A network uses operators (or layers or anything we'd like to call it) regardless of the accumulation format. The format is part of a software system's mechanism. So, I guess we don't need a accumulator_dtype and the out_dtype is what we want. The discussion is about whether we put requantization inside the conv2d op.

And, maybe we can put the quantization parameters in tensor, as the scale and zero point are describing the INT8 tensor data rather than the op. The op are supposed to read these parameters and get things done.

Not sure about this. The good thing is the conv2d relay operator can be shared across FP32 and quantized tensor types. The bad thing is compute depends on the quantized tensor type now. This might require new Relay optimizations, preventing us to fully use the existing infrastructure.

I was saying extending existing tensor rather than introduce new tensor type. I assume that this won't lead to new Relay opt :)

EDIT: Btw, the channel-wise quantization parameter is likely to be included in TensorFlow/TFLite, also the TVM stack as a roadmap. In this way, it could be easier to manage a tensor described parameter.

@zhenhuaw-me
Copy link
Contributor

zhenhuaw-me commented May 30, 2019

Regarding @jnorwood 's comments on output min/max of conv2d.

Your observations about the values of output min max are correct. But they are still activations. One thing I always try to deliver is that, the INT8 values in quantization are a representation of original FP32 values.

When we talking about ReLU6 activations, it means that in FP32 format, the op outputs FP32 values in range [0, 6]. For INT8 quantization, INT8 data is an representation of FP32 value, which means, the output min/max (which is typically [0, 255] of INT8 type in pre-provided quantized MobileNet) are representing [0, 6] of FP32 type - the INT8 0/255 is actually FP32 0/6. Try the output scale (0.023528477177023888) with the activation min/max, we will get value range like [0, 5.999761581420898] (from output of the first conv of the pre-provided quantized MobileNet).

Conclusions can easily draw once we have this in mind :)

@anijain2305
Copy link
Contributor

anijain2305 commented May 31, 2019

I would suggest to design the infrastructure that supports both symmetric/asymmetric quantization. We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

  • namespace for the tflite quantize style dialect

I think this is required for both asymmetric and symmetric quantization. These ops will be rewritten to low-level instructions by a Relay pass. How about using relay.op._quantization as the namespace? So, the operations can be relay.op._quantization.conv2d or relay.op._quantization.quantize.

  • List of ops that might need tvm's compute declaration

I am not sure yet. The only unknown to me are the special rounding operations that are used in converting the Floating point to Integer multiplication in scaling the quantized conv matrix. But, they might already be covered in current low-level ops.

  • set of possible passes that lower the rest into the core ops

I was hoping to re-use the FForwardRewrite infrastructure to lower the ops. Do you anticipate more passes here?

@jnorwood
Copy link

We can certainly start with symmetric to flush the flow, while keeping in mind that we can share as much infrastructure as possible between them.

All the tflite quantized models I've tested use the asymmetric uint8 quantization. If you are planning to use those as examples, it will be hard to debug if you throw in the change to symmetric.

@FrozenGene
Copy link
Member Author

slight difference in a single point(0.5) is fine and likely won’t have an impact on final acc

Yeah, I was planning to add a rounding param to the op. For "ceil", we could just add a 0.5 rounding without worrying about negative values. For "round', we can be more precise. By default, we can choose "ceil". What do you think?

Update - Maybe not, "ceil" is confusing. Let me think and come up with better terms (like round-away-from-zero etc.).

If your round is the concept of my previous comment, maybe round is better and is the same as TFLite. IMO, if we couldn't get the same result of TFLite, we can not know where is wrong when the model is large and we could have problem when we deploy it in industry environment. Because algo. team often verify acc in TFLite, not verify the acc in TVM.

@jnorwood
Copy link

jnorwood commented Jul 8, 2019

tflite computes the output_multiplier and output_shift integer parameters from a double input in the call to QuantizeMultiplier . These are the integer downscale multiplier and right_shift divider parameters.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/experimental/micro/kernels/conv.cc

   double real_multiplier = 0.0;
    TF_LITE_ENSURE_STATUS(GetQuantizedConvolutionMultipler(
        context, input, filter, bias, output, &real_multiplier));
     int exponent;
     QuantizeMultiplier(real_multiplier, &data->output_multiplier, &exponent);
data->output_shift = -exponent;

I'd suggest you print out their output_multiplier and output_shift values for comparison, since errors can start there.

Their downscale operations are implemented in int64.

anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 8, 2019
Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 8, 2019
Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 8, 2019
Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

[Relay] [Quantization] WIP - Protoyping the quantized convolution op

Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

Adding the fixed point compute handling for requantiazation.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 8, 2019
Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

[Relay] [Quantization] WIP - Protoyping the quantized convolution op

Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

Adding the fixed point compute handling for requantiazation.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 8, 2019
Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

[Relay] [Quantization] WIP - Protoyping the quantized convolution op

Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

Adding the fixed point compute handling for requantiazation.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 8, 2019
Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

Adding the fixed point compute handling for requantiazation.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 8, 2019
Goal - Act as medium of discussion for pull request apache#2351

Features
- New quantized conv2D op in Relay
- Python API interface to instantiate the Relay op
- Infer Type implemented
- Lowering of quantized_conv op to low-level Relay ops

Discussion points
- Does the namespace look correct?
    - Relay op is called 'relay.op.nn._quantize.quantized_conv2d'
    - Idea is that any op under '_quantize' namespace will go through rewrite.
- Should we reuse Conv2DRel and Conv2DAttrs
    - Tried protoyping. Found it hard to derive from Conv2DAttr struct
    - Infer Type has a param field. This need to come from the right datatype.

Missing implememtation
    - Lowering of quantized conv into conv+cast is incomplete.
    - Will work on it async. This is orthogonal to the discussion.

Adding the fixed point compute handling for requantiazation.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 10, 2019
Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Jul 15, 2019
Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.
@tqchen
Copy link
Member

tqchen commented Jul 18, 2019

The discussion in this thread has get quite long and given that we are converging. I recommend we close this thread, and open a new RFC thread "QNN Dialect". With latest proposals of the APIs that @anijain2305 @shoubhik is putting together(please also put in related APIs in TF/QNN for reference to back the decision).

This way we keep the community informed and we can move forward with these implementations. I hope we can get +1 representation from different groups who are interested in this direction, in particular @jnorwood @ajtulloch @FrozenGene @yzhliu

Thanks everyone for the hard work.

@tqchen
Copy link
Member

tqchen commented Jul 18, 2019

@anijain2305 can you lead the proposal discussion?

@anijain2305
Copy link
Contributor

I agree, we should move the proposal to a new thread.
Yes, I can lead the proposal discussion.

@tqchen
Copy link
Member

tqchen commented Jul 19, 2019

@anijain2305 can you open the RFC thread? Sorry for being a bit formal in this case, we want to set an example for the first dialect public discussions.

@anijain2305
Copy link
Contributor

@tqchen Thanks for reminding. Just created one :)

@tqchen
Copy link
Member

tqchen commented Jul 20, 2019

Let us move to #3591

@tqchen tqchen closed this as completed Jul 20, 2019
@u99127
Copy link
Contributor

u99127 commented Jul 29, 2019

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

A quick question here since I can't see this mentioned on #3591

Is this network going to be quantized per tensor as well as the new per-channel quantization that is appearing in tflite 2.0 ? IIUC, tf1.13 has per tensor quantization rather than the per channel quantization. i.e. more interestingly can the relay design support both ?

https://www.tensorflow.org/lite/performance/quantization_spec?source=post_page---------------------------#per-axis_vs_per-tensor

regards
Ramana

regards
Ramana

@FrozenGene
Copy link
Member Author

Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)

A quick question here since I can't see this mentioned on #3591

Is this network going to be quantized per tensor as well as the new per-channel quantization that is appearing in tflite 2.0 ? IIUC, tf1.13 has per tensor quantization rather than the per channel quantization. i.e. more interestingly can the relay design support both ?

https://www.tensorflow.org/lite/performance/quantization_spec?source=post_page---------------------------#per-axis_vs_per-tensor

regards
Ramana

regards
Ramana

Good question. We have only supported TF1.13 quantization. TF2.0 has separate scale and doesn't be considered in previous discussion. Seems there is a gap here. cc @anijain2305

anijain2305 added a commit to anijain2305/tvm that referenced this issue Aug 6, 2019
Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Aug 7, 2019
Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Aug 7, 2019
Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.
anijain2305 added a commit to anijain2305/tvm that referenced this issue Aug 8, 2019
Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.
tqchen pushed a commit that referenced this issue Aug 8, 2019
* [Relay] [Quantization] WIP - Common files for the qauntization work.

* [Relay] [Quantization] WIP - Prototyping requantize op.

* Requantize operator implementation.

Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - #2351

Credit to TFLite and GemmLowp to provide reference implementations.

* Typo and lint fixes.

* Doc fix.

* Uncommenting the lint script (fixing mistake).

* Modifying the unit tests.

* Moving C++ files into src/relay/qnn

* Moving python files to python/tvm/relay/qnn. Some minor fixes.

* Moving the attrs.h inside the include directory.

* Pushing files that I forgot earlier. Changing util location.

* Incorporating comments. API change. Lint fixes.

* Modifying the GetFixedPointMultiplierShift API as per comments.

* Forgot the dialect change.

* Changing rewrite to qnn_lower.

* Renaming Quantize to Qnn for clarity.

* Remove use_int_domain.

* Incorportaing review comments.

* Adding API doc for QNN dialect.

* Move the qnn_lower pass to transform namespace.

* Moving from expr to module. Adding namespace in C++.

* Minor sentence rewrites. Added qnn namespace.

* Added the API doc.

* Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8.

* Style fixes. Better error messages.

* Adding documentation.

* More documentation fixes.

* Adding out dtype check for requantize.

* Adding corner case for FP32 to fixed point conversion.

* Adding extra line.

* Documentation fix.

* Adding static inline.

* Incorporating jackwish comment. Removed idtype from requantize lowering.

* Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32.

* Style fixes.

* Fix the docs.

* Move to Legalize API.
wweic pushed a commit to wweic/tvm that referenced this issue Aug 9, 2019
* [Relay] [Quantization] WIP - Common files for the qauntization work.

* [Relay] [Quantization] WIP - Prototyping requantize op.

* Requantize operator implementation.

Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.

* Typo and lint fixes.

* Doc fix.

* Uncommenting the lint script (fixing mistake).

* Modifying the unit tests.

* Moving C++ files into src/relay/qnn

* Moving python files to python/tvm/relay/qnn. Some minor fixes.

* Moving the attrs.h inside the include directory.

* Pushing files that I forgot earlier. Changing util location.

* Incorporating comments. API change. Lint fixes.

* Modifying the GetFixedPointMultiplierShift API as per comments.

* Forgot the dialect change.

* Changing rewrite to qnn_lower.

* Renaming Quantize to Qnn for clarity.

* Remove use_int_domain.

* Incorportaing review comments.

* Adding API doc for QNN dialect.

* Move the qnn_lower pass to transform namespace.

* Moving from expr to module. Adding namespace in C++.

* Minor sentence rewrites. Added qnn namespace.

* Added the API doc.

* Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8.

* Style fixes. Better error messages.

* Adding documentation.

* More documentation fixes.

* Adding out dtype check for requantize.

* Adding corner case for FP32 to fixed point conversion.

* Adding extra line.

* Documentation fix.

* Adding static inline.

* Incorporating jackwish comment. Removed idtype from requantize lowering.

* Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32.

* Style fixes.

* Fix the docs.

* Move to Legalize API.
wweic pushed a commit to neo-ai/tvm that referenced this issue Sep 6, 2019
* [Relay] [Quantization] WIP - Common files for the qauntization work.

* [Relay] [Quantization] WIP - Prototyping requantize op.

* Requantize operator implementation.

Requantize converts one quantized tensor representation to another quantized
representation. The PR has following implementation features

- Requantize operator defined in qnn namespace - relay.qnn.requantize
- Lowering of the requantize to exisiting Relay operators
- Integer fixed point implementation of requantize
    - Two rounding modes - FE_UPWARDS (round towards infinity) and
    FE_AWAY_FROM_ZERO (std::round behavior)
- Floating point implementation as well, that can act as reference or can be
used for devices when FP32 computation is not used.
- Unit test cases

Relevant Issue - apache#2351

Credit to TFLite and GemmLowp to provide reference implementations.

* Typo and lint fixes.

* Doc fix.

* Uncommenting the lint script (fixing mistake).

* Modifying the unit tests.

* Moving C++ files into src/relay/qnn

* Moving python files to python/tvm/relay/qnn. Some minor fixes.

* Moving the attrs.h inside the include directory.

* Pushing files that I forgot earlier. Changing util location.

* Incorporating comments. API change. Lint fixes.

* Modifying the GetFixedPointMultiplierShift API as per comments.

* Forgot the dialect change.

* Changing rewrite to qnn_lower.

* Renaming Quantize to Qnn for clarity.

* Remove use_int_domain.

* Incorportaing review comments.

* Adding API doc for QNN dialect.

* Move the qnn_lower pass to transform namespace.

* Moving from expr to module. Adding namespace in C++.

* Minor sentence rewrites. Added qnn namespace.

* Added the API doc.

* Chanding default out_dtype to int8. Adding a test with in/out_dtype as uint8.

* Style fixes. Better error messages.

* Adding documentation.

* More documentation fixes.

* Adding out dtype check for requantize.

* Adding corner case for FP32 to fixed point conversion.

* Adding extra line.

* Documentation fix.

* Adding static inline.

* Incorporating jackwish comment. Removed idtype from requantize lowering.

* Removing Quantize/Dequantize code. Restricting Requantize to (u)int8/int32.

* Style fixes.

* Fix the docs.

* Move to Legalize API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants