[WIP] [Quantization] Quantization in TVM #7474

electriclilies · 2021-02-18T21:31:30Z

This PR introduces a new framework for quantization to TVM. For details and explanation of the code structure, please see https://discuss.tvm.apache.org/t/rfc-quantization-quantization-in-tvm/9161.

Please let me know if you have any questions or comments.

Also, I'd like to thank @mbrookhart and @jwfromm for their mentorship on this project!

masahi · 2021-02-19T03:59:38Z

include/tvm/relay/qnn/attrs.h


  TVM_DECLARE_ATTRS(DequantizeAttrs, "relay.attrs.DequantizeAttrs") {
    TVM_ATTR_FIELD(axis)
        .describe(
            "The channel axis for channel wise dequantization. Default value is -1,"
            "which corresponds to the last axis.")
        .set_default(-1);
+    TVM_ATTR_FIELD(out_dtype)
+        .describe(
+            "The datatype we are dequantizing to (float32 or int32). Defaults to float32.")


The output of dequantize is always float, so this change doesn't make sense to me.

The reason I did this is that nn.conv2d sometimes has int32 as an output dtype. Since we are replacing nn.conv2d with a pattern whose final op is dequantize, dequantize also needs to be able to output int32 as a dtype. If I don't introduce an out_dtype, then sometimes the qnn graph won't pass the type checker because the consumer of dequantize expects an int32 when it can only output a float32.

Here's where I use out_dtype='int32': https://github.com/apache/tvm/blob/49801e85ff6244b1d0567965fec18544ce51dd70/python/tvm/relay/transform/quantize/_quantizer_patterns.py#L325-#L328

Introducing a new quantization related operator like - simulated_quantize might be better. This op could take any input dtype and any out dtype, and you can handle all the cases internally. You can use this op for calibration, and once you have figured out good scale and zero points, you can replace this with its QNN counterpart depending on the in and out dtypes.

masahi · 2021-02-19T04:06:03Z

python/tvm/relay/transform/quantize/_quantizer_patterns.py

+    def quantize_args(self):
+        """Helper to quantize the arguments to the qnn.conv2d."""
+        quantized_data = relay.qnn.op.quantize(
+            self.args[0], self.scale_zps[0], self.scale_zps[1], axis=self.data_channel_axis


You should be explicit about what data type you are quantizing to. As rewritten this way, you are implicitly assuming that we are always doing symmetric quantization. Asymmetric quantation should also be supported.

Thanks for pointing this out. I'll add a dtype parameter to QuantizerPattern so we can pass in uint8 as well as int8.

masahi · 2021-02-19T04:07:06Z

python/tvm/relay/transform/quantize/_quantizer_patterns.py

+                    (0, 0),
+                    (0, 0),
+                )
+            pad_val = 0


again, this is assuming that zero point is always 0, so you are only supporting symmetric Q

masahi · 2021-02-19T04:11:52Z

python/tvm/relay/transform/quantize/_quantizer_patterns.py

+        )
+        dequantized_rhs = relay.qnn.op.dequantize(
+            quantized_rhs, rhs_scale, relay.const(0, dtype="int32"), out_dtype=out_dtype
+        )


I have no idea what's going on here

I will add a comment here explaining what is going on. The main reason I do it this way is because the qnn.add op implicitly quantizes the args, and uses requantize inside it and generally does some weird stuff.

We want to quantize lhs and rhs to lhs_scale and rhs_scale, respectively, and then requantize to lhs_scale + rhs_scale before adding and dequantizing. I have simply broken the requantize into dequantize -> quantize.

masahi · 2021-02-19T04:14:03Z

python/tvm/relay/transform/quantize/_requantizer.py

+                | is_op("copy")
+                | is_op("mean")
+                | is_op("sqrt")
+            )


This is too ad hoc, it can easily break and leaves more dequantize/quantize than necessary. And the patterns are not correct.

Can you provide more feedback about how it can break and what quantize and dequantizes it leaves? On all the graphs I've tried this on so far, I've been able to remove all extraneous quantize and dequantizes.

masahi · 2021-02-19T04:17:27Z

python/tvm/relay/transform/quantize/_requantizer.py

+                        transformed_data = self.op_map[call.op](transformed_data, **call.attrs)
+                    else:
+                        raise ValueError(
+                            "Uh oh, %s is not copied properly in the requantizer. " % str(call.op)


Even though you are using dominator pattern, rewritten this way it only supports a linear path graph. It breaks for diamond structures, e.g if the output of dequantize is consumed by multiple nodes

The allow_overlapping_patterns option lets me match dequantizes which are consumed by two quantizes.

masahi · 2021-02-19T04:19:29Z

python/tvm/relay/transform/quantize/_requantizer.py

+
+    class ConsolidateRequantizeandQuantize(DFPatternCallback):
+        """Gets rid of unnecessary requantizes directly following a quantize. Takes
+        quantize(scale_a, zp_a) -> requantize(scale_a, zp_a, scale_b, zp_b) to


I don't see how this pattern could arise in practice? Do you have an example?

A good example is actually the result of the add pattern.

quantize -> dequantize -> quantize -> quantize -> dequantize -> quantize -> add -> dequantize

becomes

quantize -> requantize -> quantize -> requantize -> add -> dequantize

Then we need to combine the quantize and requantize into one quantize:

quantize -> quantize -> add -> dequantize

masahi · 2021-02-19T04:19:48Z

python/tvm/relay/transform/quantize/_requantizer.py

+
+    class RequantizeChainCallback(DFPatternCallback):
+        """Folds chains of requantizes into one requantize.
+        requantize(scale_a, zp_a, scale_b, zp_b) -> requantize(scale_b, zp_b, scale_c, zp_c) becomes


I don't see how this pattern could arise in practice? Do you have an example?

I have definitely seen it which is why I put it in, but I'll have to find the example again. Let me get back to you about this.

masahi · 2021-02-19T04:20:53Z

src/relay/qnn/op/dequantize.cc

@@ -105,6 +115,10 @@ Expr DequantizeLower(const Expr& input_tensor, const Expr& input_scale,

  auto shift = Subtract(Cast(input_tensor, DataType::Int(32)), expanded_input_zero_point);
  auto scaled_output = Multiply(Cast(shift, DataType::Float(32)), expanded_input_scale);
+
+  if (out_dtype != DataType::Float(32)) {
+    scaled_output = Cast(scaled_output, out_dtype);


Casting the output of float multiply to int is wrong.

What would be correct?

masahi · 2021-02-19T04:21:45Z

src/relay/qnn/op/quantize.cc

@@ -46,19 +47,22 @@ bool QuantizeRel(const Array<Type>& types, int num_inputs, const Attrs& attrs,
  }

  const auto input_dtype = data->dtype;
-  ICHECK(input_dtype == DataType::Float(32))
-      << "Input type should be one of float32 but was " << input_dtype;
+  ICHECK(input_dtype == DataType::Float(32) || input_dtype == DataType::Int(32))


I have no idea why the input to quantize could be int32.

When quantizing nn.conv2d -> nn.bias_add, we quantize the bias weight to int32. This is what other quantization frameworks do, including pytorch (see https://stackoverflow.com/questions/63132181/how-does-bias-work-in-pytorch-quantized-convolution).

See this snippet from https://github.com/apache/tvm/blob/49801e85ff6244b1d0567965fec18544ce51dd70/python/tvm/relay/transform/quantize/_quantizer_patterns.py#L352-#L355

quantized_bias = relay.qnn.op.quantize( self.args[2], self.scale_zps[0], self.scale_zps[1], axis=0, out_dtype="int32" ) self.quantized_args.append(quantized_bias)

Sure but it is the output that needs to be int32. Here, you are allowing inputs to be int32. Bias is always float. Am I missing something?

masahi · 2021-02-19T08:28:37Z

python/tvm/relay/transform/quantize/_calibrator.py

+
+    def initialize_scale_zp_map(self):
+        """Initializes scales to 1 and zero points to zero. These values will only be used
+        to calculate values in the tuple subgraph that are not returned to the user."""


We should be careful with how we initialize the params here. A scale of 1 doesn't make sense, since it would essentially clamp the entire floating point range to [-128, 127]. So the outputs from the first run will likely be garbage.

Does the choice of initialization affect the accuracy of the final model? If so, we should use more sensible values by default.

These initial values aren't exposed -- Right now, you have to pass scale and zero point values into get_quantized_layer_inputs. 1 is just a placeholder for scales the parts of the graph that we haven't called a calibrator callback for yet.
However, in case someone forgets to set a scale or zero point, maybe I should make the default value better. I actually think that a lower default scale might be better since most ML activation values are very small (maybe something like 0.05?) In my experience that scale at least produces non-zero outputs that are the same or similar magnitude to the non-quantized graph.
Also as a quick note, @jwfromm (as well as you) suggested changing the API a little bit so that you don't have to pass in scale and zero point to get_quantized_layer_inputs, so I'm going to do that. It makes this problem less pressing and also lets people directly use the value of quantized data to calculate scales.

masahi · 2021-02-19T08:40:52Z

@electriclilies Can you add an end to end runnable example? Like importing a pytorch or onnx graph and quantize it.

masahi · 2021-02-19T10:08:37Z

python/tvm/relay/transform/quantize/_quantizer_patterns.py

+            version of the expression.
+        """
+        pattern_ffi.rewrite(
+            [_DFPatternCallback(self.pattern, self.extract_attrs, self.require_type)],


are the methods in this class tested? I don't find definitions of self.pattern, self.scale etc in this and derived classes.

Yes, they are. This class is only meant to be used for multiple inheritance with a QuantizerPattern, which is why it looks a little weird. When I test the AverageMaxPerChannelConv2dPattern and AverageMaxPerChannelDensePattern, all these methods are tested.
I will make PerChannelPattern and AverageMaxPerChannelPattern abstact to make it clear that they should not be instantiated.

masahi · 2021-02-19T10:12:50Z

python/tvm/relay/transform/quantize/_quantizer_patterns.py

+        """Helper to quantize the arguments to the qnn.conv2d."""
+        quantized_data = relay.qnn.op.quantize(
+            self.args[0], self.scale_zps[0], self.scale_zps[1], axis=self.data_channel_axis
+        )


If you quantize this way, I think an argument that is consumed by multiple nodes will get quantized multiple times. I don't think CSE would help since you are using separate scale and zp vars. You should cache the quantized values and avoid quantizing the same node twice.

This is an interesting suggestion. It's important to introduce a new quantize node for every input so that we can correctly do requantize later. Instead of caching the whole quantize node, I could cache the scale and zero point from the previous time the node was cached. However, I'm not sure how we would deal with this during calibration, since we are calibrating on a per-pattern basis. Either the second value set for that scale and zero point would be the actual value, or we could try to remove the scale and zero point from the second pattern. But removing a scale and zero point from a pattern would introduce some inconsistency and complexity to the callbacks that I think is not good.

CSE does work for this when we have integer constants as scales and zero points, however, it might not work for floats. There are only two cases that this really becomes an issue though: 1. a weight gets quantized multiple times 2. we end up with multiple requantize ops that requantize the same node to slightly different values. I think that dealing with this in post processing (i.e., requantizer) is the best decision. Then we can combine scales and zero points through some rule, perhaps by averaging them.

masahi · 2021-02-19T10:19:29Z

python/tvm/relay/transform/quantize/_quantizer.py

+from . import _ffi as ffi
+
+
+class Quantizer:


I think this class is redundant, since all you do is to do some stuff in the constructor and immediately pass this object to QuantizationCalibrator. It is better to directly do the same initialization in the QuantizationCalibrator constructor.
And probably I'd rename QuantizationCalibrator to Quantizer.

There is definitely an argument to combine the Quantizer and the QuantizationCalibrator into one class, since they are pretty closely linked.

@jwfromm, @mbrookhart and I went back and forth about this. Initially, we made them separate because they were both large classes and did logically different things, and combining them would have resulted in a lot of code in one monolithic class. However, now that I have the QuantizerPattern class and the CalibrationInfo class, which contain a lot of the code, it could make sense to combine them into one class.

One argument to keep them separate is that the Quantizer does a lot of stuff using the C backend, and keeping that separate from the Calibrater is kind of nice for readability.

@jwfromm @mbrookhart What do you think?

Lily Orth-Smith and others added 5 commits February 18, 2021 13:11

Quantization in TVM

4e99909

Fix rebase mistake

d700495

Remove demo scripts

3ae991d

lint and docstrings

27db2a4

fix cpplint

49801e8

masahi reviewed Feb 19, 2021

View reviewed changes

electriclilies marked this pull request as draft February 22, 2021 17:03

fix windows build

1d3d61e

electriclilies changed the title ~~Quantization in TVM~~ [WIP] [Quantization] Quantization in TVM Feb 24, 2021

electriclilies closed this Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [Quantization] Quantization in TVM #7474

[WIP] [Quantization] Quantization in TVM #7474

electriclilies commented Feb 18, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

anijain2305 Feb 23, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 19, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 19, 2021 •

edited

Loading

electriclilies Feb 22, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 19, 2021

electriclilies Feb 23, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 23, 2021 •

edited

Loading

masahi Feb 19, 2021

electriclilies Feb 23, 2021

masahi commented Feb 19, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 19, 2021

electriclilies Feb 22, 2021

masahi Feb 19, 2021 •

edited

Loading

electriclilies Feb 22, 2021

[WIP] [Quantization] Quantization in TVM #7474

[WIP] [Quantization] Quantization in TVM #7474

Conversation

electriclilies commented Feb 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Feb 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Feb 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Feb 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Feb 19, 2021 •

edited

Loading

masahi Feb 23, 2021 •

edited

Loading

masahi Feb 19, 2021 •

edited

Loading