Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Quantization] Quantization in TVM #7474

Closed

Conversation

electriclilies
Copy link
Contributor

This PR introduces a new framework for quantization to TVM. For details and explanation of the code structure, please see https://discuss.tvm.apache.org/t/rfc-quantization-quantization-in-tvm/9161.

Please let me know if you have any questions or comments.

Also, I'd like to thank @mbrookhart and @jwfromm for their mentorship on this project!


TVM_DECLARE_ATTRS(DequantizeAttrs, "relay.attrs.DequantizeAttrs") {
TVM_ATTR_FIELD(axis)
.describe(
"The channel axis for channel wise dequantization. Default value is -1,"
"which corresponds to the last axis.")
.set_default(-1);
TVM_ATTR_FIELD(out_dtype)
.describe(
"The datatype we are dequantizing to (float32 or int32). Defaults to float32.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of dequantize is always float, so this change doesn't make sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did this is that nn.conv2d sometimes has int32 as an output dtype. Since we are replacing nn.conv2d with a pattern whose final op is dequantize, dequantize also needs to be able to output int32 as a dtype. If I don't introduce an out_dtype, then sometimes the qnn graph won't pass the type checker because the consumer of dequantize expects an int32 when it can only output a float32.

Here's where I use out_dtype='int32': https://github.com/apache/tvm/blob/49801e85ff6244b1d0567965fec18544ce51dd70/python/tvm/relay/transform/quantize/_quantizer_patterns.py#L325-#L328

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a new quantization related operator like - simulated_quantize might be better. This op could take any input dtype and any out dtype, and you can handle all the cases internally. You can use this op for calibration, and once you have figured out good scale and zero points, you can replace this with its QNN counterpart depending on the in and out dtypes.

def quantize_args(self):
"""Helper to quantize the arguments to the qnn.conv2d."""
quantized_data = relay.qnn.op.quantize(
self.args[0], self.scale_zps[0], self.scale_zps[1], axis=self.data_channel_axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be explicit about what data type you are quantizing to. As rewritten this way, you are implicitly assuming that we are always doing symmetric quantization. Asymmetric quantation should also be supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I'll add a dtype parameter to QuantizerPattern so we can pass in uint8 as well as int8.

(0, 0),
(0, 0),
)
pad_val = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this is assuming that zero point is always 0, so you are only supporting symmetric Q

)
dequantized_rhs = relay.qnn.op.dequantize(
quantized_rhs, rhs_scale, relay.const(0, dtype="int32"), out_dtype=out_dtype
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what's going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment here explaining what is going on. The main reason I do it this way is because the qnn.add op implicitly quantizes the args, and uses requantize inside it and generally does some weird stuff.

We want to quantize lhs and rhs to lhs_scale and rhs_scale, respectively, and then requantize to lhs_scale + rhs_scale before adding and dequantizing. I have simply broken the requantize into dequantize -> quantize.

| is_op("copy")
| is_op("mean")
| is_op("sqrt")
)
Copy link
Member

@masahi masahi Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too ad hoc, it can easily break and leaves more dequantize/quantize than necessary. And the patterns are not correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more feedback about how it can break and what quantize and dequantizes it leaves? On all the graphs I've tried this on so far, I've been able to remove all extraneous quantize and dequantizes.

transformed_data = self.op_map[call.op](transformed_data, **call.attrs)
else:
raise ValueError(
"Uh oh, %s is not copied properly in the requantizer. " % str(call.op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though you are using dominator pattern, rewritten this way it only supports a linear path graph. It breaks for diamond structures, e.g if the output of dequantize is consumed by multiple nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allow_overlapping_patterns option lets me match dequantizes which are consumed by two quantizes.


class ConsolidateRequantizeandQuantize(DFPatternCallback):
"""Gets rid of unnecessary requantizes directly following a quantize. Takes
quantize(scale_a, zp_a) -> requantize(scale_a, zp_a, scale_b, zp_b) to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this pattern could arise in practice? Do you have an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good example is actually the result of the add pattern.

quantize -> dequantize -> quantize -> 
quantize -> dequantize -> quantize -> add -> dequantize

becomes

quantize -> requantize -> 
quantize -> requantize -> add -> dequantize

Then we need to combine the quantize and requantize into one quantize:

quantize -> 
quantize -> add -> dequantize


class RequantizeChainCallback(DFPatternCallback):
"""Folds chains of requantizes into one requantize.
requantize(scale_a, zp_a, scale_b, zp_b) -> requantize(scale_b, zp_b, scale_c, zp_c) becomes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this pattern could arise in practice? Do you have an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have definitely seen it which is why I put it in, but I'll have to find the example again. Let me get back to you about this.

@@ -105,6 +115,10 @@ Expr DequantizeLower(const Expr& input_tensor, const Expr& input_scale,

auto shift = Subtract(Cast(input_tensor, DataType::Int(32)), expanded_input_zero_point);
auto scaled_output = Multiply(Cast(shift, DataType::Float(32)), expanded_input_scale);

if (out_dtype != DataType::Float(32)) {
scaled_output = Cast(scaled_output, out_dtype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting the output of float multiply to int is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be correct?

@@ -46,19 +47,22 @@ bool QuantizeRel(const Array<Type>& types, int num_inputs, const Attrs& attrs,
}

const auto input_dtype = data->dtype;
ICHECK(input_dtype == DataType::Float(32))
<< "Input type should be one of float32 but was " << input_dtype;
ICHECK(input_dtype == DataType::Float(32) || input_dtype == DataType::Int(32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why the input to quantize could be int32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When quantizing nn.conv2d -> nn.bias_add, we quantize the bias weight to int32. This is what other quantization frameworks do, including pytorch (see https://stackoverflow.com/questions/63132181/how-does-bias-work-in-pytorch-quantized-convolution).

See this snippet from https://github.com/apache/tvm/blob/49801e85ff6244b1d0567965fec18544ce51dd70/python/tvm/relay/transform/quantize/_quantizer_patterns.py#L352-#L355

quantized_bias = relay.qnn.op.quantize(
            self.args[2], self.scale_zps[0], self.scale_zps[1], axis=0, out_dtype="int32"
        )
        self.quantized_args.append(quantized_bias)

Copy link
Member

@masahi masahi Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but it is the output that needs to be int32. Here, you are allowing inputs to be int32. Bias is always float. Am I missing something?


def initialize_scale_zp_map(self):
"""Initializes scales to 1 and zero points to zero. These values will only be used
to calculate values in the tuple subgraph that are not returned to the user."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful with how we initialize the params here. A scale of 1 doesn't make sense, since it would essentially clamp the entire floating point range to [-128, 127]. So the outputs from the first run will likely be garbage.

Does the choice of initialization affect the accuracy of the final model? If so, we should use more sensible values by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These initial values aren't exposed -- Right now, you have to pass scale and zero point values into get_quantized_layer_inputs. 1 is just a placeholder for scales the parts of the graph that we haven't called a calibrator callback for yet.
However, in case someone forgets to set a scale or zero point, maybe I should make the default value better. I actually think that a lower default scale might be better since most ML activation values are very small (maybe something like 0.05?) In my experience that scale at least produces non-zero outputs that are the same or similar magnitude to the non-quantized graph.
Also as a quick note, @jwfromm (as well as you) suggested changing the API a little bit so that you don't have to pass in scale and zero point to get_quantized_layer_inputs, so I'm going to do that. It makes this problem less pressing and also lets people directly use the value of quantized data to calculate scales.

@masahi
Copy link
Member

masahi commented Feb 19, 2021

@electriclilies Can you add an end to end runnable example? Like importing a pytorch or onnx graph and quantize it.

version of the expression.
"""
pattern_ffi.rewrite(
[_DFPatternCallback(self.pattern, self.extract_attrs, self.require_type)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the methods in this class tested? I don't find definitions of self.pattern, self.scale etc in this and derived classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are. This class is only meant to be used for multiple inheritance with a QuantizerPattern, which is why it looks a little weird. When I test the AverageMaxPerChannelConv2dPattern and AverageMaxPerChannelDensePattern, all these methods are tested.
I will make PerChannelPattern and AverageMaxPerChannelPattern abstact to make it clear that they should not be instantiated.

"""Helper to quantize the arguments to the qnn.conv2d."""
quantized_data = relay.qnn.op.quantize(
self.args[0], self.scale_zps[0], self.scale_zps[1], axis=self.data_channel_axis
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you quantize this way, I think an argument that is consumed by multiple nodes will get quantized multiple times. I don't think CSE would help since you are using separate scale and zp vars. You should cache the quantized values and avoid quantizing the same node twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting suggestion. It's important to introduce a new quantize node for every input so that we can correctly do requantize later. Instead of caching the whole quantize node, I could cache the scale and zero point from the previous time the node was cached. However, I'm not sure how we would deal with this during calibration, since we are calibrating on a per-pattern basis. Either the second value set for that scale and zero point would be the actual value, or we could try to remove the scale and zero point from the second pattern. But removing a scale and zero point from a pattern would introduce some inconsistency and complexity to the callbacks that I think is not good.

CSE does work for this when we have integer constants as scales and zero points, however, it might not work for floats. There are only two cases that this really becomes an issue though: 1. a weight gets quantized multiple times 2. we end up with multiple requantize ops that requantize the same node to slightly different values. I think that dealing with this in post processing (i.e., requantizer) is the best decision. Then we can combine scales and zero points through some rule, perhaps by averaging them.

from . import _ffi as ffi


class Quantizer:
Copy link
Member

@masahi masahi Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class is redundant, since all you do is to do some stuff in the constructor and immediately pass this object to QuantizationCalibrator. It is better to directly do the same initialization in the QuantizationCalibrator constructor.
And probably I'd rename QuantizationCalibrator to Quantizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely an argument to combine the Quantizer and the QuantizationCalibrator into one class, since they are pretty closely linked.

@jwfromm, @mbrookhart and I went back and forth about this. Initially, we made them separate because they were both large classes and did logically different things, and combining them would have resulted in a lot of code in one monolithic class. However, now that I have the QuantizerPattern class and the CalibrationInfo class, which contain a lot of the code, it could make sense to combine them into one class.

One argument to keep them separate is that the Quantizer does a lot of stuff using the C backend, and keeping that separate from the Calibrater is kind of nice for readability.

@jwfromm @mbrookhart What do you think?

@electriclilies electriclilies marked this pull request as draft February 22, 2021 17:03
@electriclilies electriclilies changed the title Quantization in TVM [WIP] [Quantization] Quantization in TVM Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants