Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extra output for inference #23

Closed
liamsun2019 opened this issue Dec 30, 2021 · 44 comments
Closed

Add extra output for inference #23

liamsun2019 opened this issue Dec 30, 2021 · 44 comments
Labels
question Further information is requested

Comments

@liamsun2019
Copy link

Hi Author,

I need to add some extra output tensors which are used for inference. These tensors are not referenced during training but just for inference after the conversion to tflite. My naive intention is to put some operations in forward as many as possible so as to relief the loading of post processing which has to be implemented by c/c++ code.

For instance, some matrix ops such as reshape/sigmoid/multiply better be done by GPU/NPU instead of with CPU.

I add some logic in forward to implement this requirement and the training goes well but the conversion to tflite fails with following error message:
File "/usr/local/lib/python3.6/dist-packages/torch/nn/quantized/modules/functional_modules.py", line 160, in mul
r = ops.quantized.mul(x, y, scale=self.scale, zero_point=self.zero_point)
RuntimeError: Mul operands should have same data type.

Is there any feasible way or workaround for this scenario?
The script attached. Thanks.
movenet_qat.zip

@liamsun2019
Copy link
Author

source code snippet:

center = ret[head]
center_max = torch.sigmoid(center)
center_max = self.maxpool(center_max)
center_peaks = (center_max == center).float()
center = center * center_peaks
ret['filtered_hm'] = center

where ret['filtered_hm'] is one of the extra outputs. The error message is supposed to be related that.

@liamsun2019
Copy link
Author

If I changed the codes like following:
center = ret[head]
center_max = torch.sigmoid(center)
center_max = self.maxpool(center_max)
ret['hm_hmax'] = center_max

Another error rises up:
assert tensor.q_zero_point() == 128, "As for symmetric quantization, "
AssertionError: As for symmetric quantization, the zero point of the u8 tensors should be 128. This could happen if you didn't train
the model after QAT preparation. Attached the script.
movenet_qat.zip

@liamsun2019
Copy link
Author

Above experiments are based on the recent version.

@liamsun2019
Copy link
Author

BTW,the following line
center = ret[head]

better rewritten as:
center = ret[head].clone()

to avoid being overridden. It does not influence the experiment results.

@dinghuanghao dinghuanghao added help wanted Extra attention is needed question Further information is requested and removed help wanted Extra attention is needed labels Dec 30, 2021
@peterjc123
Copy link
Collaborator

@liamsun2019

I add some logic in forward to implement this requirement and the training goes well but the conversion to tflite fails with following error message:
File "/usr/local/lib/python3.6/dist-packages/torch/nn/quantized/modules/functional_modules.py", line 160, in mul
r = ops.quantized.mul(x, y, scale=self.scale, zero_point=self.zero_point)
RuntimeError: Mul operands should have same data type.

This is because the graph rewriter for quantization doesn't properly handle type casting functions like .float(). At this point, you may rewrite it yourself.

The diff to the model I made it to work is shown below.

317c317,318
<         mul_1 = self.float_functional_simple_13.mul(hm_3, float_1)
---
>         fake_dequant_0 = self.fake_dequant_0(hm_3)
>         mul_1 = fake_dequant_0 * float_1
340,341c341,342
<         fake_dequant_0 = self.fake_dequant_0(hm_3)
<         fake_dequant_1 = self.fake_dequant_1(mul_1)
---
>         # fake_dequant_1 = self.fake_dequant_1(mul_1)
>         fake_dequant_1 = mul_1

If I changed the codes like following: center = ret[head] center_max = torch.sigmoid(center) center_max = self.maxpool(center_max) ret['hm_hmax'] = center_max

Another error rises up: assert tensor.q_zero_point() == 128, "As for symmetric quantization, " AssertionError: As for symmetric quantization, the zero point of the u8 tensors should be 128. This could happen if you didn't train the model after QAT preparation. Attached the script. movenet_qat.zip

The error message is legit. For symmetric QAT, you must train the model for at least one iteration or invoke the forward function of the model at least once, otherwise the zero point will remain zero, which leads to this error.

from movenet_qat import MoveNet_qat
model = MoveNet_qat()

dummy_input = torch.ones((1, 3, 256, 256), dtype=torch.float32)

# QAT prep
quantizer = QATQuantizer(model, dummy_input, work_dir='out', config={'rewrite_graph': False, ...})
qat_model = quantizer.quantize()

# Invoke once
qat_model(dummy_input) 

# Conversion goes here

@liamsun2019
Copy link
Author

Big Thanks. I'll try it out later and let you know the results.

@liamsun2019
Copy link
Author

One more question, is there a simple way for forward to output different tensors under different conditions? For instance, I need o1, o2 for training while o3 and o4 for inference. In the stage of conversion to tflite, I just want o3 and o4 to be output. Do I have to manually edit the .py file to achieve this ?

@liamsun2019
Copy link
Author

The error message is legit. For symmetric QAT, you must train the model for at least one iteration or invoke the forward function of the model at least once, otherwise the zero point will remain zero, which leads to this error.

==> Based on my experiments, it shows that this error only rises up after I add some extra outputs to the forward operation. If I remove these extra outputs, the error disappear. I set max epoch to 2 to conduct the experiments.

@liamsun2019
Copy link
Author

My script attached
movenet_qat.zip

@liamsun2019
Copy link
Author

My simple guess is since the added outputs do not join in training, so the tracer(just guess) might not trace them correctly.

@liamsun2019
Copy link
Author

Any updates?
^_^

@peterjc123
Copy link
Collaborator

Any updates? ^_^

Sorry for late reply, we were working on something else.

The error message is legit. For symmetric QAT, you must train the model for at least one iteration or invoke the forward function of the model at least once, otherwise the zero point will remain zero, which leads to this error.

==> Based on my experiments, it shows that this error only rises up after I add some extra outputs to the forward operation. If I remove these extra outputs, the error disappear. I set max epoch to 2 to conduct the experiments.

As can be seen in
https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/observer.py#L272-L273 and https://github.com/pytorch/pytorch/blob/402f2934bf380964a403d2e139ec529d1f5bac0e/torch/ao/quantization/utils.py#L148-L176, if you don't run inference once, the min, max values of the observers will remain -inf and inf, so that the scale and the zero point will be set 1 and 0 accordingly, which leads to the failed asserts in the converter. There's nothing more I can say without the details of your experiment.

One more question, is there a simple way for forward to output different tensors under different conditions? For instance, I need o1, o2 for training while o3 and o4 for inference. In the stage of conversion to tflite, I just want o3 and o4 to be output. Do I have to manually edit the .py file to achieve this ?

The problem is already covered in our FAQ. So, the brief answer is yes, because it's how tracing works.

@liamsun2019
Copy link
Author

Got it, I will try the proposed way in FAQ. Big thanks for your help.

@liamsun2019
Copy link
Author

I followed the method in FAQ:

  1. Generate the script for inference. (Looks that quantizer.quantize() will force setting training mode and I have to hack my code to generate the script)

  2. QAT train the model and get the qat_last_model.pth

  3. Based on the script for inference, convert to tflite like following:
    if name == "main":
    qat_model = MoveNet_qat()
    qat_model.load_state_dict(torch.load('qat_last_model.pth'), strict=False)

    dummy_input = torch.ones((1, 3, 256, 256), dtype=torch.float32)
    with torch.no_grad():
    qat_model.eval()
    qat_model.cpu()
    qat_model(dummy_input.to('cpu'))
    torch.backends.quantized.engine = 'qnnpack'
    converter = TFLiteConverter(qat_model, dummy_input, tflite_path="test.tflite", asymmetric=False)
    converter.convert()

The tflite can be generated. But the weights/bias in it are already converted to float32. I actually need a QAT tflite whose weights/bias are supposed to be int8/int32. How could I achieve it?

@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 6, 2022

@liamsun2019 You need to go through quantizer again although your model is already QAT rewritten (because a QAT rewritten model is still a float model, not a quantized one).

qat_model = MoveNet_qat()
qat_model.load_state_dict(torch.load('qat_last_model.pth'), strict=False)

dummy_input = torch.ones((1, 3, 256, 256), dtype=torch.float32)
quantizer = QATQuantizer(qat_model, dummy_input, work_dir='out', config={'rewrite_graph': False, ...})
qat_model = quantizer.quantize()

@liamsun2019
Copy link
Author

if name == "main":
model = MoveNet_qat()
model.load_state_dict(torch.load('qat_last_model.pth'), strict=False)
dummy_input = torch.ones((1, 3, 256, 256), dtype=torch.float32)
quantizer = QATQuantizer(model, dummy_input, work_dir='./', config={'backend': "qnnpack", 'force_overwrite': False, 'asymmetric': False, 'per_tensor': False, 'rewrite_graph': False})
qat_model = quantizer.quantize()
qat_model(dummy_input.to('cpu'))

with torch.no_grad():
    qat_model.eval()
    qat_model.cpu()
    qat_model = torch.quantization.convert(qat_model)
    torch.backends.quantized.engine = 'qnnpack'
    converter = TFLiteConverter(qat_model, dummy_input, tflite_path="ohyeah.tflite", asymmetric=False)
    converter.convert()

The following error rises up:
assert tensor.q_zero_point() == 128, "As for symmetric quantization, "
AssertionError: As for symmetric quantization, the zero point of the u8 tensors should be 128. This could happen if you didn't train the model after QAT preparation.

I do not need training anymore but looks like that I have to. Any suggestions?

@peterjc123
Copy link
Collaborator

@liamsun2019 Would you please share the code of the class MoveNet_qat?

@liamsun2019
Copy link
Author

Sure, FYR
test.zip

@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 7, 2022

@liamsun2019 I can reproduce locally. Looking into it now.

@peterjc123
Copy link
Collaborator

It seems the problem is on torch.sigmoid. A tensor with the qscheme of per tensor and symmetric will become per tensor and affine after running through this op, so we need to insert a (re)quantize op after it.

@liamsun2019
Copy link
Author

A tensor with the qscheme of per tensor and symmetric will become per tensor and affine after running through this op, so we need to insert a (re)quantize op after it.
==> I apply qscheme of per channel and symmetric, instead of per tensor. So this issue also exists for qscheme of per channel and symmetric?

@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 7, 2022

Just wonder what your target platform is. Take NNAPI as an example, it supports the common qschemes:

  1. ANEURALNETWORKS_TENSOR_QUANT8_ASYMM (uint8)
  2. ANEURALNETWORKS_TENSOR_QUANT8_ASYMM_SIGNED (int8)
  3. ANEURALNETWORKS_TENSOR_QUANT8_SYMM (int8 with zero point=0)
  4. ANEURALNETWORKS_TENSOR_QUANT8_SYMM_PER_CHANNEL (int8 with zero point=0)

For ops that support per channel (e.g. Conv2D), you should use (4) for the weight and (2) or (3) for the input. As for other ops, you use (2) or (3). But I don't the support for (2) is broad enough, usually there is only support for (3).

@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 7, 2022

A tensor with the qscheme of per tensor and symmetric will become per tensor and affine after running through this op, so we need to insert a (re)quantize op after it. ==> I apply qscheme of per channel and symmetric, instead of per tensor. So this issue also exists for qscheme of per channel and symmetric?

The activations are always using the per tensor qscheme. That's why I ask the previous question. If your target platform has support for (2), then we may just cancel the limitation. But if it only supports (3), then we need to insert the requantize nodes during QAT graph rewriting.

@liamsun2019
Copy link
Author

For per-channel QAT,our target platform supports asymmetric_affine int8 for activation. And weights supports perchannel_symmetric_affine int8

@peterjc123
Copy link
Collaborator

For per-channel QAT,our target platform supports asymmetric_affine int8 for activation. And weights supports perchannel_symmetric_affine int8

OK, we will work on it.

@peterjc123
Copy link
Collaborator

@liamsun2019 I've uploaded the related changes. You may have to use the following line instead for defining the converter object.

converter = TFLiteConverter(qat_model, dummy_input, tflite_path="ohyeah.tflite", asymmetric=True, quantize_target_type='int8')

@liamsun2019
Copy link
Author

OK. Will update and try it out.

@liamsun2019
Copy link
Author

I tried the recent version and can get the outputs for inference. Thanks a lot.

@peterjc123
Copy link
Collaborator

@liamsun2019 Looks like this issue is resolved. I'll close it. Please feel free to open a new issue when you encounter new problems. Again, thanks for supporting our project.

@peterjc123
Copy link
Collaborator

@liamsun2019 FYI, we've decoupled asymmetric and per_tensor in the quantizer so you are now free to do asymmetric per-channel quantization. Please read here for more details.

@liamsun2019
Copy link
Author

My understanding is that you support asym per-channel QAT now, right ?

@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 10, 2022

My understanding is that you support asym per-channel QAT now, right ?

Yes. But actually this should be
OPs that support per-channel, weight , symmetric, int8, per-channel
OPs that support per-channel, activation , asymmetric, int8, per-tensor
Other OPs, weight, symmetric, int8, per-tensor
Other OPs, activation , asymmetric, int8, per-tensor
for config={'asymmetric': True, 'per_tensor': False}.

Previously, we have
OPs that support per-channel, weight , symmetric, int8, per-channel
OPs that support per-channel, activation , symmetric, int8, per-tensor
Other OPs, weight, symmetric, int8, per-tensor
Other OPs, activation , symmetric, int8, per-tensor
for config={'asymmetric': False, 'per_tensor': False}.

@liamsun2019
Copy link
Author

I just tried with config:

quantizer = QATQuantizer(model, dummy_input, work_dir='out', config={'backend': "qnnpack", 'force_overwrite': True, 'asymmetric': True, 'per_tensor': False, 'rewrite_graph': True})

and converter:
converter = TFLiteConverter(qat_model, dummy_input, tflite_path='out' + '/qat_model.tflite', asymmetric=True)

The following error is output:
assert tensor.q_zero_point() == asym_s8_offset, "As for asymmetric quantization, "
RuntimeError: Expected quantizer->qscheme() == kPerTensorAffine to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

With above message, looks like some pytorch ops do not support asym per-channel. As what you mentioned before, all the activations only suppport per-tensor qat, so probably asym per-channel schema cannot be applied to my case.

@peterjc123
Copy link
Collaborator

@liamsun2019 Please try the following configuration.

quantizer = QATQuantizer(model, dummy_input, work_dir='out', config={'backend': "qnnpack", 'force_overwrite': True, 'asymmetric': True, 'per_tensor': False, 'rewrite_graph': True})

converter = TFLiteConverter(qat_model, dummy_input, tflite_path='out' + '/qat_model.tflite', asymmetric=True, quantize_target_type='int8')

@liamsun2019
Copy link
Author

Yes, it works this way and the resulted QAT tflite seems to always hold weights in int8 data type.

@peterjc123
Copy link
Collaborator

@liamsun2019 As far as I know, the model with u8 weights or inputs doesn't support per-channel quantization.

@liamsun2019
Copy link
Author

I got it. But, based on the recent version, I encounterd following error when doing symmetric per-channel QAT which is fine with old version:

WARNING (tinynn.converter.base) Symmetric quantized model with uint8 is unsupported in most backends of TFLite
Traceback (most recent call last):
File "main.py", line 243, in
main_quant(opt)
File "main.py", line 234, in main_quant
converter.convert()
File "/usr/local/lib/python3.6/dist-packages/TinyNeuralNetwork-0.1.0+5fe672e315215dc3914d264d5f7756d783b5addb-py3.6.egg/tinynn/converter/base.py", line 285, in convert
self.init_operations()
File "/usr/local/lib/python3.6/dist-packages/TinyNeuralNetwork-0.1.0+5fe672e315215dc3914d264d5f7756d783b5addb-py3.6.egg/tinynn/converter/base.py", line 250, in init_operations
converter.parse(node, attrs, args, self.common_graph)
File "/usr/local/lib/python3.6/dist-packages/TinyNeuralNetwork-0.1.0+5fe672e315215dc3914d264d5f7756d783b5addb-py3.6.egg/tinynn/converter/operators/torch/quantized.py", line 116, in parse
self.parse_common(graph_converter)
File "/usr/local/lib/python3.6/dist-packages/TinyNeuralNetwork-0.1.0+5fe672e315215dc3914d264d5f7756d783b5addb-py3.6.egg/tinynn/converter/operators/torch/quantized.py", line 74, in parse_common
weight_tensor = self.create_attr_tensor(weight)
File "/usr/local/lib/python3.6/dist-packages/TinyNeuralNetwork-0.1.0+5fe672e315215dc3914d264d5f7756d783b5addb-py3.6.egg/tinynn/converter/operators/torch/base.py", line 198, in create_attr_tensor
return tfl.Tensor(tensor, name, has_buffer=True, asymmetric=self.asymmetric, q_type=self.q_type)
File "/usr/local/lib/python3.6/dist-packages/TinyNeuralNetwork-0.1.0+5fe672e315215dc3914d264d5f7756d783b5addb-py3.6.egg/tinynn/converter/operators/tflite/base.py", line 198, in init
asym_s8_offset = tensor.q_zero_point()
RuntimeError: Expected quantizer->qscheme() == kPerTensorAffine to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

I wonder if something wrong degrades the latest codes? My config is:

quantizer = QATQuantizer(model, dummy_input, work_dir='out', config={'backend': "qnnpack", 'force_overwrite': True, 'asymmetric': False, 'per_tensor': False, 'rewrite_graph': True})

converter = TFLiteConverter(qat_model, dummy_input, tflite_path='out' + '/qat_model.tflite', asymmetric=False)

I wonder if the usage is changed or not. Above config works well for old version.

@liamsun2019
Copy link
Author

One thing needs to be pointed out that I used the exactly same outputs for training and inference for above experiment.

@liamsun2019
Copy link
Author

Looks like that I need to set quantize_target_type=int8 explicitly which is not a must before.

@peterjc123
Copy link
Collaborator

peterjc123 commented Jan 11, 2022

Looks like that I need to set quantize_target_type=int8 explicitly which is not a must before.

Yes, the changes in 9b656ce are not backward compatible. You have to do it now when you use per-channel quantization.

@liamsun2019
Copy link
Author

Thanks. Another question that may not be related with the issue. I notice 'Dequantize' nodes exist for some QAT tflite models while the converted tflite model via tinynn does not.
1
2

I just wonder how can this happen?

@peterjc123
Copy link
Collaborator

@liamsun2019 Just curious how you get the model? It seems that you use onnx2tf and then TFLiteConverter from official TF to get the first model.
As for the first model, you convert via dynamic range quantization, in which it tries to quantized all weights and biases. But since Conv2D is not a op that supports this kind of inference (it's called Hybrid kernels internally), they will be converted back to floating point. That's why you see the dequantize nodes in the graph. So it only reduces the size of the model.
For the second model, it's converted via quantization-aware training. As you can see, the weights and biases are quantized, so they will actually go through the quantized kernels, so you are likely to see a speedup in model inference.

@liamsun2019
Copy link
Author

Big thanks for your detailed illustration. I made a mistake when introducing the 2 graphs. In fact, the 1st graph comes from a tflite model that's not in QAT representation but I took for that it's QAT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants