-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError while annotating deeplabv3 model from pytorch #33
Comments
@abdulazizm This line checks whether the dimensions that are not concatenated over have the same value for all inputs. Could you share the Relay representation of the model ( |
@jtuyls Thanks for esteemed reply. Here are the requested details.
Please let me know if anything else needed. |
%0 = nn.conv2d(%data, %model.backbone.conv1.weight, strides=[2, 2], padding=[3, 3, 3, 3], channels=64, kernel_size=[7, 7]) /* ty=Tensor[(1, 64, 112, 112), float32] /; Recopied - same as previous one (just with good alignment/indentation) |
@abdulazizm Thanks for posting the Relay expression. The issue seems to be in this concatenate layer: To log debug info, add this at the top of your script:
|
@jtuyls Hope you are getting nearer and about to solve this issue. I have enabled logging earlier too. Just attaching logs without printing mod['main'] just for your reference. Let me know if you are worried about my model tracing.
|
@abdulazizm Could you do the same, but with the DEBUG flag? The INFO flag is not showing enough details to debug this part of the codebase. My bad for asking to use the INFO flag earlier.
|
@jtuyls No issues, here is the requested log (attaching as file - seems to have more details). Hope this helps |
@abdulazizm This issue is caused by the conv2d shape inference not having been added for dilation > 1. It's still WIP but you could try this branch of PyXIR: https://github.com/Xilinx/pyxir/tree/dev-rf-test-0 |
@jtuyls Thanks for the time and your code change. Tried pyxir branch (dev-rf-test-0), annotation() API worked good without assertion error but getting seg fault during PartitionGraph() call.
Code used for reference:
|
@abdulazizm This looks like a partitioning issue. Could you provide the output of |
@jtuyls Yeah it seems to be a partitioning issue. (May be this is why we didnt had dilation/strides not implemented earlier?) Here is the requested output -> Thanks |
@abdulazizm I think the issue is this dropout layer: |
Hi @jtuyls ,
Do we have any quick fix updated for this segfault issue (in the previous mentioned branch)?
So we may not be using DPU efficiently for this purpose? This may be acceptable for my purpose, do you want me to proceed or hold on? Is there any workaround you recommend us to try for deeplab model? Guess trying deeplab with frameworks other than Pytorch may not help - Right? |
Yes, I am working on it and will push it to the same branch. I will ping you when it's in.
Yes, the performance will suffer from this. You could remove the dropout layers from the model before passing the model to TVM to avoid this. |
@jornt-xilinx That sounds great. Thanks. |
@abdulazizm I think it was the dilated conv2d with large padding that is unsupported on the DPU that was causing incorrect partitioning (DPUCZDX8G supports padding sizes in range [0, kernel_size-1], but kernel size was 3 and padding size 4). I could recreate a small test case and fixed the partitioning issue but this means that this specific conv2d and following convolutions will be executed on CPU. I pushed the fix to this branch again: https://github.com/Xilinx/pyxir/tree/dev-rf-test-0. Could you try it out to check whether it works for you? |
@jtuyls That works great!! Can able to build the deeplab v3 model successfully. Thanks. Will try the inference results and update benchmarks here shortly. When can we expect these changes in release branch? Shall we close this issue? |
@abdulazizm Great that it's working, I will move the changes into dev shortly and will probably get them in released this or next week. Feel free to close the issue. |
@jtuyls Couldnt load the library module generated in EDGE device. Not sure why. Am I missing something?
|
@abdulazizm I think this might be caused by the TVM GraphRuntime having been changed to GraphExecutor. I think you have different TVM versions installed on the host machine and board and you will have to pull in the latest TVM version on the host to resolve this. |
@jtuyls Tried with latest TVM and Pyxir (dev-rf-test-0 git branch), couldnt compile now. Any other workflow has changed (its popping not to use annotation API - but tvm's vitis-ai example and pyxir example still has annotation API used in it)? and loading build config changed? (couldnt find the recent way load build config).
|
@abdulazizm Were you also in the vitis-ai-pytorch environment earlier? You should use the vitis-ai-tensorflow environment. |
@jtuyls Yeah i was using pytorch earlier too since I was using deeplab model from pytorch framework, someone suggested to use this conda environment. (Reference attached below) - will try tensorflow environment too and keep you posted
|
@abdulazizm I see, yeah, for the TVM work it is always the tensorflow environment. Not sure why the pytorch environment was working for you earlier. |
@jtuyls That worked. With tensorflow conda environment managed to build the library file. But on EDGE device at runtime throws error like below while mounting lib. Any idea?
Tried tvm.runtime.load_module(lib_path) instead of tvm.runtime.module.load_module(lib_path), but didnt help - same errors |
@abdulazizm I think your model didn't get quantized and compiled properly on the host machine. Make sure that you are providing enough calibration images and to verify check that the quantizer and compiler got called in the console output. Also, build and export for aarch64 afterwards, here is a full example: https://github.com/Xilinx/pyxir/blob/master/examples/tvm/edge_resnet_18_host.py. |
@jtuyls With default 128 quant_size could able to compile the model with quantization. It took around 3.85 seconds for an image from coco dataset (earlier it was 8.5 seconds - guess it used cpu completely). Hope if I could able to remove dropout layer, will see drastic improvements. Thanks for the time, support and patience. Closing this issue. |
@abdulazizm That isn't good performance and means that still a lot of the convolutions are executed on the CPU. Removing the dropout might help but there are also dilated convolutions with large padding values that are breaking up the DPU partition and causing those and subsequent convolutions to be offloaded to the CPU. Like these ones:
It looks like those dilated convolutions with large padding values are causing more damage to the performance than the dropout so to achieve good performance these operations will also have to be adjusted to DPU supported dilated convolutions. For you reference, on page 23 of the Zynq DPU product guide you can find a table of DPU supported operations. As an example, the above convolutions are not supported because the padding values are larger than kernel_w/h - 1. |
@jtuyls Tried with padding and dilation = (2,2) instead of (12,12), (24,24), (36,36) but didn't help much. Took around the same ~4seconds for single image inference - not sure why. |
@abdulazizm I guess there are still unsupported convolutions around then. You also replaced the one with (4, 4)?
You can look at the |
@jtuyls You are right, I missed (4,4). Now tried removing those too, but couldn't get the library exported even after quantization done.
|
@jornt-xilinx Any suggestions for the above issue? Merged dev-rf-test-0 to master branch? |
@abdulazizm I merged |
Debug_log_pyxir_deeplab_less_dilated_conv.txt |
@abdulazizm I think we are running into this DPU Conv2d constraint:
kernel_w, kernel_h = 3, 3 So, |
pyxir/python/pyxir/graph/ops/l1_basic_nn.py
Line 167 in 7d73925
Trying to build deeplabv3 from pytorch to deploy on top of xilinx EDGE device - zcu104. Facing some issue while annotating. Code snippets here:
Two doubtful things in output:
/home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/frontend/tvm/relay_tools/relay_l10_temporary.py:64: UserWarning: Convert Relay Adaptive Avg pool2d layer into normal average pool2d layer
warnings.warn("Convert Relay Adaptive Avg pool2d layer into normal"
File "/home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/frontend/tvm/relay_tools/relay_l1_basic.py", line 414, in concatenate
X = px.ops.concat(op_name, data_layers, axis, relay_id=relay_idx)
File "/home/vitis-ai-user/.local/lib/python3.6/site-packages/pyxir-0.1.6-py3.6-linux-x86_64.egg/pyxir/graph/ops/l1_basic_nn.py", line 167, in concat
assert i == axis or len(check) == 1
AssertionError
While printing/debugging some values near the above-referenced line, noticed that there is a TODO tag. Anyone working on this implementation? (FYI: It seems i=2 & axis=1 and check={132,108,156,86,28} when it throws assertion error).
The text was updated successfully, but these errors were encountered: