Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load custom model on firmware 11 #160

Closed
Duckypu opened this issue Sep 26, 2023 · 14 comments
Closed

Unable to load custom model on firmware 11 #160

Duckypu opened this issue Sep 26, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@Duckypu
Copy link

Duckypu commented Sep 26, 2023

Description

Thank you for your attention. I've trained custom ssdlite_mobiledet model using the TensorFlow API. Following the efforts of previous work, I made changes to the Dockerfile.model and env.aarch64.artpec8 paths, and I was able to successfully run it in the following environment:

  • Axis device model: P3265-LVE
  • Axis device firmware version: 10.11.76
  • SDK VERSION: 1.2.1
  • docker daemon with Compose: 1.2.3
  • acap-runtime : 1.2.0
    image

However, when I upgraded Axis firmware to version 11, I encountered the following issue during inference:

image

inference-server_1            | ERROR in Inference: Failed to load model model.tflite (Could not send message: Transport endpoint is not connected)
object-detector-api-python_1  | <_InactiveRpcError of RPC that terminated with:
object-detector-api-python_1  |         status = StatusCode.CANCELLED
object-detector-api-python_1  |         details = ""
object-detector-api-python_1  |         debug_error_string = "{"created":"@1695725084.818779200","description":"Error received from peer unix:/tmp/acap-runtime.sock","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"","grpc_status":1}"

Issue environment

  • Axis device model: P3265-LVE
  • Axis device firmware and CV SDK: 11.3.70 vs 1.9
    (also tried 11.2.68 vs 1.6, 11.4.63 vs 1.8, 11.5.64 vs 1.10)
  • docker daemon with Compose: 1.2.3
  • acap-runtime : 1.2.0 (also tried 1.3.1)

Please help me, thanks in advance.

@Duckypu Duckypu added the bug Something isn't working label Sep 26, 2023
@Corallo
Copy link
Contributor

Corallo commented Sep 26, 2023

Hi @Duckypu

First, I'd recommend to make sure that you are using the correct Firmware version with the correct SDK version we test only for that:
https://axiscommunications.github.io/acap-documentation/docs/api/computer-vision-sdk-apis.html

Do you have this issue also when you try the model provided in the example?

For debugging, upload first your model in the camera and run on the device
larod-client -g model_path -c axis-a8-dlpu-tflite
This will test the loading of the model.
If it fails
journalctl -u larod
And check what's the output.

@Duckypu
Copy link
Author

Duckypu commented Sep 27, 2023

Hi @Corallo
Thank you for your prompt response. I have considered the issue of version compatibility and have paired different firmware versions with corresponding SDK versions as described in the response to the environmental issue mentioned above (in fact, I have paired even more version combinations not listed). However, what I can confirm is that version 10 is able to successfully load the model except version 11.

I did the prompt you mentioned

larod-client -g model_path  -c axis-a8-dlpu-tflite

I got:

2023-09-27T09:44:58.848 Connecting to larod...
2023-09-27T09:44:58.863 Connected
2023-09-27T09:44:59.295 ERROR: When loading model synchronously (-6): Could not load model: Asynchronous connection has been closed

And then

journalctl -u larod

I got:

Sep 27 09:44:58 axis-b8a44f495376 larod[73380]: Created a new session ID: 1, client: :1.519
Sep 27 09:44:58 axis-b8a44f495376 sh[73380]: WARNING: Fallback unsupported op 32 to TfLite
Sep 27 09:44:59 axis-b8a44f495376 sh[73380]: double free or corruption (out)
Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: larod.service: Main process exited, code=killed, status=6/ABRT
Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: larod.service: Failed with result 'signal'.
Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: larod.service: Scheduled restart job, restart counter is at 9.
Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: Stopped Machine learning service.
Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: Starting Machine learning service...
Sep 27 09:44:59 axis-b8a44f495376 systemd[1]: Started Machine learning service.
Sep 27 09:44:59 axis-b8a44f495376 larod[73542]: Service started
Sep 27 09:44:59 axis-b8a44f495376 larod[73542]: Created a new session ID: 0, client: :1.523
Sep 27 09:44:59 axis-b8a44f495376 larod[73542]: Session 0 killed since client's (:1.523) connection has been lost

In addition, I also tried to compare 'ssd_mobilenet_v2_coco_quant_postprocess.tflite" and "my_custom.tflite" in netron.
I compared the input, output, and even the structure, and it seems that I can't see anything unusual.

ssd_mobilenet_v2_coco_quant_postprocess.tflite:
image

my_custom.tflite:
image

Lastly, I'm happy to provide my unweighted model to you personally if you want.
Thank you!

@Corallo
Copy link
Contributor

Corallo commented Sep 27, 2023

Thanks for the detailed report. This looks like a bug on our side.
We would have to investigate and try to replicate.

If you can't share publicly your model, the best is that you open a Ticket here:
https://www.axis.com/support/helpdesk/cases
Attach an unweighted version of your model, and possibly add a link to this Issue for reference.

@Duckypu
Copy link
Author

Duckypu commented Sep 27, 2023

Thank you, I've already opened a Ticket (#02150437)
I provided two versions of models for you (Tensorflow 1 and Tensorflow 2)

If you have any problem attaching models, Please let me know.

@Corallo
Copy link
Contributor

Corallo commented Oct 2, 2023

Hi @Duckypu

Could you try to run again the larod-client command, like this:
larod-client -g model_path -c axis-a8-dlpu-tflite -R 10 -i ''
And this time provide the system log?
It might be that you are experiencing an out of memory issue.

You can find the system log in the GUI going in System -> Logs -> View the system log

@Duckypu
Copy link
Author

Duckypu commented Oct 3, 2023

Hi @Corallo

I'm glad to receive your messages.

I did the prompt you mentioned

larod-client -g model_path  -c axis-a8-dlpu-tflite -R 10 -i ''

I got:

2023-10-03T14:47:12.703 Connecting to larod...
2023-10-03T14:47:12.719 Connected
2023-10-03T14:47:13.382 ERROR: When loading model synchronously (-6): Could not load model: Asynchronous connection has been closed

and then check the GUI in System -> Logs -> View the system log

I got:

2023-10-03T14:47:12.902+08:00 axis-b8a44f495376 [ INFO ] sh[1841]: WARNING: Fallback unsupported op 32 to TfLite
2023-10-03T14:47:13.325+08:00 axis-b8a44f495376 [ INFO ] sh[1841]: double free or corruption (out)
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ ERR ] kernel: [ 225.208127][ T1845] larod: singleprocq: potentially unexpected fatal signal 6.
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208168][ T1845] CPU: 0 PID: 1845 Comm: singleprocq Kdump: loaded Tainted: G O 5.15.13-axis9 #1
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208181][ T1845] Hardware name: AXIS P3265/P3267/P3268 Dome Camera (DT)
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208190][ T1845] pstate: 60000000 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208202][ T1845] pc : 0000007f9a7af0f8
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208209][ T1845] lr : 0000007f9a7af0e4
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208216][ T1845] sp : 0000007f987c8100
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208223][ T1845] x29: 0000007f987c8100 x28: 0000007f987c88b8 x27: 0000007f987c85a8
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208242][ T1845] x26: 0000007f9a8b7a60 x25: 0000007f987c8ba8 x24: 0000007f9a880d8a
2023-10-03T14:47:13.327+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208259][ T1845] x23: 0000007f94026000 x22: 0000000000000001 x21: 0000000000000006
2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208275][ T1845] x20: 0000007f9a8b96e0 x19: 0000000000000735 x18: 000000004c41bed4
2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208291][ T1845] x17: 0000000000000000 x16: 0000000000000000 x15: 000000005636a287
2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208307][ T1845] x14: 0000000000000000 x13: 2974756f28206e6f x12: 6974707572726f63
2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208323][ T1845] x11: 6333323930316363 x10: 000000000000000a x9 : 0000007f987c8440
2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208339][ T1845] x8 : 0000000000000083 x7 : 6320726f20656572 x6 : 0000000000000020
2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208355][ T1845] x5 : 0000000000000001 x4 : 0000007f9a8b96e0 x3 : 0000007f987ca0c0
2023-10-03T14:47:13.328+08:00 axis-b8a44f495376 [ WARNING ] kernel: [ 225.208371][ T1845] x2 : 0000000000000006 x1 : 0000000000000735 x0 : 0000000000000000
2023-10-03T14:47:13.360+08:00 axis-b8a44f495376 [ INFO ] dbg-cgi[1141]: Core dump ID: axis-b8a44f495376_1696315633_1841.core

In my opinion, I don't think the issue is related to memory. After all, in the OS 10 version, the model could be successfully loaded. Or is it the case that there is a default model running on the device after the OS 11 version?

@Corallo
Copy link
Contributor

Corallo commented Oct 4, 2023

I have been trying your model on a P3265-LVE with 11.5 and 11.6, it works fine for me.
Because that model of camera has only 1 Gb of RAM I was expecting an out of memory issue, because a known difference between 10.x and 11.x is that the peak memory used during the model loading is higher. But looking at the log you provided, it doesn't seem so. Did you try the command with the latest firmware?

The only thing I can reproduce is that warning message about OP32, even tho it doesn't result in a crash for me.
Can you elaborate more on how do you make the quantization and the conversion to TFlite?

@Duckypu
Copy link
Author

Duckypu commented Oct 11, 2023

Hi @Corallo

After digging deep into this, I found a mistake on my end. My model is actually P3265-LV, not P3265-LVE.
Could you also successfully load model in this type of model?

Now, I've managed to successfully load the model with version 11.5.64 and SDK 1.9 randomly. However, even after making sure I have the correct firmware version, I'm still facing problems loading the model in other versions:

  • 11.4.63 SDK 1.8
  • 11.3.70 SDK 1.7

Additionally, I'd be happy to share the conversion method with you privately. Can I send it to you through a private channel?

@Corallo
Copy link
Contributor

Corallo commented Oct 11, 2023

Hi @Duckypu

Yes, I actually tested on P3265-LV too, but the two device should be equivalent.

What do you mean with "randomly"? It is not consistent/reproducible?

@Duckypu
Copy link
Author

Duckypu commented Oct 12, 2023

Hi @Corallo
I've already sent the mail, please let me know if you didn't receive it.

@Corallo
Copy link
Contributor

Corallo commented Oct 12, 2023

@Duckypu
Hi, I am sorry for the mistake, I had a typo in my mail.

@ThenoobMario
Copy link

Hi @Corallo,

I am facing the same issue when I try to load my custom model as well.

@Corallo
Copy link
Contributor

Corallo commented Oct 18, 2023

@ThenoobMario Please open a new discussion or issue and provide some more context :)

@Corallo
Copy link
Contributor

Corallo commented Oct 18, 2023

Moving this issue into discussions, as for now it doesn't seem like a bug.

@AxisCommunications AxisCommunications locked and limited conversation to collaborators Oct 18, 2023
@Corallo Corallo converted this issue into discussion #168 Oct 18, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working
Development

No branches or pull requests

3 participants