Can not load Mxnet trained model #189

thhart · 2020-10-04T21:51:05Z

Description

Can not load a pre trained Yolo model from Mxnet. I have a param file and a symbol.json. MxModel seems to fail to handle the params file. If interested I might be able to share the model private on request.
The model was trained in a mxnet/gluoncv python environment.

Debugging the code I can see the key value is stages.0.0.0.weight which is supposed to be split by ":" which obviously fails.

Error Message

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
	at ai.djl.mxnet.engine.MxModel.loadParameters(MxModel.java:201)
	at ai.djl.mxnet.engine.MxModel.load(MxModel.java:119)
	at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:142)
	at ai.djl.repository.zoo.ModelZoo.loadModel(ModelZoo.java:162)
	at com.itth.okra.axle.AxleDetectorMxnet.(AxleDetectorMxnet.java:29)
	at com.itth.okra.axle.AxleDetectorMxnet.main(AxleDetectorMxnet.java:42)

How to Reproduce?

I try to load the model with following code:

         Criteria criteria = Criteria.builder()
               .setTypes(Image.class, DetectedObjects.class) // defines input and output data type
               .optDevice(Device.cpu())
               .optTranslator(new YoloTranslator(new Builder()))
               .optModelUrls("file:///tmp/mxnet") // search models in specified path
               .optModelName("model")
               .build();
         final ZooModel model = ModelZoo.loadModel(criteria);

Environment Info

djl: 0.8.0
mxnet-engine: 0.8.0
mxnet-native-mkl: 1.7.0

lanking520 · 2020-10-05T02:49:12Z

Hi, can you share the model or the code to obtain the model? I can try to rerpoduce that.

thhart · 2020-10-05T08:25:47Z

The script the model was learned with train_yolo3.py:
https://drive.google.com/file/d/10zALbFJdKn2-9NXJbVzQsSnQ8_Bm1irq/view?usp=sharing

The trained model model-0000.params:
https://drive.google.com/file/d/1Eaj062TBRrBS4rIkKLsctiE5DKe_3kqr/view?usp=sharing

The json file of the model model-symbol.json:
https://drive.google.com/file/d/1gGH6zdCFlQXHVKPVn1mMCYePvrTsQyxN/view?usp=sharing

lanking520 · 2020-10-06T00:57:44Z

@thhart Do you know which mxnet version you are using to train the model (pip package)? Is that MXNet 1.7 or MXNet 1.5 (or lower?)

thhart · 2020-10-06T08:22:17Z

Name: mxnet-cu102
Version: 1.7.0
Summary: MXNet is an ultra-scalable deep learning framework. This version uses CUDA-10.2.
Home-page: https://github.com/apache/incubator-mxnet
Author: None
Author-email: None
License: Apache 2.0
Location: /usr/local/lib/python3.8/dist-packages
Requires: requests, graphviz, numpy

Name: gluoncv
Version: 0.8.0
Summary: MXNet Gluon CV Toolkit
Home-page: https://github.com/dmlc/gluon-cv
Author: Gluon CV Toolkit Contributors
Author-email: UNKNOWN
License: Apache-2.0
Location: /usr/local/lib/python3.8/dist-packages
Requires: matplotlib, requests, numpy, tqdm, portalocker, Pillow, scipy

lanking520 · 2020-10-06T20:36:10Z

After trying to load this model in python, I got the following issue:

AssertionError: Parameter 'darknetv30_conv0_weight' is missing in file: yolo/model-0000.params, which contains parameters: 'stages.0.0.0.weight', 'stages.0.0.1.gamma', 'stages.0.0.1.beta', ..., 'yolo_outputs.2.anchors', 'yolo_outputs.2.offsets', 'yolo_outputs.2.prediction.weight', 'yolo_outputs.2.prediction.bias'. Please make sure source and target networks have the same prefix.For more info on naming, please see https://mxnet.io/api/python/docs/tutorials/packages/gluon/blocks/naming.html

It seemed some of the layers weight are not saved.

There is something you can do, try to follow the steps:

http://docs.djl.ai/docs/mxnet/how_to_convert_your_model_to_symbol.html#how-to-convert-your-gluon-model-to-an-mxnet-symbol

To save your model into a Symbol compatible way

To reproduce the above python issue

import mxnet as mx
from mxnet import gluon

model_prefix = "yolo/model" # your model path

model = gluon.nn.SymbolBlock.imports(model_prefix + "-symbol.json", ['data'], model_prefix + "-0000.params")

If the problem persist, I guess we have to dig through the block to see which part may not be hybridized.

thhart · 2020-10-07T10:25:57Z

Cool, thanks for checking, I was under the consumption the net was already hybridized but in fact it wasn't. So I converted now and it is loading. Sorry not have checked carefully.

However now I run into an other problem the layers not feeded well. I have changed the criteria like this:

Criteria criteria = Criteria.builder()
               .setTypes(Image.class, DetectedObjects.class) // defines input and output data type
               .optDevice(Device.cpu())
               .optTranslator(new YoloTranslator(
                     YoloTranslator.builder()
                           .optSynsetArtifactName("synset.txt")
                     .setPipeline(new Pipeline())
                     ))
               .optModelUrls("file:///tmp") // search models in specified path
               .optModelName("yolo3_darknet53")

This is the error I receive:

Exception in thread "main" ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: Error in operator darknetv30_conv0_fwd: Shape inconsistent, Provided = [32,3,3,3], inferred shape=(32,608,3,3)
at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808)
	at ai.djl.mxnet.jna.JnaUtils.cachedOpInvoke(JnaUtils.java:1785)
	at ai.djl.mxnet.engine.CachedOp.forward(CachedOp.java:135)
	at ai.djl.mxnet.engine.MxSymbolBlock.forward(MxSymbolBlock.java:178)
	at ai.djl.nn.Block.forward(Block.java:117)
	at ai.djl.inference.Predictor.predict(Predictor.java:117)
	at ai.djl.inference.Predictor.batchPredict(Predictor.java:157)
	at ai.djl.inference.Predictor.predict(Predictor.java:112)
	at com.itth.okra.axle.AxleDetectorMxnetDjl.(AxleDetectorMxnetDjl.java:38)
	at com.itth.okra.axle.AxleDetectorMxnetDjl.main(AxleDetectorMxnetDjl.java:48)

This is the image (608x608) loading:

         final ZooModel model = ModelZoo.loadModel(criteria);
         Predictor predictor = model.newPredictor();
         final File input = new File("/tmp/sample.jpg");
         BufferedImageFactory factory = new BufferedImageFactory();
         DetectedObjects detection = predictor.predict(factory.fromImage("sample.jpg");
         for (Classification item : detection.items()) {
            System.err.println(item.getClassName() + ": " + item.getProbability());
         }

lanking520 · 2020-10-07T17:16:00Z

I can take a look too, is seemed like the input shape are not YoLo network looking for. What is the shape for your input? Usually during the training we do normalize and resize the image. 608 seemed to be the upper limit for the model, maybe you can do Resize() in the pipeline to resize it to some value between 320 - 608.

Can you send me the model files again, I can help discover that in the meantime

lanking520 · 2020-10-07T18:33:28Z

BTW, we do have some pretrained gluoncv yolo model in DJL. You may want to take a look. Some tricks we do for hybridization is, before we do the export, we do a forward with a dummy image (nd.ones((1, 3, size, size)), and use the size as the standard image size to feed. I assume you are doing something similiar

zachgk · 2020-10-07T20:20:33Z

You can also find the script that was used for the model zoo yolo models at https://github.com/awslabs/djl/blob/master/mxnet/mxnet-model-zoo/src/main/scripts/exportYolo.py

thhart · 2020-11-03T19:09:24Z

Hi Lanking, sorry for this late answer but I checked meanwhile an alternative approach with YoloV5. With this I achieved inference over an Onnx bridge (1.5.1).
This is working very good already, however DJL is a more sophisticated framework of course and I am convinced support for YoloV5 might be a good progress and enhancement for it. Maybe the built in Onnx support in DJL could be the key also. The only bottleneck I see is probably the output layer parsing but this should not be too hard. I have ported this into Java however I I am not aware of internal DJL structures. Maybe your YoloV3 version might be of use but I don't know the differences in the input/output layers.
Albeit I have a solution I could offer testing capabilities and share my small code to test if you are interested, otherwise feel free to close this issue.
BR
Thomas

lanking520 · 2020-11-03T19:19:20Z

@thhart Hi Thomas, we do have ONNX Runtime support: http://docs.djl.ai/onnxruntime/onnxruntime-engine/index.html

Could you please try it out? This should work with the majority ONNX model designed for Deep Learning

Thanks

thhart · 2020-11-03T19:46:39Z

Using following code produces the error below, maybe a simple input encoding problem? Any hint? YoloTranslator looks the same as in my solution by the way but did not get this far yet...

         Criteria<Image, DetectedObjects> criteria = Criteria.builder()
               .setTypes(Image.class, DetectedObjects.class) // defines input and output data type
               .optDevice(Device.cpu())
               .optTranslator(new YoloTranslator(YoloTranslator.builder().optSynsetArtifactName("synset.txt").setPipeline(new Pipeline())))
               .optModelUrls("file:///home/th/dev/itth/okraLearn/yolov5/") // search models in specified path
               .optModelName("axle-model-20201102-1024.onnx")
               .optEngine("OnnxRuntime")
               .build();
         final ZooModel<Image, DetectedObjects> model = ModelZoo.loadModel(criteria);
         Predictor<Image, DetectedObjects> predictor = model.newPredictor();
         final File input = new File("/opt/axle/images/JPEGImages/20200928-115130997-6.jpg");
         BufferedImageFactory factory = new BufferedImageFactory();
         DetectedObjects detection = predictor.predict(factory.fromImage(HelperImage.scaleImage(ImageIO.read(input), 1024, 1024)));
         for (Classification item : detection.items()) {
            System.err.println(item.getClassName() + ": " + item.getProbability());
         }

Exception in thread "main" ai.djl.engine.EngineException: ai.onnxruntime.OrtException: Error code - ORT_INVALID_ARGUMENT - message: Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIaEE) , expected: (N11onnxruntime17PrimitiveDataTypeIfEE)
	at ai.djl.onnxruntime.engine.OrtSymbolBlock.forward(OrtSymbolBlock.java:102)
	at ai.djl.nn.Block.forward(Block.java:117)
	at ai.djl.inference.Predictor.predict(Predictor.java:117)
	at ai.djl.inference.Predictor.batchPredict(Predictor.java:157)
	at ai.djl.inference.Predictor.predict(Predictor.java:112)
	at com.itth.okra.axle.AxleDetectorOnnxDjl.(AxleDetectorOnnxDjl.java:34)
	at com.itth.okra.axle.AxleDetectorOnnxDjl.main(AxleDetectorOnnxDjl.java:44)
Caused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_ARGUMENT - message: Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIaEE) , expected: (N11onnxruntime17PrimitiveDataTypeIfEE)
	at ai.onnxruntime.OrtSession.run(Native Method)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:288)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:231)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:200)
	at ai.djl.onnxruntime.engine.OrtSymbolBlock.forward(OrtSymbolBlock.java:99)
	... 6 more

thhart · 2020-11-03T19:56:57Z

I should add I feed the model with normalized floats.

lanking520 · 2020-11-03T20:02:59Z

I haven't seen this before

N11onnxruntime17PrimitiveDataTypeIaEE
N11onnxruntime17PrimitiveDataTypeIfEE

I will take a look

[Update]
onnx/models#257

Datatype might be the pain. Try with converting data to float32

thhart · 2020-11-03T20:06:46Z

Sure, but how to feed data or influence the conversion when using your framework image input chain as above?

thhart · 2020-11-03T20:18:35Z

Looks like BaseImageTranslator is feeding INT by default, maybe worth to check overwriting...

lanking520 · 2020-11-03T21:28:15Z

@thhart You can use the pipeline to add a ToTensor method.

Like we tried here: #238 (comment)

thhart · 2020-11-03T23:12:27Z

Got it working, need some NMS (non max supression) still, is there something available already in DJL for DetectedObjects?

thhart · 2020-11-04T13:16:50Z

Please check following PR:

#272

Tested successful with custom trained network.

Fixes deepjavalibrary#184 Uses workaround described in diffplug/spotless#834 and found in https://github.com/diffplug/spotless/tree/main/plugin-gradle#google-java-format.

thhart added the bug Something isn't working label Oct 4, 2020

lanking520 self-assigned this Oct 7, 2020

thhart closed this as completed Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not load Mxnet trained model #189

Can not load Mxnet trained model #189

thhart commented Oct 4, 2020 •

edited

lanking520 commented Oct 5, 2020

thhart commented Oct 5, 2020

lanking520 commented Oct 6, 2020

thhart commented Oct 6, 2020 •

edited

lanking520 commented Oct 6, 2020 •

edited

thhart commented Oct 7, 2020

lanking520 commented Oct 7, 2020

lanking520 commented Oct 7, 2020

zachgk commented Oct 7, 2020

thhart commented Nov 3, 2020

lanking520 commented Nov 3, 2020

thhart commented Nov 3, 2020

thhart commented Nov 3, 2020

lanking520 commented Nov 3, 2020 •

edited

thhart commented Nov 3, 2020

thhart commented Nov 3, 2020

lanking520 commented Nov 3, 2020

thhart commented Nov 3, 2020

thhart commented Nov 4, 2020

Can not load Mxnet trained model #189

Can not load Mxnet trained model #189

Comments

thhart commented Oct 4, 2020 • edited

Description

Error Message

How to Reproduce?

Environment Info

lanking520 commented Oct 5, 2020

thhart commented Oct 5, 2020

lanking520 commented Oct 6, 2020

thhart commented Oct 6, 2020 • edited

lanking520 commented Oct 6, 2020 • edited

thhart commented Oct 7, 2020

lanking520 commented Oct 7, 2020

lanking520 commented Oct 7, 2020

zachgk commented Oct 7, 2020

thhart commented Nov 3, 2020

lanking520 commented Nov 3, 2020

thhart commented Nov 3, 2020

thhart commented Nov 3, 2020

lanking520 commented Nov 3, 2020 • edited

thhart commented Nov 3, 2020

thhart commented Nov 3, 2020

lanking520 commented Nov 3, 2020

thhart commented Nov 3, 2020

thhart commented Nov 4, 2020

thhart commented Oct 4, 2020 •

edited

thhart commented Oct 6, 2020 •

edited

lanking520 commented Oct 6, 2020 •

edited

lanking520 commented Nov 3, 2020 •

edited