Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not load Mxnet trained model #189

Closed
thhart opened this issue Oct 4, 2020 · 19 comments
Closed

Can not load Mxnet trained model #189

thhart opened this issue Oct 4, 2020 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@thhart
Copy link
Contributor

thhart commented Oct 4, 2020

Description

Can not load a pre trained Yolo model from Mxnet. I have a param file and a symbol.json. MxModel seems to fail to handle the params file. If interested I might be able to share the model private on request.
The model was trained in a mxnet/gluoncv python environment.

Debugging the code I can see the key value is stages.0.0.0.weight which is supposed to be split by ":" which obviously fails.

Error Message

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
	at ai.djl.mxnet.engine.MxModel.loadParameters(MxModel.java:201)
	at ai.djl.mxnet.engine.MxModel.load(MxModel.java:119)
	at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:142)
	at ai.djl.repository.zoo.ModelZoo.loadModel(ModelZoo.java:162)
	at com.itth.okra.axle.AxleDetectorMxnet.(AxleDetectorMxnet.java:29)
	at com.itth.okra.axle.AxleDetectorMxnet.main(AxleDetectorMxnet.java:42)

How to Reproduce?

I try to load the model with following code:

         Criteria criteria = Criteria.builder()
               .setTypes(Image.class, DetectedObjects.class) // defines input and output data type
               .optDevice(Device.cpu())
               .optTranslator(new YoloTranslator(new Builder()))
               .optModelUrls("file:///tmp/mxnet") // search models in specified path
               .optModelName("model")
               .build();
         final ZooModel model = ModelZoo.loadModel(criteria);

Environment Info

djl: 0.8.0
mxnet-engine: 0.8.0
mxnet-native-mkl: 1.7.0

@thhart thhart added the bug Something isn't working label Oct 4, 2020
@lanking520
Copy link
Member

Hi, can you share the model or the code to obtain the model? I can try to rerpoduce that.

@thhart
Copy link
Contributor Author

thhart commented Oct 5, 2020

@lanking520
Copy link
Member

@thhart Do you know which mxnet version you are using to train the model (pip package)? Is that MXNet 1.7 or MXNet 1.5 (or lower?)

@thhart
Copy link
Contributor Author

thhart commented Oct 6, 2020

Name: mxnet-cu102
Version: 1.7.0
Summary: MXNet is an ultra-scalable deep learning framework. This version uses CUDA-10.2.
Home-page: https://github.com/apache/incubator-mxnet
Author: None
Author-email: None
License: Apache 2.0
Location: /usr/local/lib/python3.8/dist-packages
Requires: requests, graphviz, numpy

Name: gluoncv
Version: 0.8.0
Summary: MXNet Gluon CV Toolkit
Home-page: https://github.com/dmlc/gluon-cv
Author: Gluon CV Toolkit Contributors
Author-email: UNKNOWN
License: Apache-2.0
Location: /usr/local/lib/python3.8/dist-packages
Requires: matplotlib, requests, numpy, tqdm, portalocker, Pillow, scipy

@lanking520
Copy link
Member

lanking520 commented Oct 6, 2020

After trying to load this model in python, I got the following issue:

AssertionError: Parameter 'darknetv30_conv0_weight' is missing in file: yolo/model-0000.params, which contains parameters: 'stages.0.0.0.weight', 'stages.0.0.1.gamma', 'stages.0.0.1.beta', ..., 'yolo_outputs.2.anchors', 'yolo_outputs.2.offsets', 'yolo_outputs.2.prediction.weight', 'yolo_outputs.2.prediction.bias'. Please make sure source and target networks have the same prefix.For more info on naming, please see https://mxnet.io/api/python/docs/tutorials/packages/gluon/blocks/naming.html

It seemed some of the layers weight are not saved.

There is something you can do, try to follow the steps:

http://docs.djl.ai/docs/mxnet/how_to_convert_your_model_to_symbol.html#how-to-convert-your-gluon-model-to-an-mxnet-symbol

To save your model into a Symbol compatible way

To reproduce the above python issue

import mxnet as mx
from mxnet import gluon

model_prefix = "yolo/model" # your model path

model = gluon.nn.SymbolBlock.imports(model_prefix + "-symbol.json", ['data'], model_prefix + "-0000.params")

If the problem persist, I guess we have to dig through the block to see which part may not be hybridized.

@thhart
Copy link
Contributor Author

thhart commented Oct 7, 2020

Cool, thanks for checking, I was under the consumption the net was already hybridized but in fact it wasn't. So I converted now and it is loading. Sorry not have checked carefully.

However now I run into an other problem the layers not feeded well. I have changed the criteria like this:

Criteria criteria = Criteria.builder()
               .setTypes(Image.class, DetectedObjects.class) // defines input and output data type
               .optDevice(Device.cpu())
               .optTranslator(new YoloTranslator(
                     YoloTranslator.builder()
                           .optSynsetArtifactName("synset.txt")
                     .setPipeline(new Pipeline())
                     ))
               .optModelUrls("file:///tmp") // search models in specified path
               .optModelName("yolo3_darknet53")

This is the error I receive:

Exception in thread "main" ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: Error in operator darknetv30_conv0_fwd: Shape inconsistent, Provided = [32,3,3,3], inferred shape=(32,608,3,3)
at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1808)
	at ai.djl.mxnet.jna.JnaUtils.cachedOpInvoke(JnaUtils.java:1785)
	at ai.djl.mxnet.engine.CachedOp.forward(CachedOp.java:135)
	at ai.djl.mxnet.engine.MxSymbolBlock.forward(MxSymbolBlock.java:178)
	at ai.djl.nn.Block.forward(Block.java:117)
	at ai.djl.inference.Predictor.predict(Predictor.java:117)
	at ai.djl.inference.Predictor.batchPredict(Predictor.java:157)
	at ai.djl.inference.Predictor.predict(Predictor.java:112)
	at com.itth.okra.axle.AxleDetectorMxnetDjl.(AxleDetectorMxnetDjl.java:38)
	at com.itth.okra.axle.AxleDetectorMxnetDjl.main(AxleDetectorMxnetDjl.java:48)

This is the image (608x608) loading:

         final ZooModel model = ModelZoo.loadModel(criteria);
         Predictor predictor = model.newPredictor();
         final File input = new File("/tmp/sample.jpg");
         BufferedImageFactory factory = new BufferedImageFactory();
         DetectedObjects detection = predictor.predict(factory.fromImage("sample.jpg");
         for (Classification item : detection.items()) {
            System.err.println(item.getClassName() + ": " + item.getProbability());
         }

@lanking520
Copy link
Member

I can take a look too, is seemed like the input shape are not YoLo network looking for. What is the shape for your input? Usually during the training we do normalize and resize the image. 608 seemed to be the upper limit for the model, maybe you can do Resize() in the pipeline to resize it to some value between 320 - 608.

Can you send me the model files again, I can help discover that in the meantime

@lanking520 lanking520 self-assigned this Oct 7, 2020
@lanking520
Copy link
Member

BTW, we do have some pretrained gluoncv yolo model in DJL. You may want to take a look. Some tricks we do for hybridization is, before we do the export, we do a forward with a dummy image (nd.ones((1, 3, size, size)), and use the size as the standard image size to feed. I assume you are doing something similiar

@zachgk
Copy link
Contributor

zachgk commented Oct 7, 2020

You can also find the script that was used for the model zoo yolo models at https://github.com/awslabs/djl/blob/master/mxnet/mxnet-model-zoo/src/main/scripts/exportYolo.py

@thhart
Copy link
Contributor Author

thhart commented Nov 3, 2020

Hi Lanking, sorry for this late answer but I checked meanwhile an alternative approach with YoloV5. With this I achieved inference over an Onnx bridge (1.5.1).
This is working very good already, however DJL is a more sophisticated framework of course and I am convinced support for YoloV5 might be a good progress and enhancement for it. Maybe the built in Onnx support in DJL could be the key also. The only bottleneck I see is probably the output layer parsing but this should not be too hard. I have ported this into Java however I I am not aware of internal DJL structures. Maybe your YoloV3 version might be of use but I don't know the differences in the input/output layers.
Albeit I have a solution I could offer testing capabilities and share my small code to test if you are interested, otherwise feel free to close this issue.
BR
Thomas

@lanking520
Copy link
Member

@thhart Hi Thomas, we do have ONNX Runtime support: http://docs.djl.ai/onnxruntime/onnxruntime-engine/index.html

Could you please try it out? This should work with the majority ONNX model designed for Deep Learning

Thanks

@thhart
Copy link
Contributor Author

thhart commented Nov 3, 2020

Using following code produces the error below, maybe a simple input encoding problem? Any hint? YoloTranslator looks the same as in my solution by the way but did not get this far yet...

         Criteria<Image, DetectedObjects> criteria = Criteria.builder()
               .setTypes(Image.class, DetectedObjects.class) // defines input and output data type
               .optDevice(Device.cpu())
               .optTranslator(new YoloTranslator(YoloTranslator.builder().optSynsetArtifactName("synset.txt").setPipeline(new Pipeline())))
               .optModelUrls("file:///home/th/dev/itth/okraLearn/yolov5/") // search models in specified path
               .optModelName("axle-model-20201102-1024.onnx")
               .optEngine("OnnxRuntime")
               .build();
         final ZooModel<Image, DetectedObjects> model = ModelZoo.loadModel(criteria);
         Predictor<Image, DetectedObjects> predictor = model.newPredictor();
         final File input = new File("/opt/axle/images/JPEGImages/20200928-115130997-6.jpg");
         BufferedImageFactory factory = new BufferedImageFactory();
         DetectedObjects detection = predictor.predict(factory.fromImage(HelperImage.scaleImage(ImageIO.read(input), 1024, 1024)));
         for (Classification item : detection.items()) {
            System.err.println(item.getClassName() + ": " + item.getProbability());
         }

Exception in thread "main" ai.djl.engine.EngineException: ai.onnxruntime.OrtException: Error code - ORT_INVALID_ARGUMENT - message: Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIaEE) , expected: (N11onnxruntime17PrimitiveDataTypeIfEE)
	at ai.djl.onnxruntime.engine.OrtSymbolBlock.forward(OrtSymbolBlock.java:102)
	at ai.djl.nn.Block.forward(Block.java:117)
	at ai.djl.inference.Predictor.predict(Predictor.java:117)
	at ai.djl.inference.Predictor.batchPredict(Predictor.java:157)
	at ai.djl.inference.Predictor.predict(Predictor.java:112)
	at com.itth.okra.axle.AxleDetectorOnnxDjl.(AxleDetectorOnnxDjl.java:34)
	at com.itth.okra.axle.AxleDetectorOnnxDjl.main(AxleDetectorOnnxDjl.java:44)
Caused by: ai.onnxruntime.OrtException: Error code - ORT_INVALID_ARGUMENT - message: Unexpected input data type. Actual: (N11onnxruntime17PrimitiveDataTypeIaEE) , expected: (N11onnxruntime17PrimitiveDataTypeIfEE)
	at ai.onnxruntime.OrtSession.run(Native Method)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:288)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:231)
	at ai.onnxruntime.OrtSession.run(OrtSession.java:200)
	at ai.djl.onnxruntime.engine.OrtSymbolBlock.forward(OrtSymbolBlock.java:99)
	... 6 more

@thhart
Copy link
Contributor Author

thhart commented Nov 3, 2020

I should add I feed the model with normalized floats.

@lanking520
Copy link
Member

lanking520 commented Nov 3, 2020

I haven't seen this before

N11onnxruntime17PrimitiveDataTypeIaEE
N11onnxruntime17PrimitiveDataTypeIfEE

I will take a look

[Update]
onnx/models#257

Datatype might be the pain. Try with converting data to float32

@thhart
Copy link
Contributor Author

thhart commented Nov 3, 2020

Sure, but how to feed data or influence the conversion when using your framework image input chain as above?

@thhart
Copy link
Contributor Author

thhart commented Nov 3, 2020

Looks like BaseImageTranslator is feeding INT by default, maybe worth to check overwriting...

@lanking520
Copy link
Member

@thhart You can use the pipeline to add a ToTensor method.

Like we tried here: #238 (comment)

@thhart
Copy link
Contributor Author

thhart commented Nov 3, 2020

Got it working, need some NMS (non max supression) still, is there something available already in DJL for DetectedObjects?

@thhart
Copy link
Contributor Author

thhart commented Nov 4, 2020

Please check following PR:

#272

Tested successful with custom trained network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants