Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras model serving for inference with dl4j #8298

Closed
guigautier opened this issue Oct 18, 2019 · 15 comments · Fixed by KonduitAI/deeplearning4j#12
Closed

Keras model serving for inference with dl4j #8298

guigautier opened this issue Oct 18, 2019 · 15 comments · Fixed by KonduitAI/deeplearning4j#12
Assignees
Labels

Comments

@guigautier
Copy link

@guigautier guigautier commented Oct 18, 2019

  1. I saved a keras model with weights with python and loaded a ComputationGraph in java on DL4j using

model = KerasModelImport.importKerasModelAndWeights(unet, enforceTrainingConfig=false)

  1. I create my input INArray with Nd4j.create(floats)
    and run the inference :

INDArray output = model.output(input)

  1. I retrieve the output :

float[][] x = output.reshape(new int[]{floats.length, 256 * 256}).toFloatMatrix();

There is a output but similar to the input with some different pattern. The result is not correct.

I tried this model, it works well in python with Keras and in Java with a frozenModel (freezegraph) on Tensorflow.
Does anyone try to use trained keras model on dl4j ?

I expect to reproduce the inference on dl4j as I got on Tensorflow.

Have you ever encountered this issue ? any hints will be helpful, thanks.
https://stackoverflow.com/questions/58434187/keras-model-serving-for-inference-with-dl4j

@guigautier

This comment has been minimized.

Copy link
Author

@guigautier guigautier commented Oct 18, 2019

@eraly

This comment has been minimized.

Copy link
Contributor

@eraly eraly commented Oct 21, 2019

Are you using beta4? There is a bug that was fixed in beta5.

@guigautier

This comment has been minimized.

Copy link
Author

@guigautier guigautier commented Oct 22, 2019

I'm using beta5

This is my maven dependencies :

   <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>nd4j-api</artifactId>
        <version>1.0.0-beta5</version>
    </dependency>

    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-core</artifactId>
        <version>1.0.0-beta5</version>
    </dependency>

    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-nn</artifactId>
        <version>1.0.0-beta5</version>
    </dependency>

    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-modelimport</artifactId>
        <version>1.0.0-beta5</version>
    </dependency>
    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-cuda-10.0</artifactId>
        <version>1.0.0-beta5</version>
    </dependency>
    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-parallel-wrapper</artifactId>
        <version>1.0.0-beta5</version>
    </dependency>
@eraly

This comment has been minimized.

Copy link
Contributor

@eraly eraly commented Oct 22, 2019

Thank you for the update. Will you please upload either your model or the model architecture (json)? Also, what OS and what version of Keras?

@guigautier

This comment has been minimized.

Copy link
Author

@guigautier guigautier commented Oct 23, 2019

I'm using Windows 10 and Keras 2.2.4 with tensorflow backend 1.13.1.
And this is my model
unet.zip

Thank you.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

@AlexDBlack AlexDBlack commented Oct 23, 2019

Here's my attempt to reproduce this from the provided JSON configuration.
I basically overfit the provided image, then compared results in DL4J vs. Keras.

I am seeing a difference...
Absolute difference over all output pixels:

Min diff: 1.1920928955078125E-7
Max diff: 0.4378690719604492
Avg diff: 0.027073459699749947

image

So, not as extreme as your images, but still slightly "washed out" and corner artifact...

Code to reproduce:
https://gist.github.com/AlexDBlack/3a4f58edcf243ef4d3faa73ee176eb04

So, I'd say this is confirmed as a bug of some description. Will look into it further and try to isolate it.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

@AlexDBlack AlexDBlack commented Oct 23, 2019

So, it's definitely coming from deconv layers. If I use the same input image as the keras model in DL4J, activation differences are around 1e-8 on average for the conv layers... until the first deconv (conv2d_transpose_1)
Code and output: https://gist.github.com/AlexDBlack/abb3bf0a6f4f384863fa467f498447a4
I'm using Keras 2.3.1 btw.

Note difference is a little bigger with NativeImageLoader, not sure why (probably just slightly different resize and/or grayscale conversion algorithms or something).

Will look at this more tomorrow, unless @eraly debugs this first.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

@AlexDBlack AlexDBlack commented Oct 23, 2019

So, a few things I've noticed here

First: this padding calculation is wrong, but it's unused anyway, and will be calculated/replaced in the c++ op instead:
https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/convolution/Deconvolution2DLayer.java#L202
(we end up with "padding" calculated as [24, 24] in this case)

Second: bias add appears to be hardcoded to NCHW format:
https://github.com/eclipse/deeplearning4j/blob/master/libnd4j/include/ops/declarable/generic/nn/convo/deconv2d.cpp#L84

I believe this should be helpers::addBias(block, *output, *bias, *output, isNCHW);

@eraly

This comment has been minimized.

Copy link
Contributor

@eraly eraly commented Oct 24, 2019

The C++ implementation gives incorrect answers. Here is a simple test case. The expected answers are calculated via Keras. Keras code is included in the gist. Attaching the h5 model from Keras if someone wants to debug the java side. For the c++ side you don't need anything else other than what is in the gist since I manually recreate the call to libnd4j

https://gist.github.com/eraly/49dfdb401b0347d8183027edae462f3e

Keras Model:
de_conv.h5.zip

@eraly

This comment has been minimized.

Copy link
Contributor

@eraly eraly commented Oct 24, 2019

For a case with biases not set to zero:
https://gist.github.com/eraly/0824294d453261c58ef0585d70496464

Keras Model:
de_conv.h5.zip

@eraly

This comment has been minimized.

Copy link
Contributor

@eraly eraly commented Oct 24, 2019

Also of note: With the case here (i.e combination of kernel, stride and input sizes) the output shape with same == output shape with zero padding and not same. This gives the right answer. So it might be an issue only with same mode?

@AlexDBlack AlexDBlack self-assigned this Oct 24, 2019
@AlexDBlack

This comment has been minimized.

Copy link
Contributor

@AlexDBlack AlexDBlack commented Oct 24, 2019

Update here: looks like our shape and padding calculation may not be right in c++, in particular the args passed to col2im. I'll go through this and work out what it should be...
Pg24-26 here in case anyone wants to know: https://arxiv.org/pdf/1603.07285.pdf

@raver119

This comment has been minimized.

Copy link
Contributor

@raver119 raver119 commented Oct 24, 2019

addBias being hardcoded to NCHW is fine, since temporary output array is always in NCHW format.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

@AlexDBlack AlexDBlack commented Oct 25, 2019

Fixed here: KonduitAI#12
I added a whole lot of test cases pulled from Keras here also KonduitAI/dl4j-test-resources#2
2 test cases left there (valid mode, output size differs; separate issue to same mode problem here).

I should be able to merge that tomorrow, then push that back to Eclipse. New snapshots with the fix will likely be up some time next week.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

@AlexDBlack AlexDBlack commented Oct 26, 2019

Thanks for reporting this. The fix has been merged here: KonduitAI#12
This will be merged back to the Eclipse repo shortly, and will be available on snapshots some time next week.

image

Keras vs. DL4J layer activations comparison from earlier:
https://gist.github.com/AlexDBlack/fc25ead7c88f207fa7ca9f43154014c7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.