New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outputs differ with cuDNN enabled. #5836

Closed
ClnSchlssr opened this Issue Jul 7, 2018 · 19 comments

Comments

Projects
None yet
4 participants
@ClnSchlssr
Copy link

ClnSchlssr commented Jul 7, 2018

I've been trying to get some code which was originally tested on a GPUless machine to run on one that has a Nvidia GTX 1080ti with CUDA 9.0 and cuDNN 7.1 installed. While it appears to run fine, I'm getting a different output when cuDNN is enabled (I'm enabling/disabling by removing the dependency in the pom file).
To check the output I'm comparing against the same image run on the same model but loaded using the Keras library in R. The CPU/CUDA-only setup returns approximately the same output as the model loaded in R, but with cuDNN enables the outputs are off (generally they are larger).
Additionally (perhaps related but possibly not), whether cuDNN is enabled or not, I'm not seeing a big difference in speed when running restoredModel.outputSingle(...). The Resnet model classifies one example very quickly, and the xception model takes on the order of 20 seconds either way (can add more precise timing if that's valuable). This might somehow suggest that cuDNN isn't being loaded properly at all but then no errors are logged. Is there a good way to check if cuDNN has been loaded correctly and to make sure it's outputting the correct version?
Any other thoughts as to why this might be happening?
I've attached a small test program and the pom file, along with outputs when the

    <<dependency>>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-cuda-9.0</artifactId>
        <version>${dl4j.version}</version>
    </dependency>

dependency in the POM is commented out. I haven't looked in detail into where in the networks the outputs diverge but could begin to do that if no better clues come from this information. I'm not currently building myself so haven't yet tried CUDA9.2 but could endeavor to do that if that also seems like a good starting point.

System Information:
DL4J: 1.0.0 - SNAPSHOT
Operating System: Win10
CUDA Version: 9.0
cuDNN Version: 7.1
GPU: Nvidia GTX 1080ti
GPU Driver: 398.36

Keras Resnet Output: 0.3022
Keras Xception Output: 0.00292

import-cuda-test.txt
pom_cuDNN_disabled.txt
resnet-cuDNN_disabled.txt
xception-cuDNN_enabled.txt
xception-cuDNN_disabled.txt
resnet-cuDNN_enabled.txt

@maxpumperla

This comment has been minimized.

Copy link
Contributor

maxpumperla commented Jul 7, 2018

Given that this model works fine on CPU and CUDA (thanks @ClnSchlssr for working with me on this), I don't consider this an import issue. @AlexDBlack have we seen issues like this before? Any ideas?

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 8, 2018

We haven't seen any recent issues like this. But maybe check snapshots just in case: https://deeplearning4j.org/snapshots

Only other idea I have is the fact that cudnn is non-deterministic, but the scale of differences is larger than I would expect for that - #5089

I'll need to look into this in detail and work out what's up. But I agree that this probably isn't related to import.

@AlexDBlack AlexDBlack self-assigned this Jul 17, 2018

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 17, 2018

I can confirm that I've been able to reproduce this, and it's still present on current master/snapshots.
Initial investigation suggests there's slight differences in CuDNN batch norm forward pass. Not clear why yet, especially given we've got other unit tests that validate that our and CuDNN's batch norm implementations match...

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 17, 2018

So the first issue here is that we aren't adding epsilon value (0.001) during inference.
That has been fixed here: #5917

There is however some other slight differences between our implementation and CuDNN's implementation in later layers in the net that I'm still looking into.

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 20, 2018

OK, finally got back to this. Full fix has been implemented here: #5917

The second issue was due to incorrect forward pass under the following limited circumstances:
(a) max pooling layer
(b) same convolution mode
(c) all values in convolution window less than 0

Basically, CuDNN doesn't support SAME padding mode (i.e., asymmetric padding). Consquently, we (like TF etc) manually pad to handle this. But CuDNN then doesn't know the zeros are padding values - so if all 'real' inputs are less than 0, then the maximum is the padding value (0). Now we use -infinity for max pooling instead, so the padding values will always be excluded.

@AlexDBlack AlexDBlack closed this Jul 20, 2018

@ClnSchlssr

This comment has been minimized.

Copy link

ClnSchlssr commented Jul 23, 2018

Really appreciate you tracking that down Alex. I've been away recently but looking forward to running on my end. Also thanks for the explanation - interesting bug.

@ClnSchlssr

This comment has been minimized.

Copy link

ClnSchlssr commented Jul 24, 2018

The fix definitely works. While it runs fast for the resnet model, curiously the xception model is still slower with CUDA (with or without CuDNN) than just with CPU. @AlexDBlack , is this a known issue or are there other settings that might lead to this condition? If not perhaps I should investigate with more detail and open a separate issue.

@agibsonccc

This comment has been minimized.

Copy link
Member

agibsonccc commented Jul 25, 2018

@ClnSchlssr some profiling would help here. Fast/slow is always relative to something.

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 25, 2018

@ClnSchlssr Yeah, that's not expected.
I haven't profiled xception specifically, but other nets I have profiles (vgg16, resnet50, etc) have all been fast.
If you can't work it out (or suspect there's a bottleneck or something), open a new issue with details and your benchmark code and we'll take a look.

@ClnSchlssr

This comment has been minimized.

Copy link

ClnSchlssr commented Jul 25, 2018

Here is a quick run of the models with their outputs timed. Code is attached. Note I'm changing the platform by switching/removing the dependencies form the pom file (is there a better way to do that?). Two things are strange: Different image orders give different outputs when run on different platforms, and how slow this Xception model is in comparison to Resnet on CUDA / CuDNN.

Using SNAPSHOTS with CUDA9.2 but it looks the same with 9.0

Platform Model RGB Output / Time (ms) BGR Output / Time(ms)
CPU Resnet 0.3022676706314087 / 551.458241 0.9479966163635254 / 308.752494
CPU Xception 0.012912283651530743 / 603.140759 0.002920493483543396 / 574.26103
CUDA Resnet 0.3022680878639221 /541.13487 0.9479965567588806/ 326.245233
CUDA Xception 0.01291227899491787 / 32549.983197 0.00292049627751112 / 32530.766062
CuDNN Resnet 0.9479965567588806 / 279.979155 0.3022678792476654 / 154.730324
CuDNN Xception 0.0029204979073256254 / 33250.771669 0.012912283651530743 / 33030.097111

Is this worth 2 seperate new issues? LMK and I can do that + add a bit more context.

keras-import_platform-benchmarks.txt

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 25, 2018

Thanks for the results and code, that's quite helpful. I'll have a look in the profiler and see what's going on here.

Note I'm changing the platform by switching/removing the dependencies form the pom file

Maven profiles is another approach (but requires rebuilding).
Or include both native and CUDA, but set the "BACKEND_PRIORITY_CPU" and "BACKEND_PRIORITY_GPU" environment vars to determine which one gets loaded.
CPU value needs to be bigger than GPU value for it to get loaded in preference

String s = System.getenv("BACKEND_PRIORITY_CPU");

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 26, 2018

@ClnSchlssr OK, looking at these benchmarks in detail: there are some issues with them - no warmup and single iteration. I'll direct you to this for details: https://deeplearning4j.org/benchmark

As for the models themselves - I don't have the exact files you are using (do you have a link?) but I'll be using the xception and resnet models here to benchmark this:
https://github.com/deeplearning4j/dl4j-test-resources/tree/master/src/main/resources/modelimport/keras/examples/xception
https://github.com/deeplearning4j/dl4j-test-resources/tree/master/src/main/resources/modelimport/keras/examples/resnet

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 26, 2018

Here's an improved benchmark, multiple iterations and warmup (i.e., excluding first 10 times from the mean/stdev):
Resnet50 results look fine to me.
(Windows 10, Titan X, CUDA 9.1 + CuDNN)
https://gist.github.com/AlexDBlack/7f8ed83fc3660f545abc67983997e894

RESNET
Runtimes, RGB: [197.663918, 129.812707, 49.692795, 29.732299, 47.941929, 27.802181, 29.0508, 47.923833, 27.107707, 47.134099, 25.603355, 47.179168, 26.663844, 26.845486, 27.704531, 25.584576, 25.419322, 24.853227, 46.803251, 24.418924, 45.217979, 25.428882, 44.056084, 29.394623, 44.923664, 24.07476, 45.016192, 24.814645, 45.69974, 25.982686, 27.53757, 24.557546, 25.09223, 24.820108, 22.74249, 21.951731, 43.842347, 22.216001, 44.545015, 22.021384, 43.297078, 22.239218, 21.987924, 22.192783, 22.145666, 22.2778, 22.459783, 22.29214, 21.982802, 43.812983, 21.824718, 44.174902, 21.997825, 42.524416, 22.382961, 21.795014, 23.935797, 23.911556, 23.769519, 23.89107, 22.260728, 22.887257, 22.034359, 22.336185, 22.059625, 21.67961, 43.670265, 21.932953, 43.772694, 21.680976, 21.856472, 22.030944, 21.625663, 21.732874, 21.387002, 21.492504, 21.276036, 43.431604, 21.528355, 43.030419, 21.594935, 21.942855, 21.875592, 21.77248, 21.668001, 21.394855, 42.436669, 21.414317, 43.197722, 21.38871, 38.007263, 21.486701, 21.594593, 21.583667, 21.469287, 23.08324, 22.001922, 44.183438, 21.833937, 44.47468]
Runtimes, BGR: [167.192416, 56.609198, 31.562379, 30.143384, 27.29208, 64.79334, 28.485046, 27.771793, 26.762859, 26.21247, 25.655935, 26.541611, 48.052895, 26.369871, 45.233343, 25.564431, 46.336171, 24.550376, 26.701743, 24.031398, 26.530344, 25.11852, 27.05137, 29.005731, 24.433606, 24.160119, 24.297374, 24.385805, 25.739245, 43.628952, 25.918838, 44.736559, 24.424046, 43.328491, 22.376132, 22.298627, 22.177077, 22.591577, 22.33038, 21.954463, 22.322527, 22.146348, 43.439115, 21.969145, 43.805814, 22.485391, 42.736105, 22.243998, 22.065087, 21.897103, 23.028269, 22.08011, 21.673805, 22.315016, 21.900175, 44.430294, 23.700891, 45.05136, 23.473497, 43.769621, 23.192498, 43.313467, 22.675228, 43.373559, 22.204734, 21.530403, 22.20644, 21.955487, 21.75814, 22.209513, 44.19812, 21.910077, 42.346189, 21.409878, 43.847468, 21.383929, 21.335787, 21.700779, 21.593569, 21.680976, 21.486701, 44.798017, 21.918271, 44.334009, 22.053137, 21.381197, 21.558743, 21.484651, 21.476458, 21.321105, 25.941032, 21.418755, 43.751867, 21.590838, 42.085676, 21.891981, 21.868764, 21.995777, 22.128594, 22.181516]
Mean RGB: 28.1335391998291 (stdev 9.24053955078125)
Mean BGR: 27.93946647644043 (stdev 9.136126518249512)

Looking into xception results now, first guess is probably a CPU-only op being used there.

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 26, 2018

Yeah, the cause here is a CPU-only op - sconv2d:

CustomOp op = DynamicCustomOp.builder("sconv2d")
.addInputs(opInputs)
.addIntegerArguments(args)
.addOutputs(output)
.callInplace(false)
.build();
Nd4j.getExecutioner().exec(op);

Not much we can do about that right now. We've added a hundreds of new ops (CPU-only) as part of SameDiff and TF import in the last few months; this in one of them. The next step is to adapt or create CUDA versions of all of these ops. We're starting on that now, but it'll take time.

I consider this issue as resolved for now. CPU-only op is a known issue.

@maxpumperla

This comment has been minimized.

Copy link
Contributor

maxpumperla commented Jul 26, 2018

@ClnSchlssr about the other issue you mentioned, I'd need more details. It seems a little strange that the results for cudnn and cuda are just swapped. Did you feed RGB and BGR input to the same model? That won't work. If you go back to keras and specify data_format correctly, upon import this should be reflected and produce consistent results. please provide me with the four models (2x resnet, 2x xception) you're using here and indicate which results are expected in the above list.

@maxpumperla

This comment has been minimized.

Copy link
Contributor

maxpumperla commented Jul 26, 2018

maybe open another issue for that

@ClnSchlssr

This comment has been minimized.

Copy link

ClnSchlssr commented Jul 27, 2018

@AlexDBlack Thanks for that link. Definitely would have been helpful to read that before posting, but thanks for going ahead and running some more rigorous benchmarks. CPU-only op is totally understandable. It would be nice to be at the level of understanding with this library to dig up that answer myself.

@maxpumperla I've been flipping the channels and feeding the same image with different channel orders into the the networks because when I was first trying this it seemed like the default order wasn't giving me the output I expected. I realize that feeding the image in with the wrong channel ordering "won't work" in the sense that it will give you a completely different prediction, but since it's the same data structure it works in the sense that it's a valid input into the network. I'm also finding it strange that they're swapped. Is there an operation at the start of the feedforward (or some preprocessing) that's different when using CuDNN that might flip the order of the channels? I'll make a new issue and attach the models with a test image.

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Jul 27, 2018

Yeah, re: CPU-only ops - unfortunately it's not really documented, and there's some exception even for things using DynamicCustomOp - really hard for most users to know until they run into it and ask at present... It's a short-term thing though, so we haven't put much effort into docs there.

@lock

This comment has been minimized.

Copy link

lock bot commented Sep 21, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 21, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.