Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DL4J: LRN doesn't match MKL-DNN implementation, libnd4j implementation incorrect #7272

Open
AlexDBlack opened this issue Mar 9, 2019 · 2 comments
Assignees
Labels
Bug Bugs and problems DL4J General DeepLearning4j issues

Comments

@AlexDBlack
Copy link
Contributor

AlexDBlack commented Mar 9, 2019

When implementing MKL-DNN support for DL4J, I found that MKL-DNN implementation doesn't provide the same result as DL4J for backprop - forward pass is OK.
After digging further, I'm unable to rule out the possibility that both the DL4J LRN backprop implementation and the libnd4j LRN backprop implementations are incorrect here.
DL4J: For example, setting alpha=1, k=0, beta=1 (about the simplest possible case for LRN) results in gradient checks failing when MKL-DNN is not used. Smaller values of alpha (<=1e-4) pass gradient checks. I suspect with a small enough value for alpha, the incorrect component of the gradient calculation just doesn't cause enough of a problem to trigger a test failure.

When MKL-DNN is enabled, I'm not seeing any evidence it is used when set to double precision (as required for gradient checks) - only for float precision. i.e., no output when MKLDNN_VERBOSE=1 env variable is set. Which suggests the libnd4j helper implementation is being used instead, and this is also incorrect - as gradient checks fail when it is used (including for cases that DL4J passes for).

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf Section 3.3

Aha! Link: https://skymindai.aha.io/features/DL4J-5

@AlexDBlack AlexDBlack added Bug Bugs and problems DL4J General DeepLearning4j issues labels Mar 9, 2019
@AlexDBlack AlexDBlack self-assigned this Mar 9, 2019
@raver119
Copy link
Contributor

raver119 commented Mar 9, 2019

libnd4j LRN BP being incorrect is known issue, and is a wip.

@raver119 raver119 added the C++ label Mar 9, 2019
@AlexDBlack
Copy link
Contributor Author

Right, I'm aware, I need to derive math for that. DL4J also being wrong is main new thing here.

AlexDBlack added a commit that referenced this issue Mar 12, 2019
* Fix issues with shared context for MKLDNNConvHelper

* Test tweaks

* MKL-DNN subsampling layer helper implementation (not yet passing)

* First validation test + validation utils

* Fixes

* Conv2d MKL-DNN helper

* Batch norm helper + test (fwd only so far, but passing)

* MKLDNN LRN helper

* Fix MKL-DNN conv helper

* Javadoc; disable LRN until fixed

* Properly check for conv/sub helper supported

* MKLDNNConv OpContext

* OpContext changes for batch norm, subsampling

* LRN op context

* Disable LRN helper for now

* Disable LRN test for now - issue #7272

* Batch norm fixes

* Various final (hopefully) fixes - MKL-DNN

* Last fix
@raver119 raver119 removed the C++ label Nov 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bugs and problems DL4J General DeepLearning4j issues
Projects
None yet
Development

No branches or pull requests

2 participants