Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

v1.7.x head commit with MKL failure #19150

Closed
haohuanw opened this issue Sep 15, 2020 · 12 comments
Closed

v1.7.x head commit with MKL failure #19150

haohuanw opened this issue Sep 15, 2020 · 12 comments

Comments

@haohuanw
Copy link
Contributor

Description

when running with built mxnet-1.7 with head of v1.7.x; if I create a model with even number of channels in batch norm, I got the is_view check failure when using mkldnn.

Error Message

dev-dsk-haohuw-2c-3d588fe6 % MKLDNN_VERBOSE=1 bte python minimum_repro.py          
input channel of 45
dnnl_verbose,info,oneDNN v1.6.0 (commit N/A)
dnnl_verbose,info,cpu,runtime:sequential
dnnl_verbose,info,cpu,isa:Intel AVX2
dnnl_verbose,info,gpu,runtime:none
dnnl_verbose,exec,cpu,convolution,gemm:jit,forward_inference,src_f32::blocked:abcde:f0 wei_f32::blocked:abcde:f0 bia_undef::undef::f0 dst_f32::blocked:abcde:f0,,alg:convolution_direct,mb1_ic3oc45_id8od8kd1sd1dd0pd0_ih160oh80kh7sh2dh0ph3_iw160ow80kw7sw2dw0pw3,46.7029
dnnl_verbose,exec,cpu,batch_normalization,ncsp_bnorm:any,forward_inference,data_f32::blocked:abcd:f0 diff_undef::undef::f0,,flags:GS,mb1ic45ih1iw51200,8.98315
input channel of 45, (1, 45, 8, 80, 80)
input channel of 64
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcde:f0 dst_f32::blocked:Acdeb8a:f0,,,64x3x1x7x7,0.0170898
dnnl_verbose,exec,cpu,convolution,jit:avx2,forward_inference,src_f32::blocked:abcde:f0 wei_f32::blocked:Acdeb8a:f0 bia_undef::undef::f0 dst_f32::blocked:aBcde8b:f0,,alg:convolution_direct,mb1_ic3oc64_id8od8kd1sd1dd0pd0_ih160oh80kh7sh2dh0ph3_iw160ow80kw7sw2dw0pw3,17.5981
dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcde:f0 dst_f32::blocked:Acdeb8a:f0,,,64x3x1x7x7,0.0258789
Traceback (most recent call last):
  File "minimum_repro.py", line 53, in <module>
    out = net(input_data).asnumpy()
  File "/home/haohuw/vision-only-workspace/c2h_mainline/env/IhmPoseStateLib-1.0/test-runtime/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2566, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/haohuw/vision-only-workspace/c2h_mainline/env/IhmPoseStateLib-1.0/test-runtime/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "/home/haohuw/vision-only-workspace/c2h_mainline/src/IhmMXNet/build/private/src/src/ndarray/ndarray.cc", line 650
MXNetError: Check failed: !is_view:

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

import logging
import os

from mxnet import init
from mxnet.context import cpu
from mxnet.gluon import nn
from mxnet.gluon.block import HybridBlock
from mxnet.gluon.nn import BatchNorm

import mxnet as mx

class BuggyModel(HybridBlock):

    def __init__(
        self,
        channels,
        norm_layer=BatchNorm,
        norm_kwargs=None,
        in_channels=3,
        **kwargs
    ):
        super(BuggyModel, self).__init__(**kwargs)
        self.in_channels = in_channels
        with self.name_scope():
            self.conv1 = nn.Conv3D(
                    in_channels=self.in_channels,
                    channels=channels,
                    kernel_size=(1, 7, 7),
                    strides=(1, 2, 2),
                    padding=(0, 3, 3),
                    use_bias=False,
                    )
            self.bn1 = norm_layer(in_channels=channels, **({} if norm_kwargs is None else norm_kwargs))

    def hybrid_forward(self, F, x):
        """Hybrid forward of R2+1D net"""
        x = self.conv1(x)
        x = self.bn1(x)
        return x


print(f"input channel of 45")
net = BuggyModel(channels=45)
net.initialize(init=init.Constant(1))
input_data = mx.nd.zeros((1, 3, 8, 160, 160), ctx=mx.cpu())
out = net(input_data).asnumpy()
print(f"input channel of 45, {out.shape}")

print(f"input channel of 64")
net = BuggyModel(channels=64)
net.initialize(init=init.Constant(1))
input_data = mx.nd.zeros((1, 3, 8, 160, 160), ctx=mx.cpu())
out = net(input_data).asnumpy()
print(f"input channel of 64, {out.shape}")

What have you tried to solve it?

  1. build with 64f737c solves the issue for now

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python

dev-dsk-haohuw-2c-3d588fe6 % curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python
----------Python Info----------
Version      : 3.6.11
Compiler     : GCC 7.5.0
Build        : ('default', 'Sep 11 2020 22:03:53')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
No corresponding pip install for current python.
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-4.9.217-0.1.ac.205.84.332.metal1.x86_64-x86_64-with-redhat-5.3-Tikanga
system       : Linux
node         : dev-dsk-haohuw-2c-3d588fe6.us-west-2.amazon.com
release      : 4.9.217-0.1.ac.205.84.332.metal1.x86_64
version      : #1 SMP Thu Apr 2 15:19:24 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2699.945
BogoMIPS:              4600.13
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0018 sec, LOAD: 0.5492 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0429 sec, LOAD: 0.1730 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.34314703941345215 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0204 sec, LOAD: 0.1066 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0234 sec, LOAD: 0.4174 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.010480880737304688 sec.
----------Environment----------

@pengzhao-intel
Copy link
Contributor

@anko-intel

@pengzhao-intel pengzhao-intel added this to To do in MKLDNN improvements via automation Sep 16, 2020
@ciyongch
Copy link
Contributor

Hi @haohuanw Does this failure only happen in v1.7.x branch or the others as well?

@haohuanw
Copy link
Contributor Author

I wasn't got a chance to test other branch; I did couple build using different commit in v1.7.x branch on commit 651e24b and it still fails.

My suspicion is one of:

introduces the issue.

@ciyongch
Copy link
Contributor

Thanks for your inputs, @anko-intel will help to take a look.

@akarbown
Copy link
Contributor

I've got repro and will look into it.

@szha szha removed the needs triage label Sep 18, 2020
akarbown added a commit to akarbown/incubator-mxnet that referenced this issue Oct 6, 2020
Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.
@akarbown
Copy link
Contributor

akarbown commented Oct 9, 2020

The issue seems to be present on the other branches as well.
As far as I've checked it has been caused by the 6a5a946.

szha pushed a commit that referenced this issue Oct 25, 2020
…19299)

* Fix MKLDNN BatchNorm with even number of channels (#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
akarbown added a commit to akarbown/incubator-mxnet that referenced this issue Oct 26, 2020
Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.
akarbown added a commit to akarbown/incubator-mxnet that referenced this issue Oct 26, 2020
Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.
akarbown added a commit to akarbown/incubator-mxnet that referenced this issue Oct 29, 2020
Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.
akarbown added a commit to akarbown/incubator-mxnet that referenced this issue Oct 29, 2020
Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.
samskalicky pushed a commit that referenced this issue Oct 29, 2020
…19150) #19299 #19425 (#19428)

* Fix MKLDNN BatchNorm with even number of channels (#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
samskalicky pushed a commit that referenced this issue Oct 29, 2020
…) #19299 #19425 (#19445)

* Fix MKLDNN BatchNorm with even number of channels (#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
samskalicky pushed a commit that referenced this issue Oct 30, 2020
…19299 (#19425)

* Fix MKLDNN BatchNorm with even number of channels (#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
@akarbown
Copy link
Contributor

akarbown commented Nov 2, 2020

Hi @haohuanw, could you please verify the fix on your side?

@akarbown
Copy link
Contributor

akarbown commented Nov 9, 2020

@samskalicky, could you please tell me who should I ask for verification of that bug in order to close it?

@samskalicky
Copy link
Contributor

Hi @akarbown I pinged @haohuanw with a reminder. We can close this issue since the fix is in #19299 if we dont hear from him by tomorrow.

@akarbown
Copy link
Contributor

akarbown commented Nov 9, 2020

Hi @akarbown I pinged @haohuanw with a reminder. We can close this issue since the fix is in #19299 if we dont hear from him by tomorrow.

Thank you!

@haohuanw
Copy link
Contributor Author

sorry I don't really have time on my side to test on the fix but I am also not blocked on this change. feel free to resolve it if we have unit tests that can verify change works.

vidyaravipati pushed a commit to vidyaravipati/incubator-mxnet that referenced this issue Nov 11, 2020
) apache#19299 (apache#19425)

* Fix MKLDNN BatchNorm with even number of channels (apache#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
@samskalicky
Copy link
Contributor

Thanks @haohuanw, ill close this issue for now. we can reopen it if you find further issues.

MKLDNN improvements automation moved this from To do to Done Nov 14, 2020
chinakook pushed a commit to chinakook/mxnet that referenced this issue Nov 17, 2020
) apache#19299 (apache#19425)

* Fix MKLDNN BatchNorm with even number of channels (apache#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
chinakook pushed a commit to chinakook/mxnet that referenced this issue Nov 19, 2020
) apache#19299 (apache#19425)

* Fix MKLDNN BatchNorm with even number of channels (apache#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
josephevans pushed a commit to josephevans/mxnet that referenced this issue Dec 8, 2020
…he#19150) apache#19299 apache#19425 (apache#19445)

* Fix MKLDNN BatchNorm with even number of channels (apache#19150)

Even number of channels results in data reordering before batch
norm operation. Therefore, if BatchNorm data array is view of
another array and the data is stored in MKLDNN format, the data
needs to be converted to the default format.

* Add or updated test to verify Batchnorm odd & even number of channels

* Fix for Batchnorm odd & even chnls number context
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

6 participants