Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Memory layout in the LSTM operator #15745

Closed
eloi-loomai opened this issue Aug 3, 2019 · 9 comments
Closed

Memory layout in the LSTM operator #15745

eloi-loomai opened this issue Aug 3, 2019 · 9 comments

Comments

@eloi-loomai
Copy link

eloi-loomai commented Aug 3, 2019

Description

Suspicious bug in the LSTM RNN operator

Environment info (Required)

----------Python Info----------
Version      : 3.7.2
Compiler     : Clang 4.0.1 (tags/RELEASE_401/final)
Build        : ('default', 'Dec 29 2018 00:00:04')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.1
Directory    : /Users/edubois/anaconda3/envs/py36/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /Users/edubois/_DEV/3rdParties/incubator-mxnet/python/mxnet
Commit hash file "/Users/edubois/_DEV/3rdParties/incubator-mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/Users/edubois/_DEV/3rdParties/incubator-mxnet/python/mxnet/../../build/libmxnet.dylib']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✔ CPU_SSE4_2
✖ CPU_SSE4A
✔ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✔ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✔ BLAS_MKL
✖ BLAS_APPLE
✔ LAPACK
✖ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Darwin-18.2.0-x86_64-i386-64bit
system       : Darwin
node         : MBP-de-Eloi
release      : 18.2.0
version      : Darwin Kernel Version 18.2.0: Thu Dec 20 20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0223 sec, LOAD: 0.7915 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1524 sec, LOAD: 0.5313 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.2115 sec, LOAD: 0.6373 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0434 sec, LOAD: 0.3803 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0296 sec, LOAD: 0.6332 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0392 sec, LOAD: 0.2061 sec.

Package used (Python/R/Scala/Julia):
C++

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): clang

MXNet commit hash:
24cce9e3c99e499b696b779cbb3b863145f473f1

Build config:

# choice of compiler
#--------------------

ifndef CC
export CC = gcc
endif
ifndef CXX
export CXX = g++
endif
ifndef NVCC
export NVCC = nvcc
endif

# whether compile with options for MXNet developer
DEV = 0

# whether compile with debug
DEBUG = 0

# whether to turn on segfault signal handler to log the stack trace
USE_SIGNAL_HANDLER =

# the additional link flags you want to add
ADD_LDFLAGS =

# the additional compile flags you want to add
ADD_CFLAGS =

# whether to build operators written in TVM
USE_TVM_OP = 0

#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------

# whether use CUDA during compile
USE_CUDA = 0

# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = NONE

# whether to enable CUDA runtime compilation
ENABLE_CUDA_RTC = 1

# whether use CuDNN R3 library
USE_CUDNN = 0

# whether to use NVTX when profiling
USE_NVTX = 0

#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE

# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 1
# Add OpenCV include path, in which the directory `opencv2` exists
USE_OPENCV_INC_PATH = NONE
# Add OpenCV shared library path, in which the shared library exists
USE_OPENCV_LIB_PATH = NONE

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization
USE_OPENMP = 1

# whether use MKL-DNN library: 0 = disabled, 1 = enabled
# if USE_MKLDNN is not defined, MKL-DNN will be enabled by default on x86 Linux.
# you can disable it explicity with USE_MKLDNN = 0
USE_MKLDNN =

# whether use NNPACK library
USE_NNPACK = 0

# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 1

# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =

# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python wrapper
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
else
USE_STATIC_MKL = NONE
endif

#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
	USE_SSE=0
	USE_F16C=0
else
	USE_SSE=1
endif

#----------------------------
# F16C instruction support for faster arithmetic of fp16 on CPU
#----------------------------
# For distributed training with fp16, this helps even if training on GPUs
# If left empty, checks CPU support and turns it on.
# For cross compilation, please check support for F16C on target device and turn off if necessary.
USE_F16C =

#----------------------------
# distributed computing
#----------------------------

# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0

#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1

# Use gperftools if found
# Disable because of #8968
USE_GPERFTOOLS = 0

# path to gperftools (tcmalloc) library in case of a non-standard installation
USE_GPERFTOOLS_PATH =

# Link gperftools statically
USE_GPERFTOOLS_STATIC =

# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 1

# path to jemalloc library in case of a non-standard installation
USE_JEMALLOC_PATH =

# Link jemalloc statically
USE_JEMALLOC_STATIC =

#----------------------------
# additional operators
#----------------------------

# path to folders containing projects specific operators that you don't want to put in src/operators
EXTRA_OPERATORS =

#----------------------------
# other features
#----------------------------

# Create C++ interface package
USE_CPP_PACKAGE = 0

# Use int64_t type to represent the total number of elements in a tensor
# This will cause performance degradation reported in issue #14496
# Set to 1 for large tensor with tensor size greater than INT32_MAX i.e. 2147483647
# Note: the size of each dimension is still bounded by INT32_MAX
USE_INT64_TENSOR_SIZE = 0

# Python executable. Needed for cython target
PYTHON = python

#----------------------------
# plugins
#----------------------------

# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk

# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe
# git@github.com:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk

Error Message:

This line looks wrong
https://github.com/apache/incubator-mxnet/blob/24cce9e3c99e499b696b779cbb3b863145f473f1/src/operator/rnn.cc#L320
DType* bias_n = weight_iter_n + L * H * ngates * H;
Shouldn't it be:
DType* bias_n = weight_iter_n + L * ngates * H;

Just trying to understand the memory order of the weights.

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature, Bug

@pengzhao-intel
Copy link
Contributor

could you take a look for CPU implementation as well?

@zixuanweeei can help you to understand from the formula?

@eloi-loomai
Copy link
Author

@zixuanweeei
Copy link
Contributor

@eloi-loomai The memory layout of weights is:

              L * H * ngates * H     L * H * ngates * H     L * nbias * H    
          +----------------------+----------------------+-----------------+
      workptr                weight_iter_n            bias_n             others
  weight_layer_n

So it should be DType* bias_n = weight_iter_n + L * H * ngates * H;. And it should be noticed that the LSTM formula of MXNet differs from that of MKL-DNN. MXNet has two parts of biases in each gate of RNN variants, while MKL-DNN only has a single bias, except for the bias of current memory content of LBR-GRU.

@eloi-loomai
Copy link
Author

Ok, thanks!

@eloi-loomai
Copy link
Author

I also realized that the order of gates in the planes is:
I, F, G, O

@zixuanweeei
Copy link
Contributor

Is there any problem with the order? The native LSTM implementation of MXNet shares the same order of gates with that of MKL-DNN, but differs in the number of bias. And the gates order of their GRU implementations are different, which might be concerned.

@eloi-loomai
Copy link
Author

eloi-loomai commented Aug 6, 2019

Not sure, LSTM seems to work.

@zixuanweeei
Copy link
Contributor

Feel free to directly mention me here if there is any question 😃.

BTW, we are working on integrating the LBR-GRU of MKL-DNN into MXNet. It will be completed in these days.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants