Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when executing "example-rnn-regression" juice example on aarch64 target #134

Closed
Etienne-56 opened this issue Mar 31, 2021 · 16 comments · Fixed by #139
Closed

Error when executing "example-rnn-regression" juice example on aarch64 target #134

Etienne-56 opened this issue Mar 31, 2021 · 16 comments · Fixed by #139
Assignees
Labels
bug Something doesn't quite look right

Comments

@Etienne-56
Copy link

Describe the bug

Hello everyone,

I'm new to github this is my first post :)
My objective is to use juice library on a Nvidia Jetson Nano board. For that I managed to cross compile (after many struggling) the juice repository using the rust library "cross" (https://github.com/rust-embedded/cross) and docker container :
Dockerfile.jetson-balena.txt

The jetson nano has the following setup : Linux droopi-desktop 4.9.201-tegra #1 SMP PREEMPT Fri Feb 19 08:40:32 PST 2021 aarch64 aarch64 aarch64 GNU/Linux

I built the docker image with :
image
In juice repo i created a Cross.toml file containing:
image

I started cross compilation with:

image

The example "mnist-image-multiclass-classification" works fine on the jetson but the issue is that example "example-rnn-regression" fails at execution and I want to be shure that everything works fine in the library before going further.
The error message is :
example-rnn-regression.log.txt

Any idea why it happens? Is it even possible to have every juice feature working on arm64 target?

Thank you very much,

Etienne

@Etienne-56 Etienne-56 added the bug Something doesn't quite look right label Mar 31, 2021
@drahnr
Copy link
Member

drahnr commented Mar 31, 2021

Thanks for reporting the issue! Much appreciated.

Unfortunately the error swallows a bunch of information, in coaster-nn/src/frameworks/cuda/mod.rs
image
so I have to ask you to insert a println!("cudnn busted: {:?}", e) to show the inner error message or add a breakpoint and print it there (using lldb). I will look into improving the error message over the next few days.

@lissahyacinth I remember vaguely we had a similar issue with the GeForce GTX 1050 GPU in the CI not supporting a certain parameterization of the rnn layer. This might be the case for the jetson nano.

Which one do you have? B01 or A02 - @lissahyacinth do you still have access to one?

Technically, there are only limitations as imposed by cudnn support for your device.

@drahnr
Copy link
Member

drahnr commented Mar 31, 2021

It could be #106

@Etienne-56
Copy link
Author

Thanks for your answer, I'm using jetson nano B01 with cuda 10.2. I will try to add the prints tomorrow, I keep you up to date !

@drahnr
Copy link
Member

drahnr commented Mar 31, 2021

See #135 that should help with the error printing, and alleviate your pain :)

@lissahyacinth
Copy link
Contributor

Which one do you have? B01 or A02 - @lissahyacinth do you still have access to one?

I've got a B01 hanging around, but I haven't used it in a while. Didn't realise not propagating the error would come back to haunt me!

@Etienne-56
Copy link
Author

Etienne-56 commented Apr 1, 2021

Hello,

Here is the new backtrace with a better error message:

./example-rnn-regression train --file=SavedRNNNetwork.juice --learningRate=0.01 --batchSize=40
cudnn busted: BadParam("At least one of the following conditions was met: rnnDesc is invalid, hx_desc, w_desc, hy_desc, cy_desc, or one of the x_desc or y_desc is invalid. The descriptors for x_desc, cx_desc, _hx_desc, w_desc, y_desc, hy_desc, cy_desc have incorrect strides/diemnsions. Workspace size is too small. Reserve space size is too small.")
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Plugin(Plugin("Unable to perform RNN Forward"))', /project/juice/src/layers/common/rnn.rs:203:14
stack backtrace:
   0:       0x5594df77b0 - std::backtrace_rs::backtrace::libunwind::trace::h9084dae7a7332a8c
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
   1:       0x5594df77b0 - std::backtrace_rs::backtrace::trace_unsynchronized::h2377cbb17216fe80
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:       0x5594df77b0 - std::sys_common::backtrace::_print_fmt::he896a6d420ef4507
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/sys_common/backtrace.rs:67:5
   3:       0x5594df77b0 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd4ed931a697aec4d
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/sys_common/backtrace.rs:46:22
   4:       0x5594e11700 - core::fmt::write::hce490866ebb7b066
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/fmt/mod.rs:1096:17
   5:       0x5594df5644 - std::io::Write::write_fmt::hc305428e324737a5
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/io/mod.rs:1568:15
   6:       0x5594df9588 - std::sys_common::backtrace::_print::h699e50fd0c0840c8
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/sys_common/backtrace.rs:49:5
   7:       0x5594df9588 - std::sys_common::backtrace::print::h717bd21ff780b90e
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/sys_common/backtrace.rs:36:9
   8:       0x5594df9588 - std::panicking::default_hook::{{closure}}::h6874e771e1f74ac7
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:208:50
   9:       0x5594df90fc - std::panicking::default_hook::hc618054a7378fa55
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:225:9
  10:       0x5594df9d08 - std::panicking::rust_panic_with_hook::h75d510e06b5ae0d5
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:591:17
  11:       0x5594df9888 - std::panicking::begin_panic_handler::{{closure}}::h85464207e28d186e
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:497:13
  12:       0x5594df7c3c - std::sys_common::backtrace::__rust_end_short_backtrace::h84f042552d2bbf2e
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/sys_common/backtrace.rs:141:18
  13:       0x5594df97f0 - rust_begin_unwind
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
  14:       0x5594e0f830 - core::panicking::panic_fmt::h34a018aff744b57d
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/panicking.rs:92:14
  15:       0x5594e0f6fc - core::option::expect_none_failed::hf07bd559510f7ff7
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/option.rs:1300:5
  16:       0x5594b0dccc - core::result::Result<T,E>::unwrap::h1603c9ed2080196d
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/result.rs:1037:23
  17:       0x5594ac9f08 - <juice::layers::common::rnn::Rnn<B> as juice::layer::ComputeOutput<f32,B>>::compute_output::hbbefeb29906b93b8
                               at /project/juice/src/layers/common/rnn.rs:201:9
  18:       0x5594b2f070 - juice::layer::ILayer::forward::h169291a1b7f35c5e
                               at /project/juice/src/layer.rs:1095:9
  19:       0x5594b1fc24 - juice::layer::Layer<B>::forward::hd8b6afa6b5d69270
                               at /project/juice/src/layer.rs:557:17
  20:       0x5594acd7a4 - <juice::layers::container::sequential::Sequential<B> as juice::layer::ILayer<B>>::forward::h441f881a79ca76b3
                               at /project/juice/src/layers/container/sequential.rs:301:13
  21:       0x5594b1fc24 - juice::layer::Layer<B>::forward::hd8b6afa6b5d69270
                               at /project/juice/src/layer.rs:557:17
  22:       0x5594b36d3c - juice::solver::Solver<SolverB,B>::train_minibatch::h6fb1716633c1febd
                               at /project/juice/src/solver/mod.rs:79:27
  23:       0x5594b5905c - example_rnn_regression::train::h1669d3452fa52618
                               at /project/juice-examples/mackey-glass-rnn-regression/src/main.rs:194:28
  24:       0x5594b5a110 - example_rnn_regression::main::h5074d67af69ef82c
                               at /project/juice-examples/mackey-glass-rnn-regression/src/main.rs:270:9
  25:       0x5594ae1ba8 - core::ops::function::FnOnce::call_once::h5fdc404ae2610bb1
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/ops/function.rs:227:5
  26:       0x5594ad84ac - std::sys_common::backtrace::__rust_begin_short_backtrace::h5b5ee4a7839e0526
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/sys_common/backtrace.rs:125:18
  27:       0x5594ad9254 - std::rt::lang_start::{{closure}}::h84bedbe026d25f3a
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/rt.rs:66:18
  28:       0x5594dfa084 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hd34202797d0377fb
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/ops/function.rs:259:13
  29:       0x5594dfa084 - std::panicking::try::do_call::hc0849ce796e43a92
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:379:40
  30:       0x5594dfa084 - std::panicking::try::h8a3066153f72d673
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:343:19
  31:       0x5594dfa084 - std::panic::catch_unwind::hb68ca084e373d0e1
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panic.rs:431:14
  32:       0x5594dfa084 - std::rt::lang_start_internal::ha7e3915fcdc7c7d8
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/rt.rs:51:25
  33:       0x5594ad922c - std::rt::lang_start::hcd9fc42f0d477ddd
                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/rt.rs:65:5
  34:       0x5594b5ab44 - main
  35:       0x7f8438e720 - __libc_start_main
  36:       0x55949f18d4 - <unknown>

I wonder about workspace size ?

df

Etienne

@lissahyacinth
Copy link
Contributor

That is really helpful!

CUDA is pretty difficult to debug around there, as the error says, it can be any of those issues. I can't imagine the memory is causing the issue, but it could be?

If you run
watch -d -n 0.5 nvidia-smi
In a separate terminal, you should be able to see if it's getting close to any memory limits.

If not, I guess I'll have to debug it here :)

@drahnr
Copy link
Member

drahnr commented Apr 1, 2021

@Etienne-56 this is not related to your cargo / or disk workspace, but only to the scratch memory for the cudnn rnn layer ( os a chunk of GPU mapped memory iirc).

I think the most likely cause is one of those two:

workSpaceSizeInBytes is too small.
reserveSpaceSizeInBytes is too small.  

accoding to https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnRNNForwardTraining . @lissahyacinth this is a hunch, nothing more, nothing less :) I have a feeling the size of our scratchspace is just too small depending on the input and output, but I'd have to read up on that.

Work and reserve space buffer sizes should be computed by the cudnnGetRNNTempSpaceSizes() function with the same fwdMode setting as used in the cudnnRNNForward() call. 

☝️ which is not what we do 😰

@Etienne-56
Copy link
Author

Etienne-56 commented Apr 1, 2021

@lissahyacinth Thanks for the suggestion but nvidia-smi binary is not on the jetson
Just for completeness, I ran tegrastats command while running example-rnn-regression:

RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [2%@102,1%@102,0%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@30.5C iwlwifi@41C PMIC@100C GPU@28.5C AO@35C thermal@29.75C
RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@31C iwlwifi@41C PMIC@100C GPU@28.5C AO@35C thermal@29.75C
RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [3%@102,1%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@31C iwlwifi@41C PMIC@100C GPU@28.5C AO@34.5C thermal@29.75C
RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [2%@102,0%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@31.5C iwlwifi@41C PMIC@100C GPU@29C AO@35C thermal@29.75C
RAM 566/3964MB (lfb 706x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26C CPU@31C iwlwifi@41C PMIC@100C GPU@28.5C AO@35C thermal@29.75C
RAM 598/3964MB (lfb 693x4MB) SWAP 0/1982MB (cached 0MB) CPU [25%@1479,23%@1479,22%@1479,34%@1479] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@31C iwlwifi@41C PMIC@100C GPU@29.5C AO@35.5C thermal@30C
RAM 687/3964MB (lfb 659x4MB) SWAP 0/1982MB (cached 0MB) CPU [10%@1428,5%@1428,29%@1428,29%@1428] EMC_FREQ 0% GR3D_FREQ 21% PLL@27.5C CPU@32C iwlwifi@41C PMIC@100C GPU@29.5C AO@35C thermal@30.5C
RAM 794/3964MB (lfb 620x4MB) SWAP 0/1982MB (cached 0MB) CPU [14%@1479,12%@1479,27%@1479,30%@1479] EMC_FREQ 0% GR3D_FREQ 15% PLL@27C CPU@31.5C iwlwifi@41C PMIC@100C GPU@29.5C AO@35C thermal@30.75C
RAM 866/3964MB (lfb 589x4MB) SWAP 0/1982MB (cached 0MB) CPU [10%@1132,18%@1132,25%@1132,6%@1132] EMC_FREQ 0% GR3D_FREQ 10% PLL@27C CPU@31.5C iwlwifi@41C PMIC@100C GPU@29.5C AO@34.5C thermal@30.25C
RAM 981/3964MB (lfb 553x4MB) SWAP 0/1982MB (cached 0MB) CPU [10%@1479,24%@1479,47%@1479,6%@1479] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C CPU@31.5C iwlwifi@41C PMIC@100C GPU@29.5C AO@35C thermal@30.75C
RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [24%@1479,14%@1479,14%@1479,15%@1479] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C CPU@32C iwlwifi@41C PMIC@100C GPU@29.5C AO@35C thermal@30.75C
RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,2%@102,0%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@31.5C iwlwifi@41C PMIC@100C GPU@29C AO@34.5C thermal@30C
RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [1%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@31C iwlwifi@41C PMIC@100C GPU@29C AO@35C thermal@30.25C
RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [7%@102,1%@102,3%@102,1%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@26.5C CPU@31.5C iwlwifi@41C PMIC@100C GPU@29C AO@35C thermal@30.25C
RAM 683/3964MB (lfb 564x4MB) SWAP 0/1982MB (cached 0MB) CPU [2%@102,3%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@27C CPU@31.5C iwlwifi@41C PMIC@100C GPU@29C AO@35C thermal@30C

@drahnr
Copy link
Member

drahnr commented Apr 7, 2021

This has nothing todo with the memory available, that would yield an allocation error which would render a different error.
It seems some APIs changed and currently a few deprecated APIs are used, the next step is to create a unit test and add additional debug_assert! to verify the sizes stay const as expected.

@drahnr drahnr mentioned this issue Apr 8, 2021
4 tasks
@drahnr
Copy link
Member

drahnr commented Apr 9, 2021

@Etienne-56 I got a simple example test case working, see #139 / https://ci.spearow.io/teams/spearow/pipelines/juice/jobs/pr-test-juice-fedora-cuda/builds/41 - it was some hidden layer dimensionality mismatch that causes this, unrelated to the temporary space allocations.

Hoping to fix the RNN layer tonight.

@Etienne-56
Copy link
Author

Ok thanks @drahnr for feedback :)

@drahnr
Copy link
Member

drahnr commented Apr 12, 2021

Quick update, I got a simple layer working now, but it seems either the mackey-glass is triggering an edge case or using an invalid parameterization or when embedded within multiple layers another issue is triggered.

@drahnr
Copy link
Member

drahnr commented Apr 15, 2021

It's fixed for the most part now, training should be working just fine, there is one tensor size mismatch left in the example-rnn-regression test trained_net.capnp which is currently being investigated. Funny how much one simple badly parametrized test case can shadow.

@drahnr drahnr self-assigned this Apr 15, 2021
@Etienne-56
Copy link
Author

@drahnr Thanks for the support, I recently switched from a R&D project to a client project so I can't replay the test now, but a new colleague is going to take my place next week. I will transmit him what to do and keep you informed of the result !

@drahnr
Copy link
Member

drahnr commented Apr 18, 2021

The initial issue is resolved with the pending merge of #139 , the remaining task is described in #140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something doesn't quite look right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants