Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fig_15_3 example hangs on Iris Pro Graphics 580 #21

Open
jnorwood opened this issue Jul 15, 2021 · 8 comments
Open

fig_15_3 example hangs on Iris Pro Graphics 580 #21

jnorwood opened this issue Jul 15, 2021 · 8 comments

Comments

@jnorwood
Copy link

I'm using a Skull Canyon NUC box with Iris Pro Graphics 580. Most of the examples run ok on this, but the fig_15_3 example hangs with current matrixSize=128 setting. If I lower matrixSize to 96, it executes very slowly.. over 6 secs per iteration. At matrixSize=100 it seeming stalls after a couple of iterations. I modified the code to add both async and queue exception catch, but no error is caught. I'll attach my code, the stack backtraces and the system monitor showing hang after a couple of iterations when matrixSize=100. Looks like CPU goes to 100% and stays there. Only 3 thread running. This is the single task matrix multiplication. The parallel versions in the following examples work ok on gpu. All examples work ok on cpu.

fig_15_3_call_stack

fig_15_3_cpu_100

fig_15_3_single_task_matrix_multiplication_mod.zip

@jnorwood
Copy link
Author

On the same gpu, the fig_15_5 example runs at about 0.11 per iteration and fig_15_7 at about 0.035 per iteration, so the 7 secs per iteration of the fig_15_3 single task example seems extremely slow.

@bashbaug
Copy link
Collaborator

Interesting, the "single task" version is not going to run very well on most GPUs, but the time you are seeing is excessive.

Could you please include:

  • What version of the dpcpp compiler you are using, from dpcpp --version?
  • What driver versions you have installed, from sycl-ls or sycl-ls --verbose?

As a data point, you may also want to try using the OpenCL GPU backend instead of the Level Zero GPU backend. You can do this with the SYCL_BE or SYCL_DEVICE_FILTER environment variables - see here. I don't think this will make a difference (it doesn't on my similar Intel(R) HD Graphics 620 system), but it's worth a try.

Thanks!

@jnorwood
Copy link
Author

I'm using the most recent docker released version

root@33541cf26757:/workspaces/data-parallel-CPP-main/build# dpcpp --version
Intel(R) oneAPI DPC++ Compiler 2021.2.0 (2021.2.0.20210317)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2021.2.0/linux/bin

root@33541cf26757:/workspaces/data-parallel-CPP-main/build# sycl-ls
ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.11.3.0.17_160000]
CPU : Intel(R) OpenCL 2.1 [2021.11.3.0.17_160000]
GPU : Intel(R) OpenCL HD Graphics 3.0 [21.11.19310]
GPU : Intel(R) Level-Zero 1.0 [1.0.19310]
HOST: SYCL host platform 1.2 [1.2]

I retried using
export SYCL_DEVICE_FILTER=opencl:gpu:2
based on document at github
It still hangs for the original matrixSize=128, but made it through four iterations for matrixSize=100 at about 6.3 sec/iteration.

@bashbaug
Copy link
Collaborator

I got the most recent docker version working on my system also. Note that it appears there is a slightly newer version thanthe one you are using. I'm not able to reproduce this issue on my end:

root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# dpcpp --version
Intel(R) oneAPI DPC++/C++ Compiler 2021.3.0 (2021.3.0.20210619)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2021.3.0/linux/bin
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# sycl-ls
0. ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.12.6.0.19_160000]
1. CPU : Intel(R) OpenCL 2.1 [2021.12.6.0.19_160000]
2. GPU : Intel(R) OpenCL HD Graphics 3.0 [21.23.20043]
3. GPU : Intel(R) Level-Zero 1.1 [1.1.20043]
4. HOST: SYCL host platform 1.2 [1.2]
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# samples/Ch15_gpus/fig_15_3_single_task_matrix_multiplication 
Running on device: Intel(R) HD Graphics 620 [0x5916]
Success!
GFlops: 0.0249472
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# SYCL_DEVICE_FILTER=opencl:gpu samples/Ch15_gpus/fig_15_3_single_task_matrix_multiplication 
Running on device: Intel(R) HD Graphics 620 [0x5916]
Success!
GFlops: 0.0249608
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# 

A couple of possibilities:

  1. Perhaps there was an issue in the older docker image that has been fixed? This would be the best-case scenario. Can you please try grabbing the latest docker image and give it a try?
  2. Maybe there is an issue with your Iris Pro Graphics 580 that does not appear on my HD Graphics 620? I think this is unlikely - if anything your GPU should be faster! - but I suppose it is possible.
  3. Could there be anything else odd happening with your system? Is everything else running OK?

Since (1) is the easiest to check, let's start there first.

@jnorwood
Copy link
Author

ok, thanks. Yes, I had pulled the latest docker images, but neglected to rebuild my docker environment in vscode and update its compiler paths

after doing that I delete my build directory and then re-created makefiles. So, here's also my cmake configure options ... no optimizations and enabling debug. Maybe that has something to do with the issue.

41 mkdir build
42 cd build
43 CXX=dpcpp cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_CXX_FLAGS="-O0" -D NODPL=1 ../

Here is the compiler version showing the update to latest version and the sycl-ls versions
root@4578405bdff6:/workspaces/data-parallel-CPP-main/build# dpcpp --version
Intel(R) oneAPI DPC++/C++ Compiler 2021.3.0 (2021.3.0.20210619)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2021.3.0/linux/bin
root@4578405bdff6:/workspaces/data-parallel-CPP-main/build#

root@4578405bdff6:/workspaces/data-parallel-CPP-main/build# sycl-ls
0. ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.12.6.0.19_160000]

  1. CPU : Intel(R) OpenCL 2.1 [2021.12.6.0.19_160000]
  2. GPU : Intel(R) OpenCL HD Graphics 3.0 [21.23.20043]
  3. GPU : Intel(R) Level-Zero 1.1 [1.1.20043]
  4. HOST: SYCL host platform 1.2 [1.2]

however, the end result with matrixSize=128 is still a hang.
with matrixSize=100 it completes, although with long iterations
with matrixSize=110 it hangs after one 9.2 second iteration.

I'm attaching the screen captures with the iterations showing the matrixSize for 100 and 110 cases.

fig_15_3_single_task_mat100
fig_15_3_single_task_mat110

@jnorwood
Copy link
Author

I checked that the problem is associated with the disabled optimization. If I override build optimization to -O2 with
make CXX_FLAGS="-O2", then the MatrixSize:128 completes
I normally build with -O0 due to the poor debugger support with -O2 optimization.

Running on device: Intel(R) Iris(TM) Pro Graphics 580 [0x193b]
MatrixSize:128
time:0.979357
time:0.807278
time:0.823998
time:0.853383
Success!
GFlops: 0.00519562

@bashbaug
Copy link
Collaborator

Thanks for investigating further. I can reproduce the excessive execution time using -O0 also.

I'm checking to see if there is a way to compile the host code with -O0 for easier debugging but to keep the device code (that executes on the GPU, and is leading to the excessive execution time) using a different optimization level.

Would this satisfy your use-case? I see you mentioned above:

I normally build with -O0 due to the poor debugger support with -O2 optimization.

@jnorwood
Copy link
Author

I already have the work-arounds of reducing MatrixSize and/or using Q{cpu_selector{}}.

With MatrixSize==128 and using gpu_selector I can wait for 4 minutes without executing a single iteration, so I presume something is hung.

Using cpu_selector, fig_15_3_single_task completes an iteration in about 0.4 sec.

There is a document on gdb for gpu: gpu_debug , which I'm linking here for reference. It mentions setting heartbeat_interval, enable_hangcheck and preempt_timeout settings, which I haven't explicitly set.

I'll come back to this problem after finishing the dpc++ book examples and see if I can debug the gpu hang further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants