-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fig_15_3 example hangs on Iris Pro Graphics 580 #21
Comments
On the same gpu, the fig_15_5 example runs at about 0.11 per iteration and fig_15_7 at about 0.035 per iteration, so the 7 secs per iteration of the fig_15_3 single task example seems extremely slow. |
Interesting, the "single task" version is not going to run very well on most GPUs, but the time you are seeing is excessive. Could you please include:
As a data point, you may also want to try using the OpenCL GPU backend instead of the Level Zero GPU backend. You can do this with the Thanks! |
I'm using the most recent docker released version root@33541cf26757:/workspaces/data-parallel-CPP-main/build# dpcpp --version root@33541cf26757:/workspaces/data-parallel-CPP-main/build# sycl-ls I retried using |
I got the most recent docker version working on my system also. Note that it appears there is a slightly newer version thanthe one you are using. I'm not able to reproduce this issue on my end:
A couple of possibilities:
Since (1) is the easiest to check, let's start there first. |
ok, thanks. Yes, I had pulled the latest docker images, but neglected to rebuild my docker environment in vscode and update its compiler paths after doing that I delete my build directory and then re-created makefiles. So, here's also my cmake configure options ... no optimizations and enabling debug. Maybe that has something to do with the issue. 41 mkdir build Here is the compiler version showing the update to latest version and the sycl-ls versions root@4578405bdff6:/workspaces/data-parallel-CPP-main/build# sycl-ls
however, the end result with matrixSize=128 is still a hang. I'm attaching the screen captures with the iterations showing the matrixSize for 100 and 110 cases. |
I checked that the problem is associated with the disabled optimization. If I override build optimization to -O2 with Running on device: Intel(R) Iris(TM) Pro Graphics 580 [0x193b] |
Thanks for investigating further. I can reproduce the excessive execution time using I'm checking to see if there is a way to compile the host code with Would this satisfy your use-case? I see you mentioned above:
|
I already have the work-arounds of reducing MatrixSize and/or using Q{cpu_selector{}}. With MatrixSize==128 and using gpu_selector I can wait for 4 minutes without executing a single iteration, so I presume something is hung. Using cpu_selector, fig_15_3_single_task completes an iteration in about 0.4 sec. There is a document on gdb for gpu: gpu_debug , which I'm linking here for reference. It mentions setting heartbeat_interval, enable_hangcheck and preempt_timeout settings, which I haven't explicitly set. I'll come back to this problem after finishing the dpc++ book examples and see if I can debug the gpu hang further. |
I'm using a Skull Canyon NUC box with Iris Pro Graphics 580. Most of the examples run ok on this, but the fig_15_3 example hangs with current matrixSize=128 setting. If I lower matrixSize to 96, it executes very slowly.. over 6 secs per iteration. At matrixSize=100 it seeming stalls after a couple of iterations. I modified the code to add both async and queue exception catch, but no error is caught. I'll attach my code, the stack backtraces and the system monitor showing hang after a couple of iterations when matrixSize=100. Looks like CPU goes to 100% and stays there. Only 3 thread running. This is the single task matrix multiplication. The parallel versions in the following examples work ok on gpu. All examples work ok on cpu.
fig_15_3_single_task_matrix_multiplication_mod.zip
The text was updated successfully, but these errors were encountered: