Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question and errors compiling for -fopenmp-targets=nvptx64-nvidia-cuda #847

Open
justxi opened this issue Dec 31, 2019 · 5 comments
Open

Comments

@justxi
Copy link

justxi commented Dec 31, 2019

I compiled flang following this guide https://github.com/flang-compiler/flang/wiki/Building-Flang, except that I used release_90, added the NVPTX target, LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES="35,61" and GCC 9.2.

When I use the following program:

program hello
  use omp_lib
  implicit none

  integer, parameter :: N = 1024
  integer :: i

  real, dimension(N) :: x
  real, dimension(N) :: sum

  integer, dimension(N) :: thn
  integer, dimension(N) :: ten

  do i = 1, N 
    x(i) = 1
    sum(i) = 1
  end do

  print *, "omp_get_num_devices = ", omp_get_num_devices()

  !$omp parallel do 
  do i = 1, N
    sum(i) = sum(i) + x(i) *x(i)
    thn(i) = omp_get_thread_num()
    ten(i) = omp_get_team_num()
  end do
  !$omp end parallel do

  do i = 1, N
    print *, "team num = ", ten(i), ", thread num= ",thn(i), ", result: ", sum(i)
  end do

end program hello

Compiling it with

flang -fopenmp test0.f90 -o test0_cpu
flang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test0.f90 -Xopenmp-target -march=sm_61 -o test0_gpu

It works as expected on the CPU with 8 threads, on the GPU I see only 8 threads also.
Is this correct?

I modified the example to use target:

  print *, "omp_get_num_devices = ", omp_get_num_devices()
 
 !$omp target teams distribute parallel do 
  do i = 1, N
    sum(i) = sum(i) + x(i) *x(i)
    thn(i) = omp_get_thread_num()
    ten(i) = omp_get_team_num()
  end do
  !$omp end target teams distribute parallel do

Then I get the following error:

/pathto/flang/tools/flang2/flang2exe/verify.cpp:80: DEBUG_ASSERT 0 < ilix failed
F90-F-0000-Internal compiler error. internal error in verifier itself       0  (test1.f90: 33)
/pathto/flang/tools/flang2/flang2exe/verify.cpp:80: DEBUG_ASSERT 0 < ilix failed
F90-F-0000-Internal compiler error. internal error in verifier itself       0  (test1.f90: 33)

When I comment out "print *, "omp_get_num_devices = "..."

!  print *, "omp_get_num_devices = ", omp_get_num_devices()

  !$omp target teams distribute parallel do 
  do i = 1, N
    sum(i) = sum(i) + x(i) *x(i)
    thn(i) = omp_get_thread_num()
    ten(i) = omp_get_team_num()
  end do
  !$omp end target teams distribute parallel do

I get:

/tmp/test1a-7ce91a.ll:30:82: error: initializer with struct type has wrong # elements
@.openmp.offload.entry.__nv_MAIN__F1L21_1_ = weak global %struct.__tgt_bin_desc  {  i8* getelementptr(i8, i8* @.openmp.offload.region.__nv_MAIN__F1L21_1_, i32 0),  i8* getelementptr(i8, i8* bitcast([19 x i8]* @.C421_MAIN_ to i8*), i32 0) ,i64 0, i32 0, i32 0 }, section ".omp_offloading.entries", align 1

I saved all program codes as: test{0,1,1a}.f90
What I am doing wrong?

Programs with the same function using "C" programming language and Clang 9.0.1 works as expected.

@gklimowicz
Copy link
Contributor

@grypp Güray, is this something you have any information about?

@grypp
Copy link
Contributor

grypp commented Jan 3, 2020

Hello @justxi
For the first example, how did you observe 8 threads in GPU? I would not expect that code to run GPU since there is no target region.
For the second example, unfortunately, api functions are not implemented for the GPU device. Some of them might work, however, none of them is tested. Does your code work properly if you remove api calls?

@justxi
Copy link
Author

justxi commented Jan 3, 2020

Hi @grypp

Hello @justxi
For the first example, how did you observe 8 threads in GPU? I would not expect that code to run GPU since there is no target region.

Ok, that would explain that I have the same number of threads as CPU cores.

For the second example, unfortunately, api functions are not implemented for the GPU device. Some of them might work, however, none of them is tested. Does your code work properly if you remove api calls?

I modified the program:

program hello
  implicit none

  integer, parameter :: N = 1024
  integer :: i

  real, dimension(N) :: x
  real, dimension(N) :: sum

  do i = 1, N 
    x(i) = 1
    sum(i) = 1
  end do

  !$omp target teams distribute parallel do 
  do i = 1, N
    sum(i) = sum(i) + x(i) * x(i)
  end do
  !$omp end target teams distribute parallel do

  do i = 1, N
    print *, "result: ", sum(i)
  end do
end program hello

Compiling with:

flang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_61 test1b.f90 -o test1b_gpu

But I get this error again:

/tmp/test1b-e4125d.ll:30:86: error: initializer with struct type has wrong # elements
@.openmp.offload.entry.__nv_MAIN__F1L15_1_ = weak global %struct.__tgt_device_image  {  i8* getelementptr(i8, i8* @.openmp.offload.region.__nv_MAIN__F1L15_1_, i32 0),  i8* getelementptr(i8, i8* bitcast([19 x i8]* @.C406_MAIN_ to i8*), i32 0) ,i64 0, i32 0, i32 0 }, section ".omp_offloading.entries", align 1

@justxi
Copy link
Author

justxi commented Jan 11, 2020

@grypp Is there an example which is known to work using Fortran to offload to nVidia GPU?

@vfdff
Copy link

vfdff commented Feb 2, 2021

@justxi can you try the following more simple case, and check whether it has some issue?

PROGRAM OFFLOADINF_DEMO
USE OMP_LIB

INTEGER :: isHost = -1
character*16 :: name

!$OMP TARGET MAP (from: isHost)
  isHost = OMP_IS_INITIAL_DEVICE()
!$OMP END TARGET

if (isHost < 0) then
  PRINT *, "Runtime error, isHost = I3", isHost
end if

! CHECK: Target region executed on the device
if (isHost) then
  name = "host"
else
  name = "device"
endif

PRINT *,"Target region executed on the ", name

END

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants