You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a reproducer on an issue which has bit me when using SCC to transform the ACRANEB2 dwarf.
Consider the following Fortran module, highlighting the issue:
module mod3
implicit none
contains
subroutine wrapper
integer :: h_start = 1
integer, parameter :: h_dim = 10
integer :: h_end
integer :: v_start = 1
integer, parameter :: v_dim = 20
integer, parameter :: block_size = 30
integer :: block_index
real :: arr(h_dim, v_dim, block_size)
h_end = h_dim
do block_index = 1, block_size
call depth1(h_start, h_end, h_dim, v_start, v_dim, &
arr(:, :, block_index))
end do
print *, "Sum of array is ", sum(arr)
end subroutine wrapper
subroutine depth1(h_start, h_end, h_dim, v_start, v_dim, array)
! Arguments.
integer, intent(in) :: h_start
integer, intent(in) :: h_end
integer, intent(in) :: h_dim
integer, intent(in) :: v_start
integer, intent(in) :: v_dim
real, intent(inout) :: array(h_dim, v_dim)
! Non-array local variables.
integer :: v_index
integer :: h_index
! Temporary arrays.
real :: tmp_literal(3)
real :: tmp1(h_dim, v_dim), tmpx(h_dim, v_dim)
tmp_literal(1) = 1.0
tmp_literal(2) = 2.0
tmp_literal(3) = 3.0
do v_index = v_start, v_dim
do h_index = h_start, h_end
array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
end do
end do
do v_index = v_start, v_dim
do h_index = h_start, h_end
array(h_index, v_index) = exp(tmp1(h_index, v_index) + 0.25) + tmp_literal(2)
end do
end do
call somesubroutine(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
end subroutine depth1
subroutine somesubroutine(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
integer, intent(in) :: h_start
integer, intent(in) :: h_end
integer, intent(in) :: h_dim
integer, intent(in) :: v_start
integer, intent(in) :: v_dim
real, intent(inout) :: array(h_dim, v_dim)
real, intent(in) :: tmp_literal(3)
integer :: v_index
integer :: h_index
do v_index = v_start, v_dim
do h_index = h_start, h_end
array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
end do
end do
end subroutine
end module mod3
Above, pay close attention to depth1, where there is a "literal" temporary tmp_literal. In particular, this array is:
initialised at the top level of depth1
used in the subsequent loops of depth1
used to modify array in the call to somesubroutine after the loops.
Now, consider the transformation of the above program using SCC, first without setting --trim_vector_sections, using the call: loki-transform.py convert --out-path scc-no-trim --path src --directive openacc --mode scc --config loki-config
which results in depth1_scc being transformed to:
SUBROUTINE depth1_SCC (h_start, h_end, h_dim, v_start, v_dim, array)
! Arguments.
INTEGER, INTENT(IN) :: h_start
INTEGER, INTENT(IN) :: h_end
INTEGER, INTENT(IN) :: h_dim
INTEGER, INTENT(IN) :: v_start
INTEGER, INTENT(IN) :: v_dim
REAL, INTENT(INOUT) :: array(h_dim, v_dim)
! Non-array local variables.
INTEGER :: v_index
INTEGER :: h_index
! Temporary arrays.
REAL :: tmp_literal(3)
REAL :: tmp1(h_dim, v_dim), tmpx(h_dim, v_dim)
!$acc routine vector
!$acc data present( array )
!$acc loop vector private( tmp_literal )
DO h_index=h_start,h_end
tmp_literal(1) = 1.0
tmp_literal(2) = 2.0
tmp_literal(3) = 3.0
!$acc loop seq
DO v_index=v_start,v_dim
array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
END DO
!$acc loop seq
DO v_index=v_start,v_dim
array(h_index, v_index) = EXP(tmp1(h_index, v_index) + 0.25) + tmp_literal(2)
END DO
END DO
CALL somesubroutine_SCC(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
!$acc end data
END SUBROUTINE depth1_SCC
Above, see how the initialisation of tmp_literal has been moved inside the vector loop, and a private (tmp_literal) has been added. The issue here is that now in the call to somesubroutine, tmp_literal will be uninitialised, as the loop operates with private copies of it.
Consider then, adding the --trim_vector_sections flag to the above call to loki-transform.py as means of fixing this issue:
SUBROUTINE depth1_SCC (h_start, h_end, h_dim, v_start, v_dim, array)
! Arguments.
INTEGER, INTENT(IN) :: h_start
INTEGER, INTENT(IN) :: h_end
INTEGER, INTENT(IN) :: h_dim
INTEGER, INTENT(IN) :: v_start
INTEGER, INTENT(IN) :: v_dim
REAL, INTENT(INOUT) :: array(h_dim, v_dim)
! Non-array local variables.
INTEGER :: v_index
INTEGER :: h_index
! Temporary arrays.
REAL :: tmp_literal(3)
REAL :: tmp1(h_dim, v_dim), tmpx(h_dim, v_dim)
!$acc routine vector
!$acc data present( array )
tmp_literal(1) = 1.0
tmp_literal(2) = 2.0
tmp_literal(3) = 3.0
!$acc loop vector private( tmp_literal )
DO h_index=h_start,h_end
!$acc loop seq
DO v_index=v_start,v_dim
array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
END DO
!$acc loop seq
DO v_index=v_start,v_dim
array(h_index, v_index) = EXP(tmp1(h_index, v_index) + 0.25) + tmp_literal(2)
END DO
END DO
CALL somesubroutine_SCC(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
!$acc end data
END SUBROUTINE depth1_SCC
Now, tmp_literal will be initialised in the call to somesubroutine, but uninitialised in the vector loop (because 'private' doesn't copy values, only allocates). This highlights that there will be an issue regardless of whether --trim_vector_sections is set.
If, however, the private( tmp_literal ) is removed from the version above with trim_vector_sections enabled, results will in this case be correct.
When I run these different versions on ATOS HPC compiled using nvfortran (with a main program calling 'wrapper') I get the following results:
CPU VERSION (OpenACC disabled)
Sum of array is 25702.77
GPU (OpenACC) VERSION no-trim (version without --trim-vector-sections set)
Sum of array is 19703.85
GPU (OpenACC) VERSION trim (version with --trim-vector-sections set)
Sum of array is NaN
GPU (OpenACC) VERSION trim with private statement removed (version with --trim-vector-sections set and private statement removed manually)
Sum of array is 25702.77
As you can see, the results of the CPU version and the last version coincide.
In all of the GPU versions above, the code in somesubroutine gets transformed to:
SUBROUTINE somesubroutine_SCC (h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
INTEGER, INTENT(IN) :: h_start
INTEGER, INTENT(IN) :: h_end
INTEGER, INTENT(IN) :: h_dim
INTEGER, INTENT(IN) :: v_start
INTEGER, INTENT(IN) :: v_dim
REAL, INTENT(INOUT) :: array(h_dim, v_dim)
REAL, INTENT(IN) :: tmp_literal(3)
INTEGER :: v_index
INTEGER :: h_index
!$acc routine vector
!$acc data present( array, tmp_literal )
!$acc loop vector
DO h_index=h_start,h_end
!$acc loop seq
DO v_index=v_start,v_dim
array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
END DO
END DO
!$acc end data
END SUBROUTINE somesubroutine_SCC
and the content of loki-config is:
[default]
# Specifies the behaviour of auto-expanded routines
role = 'kernel'
expand = true # Automatically expand subroutine calls
strict = true # Throw exceptions during discovery
# Define entry point for call-tree transformation
[[routine]]
name = 'wrapper'
role = 'driver'
expand = true
[[dimension]]
name = 'horizontal'
size = 'h_dim'
index = 'h_index'
bounds = ['h_start', 'h_end']
[[dimension]]
name = 'vertical'
size = 'v_dim'
index = 'v_index'
[[dimension]]
name = 'block_dim'
size = 'block_size'
index = 'block_index'
I am not really sure what is the proper way to fix this issue.
The text was updated successfully, but these errors were encountered:
This is a reproducer on an issue which has bit me when using SCC to transform the ACRANEB2 dwarf.
Consider the following Fortran module, highlighting the issue:
Above, pay close attention to
depth1
, where there is a "literal" temporarytmp_literal
. In particular, this array is:depth1
depth1
array
in the call tosomesubroutine
after the loops.Now, consider the transformation of the above program using SCC, first without setting
--trim_vector_sections
, using the call:loki-transform.py convert --out-path scc-no-trim --path src --directive openacc --mode scc --config loki-config
which results in
depth1_scc
being transformed to:Above, see how the initialisation of
tmp_literal
has been moved inside the vector loop, and aprivate (tmp_literal)
has been added. The issue here is that now in the call tosomesubroutine
,tmp_literal
will be uninitialised, as the loop operates with private copies of it.Consider then, adding the
--trim_vector_sections
flag to the above call toloki-transform.py
as means of fixing this issue:Now,
tmp_literal
will be initialised in the call tosomesubroutine
, but uninitialised in the vector loop (because 'private' doesn't copy values, only allocates). This highlights that there will be an issue regardless of whether--trim_vector_sections
is set.If, however, the
private( tmp_literal )
is removed from the version above withtrim_vector_sections
enabled, results will in this case be correct.When I run these different versions on ATOS HPC compiled using nvfortran (with a main program calling 'wrapper') I get the following results:
CPU VERSION (OpenACC disabled)
Sum of array is 25702.77
GPU (OpenACC) VERSION no-trim (version without --trim-vector-sections set)
Sum of array is 19703.85
GPU (OpenACC) VERSION trim (version with --trim-vector-sections set)
Sum of array is NaN
GPU (OpenACC) VERSION trim with private statement removed (version with --trim-vector-sections set and private statement removed manually)
Sum of array is 25702.77
As you can see, the results of the CPU version and the last version coincide.
In all of the GPU versions above, the code in
somesubroutine
gets transformed to:and the content of
loki-config
is:I am not really sure what is the proper way to fix this issue.
The text was updated successfully, but these errors were encountered: