Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCC: wrong results with or without 'trim_vector_sections' in a certain case with literal temporary arrays #184

Open
skarppinen opened this issue Nov 1, 2023 · 0 comments

Comments

@skarppinen
Copy link
Contributor

This is a reproducer on an issue which has bit me when using SCC to transform the ACRANEB2 dwarf.
Consider the following Fortran module, highlighting the issue:

module mod3
    implicit none 
contains
    subroutine wrapper
        integer :: h_start = 1
        integer, parameter :: h_dim = 10
        integer :: h_end 
        integer :: v_start = 1
        integer, parameter :: v_dim = 20
        integer, parameter :: block_size = 30
        integer :: block_index
        

        real :: arr(h_dim, v_dim, block_size)

        h_end = h_dim
        do block_index = 1, block_size 
        call depth1(h_start, h_end, h_dim, v_start, v_dim, &
             arr(:, :, block_index))
        end do
        print *, "Sum of array is ", sum(arr)
    end subroutine wrapper

    subroutine depth1(h_start, h_end, h_dim, v_start, v_dim, array)
        ! Arguments.
        integer, intent(in) :: h_start
        integer, intent(in) :: h_end
        integer, intent(in) :: h_dim
        integer, intent(in) :: v_start
        integer, intent(in) :: v_dim
        real, intent(inout) :: array(h_dim, v_dim)

        ! Non-array local variables.
        integer :: v_index
        integer :: h_index

        ! Temporary arrays.
        real :: tmp_literal(3)
        real :: tmp1(h_dim, v_dim), tmpx(h_dim, v_dim)

        tmp_literal(1) = 1.0
        tmp_literal(2) = 2.0
        tmp_literal(3) = 3.0

        do v_index = v_start, v_dim
            do h_index = h_start, h_end
                array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1) 
            end do
        end do

        do v_index = v_start, v_dim
            do h_index = h_start, h_end
                array(h_index, v_index) = exp(tmp1(h_index, v_index) + 0.25) + tmp_literal(2)
            end do
        end do

        call somesubroutine(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)

    end subroutine depth1
    subroutine somesubroutine(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
        integer, intent(in) :: h_start
        integer, intent(in) :: h_end
        integer, intent(in) :: h_dim
        integer, intent(in) :: v_start
        integer, intent(in) :: v_dim 
        real, intent(inout) :: array(h_dim, v_dim)
        real, intent(in) :: tmp_literal(3)

        integer :: v_index
        integer :: h_index

        do v_index = v_start, v_dim
            do h_index = h_start, h_end
                array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
            end do
        end do
    end subroutine
end module mod3

Above, pay close attention to depth1, where there is a "literal" temporary tmp_literal. In particular, this array is:

  • initialised at the top level of depth1
  • used in the subsequent loops of depth1
  • used to modify array in the call to somesubroutine after the loops.

Now, consider the transformation of the above program using SCC, first without setting --trim_vector_sections, using the call:
loki-transform.py convert --out-path scc-no-trim --path src --directive openacc --mode scc --config loki-config
which results in depth1_scc being transformed to:

SUBROUTINE depth1_SCC (h_start, h_end, h_dim, v_start, v_dim, array)
    ! Arguments.
    INTEGER, INTENT(IN) :: h_start
    INTEGER, INTENT(IN) :: h_end
    INTEGER, INTENT(IN) :: h_dim
    INTEGER, INTENT(IN) :: v_start
    INTEGER, INTENT(IN) :: v_dim
    REAL, INTENT(INOUT) :: array(h_dim, v_dim)
    
    ! Non-array local variables.
    INTEGER :: v_index
    INTEGER :: h_index
    
    ! Temporary arrays.
    REAL :: tmp_literal(3)
    REAL :: tmp1(h_dim, v_dim), tmpx(h_dim, v_dim)
!$acc routine vector
!$acc data present( array )
    
!$acc loop vector private( tmp_literal )
    DO h_index=h_start,h_end
      
      tmp_literal(1) = 1.0
      tmp_literal(2) = 2.0
      tmp_literal(3) = 3.0
      
!$acc loop seq
      DO v_index=v_start,v_dim
        array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
      END DO
      
!$acc loop seq
      DO v_index=v_start,v_dim
        array(h_index, v_index) = EXP(tmp1(h_index, v_index) + 0.25) + tmp_literal(2)
      END DO
      
    END DO
    
    CALL somesubroutine_SCC(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
    
    
!$acc end data
  END SUBROUTINE depth1_SCC

Above, see how the initialisation of tmp_literal has been moved inside the vector loop, and a private (tmp_literal) has been added. The issue here is that now in the call to somesubroutine, tmp_literal will be uninitialised, as the loop operates with private copies of it.

Consider then, adding the --trim_vector_sections flag to the above call to loki-transform.py as means of fixing this issue:

SUBROUTINE depth1_SCC (h_start, h_end, h_dim, v_start, v_dim, array)
    ! Arguments.
    INTEGER, INTENT(IN) :: h_start
    INTEGER, INTENT(IN) :: h_end
    INTEGER, INTENT(IN) :: h_dim
    INTEGER, INTENT(IN) :: v_start
    INTEGER, INTENT(IN) :: v_dim
    REAL, INTENT(INOUT) :: array(h_dim, v_dim)
    
    ! Non-array local variables.
    INTEGER :: v_index
    INTEGER :: h_index
    
    ! Temporary arrays.
    REAL :: tmp_literal(3)
    REAL :: tmp1(h_dim, v_dim), tmpx(h_dim, v_dim)
!$acc routine vector
!$acc data present( array )
    
    tmp_literal(1) = 1.0
    tmp_literal(2) = 2.0
    tmp_literal(3) = 3.0
    
    
!$acc loop vector private( tmp_literal )
    DO h_index=h_start,h_end
!$acc loop seq
      DO v_index=v_start,v_dim
        array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
      END DO
      
!$acc loop seq
      DO v_index=v_start,v_dim
        array(h_index, v_index) = EXP(tmp1(h_index, v_index) + 0.25) + tmp_literal(2)
      END DO
    END DO
    
    
    CALL somesubroutine_SCC(h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
    
    
!$acc end data
  END SUBROUTINE depth1_SCC

Now, tmp_literal will be initialised in the call to somesubroutine, but uninitialised in the vector loop (because 'private' doesn't copy values, only allocates). This highlights that there will be an issue regardless of whether --trim_vector_sections is set.
If, however, the private( tmp_literal ) is removed from the version above with trim_vector_sections enabled, results will in this case be correct.

When I run these different versions on ATOS HPC compiled using nvfortran (with a main program calling 'wrapper') I get the following results:

CPU VERSION (OpenACC disabled)
Sum of array is 25702.77
GPU (OpenACC) VERSION no-trim (version without --trim-vector-sections set)
Sum of array is 19703.85
GPU (OpenACC) VERSION trim (version with --trim-vector-sections set)
Sum of array is NaN
GPU (OpenACC) VERSION trim with private statement removed (version with --trim-vector-sections set and private statement removed manually)
Sum of array is 25702.77

As you can see, the results of the CPU version and the last version coincide.

In all of the GPU versions above, the code in somesubroutine gets transformed to:

SUBROUTINE somesubroutine_SCC (h_start, h_end, h_dim, v_start, v_dim, array, tmp_literal)
    INTEGER, INTENT(IN) :: h_start
    INTEGER, INTENT(IN) :: h_end
    INTEGER, INTENT(IN) :: h_dim
    INTEGER, INTENT(IN) :: v_start
    INTEGER, INTENT(IN) :: v_dim
    REAL, INTENT(INOUT) :: array(h_dim, v_dim)
    REAL, INTENT(IN) :: tmp_literal(3)
    
    INTEGER :: v_index
    INTEGER :: h_index
!$acc routine vector
!$acc data present( array, tmp_literal )
    
    
!$acc loop vector
    DO h_index=h_start,h_end
!$acc loop seq
      DO v_index=v_start,v_dim
        array(h_index, v_index) = array(h_index, v_index) + tmp_literal(1)
      END DO
    END DO
    
    
!$acc end data
  END SUBROUTINE somesubroutine_SCC

and the content of loki-config is:

[default]
# Specifies the behaviour of auto-expanded routines
role = 'kernel'
expand = true  # Automatically expand subroutine calls
strict = true # Throw exceptions during discovery

# Define entry point for call-tree transformation
[[routine]]
name = 'wrapper'
role = 'driver'
expand = true

[[dimension]]
name = 'horizontal'
size = 'h_dim'
index = 'h_index'
bounds = ['h_start', 'h_end']

[[dimension]]
name = 'vertical'
size = 'v_dim'
index = 'v_index'

[[dimension]]
name = 'block_dim'
size = 'block_size'
index = 'block_index'

I am not really sure what is the proper way to fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant