Skip to content

Latest commit

 

History

History
215 lines (149 loc) · 6.61 KB

vectorized.md

File metadata and controls

215 lines (149 loc) · 6.61 KB
benchmark_name cpp_code accera_code
Accera_Vectorized
src/reduction/accera_vectorized.cpp
src/reduction/vectorized.py

Vectorized Accera

Note

The following shows the implementation of the {{benchmark_name}}. The full source code listing of the Accera code generator can be found in {{accera_code}} :fas fa-code: and the benchmark runner is found in {{cpp_code}} :fas fa-code: .

The following shows how to implement tree reduction using Accera. There are only a few tweaks that we need to make to the naive Accera implementation to enable vectorization.

Overview

The idea of tree reduction is rather simple. Suppose we are given a vector of length $N$:

<------ N ------>
+--+--+--+   +--+
|  |  |  |...|  |
+--+--+--+   +--+

We first partition the input into chunks of size vector_size (we use vector_size=2 in the visualization)

<---------- N/2 ---------->
+--+--+ +--+--+     +--+--+
|  |  | |  |  | ... |  |  |
+--+--+ +--+--+     +--+--+
<- 2 -> <- 2 -> ... <- 2 ->

There are $N/2$ of these $2$-element chunks. This is just a different view of the input vector and does not change the layout of the data. We then proceed by adding the $2$-element chunks with each other:

+--+--+   +--+--+             +--+--+
|  |  | + |  |  | ...  +  ... |  |  |
+--+--+   +--+--+             +--+--+

The result is a $2$-element vector.

+--+--+ 
|  |  |  
+--+--+ 

We finally perform a horizontal add on the $2$-element vector to get the output scalar value:

+--+   +--+ 
|  | + |  |  
+--+   +--+ 

The pseudocode of the above strategy can be expressed as:

\begin{algorithm} 
\begin{algorithmic} 
\PROCEDURE{VectorizedReduction}{$Input$}
    \STATE sumVec = \{0, $\ldots$, 0\}
    \FOR{$i$ = 0 \TO $\frac{N}{VecSize}$}  
        \STATE sumVec = sumVec + Input[$i$ * VecSize : $(i+1)$ * VecSize]
    \ENDFOR
    \STATE sum = 0 
    \FOR{$k$ = 0 \TO VecSize} 
        \STATE sum = sum + sumVec[$k$]
    \ENDFOR
    \RETURN sum
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

Implementation

Writing vectorized reduction in Accera follows from the pseudocode above. First, we need to import the accera package:

We then define some constants which are derived from the number of vector bytes on the host system. The system which the benchmark is run on has AVX-2 support, therefore the vector bytes is 32. Since the byte count of a single-precision float is 4, we divide the vector bytes by 4 to get the number of single-precision elements in an AVX-2 vector (which is 8 single-precision floating point values). using the vector_size, we derive two other constants (vector_units and split_size) which we picked based on properties of the host system. The reader is encouraged to try other multiples of vector_size for these parameters and observe performance differences.

As in the naive case, we define our inputs

We also define an auxillary array which has the same number of elements as the split_size. The SumVec will be used to store intermediate sums during the tree reduction.

We then define the loop nest. The loop nest has a $2$-dimensional iteration space. The outer dimension corresponds to the number of splits we are processing and the inner dimension corresponds to the size of the split. We then perform the computation. Since Input is a $1$-dimensional vector, we need to map the $2$-dimensional indices into the vector (i.e., the index computation i * split_size + j).

The above is equivalent to the following pseudocode:

\begin{algorithm} 
\begin{algorithmic}  
\FOR{$i$ = 0 \TO $\frac{N}{\texttt{split\_size}}$}  
    \FOR{$j$ = 0 \TO $\texttt{split\_size}$} 
        \STATE sumVec[j] += Input[$i$ * \texttt{split\_size} + $j$]
    \ENDFOR
\ENDFOR 
\end{algorithmic}
\end{algorithm}

We also define the final reduction nest. This nest adds the elements of SumVec together.

We create a schedule of the two nests.

We then perform concatenation fusing (i.e., fusing with partial=0).

The above would effectively concatenate the two loops generating the following equivalent pseudocode code:

\begin{algorithm} 
\begin{algorithmic}  
\FOR{$i$ = 0 \TO $\frac{N}{\texttt{split\_size}}$}  
    \FOR{$j$ = 0 \TO $\texttt{split\_size}$} 
        \STATE sumVec[j] += Input[$i$ * \texttt{split\_size} + $j$]
    \ENDFOR
\ENDFOR 
\FOR{$k$ = 0 \TO $\texttt{split\_size}$} 
    \STATE Sum[0] += sumVec[$k$]
\ENDFOR 
\end{algorithmic}
\end{algorithm}

Querying for the indices of the fused schedule returns:

  • The fused dimension f
  • The indices of the vec_accum_schedule (i,j)
  • The index of finalize_accum_schedule (k)

We can use the indices to perform a split based on the number of vector units:

I.e., the above would transform our code into:

for i in range(0, N // split_size, vector_units):
    for ii in range(i, i + vector_units):
        for j in range(split_size):
            SumVec[j] += Input[ii * split_size + j]
for k in range(split_size):
    Sum[0] += SumVec[k]

We use the fused_schedule to create an action plan

To increase performance, we apply vectorization and unrolling on the indices

We then add the fused plan to our package as a function called vectorized

We export our package as an object file called vectorized.

Usage

The package can then be used within the C code base. Here we benchmark the vectorized function: