benchmark_name | cpp_code | accera_code |
---|---|---|
Accera_Vectorized |
src/reduction/accera_vectorized.cpp |
src/reduction/vectorized.py |
Note
The following shows the implementation of the {{benchmark_name}}
.
The full source code listing of the Accera code generator can be found in {{accera_code}} :fas fa-code: and the benchmark runner is found in {{cpp_code}} :fas fa-code: .
The following shows how to implement tree reduction using Accera. There are only a few tweaks that we need to make to the naive Accera implementation to enable vectorization.
The idea of tree reduction is rather simple.
Suppose we are given a vector of length
<------ N ------>
+--+--+--+ +--+
| | | |...| |
+--+--+--+ +--+
We first partition the input into chunks of size vector_size
(we use vector_size=2
in the visualization)
<---------- N/2 ---------->
+--+--+ +--+--+ +--+--+
| | | | | | ... | | |
+--+--+ +--+--+ +--+--+
<- 2 -> <- 2 -> ... <- 2 ->
There are
+--+--+ +--+--+ +--+--+
| | | + | | | ... + ... | | |
+--+--+ +--+--+ +--+--+
The result is a
+--+--+
| | |
+--+--+
We finally perform a horizontal add on the
+--+ +--+
| | + | |
+--+ +--+
The pseudocode of the above strategy can be expressed as:
\begin{algorithm}
\begin{algorithmic}
\PROCEDURE{VectorizedReduction}{$Input$}
\STATE sumVec = \{0, $\ldots$, 0\}
\FOR{$i$ = 0 \TO $\frac{N}{VecSize}$}
\STATE sumVec = sumVec + Input[$i$ * VecSize : $(i+1)$ * VecSize]
\ENDFOR
\STATE sum = 0
\FOR{$k$ = 0 \TO VecSize}
\STATE sum = sum + sumVec[$k$]
\ENDFOR
\RETURN sum
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}
Writing vectorized reduction in Accera follows from the pseudocode above.
First, we need to import the accera
package:
We then define some constants which are derived from the number of vector bytes on the host system.
The system which the benchmark is run on has AVX-2 support, therefore the vector bytes is 32
.
Since the byte count of a single-precision float is 4
, we divide the vector bytes by 4
to get the number of single-precision elements in an AVX-2 vector (which is 8
single-precision floating point values).
using the vector_size
, we derive two other constants (vector_units
and split_size
) which we picked based on properties of the host system.
The reader is encouraged to try other multiples of vector_size
for these parameters and observe performance differences.
As in the naive case, we define our inputs
We also define an auxillary array which has the same number of elements as the split_size
.
The SumVec
will be used to store intermediate sums during the tree reduction.
We then define the loop nest.
The loop nest has a i * split_size + j
).
The above is equivalent to the following pseudocode:
\begin{algorithm}
\begin{algorithmic}
\FOR{$i$ = 0 \TO $\frac{N}{\texttt{split\_size}}$}
\FOR{$j$ = 0 \TO $\texttt{split\_size}$}
\STATE sumVec[j] += Input[$i$ * \texttt{split\_size} + $j$]
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
We also define the final reduction nest.
This nest adds the elements of SumVec
together.
We create a schedule of the two nests.
We then perform concatenation fusing (i.e., fusing with partial=0
).
The above would effectively concatenate the two loops generating the following equivalent pseudocode code:
\begin{algorithm}
\begin{algorithmic}
\FOR{$i$ = 0 \TO $\frac{N}{\texttt{split\_size}}$}
\FOR{$j$ = 0 \TO $\texttt{split\_size}$}
\STATE sumVec[j] += Input[$i$ * \texttt{split\_size} + $j$]
\ENDFOR
\ENDFOR
\FOR{$k$ = 0 \TO $\texttt{split\_size}$}
\STATE Sum[0] += sumVec[$k$]
\ENDFOR
\end{algorithmic}
\end{algorithm}
Querying for the indices of the fused schedule returns:
- The fused dimension
f
- The indices of the
vec_accum_schedule
(i,j
) - The index of
finalize_accum_schedule
(k
)
We can use the indices to perform a split based on the number of vector units:
I.e., the above would transform our code into:
for i in range(0, N // split_size, vector_units):
for ii in range(i, i + vector_units):
for j in range(split_size):
SumVec[j] += Input[ii * split_size + j]
for k in range(split_size):
Sum[0] += SumVec[k]
We use the fused_schedule
to create an action plan
To increase performance, we apply vectorization and unrolling on the indices
We then add the fused plan to our package as a function called vectorized
We export our package as an object file called vectorized
.
The package can then be used within the C code base.
Here we benchmark the vectorized
function: