Submission and reviewing guidelines and methodology: http://cTuning.org/ae/submission-20160509.html
To reproduce the results presented in our work, we provide an artifact that consist of two parts:
jarlibrary that includes all Intel-based SIMD intrinsics functions, implemented as Scala eDSLs in LMS.
NGenruntime implemented in Scala and Java, that enables the use of
lms-intrinsicsin the JVM and includes the experiments discussed in our work.
The SIMD based eDSLs follow the modular design of the LMS framework and
are implemented as an external LMS library, separated from the JVM
runtime. This allows a stand-alone use of
lms-intrinsics, enabling LMS
x86 vectorized code outside the context of the JVM. The
JVM runtime (
NGen) demonstrates the use of the
providing the compiler pipeline to generate, compile, link and execute
the LMS-generated SIMD code and has a strong dependency on this library.
The experiments included in the artifact come in the form of microbenchmarks. While the most convenient deployment for this artifact would have been a Docker image through Collective Knowledge, we decided to eliminate the overhead imposed by the containers and provided a bare metal deployment that aims at providing as precise results possible for our tests. To achieve that, we use SBT (Simple Build Tool) to build and execute our experiments.
Check-list (artifact meta information)
- Algorithm: Using SIMD intrinsics in the JVM. Experiments include dot-product on quantized arrays, BLAS routines: SAXPY and Matrix-Matrix-Multiplication.
lms-intrinsicsis a precompiled library, compiled with Scala 2.11 and is available as a
jarbundle, accessible through Maven.
NGenrequires Scala 2.11 and Java 1.8 for compilation. Both
Ccode that is compiled with
- Transformations: To make SIMD instructions available in the JVM,
NGenuses LMS as a staging framework. The user writes vectorized code as eDSL in Scala and
NGenstages the code through multiple compile phases before execution.
NGenincludes binaries for SBT v0.13.6, as well as small library for
CPUIDinspection and Sigar v1.6.5_01 (System Information Gatherer And Reporter https://github.com/hyperic/sigar) binaries.
NGenhas various dependencies on precompiled libraries that include BridJ, Apache Commons, ScalaMeter, Scala Virtualized, LMS and finally
lms-intrinsics. SBT automatically pulls all dependencies and their corresponding versions.
- Data set: Our experiments operate with random data, requiring no data set.
- Run-time environment:
lms-intrinsicscan run on any JVM that supports LMS and any operating system supported by the same JVM. Similarly,
NGencould work in any JVM that supports LMS, reflection and native code invocation, however our focus has been on the HotSpot JVM only, supporting Windows, Linux and Mac OS X. Our results are most conveniently replicated on a Unix environment.
- Hardware: The
lms-intrinsicsgenerated code can run on any
x86-64architecture that supports at least one subset of the Intel intrinsics functions. We recommend a Haswell machine for verifying the results presented in the paper to obtain comparable results.
- Run-time state: We perform our tests using warm cache scenario, warming the code and data cache many times before measurements begin. We advise that the replication of our experiments to be done with minimal interference of other applications running on the system, having technologies for frequency scaling and resource sharing disabled.
NGengenerates performance profile of each algorithm presented in this paper.
- Experiment workflow: We use SBT not only to compile the code, but also to run the experiments.
- Experiment customization: Customisation is certainly possible and can be easily achieved by implementing any vectorized code as a Scala eDSL.
- Publicly available: Yes
The precompiled SIMD eDSLs library, as well as our JVM runtime, including the supporting experiments are publicly available through GitHub, on the following links:
lms-intrinsics is also available through Maven, and can be
used through SBT directly:
libraryDependencies += "ch.ethz.acl" %% "lms-intrinsics" % "0.0.3-SNAPSHOT"
lms-intrinsics as well as
NGen are able to generate
C code that
can run on
x86-64 architecture supporting Intel ISAs.
However, the full set of our experiments require at least a Haswell
- SAXPY and MMM algorithms are implemented using
FMAISAs, and therefore require at least a Haswell enabled process. Broadwell, Skylake, Kaby Lake or later would also work.
- The dot product of the quantized arrays relies on
FMAflags, but also uses the hardware random number generator, requiring the
RDRANDISA, as well
FP16Cto deal with half-precision floats.
We recommend disabling Intel Turbo Boost and Hyper-Threading technologies to avoid the effects of frequency scaling and resource sharing on the measurements. Note that these technologies can be easily disabled in the BIOS settings of the machines that have BIOS firmware. Many Apple-based machines, such as the MacBook or others, do not have a user accessible BIOS firmware, and could only disable Turbo Boost using external kernel modules such as Turbo Boost Switcher (https://github.com/rugarciap/Turbo-Boost-Switcher).
lms-intrinsics is a self-contained precompiled library and all of its
software dependencies are handled automatically through Maven tools such
as SBT. To build and run
NGen, the following dependencies must be met:
Gitclient, used by SBT to resolve dependencies.
- Java Development Kit (JDK) 1.8 or later.
Ccompiler such as
After installing the dependencies, it is quite important to have the
binary executables available in the
$PATH. This way the SBT tool will
be able to process all compilation phases as well as to execute the
experiments. Make sure that the following commands work on your
git --version gcc --version java -version javac -version
It is also important to ensure that the installed JVM has architecture
GCC can compile to. This is particularly important for Windows
MinGW port of
GCC will fail to compile code for 64-bit
The artifact can be cloned from the GitHub repository:
git clone https://github.com/astojanov/NGen
The artifact already includes a precompiled version of SBT. Therefore, to start the SBT console, we run:
cd ngen # For Unix users: ./bin/sbt/bin/sbt # For Windows users bin\sbt\bin\sbt.bat
Once started, we can compile the code using:
Once invoked, SBT will automatically pull
lms-intrinsics as well as
all other dependencies and start the compilation.
Once SBT compiles the code, we can proceed with evaluating our
experiments. We do this through the SBT console. To inspect the testing
NGen runtime we use:
> test-only cgo.TestPlatform
The runtime will be able to inspect the CPU, identify available ISAs and compilers and inspect the current JDK. If the test platform is successfully identified, we can continue with the experiments.
Generating SIMD eDSLs.
lms-intrinsics bundle includes the automatic generator of SIMD
eDSLs, invoked by:
> test-only cgo.GenerateIntrinsics
The Scala eDSLs (coupled with statistics) will be generated in
Explicit vectorization in the JVM.
To run the experiments depicted in our work, we use:
> test-only cgo.TestSaxpy > test-only cgo.TestMMM > test-only cgo.TestPrecision
In the case of SAXPY algorithm, if the testing machine is not Haswell based, we provided an architecture independent implementation of SAXPY:
> test-only cgo.TestMultiSaxpy
Each result shows the size of our microbenchmarks, and the obtained performance in flops/cycle.
Evaluation and expected result
In the evaluation of the experiment workflow, we expect LMS to produce
correct vectorized code using
lms-intrinsics. Furthermore, we expect
our performance results to depict a consistent behaviour to the results
shown in this work, outperforming the JVM on the microarchitectures that
support our experiments. Finally, we expect the automatic generation of
eDSLs to be easily adjustable to subsequent updates on the Intel
There are many opportunities for customization. We can use
easily develop vectorized code, and we can use ScalaMeter to adjust the
Developing SIMD code.
NSaxpy.scala class, available in
provides detailed guidelines for the usage of SIMD in Scala. Following
the comments in the file, as well as the structural flow of the program,
one can easily modify the skeleton to perform other type of vectorized
Each performance experiment, uses ScalaMeter and is implemented as a
Scala class. The Matrix-Matrix-Multiplication includes
src/ch/ethz/acl/ngen/mmm/. The implementaton allows changes
to various aspects of the benchmarks, including the size and the values
of the input data, warm up times, different JVM invocations, etc.