Parallel-Programming-Strategy-Comparison

Erik Heaney

June 13, 2017

Methodology

Four test programs were created to perform a performance comparison of parallelism strategies. The single-threaded OpenMP program served as a scientific control, while the multi-threaded OpenMP, Intel SSE SIMD, and the OpenCL with a GPU served as independent variables. Each of the four programs performed an autocorrelation calculation on every datum within a data set of size 32,768. Autocorrelation is a statistical method of measuring how much a random data set can be expressed as a wide-sense stationary random process, or in other words, if the data have periodicity. It requires calculating the correlation between the variance in the dependent variable and the independent variable given a time difference. Since each dependent variable within the data set needed to be correlated with every possible discrete time difference, a total of 32,768 squared values, or 1,073,741,824 total multiplication and add operations. Thus, autocorrelation on a large data set was both a realistic operation to be performed by a computer scientist that could also provide an accurate measure of parallelism performance.

The OpenMP solution used a simple pragma statement before the primary for loop with eight threads. The SIMD solution used the Streaming SIMD Extensions (SSE), an Intel instruction set extension for x86 architectures, on the Oregon State University’s Flip server. Finally, the OpenCL solution used NVidia Titan Black GPU as the compute device with a local group size of 256. The program was executed on the Oregon State University’s Rabbit server. The results were collected in separate CSV files and the graphs were produced using Microsoft Excel.

Discussion

The performance results of this experiment demonstrate the varying levels of parallelism capable across different computing strategies. The OpenMP multi-threaded solution experienced a 2.6 speed up in performance. Since eight threads were used, this means the multi-threaded solution returned a parallel fraction of .32. This relatively low parallelizability may be explained by a few factors. First, the large amount of calculations could lead to temporal incoherence. The pre-fetching distance of the CPU’s L2 cache may result in performance degradation as the program executes. Furthermore, the thread count was selected arbitrarily. In further experiments, it would be advisable to perform a set of tests measuring the performance results across a set of thread counts, so that the highest performing result could be used for cross-strategy comparison. Nevertheless, it is this author’s assumption that an improvement in parallelizability via OpenMP would likely be marginal, and that this experiment provides sufficient evidence.

The SSE SIMD solution experienced a 5.7 speed up in performance in comparison to the single-threaded solution, and a 2.17 speed up in comparison to the multi-thread solution. This increase in performance can be explained by the test program’s ‘embarrassingly parallelism’. Since the data set is large, it lends itself to data parallelism. Furthermore, since the Intel SSE instruction set provides a fused multiply-add (FMA) operation, this increases CPU-level arithmetic optimization. Since a CPU can perform one FMA operation in one cycle asymptotically, or two operations within two cycles, while traditional multiply and add operations would take two separate cycles, this causes a doubling in performance. The SIMD solution could possibly achieve greater performance results by adjusting the pre-fetching distance in order to achieve greater temporal coherence, although again it may be assumed that the performance-enhancing effects would likely be marginal.

Finally, the OpenCL solution experienced a speed up of 19.7. This impressive performance increase can be explained by two factors: the embarrassing parallelism of the test program and the NVidia Titan Black’s power difference. As mentioned above, autocorrelation lends itself to data parallelism. A GPU’s architecture, with its many hundreds of arithmetic units, is especially designed to perform data parallelism on large data sets. Furthermore, the GTX Titan Black is a powerful, enthusiast-level GeForce 7000 GPU with 2880 CUDA cores. Although comparisons between CPUs and GPUs are notoriously fraught, it would be inappropriate to perform this cross-strategy comparison without exploring the hardware differences between the three tests. Simply put, the Titan Black is a more advanced piece of hardware than the Intel® Core™ i7 CPU operating on this author’s laptop. However, it is likely that similar performance results would be achieved with a weaker graphics card. In a more thorough experiment, performance would be measured across a variety of graphics cards, alongside a variety of local group sizes. Uncontrolled variability aside, this author argues that for identical operations performed on a large data set, an OpenCL solution using a GPU as the compute device yields the greatest performance results.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
docs		docs
results		results
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel-Programming-Strategy-Comparison

Methodology

Discussion

About

Releases

Packages

Languages

ehean/Parallel-Programming-Strategy-Comparison

Folders and files

Latest commit

History

Repository files navigation

Parallel-Programming-Strategy-Comparison

Methodology

Discussion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages