Skip to content

celestica45/Programming-Project-1

Repository files navigation

SIMD Programming Project

Members: Guillermo, Ty, Ughoc

CSC612M G03

Introduction

This project explores different ways to efficiently calculate the operation A[i] = B[i] + C[i] × D[i] for large arrays of floating-point numbers. Four kernels are implemented:

  • Standard C code
  • x86-64 assembly (non-SIMD)
  • SIMD AVX2 assembly using XMM registers
  • SIMD AVX2 assembly using YMM registers

All kernels were benchmarked in both DEBUG and RELEASE build modes to compare performance.

For each method, we:

  • Initialize vectors with known values
  • Measure execution time over large input sizes and multiple runs
  • Check that results match the C reference output
  • Demonstrate how each version deals with vector sizes that aren't a perfect fit for SIMD instructions

All source code, project files, timing results, screenshots, and correctness checks are included in this repository. From these results, we compare performance and discuss which implementation works best and why.

Program Overview

  • Write the kernel in (1) C program; (2) an x86-64 assembly language; (3) x86-64 SIMD AVX2 assembly language using XMM register; (4) x86-64 SIMD AVX2 assembly language using YMM register. The kernel is to perform vector triad for vectors B C, D and place the result in vector A. For each kernel version, time the process for vector size n = {2^(20), 2^(26), and 2^(30)}

  • Input: Scalar variable n (integer) contains the length of the vector; Vectors B, C and D are single-precision float.

    • Input Initialization: Arrays B, C, and D were initialized using trigonometric functions (sin, cos, tan) as shown below:

      Alt text

  • Process: A[i] = B[i] + C [i] * D[i]

  • Output: store the result in vector A. Display the first 5 and the last 5 elements of vector A for verification.

Program output with Execution Time and Correctness Check

Debug Mode

  • 2^20

    Alt text

  • 2^26

    Alt text

  • 2^30

    Alt text

Release Mode

  • 2^20

    Alt text

  • 2^26

    Alt text

  • 2^30

    Alt text

Boundary Check

XMM

image

rax takes in rcx, which contains the total amount of elements in the array. shr rcx, 2 divides the array into groups of 4. The boundary checking comes from and rax, 3, which gets the remainder of the array size divided by 4 (rax % 4) so that the program can compute for the remaining items later on.

image

This part of the program calculates for the result one element at a time instead of doing it 4 at a time, looping for a maximum of 3 times.

Sample:

  • Array Size = 4099 (4099 / 4 = 1024 remainder 3)
boundary 1

YMM

A similar process can be done for YMM.

image

rax takes in rcx, which contains the total amount of elements in the array. shr rcx, 3 divides the array into groups of 8. The boundary checking comes from and rax, 7, which gets the remainder of the array size divided by 8 (rax % 8) so that the program can compute for the remaining items later on.

image

This part of the program calculates for the result one element at a time instead of doing it 8 at a time, looping for a maximum of 7 times.

Sample:

  • Array Size = 24007 (24007 / 8 = 3000 remainder 7)

    image

Summary Table of Execution Time and Comparative Result

Debug

In Debug Mode, the execution time of the program increases as the array size processed by the program increases. The order from fastest to slowest is SIMD YMM, SIMD XMM, Assembly x86-64, and C as the slowest. Alt text

Release

Just like in Debug Mode, the execution time of the program in Release Mode increases as the array size processed by the program increases. SIMD YMM is still the fastest program. However, C became faster than SIMD XMM and Assembly x86-64. As such, the order from fastest to slowest program is SIMD YMM, C, SIMD XMM, and Assembly x86-64 as the slowest. Alt text

SIMD Speed-up(XMM vs YMM)

  • Debug

    In Debug Mode, SIMD YMM program is 1.084 times faster than SIMD XMM. Alt text

  • Release

    In Release Mode, SIMD YMM program is 1.051 times faster than SIMD XMM. Alt text

Analysis of Results

Overall Performance Trend

From the results, we observe a clear performance hierarchy in the Debug build. In Debug mode, SIMD(XMM and YMM) provided up to 4.4× speedup over the baseline C implementation, showing the strong advantage of vectorization when compiler optimizations are disabled. In contrast, in Release mode, all implementations converge to nearly the same performance, with differences shrinking to within ±5%. This confirms that the Release compiler automatically vectorized and optimized the C version, minimizing the manual SIMD advantage.

The x86-64 assembly implementation is faster than the plain C version because it removes overhead added by the compiler, which includes redundant instructions, bounds checks, and function call abstractions, especially in Debug mode. This allows for direct control over registers and memory addressing. Additionally, these minor performance differences may also result from how files are loaded or handled dynamically at runtime. For example, external file access or dynamic page loading can introduce slight variations in execution time.

Low Arithmetic Intensity

The kernel performs only two floating-point operations per iteration (one multiplication and one addition) but transfers 16 bytes of memory, with 4 bytes per element. There are 3 loads (B[i], C[i], and D[i]), with 1 store (A[i]) with 4 bytes each, which leads to 16 bytes.

That yields a low arithmetic intensity of:

$\text{Arithmetic Intensity} = \frac{2 \text{ flops}}{16 \text{ bytes}} = 0.125 \text{ flop/byte}$

Since the arithmetic intensity is low, it means that performance is limited by memory bandwidth, not by ALU throughput. Once arrays exceed cache capacity (such as when 2 is raised to 26 or 30), data must be fetched from DRAM. Even if the CPU executes wider SIMD operations, the bottleneck shifts to how fast data can be supplied to registers. Thus, YMM cannot be 2× faster than XMM, since both are constrained by the same DRAM bandwidth.

YMM and XMM

Although YMM registers are 256 bits wide (processing 8 floats) versus XMM’s 128 bits (4 floats), the observed speedup is only around 1.08× in practice. Several factors explain this:

  1. Memory Bandwidth Saturation. Doubling the vector width does not double the available bandwidth. The memory subsystem already operates near its limit, so YMM gains little in sustained throughput.
  2. Same Memory Bottleneck. Once memory throughput is saturated, instruction-level parallelism or SIMD width no longer impacts total runtime.

With that being said, even though YMM executes more elements per instruction, the bottleneck isn’t the compute pipeline. Rather, it is the data movement, resulting in only a slight speedup.

Debug vs. Release Behavior

In Debug mode, there is a large difference between C and SIMD implementations. Debug builds disable optimizations such as loop unrolling, inlining, and auto-vectorization. This forces the C compiler to use simple, scalar operations, while SIMD assembly manually leverages hardware vector units, which does explain the 4× gain in Debug mode.

In Release mode, however, the compiler automatically vectorizes and optimizes the C code using SSE or AVX instructions internally. As a result, manually written SIMD code provides no major advantage, and all four implementations achieve similar performance. The small performance differences are dominated by memory latency variations and cache behavior, not by arithmetic throughput.

Vector Size

When comparing different array sizes, the smallest size of 2^20 fits in the cache. SIMD achieves its best relative speed because the working set resides mostly in L2/L3 cache, allowing the CPU to exploit full vector throughput.

At 2^26 and 2^30, execution times scale almost linearly with array size, indicating that performance is dominated by memory access, not computation.

Looking at these results, it suggests that beyond a certain point, increasing SIMD width offers diminishing returns due to saturated memory bandwidth.

This behavior closely matches the Roofline performance model, where the compute-bound region (small data) transitions to a memory-bound region (large data).

Boundary Handling

Boundary handling was correctly implemented for cases where the number of elements isn’t divisible by the SIMD width (4 for XMM, 8 for YMM).

For large arrays like 2^30, the remainder cost is negligible (<0.01%), but the correctness check confirms that both SIMD implementations handle the remainder elements properly.

Key Insights

The vector triad is memory-bound, meaning that beyond a certain point, performance is determined by memory throughput, not CPU compute power.

SIMD improves efficiency by processing multiple elements per instruction, but its advantage caps out when data movement is dominating.

For workloads with higher arithmetic intensity (e.g., matrix multiplication or Fast Fourier Transform), YMM would most likely show a much larger speedup.

Discussion & Conclusion

Team Insights and Challenges

During this project, our group faced several key challenges:

  • Implementing multiple kernels: Ensuring that every version (C, x86-64 assembly, SIMD XMM, SIMD YMM) produced correct and consistent results across large input sizes. Handling boundary elements not aligned with the SIMD register width was especially tricky.
  • Debugging assembly code: Proper memory management and interfacing C with assembly were sources of bugs, and required careful debugging.
  • Accurate measurement: Measuring only the kernel runtime (not setup or I/O) was essential especially for clear comparison between Debug and Release builds.
  • Accessing of 5th element in shadow space: Troubleshooting issues with accessing parameters, specifically the 5th element in the shadow space when calling functions from assembly, which required extra care to align with calling conventions.
  • Memory limitations with large vector sizes: Running tests with the 2^30 vector size was difficult or impossible on some machines due to hardware memory limits. We had to accommodate this either by reducing the test size or using systems with higher memory capacity.

AHA Moments

  • Handling the SIMD "remainder" cases (when vector size isn’t a perfect multiple of register width) was a crucial technical realization that improved both correctness and efficiency.

  • Realizing the kernel is memory-bound: Our timing and analysis confirmed that for large arrays, RAM speed, not computation throughput, is the main bottleneck which makes SIMD helps up to a point.

  • Will SIMD YMM be faster than optimized C?
    Our results show that SIMD YMM is actually faster than the C implementation for large array sizes in both Debug and Release builds. For example, in Release mode, SIMD YMM reached a relative speed of 1.041x compared to C, while in Debug mode, it was 4.382x faster. This showed that using wider YMM registers makes the program finish faster, even when memory speed is the main thing slowing it down.

  • Unique Methodology: Our boundary check/cleanup method for leftover elements after SIMD is just one standard approach as other solutions can work as well, depending on hardware and coding preferences.

Lessons Learned

  • SIMD and register width matter: Wider registers (YMM) significantly improve performance for big data sets.
  • Test in both Debug and Release: Performance in Release builds is much closer to what modern compilers and hardware can really achieve.
  • Early correctness checks save time: Always use a C version as the reference for result validation.

References

Improving performance with SIMD intrinsics in three use cases - Stack Overflow. (2020, July 8). https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases

Jelínek, J. (2023, December 8). Vectorization optimization in GCC | Red Hat Developer. Red Hat Developer. https://developers.redhat.com/articles/2023/12/08/vectorization-optimization-gcc

Nersc. (n.d.). Roofline Performance Model - NERSC documentation. https://docs.nersc.gov/tools/performance/roofline/

Roofline Model - an overview | ScienceDirect Topics. (n.d.). https://www.sciencedirect.com/topics/computer-science/roofline-model

What is a Frontier Model? (2025, April 6). Iguazio. https://www.iguazio.com/glossary/arithmetic-intensity/

About

MP1

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •