An attempt at achieving the theoretical best memory bandwidth of my machine.
C C++
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
Makefile
README.md
functions.c
functions.h
main.c
monotonic_timer.c
monotonic_timer.h

README.md

Memory Bandwidth Demo

This project was written to support my quest to achieve the theoretical best memory bandwidth for reads and writes on my machine, as described in my blog post. For a Retina Macbook Pro, I expect 25.6 GB/s (23.8 GiB/s).

I've tried a number of approaches:

  • read_memory_loop does a simple for (i = 0; i < size; *i++);
  • read_memory_sse uses SSE packed aligned loads to read 16 bytes at a time.
  • read_memory_avx use AVX packed aligned stores to read 32 bytes at a time.
  • write_memory_loop does a simple for (i = 0; i < size; *i++ = 1);
  • write_memory_rep_stosq forces the use of the rep stosq instruction.
  • write_memory_sse uses SSE packed aligned stores to write 16 bytes at a time.
  • write_memory_nontemporal_sse uses nontemporal SSE packed aligned stores to write 16 bytes at a time and bypass the cache.
  • write_memory_avx uses AVX packed aligned stores to write 32 bytes at a time.
  • write_memory_nontemporal_avx uses nontemporal AVX packed aligned stores to write 32 bytes at a time and bypass the cache.
  • write_memory_memset is merely a wrapper for memset.

In addition, I tried wrapping all the above in OpenMP to use multiple cores. The function *_omp represent the OpenMP wrapped function *. To enable this, compile the flags -DWITH_OPENMP -fopenmp.

Compiling

Compiling this code requires a reasonably advanced version of gcc or clang (although clang does not support OpenMP).

Results

./memory_profiler
           read_memory_rep_lodsl:  4.80 GiB/s
                read_memory_loop: 10.66 GiB/s
                 read_memory_sse: 13.44 GiB/s
                 read_memory_avx: 13.60 GiB/s
        read_memory_prefetch_avx: 15.06 GiB/s
               write_memory_loop: 12.84 GiB/s
          write_memory_rep_stosl: 19.22 GiB/s
                write_memory_sse:  8.93 GiB/s
    write_memory_nontemporal_sse: 12.83 GiB/s
                write_memory_avx:  8.91 GiB/s
    write_memory_nontemporal_avx: 12.65 GiB/s
             write_memory_memset: 12.84 GiB/s
       read_memory_rep_lodsl_omp: 19.01 GiB/s
            read_memory_loop_omp: 22.03 GiB/s
             read_memory_sse_omp: 22.18 GiB/s
             read_memory_avx_omp: 22.21 GiB/s
    read_memory_prefetch_avx_omp: 22.19 GiB/s
           write_memory_loop_omp: 22.13 GiB/s
      write_memory_rep_stosl_omp: 21.25 GiB/s
            write_memory_sse_omp:  9.70 GiB/s
write_memory_nontemporal_sse_omp: 22.13 GiB/s
            write_memory_avx_omp:  9.70 GiB/s
write_memory_nontemporal_avx_omp: 22.13 GiB/s
         write_memory_memset_omp: 22.14 GiB/s