-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpredictable execution time of BLASFEO functions caused by uninitialized padding elements of the matrix #103
Comments
I am not sure whether the issue is related or not to the details of how the time is measured in the benchmark. But the difference is surprising, and I have not seen such strange results when benchmarking non-blasfeo code. |
With Command: ./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --v=2 Output
You can see that the benchmark was ran with an increasing iteration count to determine the number of required iterations to be 75769. And then it ran 5 times a batch of 75769 calls to |
The issue is not reproduced if BLASFEO is compiled with |
The prefetch instructions are my suspect:
|
I commented out all |
Is the issue happening for |
The important difference between the benchmark runs is where the memory is allocated. I found out that the alignment of the array data has dramatic effect on the time. Below are the results from the benchmark run with different allocated memory alignment (last parameter). You see that when the alignment is 0x20 it varies a lot. When it is 0x1000, it does not vary.
|
For the BLASFEO API, 32 bytes is the minimum alignment to avoid segfaults, but 64 bytes is the assumed alignment for performance. But performance deviations should not be that large. On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation. And in your script the memory allocation happens differently for the BLASFEO case ( In summary, it is not clear to me that the difference in computational time comes from the execution of the BLASFEO routine, or conversely from the memory allocation routine. |
Here is the updated benchmark, where alignment is added as an extra parameter. The results from the run ./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --benchmark_report_aggregates_only=true are below:
You can see that when the alignment is 32 the standard deviation of the time is huge, and with the alignment 4096 it is very small. However, surprisingly, with large alignment the performace is often many times worse than with the small, compare:
This is very puzzling. |
No, it is not. It benchmarks only the call to the routine. If you look in the source, only what is inside the |
OK sorry I didn't get how that smart auto benchmarking loop works. Alright then, memory allocation is not benchmarked, but only the linear algebra routine call. |
It seems that not the alignment itself is important, but really the address of the data arrays. This is confirmed by the following test, where the alignment is kept at 0x20 bytes, but the arrays are reused between different runs of the benchmark with the same sizes:
You see that keeping location of the data removes variability of the timings. However, one would expect the same performance regardless of the location of the data, which is not the case. Interestingly, you can see from the result that the time does not necessarily increase with the size of the problem. This is probably caused by the large variation of the time due to different data location. |
Although re-running the benchmarks gives very similar results, running them in reverse order gives very different results, probably due to changed placement of memory chunks:
|
The current version of the benchmark sources gemm-benchmark.zip |
The same behavior is observed on a machine with AMD Bulldozer and BLASFEO built with |
For syscop members, the benchmark code is accessible in the tmpc repo: git clone git@gitlab.syscop.de:mikhail.katliar/tmpc.git -b 44-mpipm
cd tmpc/bench/blasfeo
make
./bench You need to install Google Benchmark first, see https://github.com/google/benchmark#installation |
Explanation of what is the difference between The benchmark is run as follows. For every combination of parameters (m, n, k, alignment) a loop where the function is called (let's call it a batch) is run several times. Then the mean, median, and standard deviation across batches is computed.
|
This makes clear that the performance penalty comes from the choice of the specific memory address. I would like to clarify that you chosen alignment, 32-bytes, is the minimum to have correct behavior (i.e. avoid segmentation faults), but it is not the recommended one in the BLASFEO API. As you can see e.g. here https://github.com/giaf/blasfeo/blob/master/auxiliary/blasfeo_stdlib.c#L36 if you use the BLASFEO functionality to allocate aligned memory, the address is aligned to typical cache size boundaries, i.e. 64 bytes. Having a memory address aligned to 32 bytes but not to 64 bytes is expected to slightly increase the computational cost, and the matrix actually occupies more cache lines in memory, and therefore more memory has to be moved through the cache hierarchy. But the effect seen in some of your tests is much larger than the expected one. A possible explanation is that sometimes you also hit memory page boundaries, or something like that. |
The same behavior is observed with the alignment of 0x40 bytes. |
"Sometimes you also hit memory page boundaries" would also occur in the BM_gemm_cblas, but we don't see such effect. |
That depends on the implementation. E.g OpenBLAS typically packs all data in internally allocated buffers (likely properly aligned), so unaligned data is accessed only once. Conversely, BLASFEO (and especially the BLASFEO API) operates natively on some or all of the matrix arguments. So unaligned data is accessed more times. |
But do you agree that the described behavior of BLASFEO should be fixed? |
You can try out MKL instead of OpenBLAS, for the variants 'NN' and 'NT' in the fortran BLAS API (so check out which the corresponding variants are through the CBLAS API). Depending on the outcome, we can speak about "fixes". |
Well, once we know about this issue, one cannot anymore say anything meaningful about BLASFEO performance. Instead of saying "this function executes in 30ns" one should say "this function executes in between 30 and 300 or maybe 500ns, depending on where you allocate memory, and we don't know how it depends". |
Before undergoing the complete reimplementation of BLASFEO (since choices about internal memory formats are at the very foundations), I will investigate personally and with independent benchmarks the issue. If there is some "bug" to be fixes (i.e. something specific and non-fundamental which triggers the behavior) it will be fixed, but if this only comes form the implementation choices, I guess a note about providing a minimum alignment for maximum performance will make it. In some benchmarks I did in the past, I encountered some performance degradation in MKL if memory is not aligned, so if MKL is good with that, I'm good too ;) Thanks for the deep investigation of the issue and I'll keep you updates on the next steps! |
Results with MKL BLAS (Fortran API, although I don't see why Fortran API vs CBlas API would make difference):
|
@giaf I appreciate if you do independent benchmarking. I have created a separate repo with my benchmark code here: https://github.com/mkatliar/blasfeo_benchmark. Maybe it is a good place to start with and add your independent benchmarks. |
As investigated here mkatliar/blasfeo_benchmark#4 (comment), the issue is due to the presence of denormals in the memory passed to BLASFEO in order to create the matrices. In case of matrices of size not multiple of the minimum kernel size (usually 4x4), the matrix multiple of 4x4 is actually used for computation, and the extra computed elements discarded when writing the result form registers into memory. Therefore, if there are denormals on the "dark" part of the memory of the matrix multiple of 4x4, they do not affect the correctness of the result, but they may severely slow down computations like in this case. To answer to why there are denormals at all in the memory (and consistently across different test machines and OSs), this may be due to what the google benchmark itself does. |
It is BLASFEO's responsibility that the matrix structure after I don't consider this issue as "Closed". To fix the issue, |
We will write a "best practice" in the BLASFEO repo, but what you propose is not the solution. If you know what you do, you can put already the data in the memory and the pass it to the |
On top of that, |
"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations. |
Sure it is undocumented exactly because it is unsafe. But nonetheless since we are the developers of BLASFEO, we can use knowledge about its internal in our own software. |
Just to be clear, the results computed by the routine are CORRECT. This does not create Computational speed can be lower if some good practice rules are not followed. We were aware about slight performance drops in case of not aligned memory, but they are in the order of some percent, and therefore not of interest for the normal user. We were not aware of the large performance penalty (in the order of 10x or more) that can arise from having denormals in the padding memory. Now that we are aware of this, we will add a note to make sure that other users can avoid this issue. We thank you for your performance issue report and for the effort and help in finding the reason for that. The lifetime you invested there will not be wasted. |
…le performance issue (giaf/blasfeo#103) moved to a separate repository (https://github.com/mkatliar/blasfeo_benchmark).
Attached is the source code of a benchmark which measures execution time of the
gemm
BLAS routine implemented in BLASFEO and in a BLAS library of choice for different matrix sizes.gemm-benchmark.zip
The benchmark uses the Google Benchmark library.
Running the benchmark with the following command line
gives the following output:
One can see that, according to the benchmark, the execution time of BLASFEO
dgemm()
varies a lot, whereas the execution time of theopenblas
implementation is very stable.INTEL_HASWELL
architecture.g++ (Ubuntu 8.3.0-6ubuntu1) 8.3.0
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
CPU (Skylake architechure).Ubuntu Linux 19.04
with5.0.0-29-generic
kernel.The text was updated successfully, but these errors were encountered: