Possible performance regression in Debug mode on the stable branch #8520

rezarastak · 2019-08-08T21:55:03Z

Recently I upgraded the Candi deal.II installation (from 9.0.1 to 9.1.0) on my desktop and noticed significant performance degradation (~6-7 times) in the debug mode in my code (no change in the performance of the release mode). Using a profiler, I narrowed down the source of slowness. The tensor operations (double_contract etc) seem to be slowed the most. My hypothesis is that the performance regression comes mostly from Assert statements in tensor operations (specifically Tensor::operator()).

To test this hypothesis, I created a simple code that checks the speed of tensor operations.

int main()
{
  double values[3][3] = {2., 5., 1.,
                     6., 3., 8.,
                     -2., 5.5, 1.7};
  const Tensor<2,3> t(values);
  double sum = 0.;
  for (int i = 0; i < 100000; ++i)
    sum += scalar_product(t, t);
  return static_cast<int>(sum);
}

The timing results in debug mode (using elapsed time provided by /usr/bin/time)

0.01s on stable version 9.0.1
0.15s on stable version 9.1.0
0.01s on current master branch (7d1f025)

Please let me know if you can reproduce these results. It seems that most of the error has been solved on the master branch, but maybe we should update the stable branch as well.

I used callgrind to profile it. The number of instruction read (IR) events are

31,930,901 on stable version 9.0.1
608,772,206 on stable version 9.1.0
65,562,021 on current master branch (7d1f025)

One of the evidences suggesting that "Assert macro is causing the performance regression" is the following output from callgrind for v9.1.0

-- line 1084 ----------------------------------------
         .      template <typename ArrayElementType, int dim>
         .      inline DEAL_II_ALWAYS_INLINE DEAL_II_CUDA_HOST_DEV ArrayElementType &
         .                                                         subscript(ArrayElementType * values,
         .                                                                   const unsigned int i,
         .                                                                   std::integral_constant<int, dim>)
         .      {
         .        // We cannot use Assert in a CUDA kernel
         .  #ifndef __CUDA_ARCH__
12,000,000        Assert(i < dim, ExcIndexRange(i, 0, dim));
         .  #endif
13,800,000        return values[i];
         .      }
         .

The dealii installation logs and the callgrind profiling data are attached here.

detailed-7d1f025430.log
detailed-v9.0.1.log
detailed-v9.1.0.log

profile-results.7d1f025430.txt
profile-results.v9.0.1.txt
profile-results.v9.1.0.txt

The text was updated successfully, but these errors were encountered:

drwells · 2019-08-08T22:38:34Z

Here is a slightly modified version of the benchmark. This takes me 140 ms with 9.1.1 and 150 ms with the current master branch.

#include <deal.II/base/tensor.h>

#include <chrono>
#include <iostream>

int main()
{
  using namespace dealii;

  const auto start = std::chrono::high_resolution_clock::now();

  double sum = 0.;
  for (int i = 0; i < 10; ++i)
    {
      double values[3][3] = {2., 5., 1.,
                             6., 3., 8.,
                             -2., 5.5, 1.7};
      const Tensor<2,3> t(values);
      for (int j = 0; j < 100000; ++j)
        sum += scalar_product(t, t);
    }
  const auto end = std::chrono::high_resolution_clock::now();

  std::cout << "took " << std::chrono::duration_cast<std::chrono::milliseconds>
    (end - start).count() << " milliseconds\n";

  return static_cast<int>(sum);
}

rezarastak · 2019-08-08T22:51:59Z

@drwells using your modified benchmark I ran it again on my machine (all in Debug mode):

stable installation 9.0.1: 9 milliseconds
stable installation 9.1.0: 112 milliseconds
master branch (7d1f025): 53 milliseconds

kronbichler · 2019-08-09T07:48:47Z

I guess one would need to look into the assembly code to see what has changed. I am willing to believe that the assertion could be part of the reason - however, I want to point out that it is not the check itself that is the problem, but rather the compiler which moves content back and forth between registers and the stack (because you need to be able to access all variables).

Background information: If we had those checks also in release mode (not count some of the more expensive ones we do), I would think that the slowdown would be less than a factor of 2, often less than a factor 1.2. The reason is that almost all branches due to assertions should be easy to predict (branch prediction of the CPU). What could cause a slowdown is some missed optimization opportunities due to the check (say less loop unrolling, additional dependencies that harm out-of-order execution) and, sometimes, the decode issue that the CPU simply has to execute more instructions. But as I said, I believe this is less than 2x.

Now to the actual cause: I believe the huge difference between version 9.0 and 9.1 is #7207. If you check your detailed.log, we have

#        DEAL_II_CXX_FLAGS_DEBUG:      -Og -ggdb -Wa,--compress-debug-sections

in the case of 9.0.1 and

#        DEAL_II_CXX_FLAGS_DEBUG:      -O0 -ggdb -Wa,--compress-debug-sections

for 9.1.0, note the switch from -Og to -O0. This was intentional, so I guess we need to live with it. After all, debug is supposed to do many check, and we need to compromise between the ability to debug and read all variables e.g. in gdb (which requires moving tons of content back and forth between stack and registers, the -O0 case) and some more speed by elimination of those operations and some more inlining (the -Og case).

rezarastak · 2019-08-09T17:41:55Z

@kronbichler Thanks for the explanation. It now makes perfect sense to me. I also remember in previous deal.II versions that gdb used to unnecessarily inline or optimize away variables that I wanted to inspect. So I support more powerful debugging in debug mode even if it is slower. I think there are many opportunities to improve the Assert statements, such as converting some of them to static_assert, to gain some speed up.

bangerth · 2019-08-13T00:14:43Z

Yes, -Og made debugging impossible on many platforms and I thought that was unacceptable.

Since there is little we can do in concrete terms about your observation, should we close this issue? Of course, we'd be happy to convert Asserts to static_asserts if you can think of any that qualify.

rezarastak closed this as completed Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible performance regression in Debug mode on the stable branch #8520

Possible performance regression in Debug mode on the stable branch #8520

rezarastak commented Aug 8, 2019

drwells commented Aug 8, 2019

rezarastak commented Aug 8, 2019

kronbichler commented Aug 9, 2019 •

edited

rezarastak commented Aug 9, 2019

bangerth commented Aug 13, 2019

Possible performance regression in Debug mode on the stable branch #8520

Possible performance regression in Debug mode on the stable branch #8520

Comments

rezarastak commented Aug 8, 2019

drwells commented Aug 8, 2019

rezarastak commented Aug 8, 2019

kronbichler commented Aug 9, 2019 • edited

rezarastak commented Aug 9, 2019

bangerth commented Aug 13, 2019

kronbichler commented Aug 9, 2019 •

edited