Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance regression in Debug mode on the stable branch #8520

Closed
rezarastak opened this issue Aug 8, 2019 · 5 comments
Closed

Comments

@rezarastak
Copy link
Contributor

Recently I upgraded the Candi deal.II installation (from 9.0.1 to 9.1.0) on my desktop and noticed significant performance degradation (~6-7 times) in the debug mode in my code (no change in the performance of the release mode). Using a profiler, I narrowed down the source of slowness. The tensor operations (double_contract etc) seem to be slowed the most. My hypothesis is that the performance regression comes mostly from Assert statements in tensor operations (specifically Tensor::operator()).

To test this hypothesis, I created a simple code that checks the speed of tensor operations.

int main()
{
  double values[3][3] = {2., 5., 1.,
                     6., 3., 8.,
                     -2., 5.5, 1.7};
  const Tensor<2,3> t(values);
  double sum = 0.;
  for (int i = 0; i < 100000; ++i)
    sum += scalar_product(t, t);
  return static_cast<int>(sum);
}

The timing results in debug mode (using elapsed time provided by /usr/bin/time)

  • 0.01s on stable version 9.0.1
  • 0.15s on stable version 9.1.0
  • 0.01s on current master branch (7d1f025)

Please let me know if you can reproduce these results. It seems that most of the error has been solved on the master branch, but maybe we should update the stable branch as well.

I used callgrind to profile it. The number of instruction read (IR) events are

  • 31,930,901 on stable version 9.0.1
  • 608,772,206 on stable version 9.1.0
  • 65,562,021 on current master branch (7d1f025)

One of the evidences suggesting that "Assert macro is causing the performance regression" is the following output from callgrind for v9.1.0

-- line 1084 ----------------------------------------
         .      template <typename ArrayElementType, int dim>
         .      inline DEAL_II_ALWAYS_INLINE DEAL_II_CUDA_HOST_DEV ArrayElementType &
         .                                                         subscript(ArrayElementType * values,
         .                                                                   const unsigned int i,
         .                                                                   std::integral_constant<int, dim>)
         .      {
         .        // We cannot use Assert in a CUDA kernel
         .  #ifndef __CUDA_ARCH__
12,000,000        Assert(i < dim, ExcIndexRange(i, 0, dim));
         .  #endif
13,800,000        return values[i];
         .      }
         .  

The dealii installation logs and the callgrind profiling data are attached here.

detailed-7d1f025430.log
detailed-v9.0.1.log
detailed-v9.1.0.log

profile-results.7d1f025430.txt
profile-results.v9.0.1.txt
profile-results.v9.1.0.txt

@drwells
Copy link
Member

drwells commented Aug 8, 2019

Here is a slightly modified version of the benchmark. This takes me 140 ms with 9.1.1 and 150 ms with the current master branch.

#include <deal.II/base/tensor.h>

#include <chrono>
#include <iostream>

int main()
{
  using namespace dealii;

  const auto start = std::chrono::high_resolution_clock::now();

  double sum = 0.;
  for (int i = 0; i < 10; ++i)
    {
      double values[3][3] = {2., 5., 1.,
                             6., 3., 8.,
                             -2., 5.5, 1.7};
      const Tensor<2,3> t(values);
      for (int j = 0; j < 100000; ++j)
        sum += scalar_product(t, t);
    }
  const auto end = std::chrono::high_resolution_clock::now();

  std::cout << "took " << std::chrono::duration_cast<std::chrono::milliseconds>
    (end - start).count() << " milliseconds\n";

  return static_cast<int>(sum);
}

@rezarastak
Copy link
Contributor Author

@drwells using your modified benchmark I ran it again on my machine (all in Debug mode):

  • stable installation 9.0.1: 9 milliseconds
  • stable installation 9.1.0: 112 milliseconds
  • master branch (7d1f025): 53 milliseconds

@kronbichler
Copy link
Member

kronbichler commented Aug 9, 2019

I guess one would need to look into the assembly code to see what has changed. I am willing to believe that the assertion could be part of the reason - however, I want to point out that it is not the check itself that is the problem, but rather the compiler which moves content back and forth between registers and the stack (because you need to be able to access all variables).

Background information: If we had those checks also in release mode (not count some of the more expensive ones we do), I would think that the slowdown would be less than a factor of 2, often less than a factor 1.2. The reason is that almost all branches due to assertions should be easy to predict (branch prediction of the CPU). What could cause a slowdown is some missed optimization opportunities due to the check (say less loop unrolling, additional dependencies that harm out-of-order execution) and, sometimes, the decode issue that the CPU simply has to execute more instructions. But as I said, I believe this is less than 2x.

Now to the actual cause: I believe the huge difference between version 9.0 and 9.1 is #7207. If you check your detailed.log, we have

#        DEAL_II_CXX_FLAGS_DEBUG:      -Og -ggdb -Wa,--compress-debug-sections

in the case of 9.0.1 and

#        DEAL_II_CXX_FLAGS_DEBUG:      -O0 -ggdb -Wa,--compress-debug-sections

for 9.1.0, note the switch from -Og to -O0. This was intentional, so I guess we need to live with it. After all, debug is supposed to do many check, and we need to compromise between the ability to debug and read all variables e.g. in gdb (which requires moving tons of content back and forth between stack and registers, the -O0 case) and some more speed by elimination of those operations and some more inlining (the -Og case).

@rezarastak
Copy link
Contributor Author

@kronbichler Thanks for the explanation. It now makes perfect sense to me. I also remember in previous deal.II versions that gdb used to unnecessarily inline or optimize away variables that I wanted to inspect. So I support more powerful debugging in debug mode even if it is slower. I think there are many opportunities to improve the Assert statements, such as converting some of them to static_assert, to gain some speed up.

@bangerth
Copy link
Member

Yes, -Og made debugging impossible on many platforms and I thought that was unacceptable.

Since there is little we can do in concrete terms about your observation, should we close this issue? Of course, we'd be happy to convert Asserts to static_asserts if you can think of any that qualify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants