Switched to in-place update of the diagonal Hessian #337

ProfFan · 2020-06-02T16:50:20Z

Benchmark with SEQUENTIAL_CHOLESKY

Warmed-up timing

Before:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    7.43e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    7.27e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.28e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.31e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.34e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.35e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.36e+00
-Total: 0 CPU (0 times, 0 wall, 74.61 children, min: 0 max: 0)
|   -optimize: 74.61 CPU (1 times, 74.7746 wall, 74.61 children, min: 74.61 max: 74.61)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  74.23s user 0.88s system 99% cpu 1:15.27 total

After:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    7.47e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    7.30e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.29e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.29e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.32e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.35e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.37e+00
-Total: 0 CPU (0 times, 0 wall, 71.2 children, min: 0 max: 0)
|   -optimize: 71.2 CPU (1 times, 71.3531 wall, 71.2 children, min: 71.2 max: 71.2)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  70.95s user 0.74s system 99% cpu 1:11.84 total

Cold timing

Before:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    1.03e+01
   1  6.149065e+04    4.16e+04    3.33e-05     1    8.06e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    8.07e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.77e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    8.16e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.77e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.80e+00
-Total: 0 CPU (0 times, 0 wall, 82.97 children, min: 0 max: 0)
|   -optimize: 82.97 CPU (1 times, 85.3366 wall, 82.97 children, min: 82.97 max: 82.97)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  81.95s user 1.64s system 97% cpu 1:26.04 total

After:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    8.29e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    8.29e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.48e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.69e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.83e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.48e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.58e+00
-Total: 0 CPU (0 times, 0 wall, 75.98 children, min: 0 max: 0)
|   -optimize: 75.98 CPU (1 times, 76.699 wall, 75.98 children, min: 75.98 max: 75.98)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  75.47s user 1.07s system 98% cpu 1:17.34 total

This change is

dellaert

Hmmm, speedup is not as large as we expected. 3-4% rather than 10% ? Although esp. in Jacobianfactor there is room to be even smarter and reduce mallocs to n (number of variables) rather than m (number of factors).

gtsam/linear/HessianFactor.cpp

gtsam/linear/JacobianFactor.cpp

dellaert · 2020-06-02T16:58:50Z

gtsam/linear/JacobianFactor.cpp

@@ -554,9 +560,12 @@ VectorValues JacobianFactor::hessianDiagonal() const {
        model_->whitenInPlace(column_k);
      dj(k) = dot(column_k, column_k);
    }
-    d.emplace(j, dj);
+    if(d.exists(j)) {
+      d.at(j) += dj;


I think we could be a bit smarter still and avoid the malloc on line 556. Since we know we're going to add, it is already allocated or we'll have to emplace which has a new malloc.

ProfFan · 2020-06-02T18:00:21Z

It’s 3%-4% of the SEQUENTIAL_CHOLESKY time, should be more with EIGEN_CHOLESKY

dellaert · 2020-06-02T19:51:38Z

gtsam/linear/JacobianFactor.cpp

@@ -560,8 +560,9 @@ void JacobianFactor::hessianDiagonalAdd(VectorValues& d) const {
        model_->whitenInPlace(column_k);
      dj(k) = dot(column_k, column_k);
    }
-    if(d.exists(j)) {
-      d.at(j) += dj;
+    auto item = d.find(j);


So, I think we can do even better! Currently, we do a find (one traversal) and another if it's not there (emplace). By always doing emplace and inspecting its return value we only do one traversal per key:

If the function successfully inserts the element (because no equivalent element existed already in the map), the function returns a pair of an iterator to the newly inserted element and a value of true. Otherwise, it returns an iterator to the equivalent element within the container and a value of false.

/* ************************************************************************* */ VectorValues::iterator VectorValues::emplace(Key j, const Vector& value) { #ifdef TBB_GREATER_EQUAL_2020 std::pair<iterator, bool> result = values_.emplace(j, value); #else std::pair<iterator, bool> result = values_.insert(std::make_pair(j, value)); #endif if(!result.second) throw std::invalid_argument( "Requested to emplace variable '" + DefaultKeyFormatter(j) + "' already in this VectorValues."); return result.first; }

This will throw if value exists. Should I make a function to access the inner values_?

Oh, dang! I had no idea the semantics of our emplace were different. That’s actually terrible :-) I think you should change this to an in-line straight call to emplace in the header, and just return the result. Might have to check current uses of emplace but I suspect none of them actually use the return value.

I think here the aim is to avoid the allocation of Vector dj(nj)? If so then the emplace will need a new Vector, thus resummoning the allocation?

Yes - either use the memory already found by emplace in the tree, or use the newly allocated memory it just created (if the key was not in the tree yet). Either way, emplace will give you a reference to the memory, so no allocation should be necessary in your code. and the tree is traversed only once.

There is try_emplace, but only with C++17/////////

I don't think you need it. emplace will forward arguments, right. So, for normal maps this should work:

size_t nj = ... auto item = emplace(j, nj); auto& dj = *item.first; if (item.second) dj.setZero(); for () { dj(k) += ... }

Try and work it out entirely before sending next comment?

Turns out the way we are using emplace is probably wrong. The argument to emplace is the argument to the object constructor, so basically we are calling the copy constructor all the time when we have that VectorValues::emplace indirection :\

gtsam/linear/VectorValues.cpp

gtsam/linear/VectorValues.h

ProfFan · 2020-06-03T01:43:59Z

There are some strange errors:

94/181 Test  #94: testJacobianFactor .................***Failed    0.09 sec
Not equal:
expected:
: 3 elements
  5: 1 1 1
  10: 4 4 4
  15: 9 9 9
actual:
: 3 elements
  5: 0.999998 0.999998 0.999998
  10: 4 4 4
  15: 9 9 9
 98/181 Test  #98: testRegularJacobianFactor ..........***Failed    0.08 sec
Not equal:
expected:
: 3 elements
  0: 4 4 4
  1: 16 16 16
  2: 36 36 36
actual:
: 3 elements
  0: 4 4 4
  1: 16 16 16
  2: 36 36 36
131/181 Test #131: testRegularImplicitSchurFactor .....***Failed    0.09 sec
Not equal:
expected:
: 3 elements
  0: 1.35714 1.35714 1.35714 1.35714 1.35714 1.35714
  1: 0.380951 0.380951 0.380951 0.380951 0.380951 0.380951
  3: 12.2143 12.2143 12.2143 12.2143 12.2143 12.2143
actual:
: 3 elements
  0: 1.35714 1.35714 1.35714 1.35714 1.35714 1.35714
  1: 0.380952 0.380952 0.380952 0.380952 0.380952 0.380952
  3: 12.2143 12.2143 12.2143 12.2143 12.2143 12.2143

dellaert · 2020-06-03T01:48:10Z

Unit test emplace

On Tue, Jun 2, 2020 at 21:44 Fan Jiang ***@***.***> wrote: There are some strange errors: 94/181 Test #94: testJacobianFactor .................***Failed 0.09 sec Not equal: expected: : 3 elements 5: 1 1 1 10: 4 4 4 15: 9 9 9 actual: : 3 elements 5: 0.999998 0.999998 0.999998 10: 4 4 4 15: 9 9 9 98/181 Test #98: testRegularJacobianFactor ..........***Failed 0.08 sec Not equal: expected: : 3 elements 0: 4 4 4 1: 16 16 16 2: 36 36 36 actual: : 3 elements 0: 4 4 4 1: 16 16 16 2: 36 36 36 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#337 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACQHGSMUGEFZ43LT6MG22GTRUWTGXANCNFSM4NQ4XUOA> .

-- Best ! Frank Dellaert http://frank.dellaert.com

ProfFan · 2020-06-03T02:00:47Z

@dellaert Fixed. Now all unit tests should pass.

dellaert · 2020-06-03T02:11:28Z

New timing?

dellaert

Nice! I think we see eye to eye now ;-) Curious to see whether all this skullduggery made any difference in timing :-)
Some more things to cleanup, but feel free to merge after CI of those fixes pans out.

gtsam/config.h.in

gtsam/linear/HessianFactor.cpp

gtsam/linear/JacobianFactor.cpp

gtsam/linear/VectorValues.h

dellaert · 2020-06-03T02:25:54Z

gtsam/linear/linearAlgorithms-inst.h

+              auto result = collectedResult.emplace(*frontal, solution.segment(vectorPosition, c.getDim(frontal)));
+              if(!result.second)
+                  throw std::invalid_argument(
+                      "Requested to emplace variable '" + DefaultKeyFormatter(*frontal)


I think this error message is unhelpful in this context. Maybe: std::runtime_error("Internal error while optimizing clique.")

dellaert · 2020-06-03T02:26:30Z

gtsam/slam/RegularImplicitSchurFactor.h

+    return d;
+  }
+
+  /// Return the diagonal of the Hessian for this factor


Fix comment

gtsam/slam/RegularImplicitSchurFactor.h

ProfFan · 2020-06-03T02:35:11Z

-Total: 0 CPU (0 times, 0 wall, 73.65 children, min: 0 max: 0)
|   -optimize: 73.65 CPU (1 times, 73.9181 wall, 73.65 children, min: 73.65 max: 73.65)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  73.23s user 0.94s system 99% cpu 1:14.43 total

@dellaert 1:14 with SEQUENTIAL_CHOLESKY :)

dellaert · 2020-06-03T02:40:45Z

-Total: 0 CPU (0 times, 0 wall, 73.65 children, min: 0 max: 0)
|   -optimize: 73.65 CPU (1 times, 73.9181 wall, 73.65 children, min: 73.65 max: 73.65)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  73.23s user 0.94s system 99% cpu 1:14.43 total

@dellaert 1:14 with SEQUENTIAL_CHOLESKY :)

Wait, I don't understand. It's 73s now, compared to 82s when you started the PR (yes!!!! > 10% saving !?), but with what solver? Note Eigen is not in develop yet, so the description of this PR should state the savings for using an existing solver, e.g. SEQUENTIAL_CHOLESKY. Could you update the PR description with a before and after for SEQUENTIAL_CHOLESKY?

ProfFan · 2020-06-03T03:04:28Z

@dellaert I updated the benchmark with proper warm-up (for filling CPU cache). Can see about 1s of improvement in runtime.

dellaert · 2020-06-03T03:10:42Z

@dellaert I updated the benchmark with proper warm-up (for filling CPU cache). Can see about 1s of improvement in runtime.

Hmm. That's disappointing. But, on the other hand, why is "warm-started?" the right benchmark? People typically run optimizations once.

ProfFan · 2020-06-03T03:13:05Z

@dellaert Because that will eliminate the disturbance caused by other processes filling the CPU cache, etc. https://engineering.appfolio.com/appfolio-engineering/2017/5/2/what-about-warmup

dellaert · 2020-06-03T03:16:47Z

@dellaert Because that will eliminate the disturbance caused by other processes filling the CPU cache, etc. https://engineering.appfolio.com/appfolio-engineering/2017/5/2/what-about-warmup

I understand warm-up :-) But my comment is that it does not apply, and we should cold-start.

ProfFan · 2020-06-03T03:53:54Z

I did another optimization to further reduce heap allocation. Now the benchmark:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    7.47e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    7.30e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.29e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.29e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.32e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.35e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.37e+00
-Total: 0 CPU (0 times, 0 wall, 71.2 children, min: 0 max: 0)
|   -optimize: 71.2 CPU (1 times, 71.3531 wall, 71.2 children, min: 71.2 max: 71.2)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  70.95s user 0.74s system 99% cpu 1:11.84 total

dellaert · 2020-06-03T13:35:02Z

@ProfFan awesome! But (a) could you update the description with cold timing? (b) the build seems to be failing :-/

ProfFan · 2020-06-03T20:09:22Z

updated and fixed the GCC issues :)

ProfFan · 2020-06-03T20:47:33Z

@dellaert Should I merge this in?

dellaert

Please update description with cold timing. I'll then do a last review

ProfFan · 2020-06-03T21:14:07Z

@dellaert Already done, see top

ProfFan · 2020-06-03T21:15:13Z

BTW, in profile the amount of total execution time consumed by hessianDiagonal() reduced from 10% to 5%.

dellaert

OK, I noticed something in this last review:

    virtual VectorValues hessianDiagonal() const = 0;

No longer needs to be abstract. It's the same in all derived classes :-) Just make it concrete and remove all copies in derived?

ProfFan · 2020-06-03T23:22:09Z

@dellaert It's possible but not trivial - VectorValues is forward declared in GaussianFactor, so we cannot use it for real. Also GaussianFactor is header-only so we cannot put implementations.

dellaert · 2020-06-03T23:24:32Z

@dellaert It's possible but not trivial - VectorValues is forward declared in GaussianFactor, so we cannot use it for real. Also GaussianFactor is header-only so we cannot put implementations.

Can you not add a .cpp file ?

ProfFan · 2020-06-04T00:07:41Z

@dellaert Done.

dellaert

Yay !

dellaert · 2020-06-04T03:24:31Z

Awesome ! Let's do some SwiftFusion now, pick up Eigen back up after we meet w those folks :-)

Switched to in-place update of the diagonal Hessian

f734291

dellaert reviewed Jun 2, 2020

View reviewed changes

Use find

50ffeb7

dellaert reviewed Jun 2, 2020

View reviewed changes

ProfFan added 4 commits June 2, 2020 16:17

Use faster version of in-place add

15fda68

fix checks

b669e3a

Fix build

1fe876a

Fixed emplace to align with std

5bf8dc4

dellaert reviewed Jun 2, 2020

View reviewed changes

gtsam/linear/VectorValues.cpp Outdated Show resolved Hide resolved

ProfFan added 3 commits June 2, 2020 19:44

Move to header

23617fd

Fixed TBB detection and make emplace great again

0675b82

replaced all find&insert to emplace

151809c

ProfFan commented Jun 3, 2020

View reviewed changes

gtsam/linear/VectorValues.h Show resolved Hide resolved

Zero initialize the allocated vector

9186e65

ProfFan requested a review from dellaert June 3, 2020 02:07

dellaert approved these changes Jun 3, 2020

View reviewed changes

Address comments

e8111a1

Further optimization

f2a7864

Trying to fix old GCC

4d5d1dc

ProfFan requested a review from dellaert June 3, 2020 20:47

dellaert requested changes Jun 3, 2020

View reviewed changes

Devirtualize hessianDiagonal

65036b2

dellaert approved these changes Jun 4, 2020

View reviewed changes

ProfFan merged commit 75ce6d6 into develop Jun 4, 2020

ProfFan deleted the feature/fast_hessian branch June 4, 2020 03:13

Switched to in-place update of the diagonal Hessian #337

Switched to in-place update of the diagonal Hessian #337

Conversation

ProfFan commented Jun 2, 2020 • edited Loading

Warmed-up timing

Cold timing

dellaert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ProfFan commented Jun 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ProfFan commented Jun 3, 2020 • edited Loading

dellaert commented Jun 3, 2020 via email

ProfFan commented Jun 3, 2020

dellaert commented Jun 3, 2020

dellaert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ProfFan commented Jun 3, 2020

dellaert commented Jun 3, 2020

ProfFan commented Jun 3, 2020

dellaert commented Jun 3, 2020

ProfFan commented Jun 3, 2020

dellaert commented Jun 3, 2020

ProfFan commented Jun 3, 2020

dellaert commented Jun 3, 2020

ProfFan commented Jun 3, 2020

ProfFan commented Jun 3, 2020

dellaert left a comment

Choose a reason for hiding this comment

ProfFan commented Jun 3, 2020

ProfFan commented Jun 3, 2020

dellaert left a comment

Choose a reason for hiding this comment

ProfFan commented Jun 3, 2020

dellaert commented Jun 3, 2020

ProfFan commented Jun 4, 2020

dellaert left a comment

Choose a reason for hiding this comment

dellaert commented Jun 4, 2020

ProfFan commented Jun 2, 2020 •

edited

Loading

ProfFan commented Jun 3, 2020 •

edited

Loading