Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switched to in-place update of the diagonal Hessian #337

Merged
merged 14 commits into from
Jun 4, 2020

Conversation

ProfFan
Copy link
Collaborator

@ProfFan ProfFan commented Jun 2, 2020

Benchmark with SEQUENTIAL_CHOLESKY

Warmed-up timing

Before:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    7.43e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    7.27e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.28e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.31e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.34e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.35e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.36e+00
-Total: 0 CPU (0 times, 0 wall, 74.61 children, min: 0 max: 0)
|   -optimize: 74.61 CPU (1 times, 74.7746 wall, 74.61 children, min: 74.61 max: 74.61)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  74.23s user 0.88s system 99% cpu 1:15.27 total

After:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    7.47e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    7.30e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.29e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.29e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.32e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.35e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.37e+00
-Total: 0 CPU (0 times, 0 wall, 71.2 children, min: 0 max: 0)
|   -optimize: 71.2 CPU (1 times, 71.3531 wall, 71.2 children, min: 71.2 max: 71.2)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  70.95s user 0.74s system 99% cpu 1:11.84 total

Cold timing

Before:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    1.03e+01
   1  6.149065e+04    4.16e+04    3.33e-05     1    8.06e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    8.07e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.77e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    8.16e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.77e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.80e+00
-Total: 0 CPU (0 times, 0 wall, 82.97 children, min: 0 max: 0)
|   -optimize: 82.97 CPU (1 times, 85.3366 wall, 82.97 children, min: 82.97 max: 82.97)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  81.95s user 1.64s system 97% cpu 1:26.04 total

After:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    8.29e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    8.29e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.48e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.69e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.83e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.48e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.58e+00
-Total: 0 CPU (0 times, 0 wall, 75.98 children, min: 0 max: 0)
|   -optimize: 75.98 CPU (1 times, 76.699 wall, 75.98 children, min: 75.98 max: 75.98)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  75.47s user 1.07s system 98% cpu 1:17.34 total

This change is Reviewable

Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, speedup is not as large as we expected. 3-4% rather than 10% ? Although esp. in Jacobianfactor there is room to be even smarter and reduce mallocs to n (number of variables) rather than m (number of factors).

gtsam/linear/HessianFactor.cpp Outdated Show resolved Hide resolved
gtsam/linear/JacobianFactor.cpp Outdated Show resolved Hide resolved
@@ -554,9 +560,12 @@ VectorValues JacobianFactor::hessianDiagonal() const {
model_->whitenInPlace(column_k);
dj(k) = dot(column_k, column_k);
}
d.emplace(j, dj);
if(d.exists(j)) {
d.at(j) += dj;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could be a bit smarter still and avoid the malloc on line 556. Since we know we're going to add, it is already allocated or we'll have to emplace which has a new malloc.

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 2, 2020

It’s 3%-4% of the SEQUENTIAL_CHOLESKY time, should be more with EIGEN_CHOLESKY

@@ -560,8 +560,9 @@ void JacobianFactor::hessianDiagonalAdd(VectorValues& d) const {
model_->whitenInPlace(column_k);
dj(k) = dot(column_k, column_k);
}
if(d.exists(j)) {
d.at(j) += dj;
auto item = d.find(j);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think we can do even better! Currently, we do a find (one traversal) and another if it's not there (emplace). By always doing emplace and inspecting its return value we only do one traversal per key:

If the function successfully inserts the element (because no equivalent element existed already in the map), the function returns a pair of an iterator to the newly inserted element and a value of true.
Otherwise, it returns an iterator to the equivalent element within the container and a value of false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  /* ************************************************************************* */
  VectorValues::iterator VectorValues::emplace(Key j, const Vector& value) {
#ifdef TBB_GREATER_EQUAL_2020
    std::pair<iterator, bool> result = values_.emplace(j, value);
#else
    std::pair<iterator, bool> result = values_.insert(std::make_pair(j, value));
#endif
    if(!result.second)
      throw std::invalid_argument(
      "Requested to emplace variable '" + DefaultKeyFormatter(j)
      + "' already in this VectorValues.");
    return result.first;
  }

This will throw if value exists. Should I make a function to access the inner values_?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, dang! I had no idea the semantics of our emplace were different. That’s actually terrible :-) I think you should change this to an in-line straight call to emplace in the header, and just return the result. Might have to check current uses of emplace but I suspect none of them actually use the return value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here the aim is to avoid the allocation of Vector dj(nj)? If so then the emplace will need a new Vector, thus resummoning the allocation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - either use the memory already found by emplace in the tree, or use the newly allocated memory it just created (if the key was not in the tree yet). Either way, emplace will give you a reference to the memory, so no allocation should be necessary in your code. and the tree is traversed only once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is try_emplace, but only with C++17/////////

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need it. emplace will forward arguments, right. So, for normal maps this should work:

size_t nj = ...
auto item = emplace(j, nj);
auto& dj = *item.first;
if (item.second) dj.setZero();
for () {
  dj(k) += ...
}

Try and work it out entirely before sending next comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the way we are using emplace is probably wrong. The argument to emplace is the argument to the object constructor, so basically we are calling the copy constructor all the time when we have that VectorValues::emplace indirection :\

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

There are some strange errors:

94/181 Test  #94: testJacobianFactor .................***Failed    0.09 sec
Not equal:
expected:
: 3 elements
  5: 1 1 1
  10: 4 4 4
  15: 9 9 9
actual:
: 3 elements
  5: 0.999998 0.999998 0.999998
  10: 4 4 4
  15: 9 9 9
 98/181 Test  #98: testRegularJacobianFactor ..........***Failed    0.08 sec
Not equal:
expected:
: 3 elements
  0: 4 4 4
  1: 16 16 16
  2: 36 36 36
actual:
: 3 elements
  0: 4 4 4
  1: 16 16 16
  2: 36 36 36
131/181 Test #131: testRegularImplicitSchurFactor .....***Failed    0.09 sec
Not equal:
expected:
: 3 elements
  0: 1.35714 1.35714 1.35714 1.35714 1.35714 1.35714
  1: 0.380951 0.380951 0.380951 0.380951 0.380951 0.380951
  3: 12.2143 12.2143 12.2143 12.2143 12.2143 12.2143
actual:
: 3 elements
  0: 1.35714 1.35714 1.35714 1.35714 1.35714 1.35714
  1: 0.380952 0.380952 0.380952 0.380952 0.380952 0.380952
  3: 12.2143 12.2143 12.2143 12.2143 12.2143 12.2143

@dellaert
Copy link
Member

dellaert commented Jun 3, 2020 via email

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

@dellaert Fixed. Now all unit tests should pass.

@ProfFan ProfFan requested a review from dellaert June 3, 2020 02:07
@dellaert
Copy link
Member

dellaert commented Jun 3, 2020

New timing?

Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think we see eye to eye now ;-) Curious to see whether all this skullduggery made any difference in timing :-)
Some more things to cleanup, but feel free to merge after CI of those fixes pans out.

gtsam/config.h.in Show resolved Hide resolved
gtsam/linear/HessianFactor.cpp Show resolved Hide resolved
gtsam/linear/JacobianFactor.cpp Outdated Show resolved Hide resolved
gtsam/linear/VectorValues.h Show resolved Hide resolved
auto result = collectedResult.emplace(*frontal, solution.segment(vectorPosition, c.getDim(frontal)));
if(!result.second)
throw std::invalid_argument(
"Requested to emplace variable '" + DefaultKeyFormatter(*frontal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message is unhelpful in this context. Maybe: std::runtime_error("Internal error while optimizing clique.")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return d;
}

/// Return the diagonal of the Hessian for this factor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

gtsam/slam/RegularImplicitSchurFactor.h Show resolved Hide resolved
@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

-Total: 0 CPU (0 times, 0 wall, 73.65 children, min: 0 max: 0)
|   -optimize: 73.65 CPU (1 times, 73.9181 wall, 73.65 children, min: 73.65 max: 73.65)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  73.23s user 0.94s system 99% cpu 1:14.43 total

@dellaert 1:14 with SEQUENTIAL_CHOLESKY :)

@dellaert
Copy link
Member

dellaert commented Jun 3, 2020

-Total: 0 CPU (0 times, 0 wall, 73.65 children, min: 0 max: 0)
|   -optimize: 73.65 CPU (1 times, 73.9181 wall, 73.65 children, min: 73.65 max: 73.65)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  73.23s user 0.94s system 99% cpu 1:14.43 total

@dellaert 1:14 with SEQUENTIAL_CHOLESKY :)

Wait, I don't understand. It's 73s now, compared to 82s when you started the PR (yes!!!! > 10% saving !?), but with what solver? Note Eigen is not in develop yet, so the description of this PR should state the savings for using an existing solver, e.g. SEQUENTIAL_CHOLESKY. Could you update the PR description with a before and after for SEQUENTIAL_CHOLESKY?

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

@dellaert I updated the benchmark with proper warm-up (for filling CPU cache). Can see about 1s of improvement in runtime.

@dellaert
Copy link
Member

dellaert commented Jun 3, 2020

@dellaert I updated the benchmark with proper warm-up (for filling CPU cache). Can see about 1s of improvement in runtime.

Hmm. That's disappointing. But, on the other hand, why is "warm-started?" the right benchmark? People typically run optimizations once.

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

@dellaert Because that will eliminate the disturbance caused by other processes filling the CPU cache, etc. https://engineering.appfolio.com/appfolio-engineering/2017/5/2/what-about-warmup

@dellaert
Copy link
Member

dellaert commented Jun 3, 2020

@dellaert Because that will eliminate the disturbance caused by other processes filling the CPU cache, etc. https://engineering.appfolio.com/appfolio-engineering/2017/5/2/what-about-warmup

I understand warm-up :-) But my comment is that it does not apply, and we should cold-start.

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

I did another optimization to further reduce heap allocation. Now the benchmark:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    7.47e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    7.30e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    7.29e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    7.29e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    7.32e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    7.35e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    7.37e+00
-Total: 0 CPU (0 times, 0 wall, 71.2 children, min: 0 max: 0)
|   -optimize: 71.2 CPU (1 times, 71.3531 wall, 71.2 children, min: 71.2 max: 71.2)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  70.95s user 0.74s system 99% cpu 1:11.84 total

@dellaert
Copy link
Member

dellaert commented Jun 3, 2020

@ProfFan awesome! But (a) could you update the description with cold timing? (b) the build seems to be failing :-/

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

updated and fixed the GCC issues :)

@ProfFan ProfFan requested a review from dellaert June 3, 2020 20:47
@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

@dellaert Should I merge this in?

Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update description with cold timing. I'll then do a last review

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

@dellaert Already done, see top

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

BTW, in profile the amount of total execution time consumed by hessianDiagonal() reduced from 10% to 5%.

Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I noticed something in this last review:

    virtual VectorValues hessianDiagonal() const = 0;

No longer needs to be abstract. It's the same in all derived classes :-) Just make it concrete and remove all copies in derived?

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 3, 2020

@dellaert It's possible but not trivial - VectorValues is forward declared in GaussianFactor, so we cannot use it for real. Also GaussianFactor is header-only so we cannot put implementations.

@dellaert
Copy link
Member

dellaert commented Jun 3, 2020

@dellaert It's possible but not trivial - VectorValues is forward declared in GaussianFactor, so we cannot use it for real. Also GaussianFactor is header-only so we cannot put implementations.

Can you not add a .cpp file ?

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 4, 2020

@dellaert Done.

Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay !

@ProfFan ProfFan merged commit 75ce6d6 into develop Jun 4, 2020
@ProfFan ProfFan deleted the feature/fast_hessian branch June 4, 2020 03:13
@dellaert
Copy link
Member

dellaert commented Jun 4, 2020

Awesome ! Let's do some SwiftFusion now, pick up Eigen back up after we meet w those folks :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants