Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/cvrp speedup #146

Merged
merged 11 commits into from
Aug 17, 2018
Merged

Conversation

krypt-n
Copy link
Contributor

@krypt-n krypt-n commented Aug 13, 2018

Issue

Fixes #144

Implements expression templates for amount to reduce heap allocations.

Tasks

  • Update CHANGELOG.md (remove if irrelevant)
  • review

Benchmarks

All CVRP instances, 8 threads, exploration level = 1

origin/master:

Total jobs,44993
Total jobs assigned,44908,99.81%
,
Instances,189
All jobs solutions,165,87.3%
Optimal solutions,10,5.29%
,
,Gaps,Computing times
Min,0.0,1
First decile,0.62,6
Lower quartile,1.82,13
Median,3.46,109
Upper quartile,6.17,988
Ninth decile,9.11,4614
Max,19.47,24572

refactor/cvrp-speedup:

Total jobs,44993
Total jobs assigned,44908,99.81%
,
Instances,189
All jobs solutions,165,87.3%
Optimal solutions,10,5.29%
,
,Gaps,Computing times
Min,0.0,0
First decile,0.62,4
Lower quartile,1.82,7
Median,3.46,52
Upper quartile,6.17,423
Ninth decile,9.11,2247
Max,19.47,17857

Solutions do not change, everything gets a bit faster.

@krypt-n
Copy link
Contributor Author

krypt-n commented Aug 13, 2018

A similar speedup can be observed with x=3:

origin/master:

,Gaps,Computing times
Min,0.0,2
First decile,0.0,22
Lower quartile,0.64,48
Median,2.06,375
Upper quartile,3.78,4540
Ninth decile,6.28,23736
Max,19.47,116124

refactor/cvrp-speedup:

,Gaps,Computing times
Min,0.0,2
First decile,0.0,13
Lower quartile,0.64,26
Median,2.06,192
Upper quartile,3.78,2139
Ninth decile,6.28,15662
Max,19.47,90277

How did you calculate the 35% number for my last PR? I'd like to add a similarly calculated number to the changelog for this PR.

@jcoupey
Copy link
Collaborator

jcoupey commented Aug 14, 2018

Thanks for the implementation, great to see you managed to keep amounts usage unchanged. I'll probably come up with a few questions once I have time to look into the details.

How did you calculate the 35% number for my last PR?

As I recall, this was simply based on overall computing time comparison, summing values in the "computing times" column of the files produced by compare_to_BKS.py. The -35% value being (PR_time/master_time - 1) * 100. This should be the same as comparing total durations provided when running time ./run.sh [...].
The gain for the last PR is actually a bit higher than that, thanks to commit 8cb214f you pushed after my test.

I'll also run the same comparisons for the current PR with various x values on my usual benchmarking machine.

@jcoupey jcoupey added this to the v1.3.0 milestone Aug 14, 2018
@jcoupey
Copy link
Collaborator

jcoupey commented Aug 14, 2018

Running on all CVRP instances with 8 threads, I'm getting the following total computing times comparisons:

x master this PR Gain (%)
0 2m23.429s 0m47.793s -66.7
1 4m21.055s 1m51.477s -57.3
2 10m15.670s 4m28.015s -56.5
3 14m31.113s 6m19.620s -56.4
4 27m20.504s 12m2.998s -55.9
5 53m57.172s 25m37.835s -52.5

I confirm that all solutions are exactly the same. This is great! 🎉

@krypt-n
Copy link
Contributor Author

krypt-n commented Aug 15, 2018

Apologies for the unorganized work, but I added two commits that improve the computing times for instances with a lot of vehicles/with a bottleneck in cvrp_local_search::try_job_additions. This does not affect most of the instances, but for the ones with a lot of vehicles it is quite a noticeable improvement.

@jcoupey
Copy link
Collaborator

jcoupey commented Aug 16, 2018

Did you spot the modifications from dc640a2 and aca17de because some overhead was showing up when profiling? By the way (out of curiosity), what profiling tool(s) did you use?

@krypt-n
Copy link
Contributor Author

krypt-n commented Aug 16, 2018

I'm using callgrind with kcachegrind as a UI. My workflow:

  1. Add -ggdb as a compilation flag to the makefile such that the profiler can read linenumbers, etc.
  2. Find something I want to profile that runs in 1-5s. Long enough that improvements don't get lost in noise and short enough for not having to wait for results. For the first commits in this PR this was X-n344-k43
  3. Get a baseline measurement with time vroom -i X-n344-k43.json -t 1 (I'm using a single thread since I'm thinking that that could be less susceptible to noise)
  4. Run callgrind valgrind --tool=callgrind --dump-instrs=yes vroom -i X-n344-k43.json -t 1. --dump-instrs=yes makes profiling information on the assembly instruction level available, this can be useful to find out where calls to inlined functions come from.
  5. View the profiling result in kcachegrind and try to find something that looks like a hotspot (e.g. allocation of amounts)
  6. Try to make performance improvements to the code and check with time vroom -i X-n344-k43.json -t 1 if something changes

For every change that has a big impact like this PR there are a lot of changes that don't have a noticeable improvement.

Did you spot the modifications from dc640a2 and aca17de because some overhead was showing up when profiling?

I did not spot it with X-n344-k43 since it does not spend a lot of time in cvrp_local_search::try_job_additions. However when running the first commits with all CVRP instances I noticed that the computing time of instances with a lot of vehicles did not improve as much as expected (less that 50%). So I did the above workflow again with X-n219-k73 which spent around 40% of its computing time in cvrp_local_search::try_job_additions. The hot code there was again allocation of amounts (dc640a2) and std::min since the loops were pretty deeply nested and did not scale well with input size (aca17de).

@jcoupey
Copy link
Collaborator

jcoupey commented Aug 16, 2018

Thanks for sharing. I've been really focused on the big-O complexity (based on number of jobs/vehicles) while designing the solving approach, but this is a great (and complementary) way to spot improvements. What I find really interesting is that (except for aca17de) the overall algorithmic complexity is untouched, yet the gains are stunning.

@jcoupey jcoupey self-requested a review August 16, 2018 14:40
CHANGELOG.md Outdated
@@ -6,6 +6,7 @@

- Update `clang-format` to 6.0 for automatic code formatting (#143)
- Improve performance of the TSP heuristic by ~35% on all TSPLIB instances (#142)
- Improve performance of the CVRP heuristic by ~55% on all CVRPLIB instances (#146)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term "perfomance" for the solving approach is expected to relate to the solution quality. What about using "speed-up" instead? (Same applies to the entry for your latest PR on the TSP heuristic)

Also you might want to replace "heuristic" with "approach" for the CVRP entry as your fix is useful throughout the whole solving phase (heuristic + local search).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced "heuristic" with "solving", since I don't think that "approach" fits very well.

@@ -348,7 +349,7 @@ void cvrp_local_search::update_amounts(index_t v) {
std::transform(_sol_state.fwd_amounts[v].cbegin(),
_sol_state.fwd_amounts[v].cend(),
_sol_state.bwd_amounts[v].begin(),
[&](const auto& a) {
[&](const auto& a) -> amount_t {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is necessary so the compiler can resolve the return type, due to the template change for amount_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's actually a bit worse. With this annotation the return type of this lambda is amount_t and an implicit conversion at the return-statements converts the amount_diff_t into an amount_t by actually copying the values and everything is fine.

Without it the return type is inferred to be amount_diff_t and the return-statement does not perform any implicit conversions, this is a problem since amount_diff_t contains references to the lhs and rhs subexpressions. This way however undefined behaviour happens somewhere and this calculation is not carried out correctly. However now that I think about it, I understand the exact reason: the auto total_amount = line copies into a local amount object. The lifetime of this local object ends after the return, but the amount_diff_t still holds a reference to it. I pushed a commit that fixes this issue a bit more elegantly.

This is the part I don't like about this PR, it makes it a bit harder to use amount_t correctly.

@@ -494,6 +495,20 @@ void cvrp_local_search::try_job_additions(const std::vector<index_t>& routes,
}
}

auto smallest = std::numeric_limits<gain_t>::max();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done for using smallest and second smallest, that's a clever way to get rid of the inner loop!

@@ -494,6 +495,20 @@ void cvrp_local_search::try_job_additions(const std::vector<index_t>& routes,
}
}

auto smallest = std::numeric_limits<gain_t>::max();
auto second_smallest = std::numeric_limits<gain_t>::max();
size_t smallest_idx = std::numeric_limits<gain_t>::max();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be std::size_t for consistency with other occurences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

capacity_t operator[](std::size_t i) const {
return lhs[i] - rhs[i];
};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting script complains about spaces here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -31,51 +25,3 @@ amount_t& amount_t::operator-=(const amount_t& rhs) {
return *this;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting script complains about newlines here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

using parent::push_back;
std::size_t size() const {
return elems.size();
};

amount_t& operator+=(const amount_t& rhs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the reason for having all implementations for expression templates in the .h file is the same as for my question in the previous PR (allows the compiler to have the details from other compilation units)?

If so, wouldn't that make sense for amount_t::operator+= and amount_t::operator-= too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them to the header and deleted the .cpp file, since it is now empty.

@jcoupey jcoupey merged commit d9ed6bb into VROOM-Project:master Aug 17, 2018
jcoupey added a commit that referenced this pull request Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants