Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite fall memory fix for #14 #30

Merged
merged 9 commits into from
Nov 10, 2020

Conversation

robert-hrusecky
Copy link
Contributor

This is just a simple de-allocation strategy that removes SAPRegions immediately when all proxies exit them. This appears to plug up the memory leak.

Since the regions are stored in a HashMap, I have a suspicion that this may be all that we need performance wise. New allocations of regions should only need to do actual memory allocations when the HashMap internally resizes. So if the number of regions stays mostly constant, I expect to see little overhead.

I'm making this PR now mostly to get the benchmarks to run. If you are happy with it already, go ahead and merge it. I can also work on adding the opt-out-of-de-allocation feature discussed in #14.

Closes #14

@robert-hrusecky
Copy link
Contributor Author

Since the CI benchmark failed and I'm not quite sure how to rerun it or what the issue is, I ran a few of the benchmarks locally with the feature simd-stable. It seems to have little impact on performance. Let me know what you think, thanks.

@@ -640,11 +651,18 @@ impl BroadPhase {
}
}

fn update_regions(&mut self) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought/question about that: it seems that currently as soon as deleted_any is true, you go through all regions to check which need update and which are to be discarded. It seems a good start, but isn't there a scalability issue for a large world in which many regions are not having any change, and a small set are changing a lot? An alternative could be to collect regions that need update or deletion, and touch only these. As I do not have the full design in mind, maybe there is a trivial reason why that would not work, but when I saw this code, this question came to my mind, so I wrote it down here, in case it can be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am misunderstanding you, I don't believe this change adds a major scalability issue, at least from an algorithmic viewpoint. The previous version and the version with my changes both iterate over all the regions, which is of course O(#of regions). Perhaps you mean that we could avoid making multiple passes? This is probably correct, but it would require a mutate-while-iterating pattern which is usually tricky to do in Rust.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed your change does not change complexity, I put a note here because while reading this code I realised this potential scalability problem, that was present already. Sorry, I should have commented the original code instead. I now created an issue with that thought.

What came to my mind for this note was that, assuming we only process objects that are awake, if a large world has a lot of objects and only a subset is awake, we still process all regions. This is in contrast to a tree structure that keeps note whether there are any awake object in its sub-tree, that could have some better scalability properties. But it seems that it is an optimisation for later probably...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the double loop, we could avoid it by storing the points (keys) of regions with 0 proxies into a temporary vector, and then iterate this vector and delete the corresponding entries from the regions hashmap. I am not sure that it would be more performant, I guess it depends on the ratio of regions with 0 proxies compared to the total number of regions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes! Sorry I misunderstood. That's definitely a good thought in my opinion.

@sebcrozet
Copy link
Member

Thank you for this PR!
I triggered the benchmark runs manually because there seem to be an issue with our GitHub Action workflow. It seems that it does not work for pull-requests originating from forks.

On this PR's benchmark run, two graphs stand out:

image
image

The first one (the Boxes scenario) is really odd because it means that this changes cause the boxes to fall asleep sooner than before. We could see this as being an improvement, though I will need to investigate exactly why this happens to make sure we are not just breaking something.

The second graph shows that there appear to be a small performance regression for the Ball-Joint scenario. This scenario is interesting because it should result on lots of broad-phase cells to be freed, and then reallocated later, because it simulates some kind of tissue hanging (left: initial configuration, right: during the simulation):

image

So maybe we are paying the price of the additional allocations here? This is something worth investigating more closely as well as this may be a case where this automatic deallocation behavior is not desired (just like @stephanemagnenat suggested in #14 (comment)).

@robert-hrusecky
Copy link
Contributor Author

I think that something to keep in mind here is that the initial cost of allocation/deallocation of regions should be fairly cheap for the reason I mentioned above. I have a feeling the performance regressions we are seeing in the jointBall scenario come after reallocation of regions, when the vectors inside newly allocated regions need to be resized to add more elements. Therefore I think we should try having a pool of "unused" regions that have been moved out of the region HashMap. Although these regions are empty, their internal vectors would hold onto any capacity generated during the simulation. When allocating, we can check the pool first for regions before allocating fresh ones. Kinda like a thrift store for memory allocations.

A simpler option would be to just allocate the vectors with larger initial capacity, but this does not avoid the initial memory allocations and adds memory usage to simulations which may not need it.

I could be completely wrong though, as my changes do cause a few more region updates to happen. We could just be seeing the effects of those. Perhaps deallocation is actually more costly than I'm realizing. I'll try out option 1 soon and see if it helps.

For the boxes scenario, I'll double check that there isn't any obvious visual difference in simulation playback.

@robert-hrusecky
Copy link
Contributor Author

Hm... I am unable to reproduce the issue locally.

jointBall_nightly

@robert-hrusecky
Copy link
Contributor Author

robert-hrusecky commented Nov 3, 2020

Hello!

I tried out my region pool idea, and it seems to improve performance slightly on a few of the benchmarks, so I went ahead and pushed the changes here.

I am still unable to reproduce the issues showing up in the public benchmarks. When I run them on my machine, I get these results (averaged over 10 runs to reduce noise):

Boxes benchmark

boxes3

Joint Ball benchmark

jointBall3

Here are a few benchmarks where the region pool appears to slightly help performance:

Joint Revolute benchmark

jointRevolute3

Keva Tower benchmark

kevaTower3

Here, the green lines are the current code on this branch, the blue lines are the code before adding the region pool, and the yellow lines are the current master.

@sebcrozet, would you trigger the benchmarks to run again for this PR, please? I think we should make sure the initial run was not some weird fluke. I noticed you merged a fix for a determinism bug recently. I haven't looked at it in any amount of depth, but perhaps that could be the cause of the discrepancy?

Also, do you think it would be a good idea to include some sort of cap on the number of empty regions stored in the pool to avoid hanging on to too much memory? Perhaps regions in the pool could be deallocated if they are unused for a certain number of timesteps. This could eventually be made configurable by the user, similar to what @stephanemagnenat suggested.

@sebcrozet
Copy link
Member

sebcrozet commented Nov 3, 2020

@sebcrozet, would you trigger the benchmarks to run again for this PR, please? I think we should make sure the initial run was not some weird fluke. I noticed you merged a fix for a determinism bug recently. I haven't looked at it in any amount of depth, but perhaps that could be the cause of the discrepancy?

@robert-hrusecky Unfortunately I'm having some ramdom issues with my benchmarking hardware recently (which is moderately old, I may have to change it completely or switch to a VPS) so I can't start a new benchmark now. I will probably investigate this hardware problem next week.

@sebcrozet
Copy link
Member

@robert-hrusecky Alright, I changed my benchmarking server. I tried with a VPS (with reserved vcores) but performances differences between two benchmark runs were too significant for it to be useful. So I built a new home server, deleted all the previous benchmark results, and ran the benchmarks again for all tags in master. And I manually started the benchmark for your PR.

Now the only weird result I see in the benchmarks is that the objects appear to be sleeping sooner that before in the cubes demo. Though I am not sure why your modification would have this effect. One guess is that with your changes, some contacts are deleted sooner than before (because of the self.update_count = 2 resulting in some other contact pairs being correctly removed). And this may result in a different simulation result where things end up sleeping slightly sooner…

Anyways, this looks all OK for me. I don't think there is a reason to make this optional to the end-user considering it does not affect performances.

Merging this now.

@sebcrozet sebcrozet merged commit 2102aeb into dimforge:master Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unbounded memory usage when an object falls indefinitely
3 participants