Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve lifetime management of ledger objects (SLEs) to prevent runaway memory usage. AKA "Is it caching? It's always caching." #4822

Merged
merged 11 commits into from Jan 12, 2024

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Nov 20, 2023

High Level Overview of Change

This PR, if merged, changes CachedView to only cache the key needed to find each ledger object (SLE) in the global CachedSLEs, instead of a strong reference to the SLE itself. Further, it releases a hard reference to the last Ledger used for path finding work as soon as there is no more work to be done. Additionally, a bunch of logging is added to monitor cache sizes before and after sweeping for expired objects. The logging is not strictly necessary for the solution, but it doesn't appear to hurt, and can be useful if there are similar issues in the future. Finally, several more objects, and some statistics for the CachedSLEs, are tracked for the get_counts admin command.

Resolves #4224

Context of Change

Currently, the CachedView object holds and keeps strong references (std::shared_ptr) to objects being loaded from the ledger. This prevents those objects from being completely released from the global cache (CachedSLEs, which is a TaggedCache) and releasing their memory until the owning Ledger is released.

This is not a significant problem during typical server operations, but can lead to memory exhaustion in some circumstances, including path finding. This looks like a memory leak, but isn't really, because the memory would be recovered eventually if rippled was able to continue.

It needs to be noted that this is not a guaranteed complete solution for all path finding memory issues. It's still possible that sufficiently large path_search configuration values, or a sufficiently large set of steps, accounts, and trust lines could still consume enough memory to cause problems.

Problems with memory usage during path finding has been an ongoing issue. See also:

I hope that this is the last word on this issue, but real world performance remains to be seen.

Type of Change

  • [X ] Bug fix (non-breaking change which fixes an issue)
  • [X ] Refactor (non-breaking change that only restructures code)

API Impact

  • [X ] Public API: New feature (new methods and/or new fields)
    • Potentially adds several new key-value pairs to the output of get_counts.

Before / After

Before

Certain combinations of path finding requests could cause rippled to exhaust available memory and crash, or be killed by the operating system.

After

The same scenarios will not cause rippled to crash, or at worst will allow rippled to service many more requests and run a lot longer before crashing.

Test Plan

This is an optimization that shouldn't affect anything other than this particular edge case. Thus existing test cases should cover it.

Beta testers are encouraged to build this branch, put it on public facing servers with path finding configured, and report back on their results.

scottschurr and others added 2 commits November 20, 2023 15:30
* Only store SLE digest in CachedView; get SLEs from CachedSLEs
* Also force release of last ledger used for path finding if there are
  no path finding requests to process
* Count more ST objects (derive from `CountedObject`)
* Track CachedView stats in CountedObjects
…emory

* upstream/develop:
  Set version to 2.0.0-rc3
  docs(API-CHANGELOG): add extra bullet about DeliverMax (4784)
  Update API-CHANGELOG.md for release 2.0 (4828)
  Add Debian 12 Bookworm; ignore core-utils in almalinux (4836)
  Set version to 2.0.0-rc2
  Optimize calculation of close time to avoid impasse and minimize gratuitous proposal changes (4760)
  Fix 2.0 regression in tx method with binary output (4812)
  Update Linux smoketest distros (4813)
  Promote API version 2 to supported (4803)
  Support for the mold linker (4807)
  Proposed 2.0.0-rc2 (4818)
  Set version to 2.0.0-rc1
@intelliot
Copy link
Collaborator

  • Requires additional testing since this makes changes that aren't necessarily specific to pathfinding.
  • There's a smaller / more targeted change that could be made, which cleans out the pointers: just the changes in src/ripple/app/ledger/impl/LedgerMaster.cpp.
  • Would like node operators (supporting pathfinding) to run this change to confirm that it resolves the problem.

…emory

* upstream/develop:
  APIv2: show DeliverMax in submit, submit_multisigned (4827)
  APIv2: consistently return ledger_index as integer (4820)
@shortthefomo
Copy link

shortthefomo commented Dec 4, 2023

Feed back so far.

  1. follow up responses to the initial path_find request seem slower than what we have today.. the production build seems to return results much more frequently.
  2. memory seems to behave for the 2-3 hrs I let node run on "normal" requests and some light pathing requests.
Screenshot 2023-12-04 at 18 47 28 3. asked few a few people to hit a front end that adds a bunch of load on the box and managed to bring it down. Here is error log...
2023-Dec-04 21:38:43.756954454 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.757092579 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.758713948 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.759163464 UTC NetworkOPs:WRN Missing offer
2023-Dec-04 21:38:43.829825677 UTC Resource:WRN Consumer entry 127.0.0.1 dropped with balance 15010 at or above drop threshold 15000
2023-Dec-04 21:38:43.830112890 UTC Resource:WRN Consumer entry 127.0.0.1 dropped with balance 15104 at or above drop threshold 15000
rippled: /root/.conan/data/boost/1.82.0/_/_/package/3a52104b31a9c302344831090b42e066cefe22fe/include/boost/beast/websocket/detail/soft_mutex.hpp:89: bool boost::beast::websocket::detail::soft_mutex::try_lock(const T*) [with T = boost::beast::websocket::stream<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy> >::close_op<boost::asio::executor_binder<ripple::BaseWSPeer<ripple::ServerHandler, ripple::PlainWSPeer<ripple::ServerHandler> >::close(const boost::beast::websocket::close_reason&)::<lambda(const error_code&)>, boost::asio::strand<boost::asio::executor> > >]: Assertion `id_ != T::id' failed.
Aborted



@ximinez
Copy link
Collaborator Author

ximinez commented Dec 4, 2023

  1. asked few a few people to hit a front end that adds a bunch of load on the box and managed to bring it down. Here is error log...

That looks like a problem at the websocket layer, which path finding shouldn't affect. Can you tell if the same issue occurs under heavy load without this change?

@shortthefomo
Copy link

shortthefomo commented Dec 5, 2023

Can you tell if the same issue occurs under heavy load without this change?

(Read the request wrong) *edited

I will build develop tomorrow and check latest less this, but I run that interface off the 1.12.0 prod build and it does not fail with this error under decent load.

@shortthefomo
Copy link

right did some more testing off the develop branch this morning @ximinez

Screenshot 2023-12-05 at 11 31 03

can validate that the issue is present outside of this change as well.

…emory

* upstream/develop:
  Set version to 2.0.0-rc5
  Workarounds for gcc-13 compatibility (4817)
  Revert 4505, 4760 (4842)
  Set version to 2.0.0-rc4
@ximinez
Copy link
Collaborator Author

ximinez commented Dec 5, 2023

I run that interface off the 1.12.0 prod build and it does not fail with this error under decent load.

When you say "prod build", do you mean the published packages? Those have asserts disabled, so even if the issue exists there, you wouldn't see it. Could you build https://github.com/XRPLF/rippled/tree/1.12.0 the same way you built develop, and see if the issue exists there, too?

@intelliot
Copy link
Collaborator

right did some more testing off the develop branch this morning @ximinez

can validate that the issue is present outside of this change as well.

Given the issue appears on develop (without #4822 - and therefore doesn't seem to be related to this PR), can you open a new issue so we can track/discuss your findings there? @lathanbritz

…emory

* upstream/develop:
  Set version to 2.0.0-rc6
  Revert "Apply transaction batches in periodic intervals (4504)" (4852)
  Revert "Add ProtocolStart and GracefulClose P2P protocol messages (3839)" (4850)
  docs(API-CHANGELOG): clarify changes for V2 (4773)
  fix typo: 'of' instead of 'on' (4821)
@intelliot intelliot requested a review from Bronek January 2, 2024 18:56
src/ripple/ledger/impl/CachedView.cpp Outdated Show resolved Hide resolved
src/ripple/ledger/impl/CachedView.cpp Show resolved Hide resolved
* Rename the CachedView counters
* Fix the scope of the digest lookup lock
@ximinez ximinez added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Jan 5, 2024
@ximinez
Copy link
Collaborator Author

ximinez commented Jan 5, 2024

@Bronek @thejohnfreeman I pushed a commit to change the "miss" label to be lower-case. I don't think it requires re-review, but I wanted to give you a heads up.

@intelliot intelliot added Perf Attn Needed Attention needed from RippleX Performance Team and removed Performance/Resource Improvement labels Jan 8, 2024
@intelliot
Copy link
Collaborator

@sophiax851 this may require perf signoff

Copy link
Collaborator

@mtrippled mtrippled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@intelliot
Copy link
Collaborator

Internal tracker: RPFC-79

…emory

* upstream/develop:
  Set version to 2.0.0-rc7
…emory

* upstream/develop:
  Set version to 2.0.0
ximinez added a commit that referenced this pull request Jan 12, 2024
Prevent WebSocket connections from trying to close twice.

The issue only occurs in debug builds (assertions are disabled in
release builds, including published packages), and when the WebSocket
connections are unprivileged. The assert (and WRN log) occurs when a
client drives up the resource balance enough to be forcibly disconnected
while there are still messages pending to be sent.

Thanks to @lathanbritz for discovering this issue in #4822.
…emory

* upstream/develop:
  WebSocket should only call async_close once (4848)
@seelabs seelabs merged commit d9f90c8 into XRPLF:develop Jan 12, 2024
16 checks passed
@ximinez ximinez deleted the fix-cachedview-memory branch January 12, 2024 17:43
This was referenced Jan 18, 2024
@intelliot intelliot added Perf SignedOff RippleX Performance Team has approved and removed Perf Attn Needed Attention needed from RippleX Performance Team labels Feb 2, 2024
@intelliot
Copy link
Collaborator

Perf SignedOff: Examined the tagged cache sweeping during the test during lab testing and using a Mainnet node, didn’t observe abnormal behavior. No performance regression observed either.

Released in 2.0.1.

@sophiax851
Copy link
Collaborator

It's worth mentioning that this "Perf SignedOff" testing was conducted to ensure that there is no performance regression in the normal transaction processing with the changes made in this PR which was intended to address the root cause of the issue that I identified for #4224 . There was separate testing conducted in verifying the #4224.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. Perf SignedOff RippleX Performance Team has approved
Projects
Status: 🚢 Released in 2.0.1
Development

Successfully merging this pull request may close these issues.

Rippled locks up when client use pathing (all versions)
9 participants