Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AccountID string conversion caching #4181

Closed
wants to merge 1 commit into from
Closed

Improve AccountID string conversion caching #4181

wants to merge 1 commit into from

Conversation

nbougalis
Copy link
Contributor

@nbougalis nbougalis commented May 27, 2022

Caching the base58check encoded version of an AccountID has performance advantages, because because of the computationally heavy cost associated with the conversion, which requires the application of SHA-256 twice.

This commit makes the cache significantly more efficient in terms of memory used: it maps an AccountID to char* instead of an std::string (eliminating the need for dynamic memory allocation and the fixed overhead of std::string) and effectively eliminates runtime allocation and deallocation of memory by grabbing a single, large block of memory at initialization.

The eviction policy is also improved and ensures that the cache, once full, will always remain full. Although not currently coded the structure of the cache is such that it can be used to easily evict the least-recently-used entry without additional overhead.

High Level Overview of Change

Context of Change

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Tests (You added tests for code that already exists, or your new feature included in this PR)
  • Documentation Updates
  • Release

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I left a suggesting that might help with lock contention.

src/ripple/protocol/impl/AccountID.cpp Outdated Show resolved Hide resolved
@scottschurr
Copy link
Collaborator

I'm assuming that the caching is purely to reduce execution time. To that end, I timed execution of std::string toBase58(AccountID const& v) in the application with an optimized build. I started rippled and just let it run until this method had been called 1003520 times.

I took measurements with three different sets of code:

  1. The code that is in the 1.9.1 release of develop.
  2. This pull request.
  3. An optimized version of the base58 encoding developed recently by @seelabs.

For timing measurements I sorted all of the execution times into buckets. I also summed all of the execution times and computed an average at the very end. I also kept track of the longest execution time. Here are the results I got on my Mac mini:

                      Old code  This Pull Req  Seelabs' Code
       0-    9 ns            0              0              0
      10-   99 ns      1000999         999060         995478
     100-  999 ns         2295           4274           7830
    1000- 9999 ns          183             85            123
   10000-99999 ns           43            101             89
  100000 and over            0              0              0
total conversions      1003520        1003520        1003520

           avg ns           49             37             39
           max ns        40773          54775          34444

Both this pull request and @seelabs' code came pretty close to one another. This pull request averaged 2 ns less than @seelabs' version. But the worst case conversion time was significantly shorter with @seelabs' code, presumably because it doesn't take any locks.

I think the measurement results are worth considering while we're deciding whether to proceed with this pull request.

Just in case anyone has questions about the timing measurement methodology, I've appended the code.

static std::atomic_int64_t total_ns = {};
static std::atomic_int64_t longest_conversion_ns = {};
static std::atomic_uint64_t total_conversions = {};
static std::atomic_uint64_t bucket_0_9 = {};
static std::atomic_uint64_t bucket_10_99 = {};
static std::atomic_uint64_t bucket_100_999 = {};
static std::atomic_uint64_t bucket_1000_9999 = {};
static std::atomic_uint64_t bucket_10000_99999 = {};
static std::atomic_uint64_t bucket_over_100000 = {};

class MyTimer
{
    std::string_view name_;
    std::chrono::steady_clock::time_point start_;

public:
    explicit MyTimer(char const* name)
        : name_(name), start_(std::chrono::steady_clock::now())
    {
    }

    ~MyTimer()
    {
        std::int64_t const dur =
            (std::chrono::steady_clock::now() - start_).count();
        total_ns += dur;
        ++total_conversions;
        for (std::int64_t longest = longest_conversion_ns; dur > longest;
             longest = longest_conversion_ns)

            if (longest_conversion_ns.compare_exchange_weak(longest, dur))
                break;

        // Keep statistics on ranges.
        if (dur < 10)
            ++bucket_0_9;
        else if (dur < 100)
            ++bucket_10_99;
        else if (dur < 1000)
            ++bucket_100_999;
        else if (dur < 10000)
            ++bucket_1000_9999;
        else if (dur < 100000)
            ++bucket_10000_99999;
        else
            ++bucket_over_100000;

        // Dump results to console sporadically
        std::uint64_t const conversions = total_conversions;
        if ((conversions & 0x00000FFFu) == 0)
        {
            std::uint64_t avg_ns = total_ns / conversions;
            std::cout << name_ << " -- conversions: " << conversions
                      << "; avg ns: " << avg_ns
                      << "; longest ns: " << longest_conversion_ns << std::endl;
            std::cout << "    ranges -- 0-9ns: " << bucket_0_9
                      << "; 10-99ns: " << bucket_10_99
                      << "; 100-999ns: " << bucket_100_999
                      << "; 1000-9999ns: " << bucket_1000_9999
                      << "; 10000-99999ns: " << bucket_10000_99999
                      << "; 100000ns and over: " << bucket_over_100000
                      << std::endl;
        }
    }
};

std::string
toBase58(AccountID const& v)
{
    MyTimer("AccountID to string");
    return encodeBase58Token(TokenType::AccountID, v.data(), v.size());
}

@scottschurr
Copy link
Collaborator

scottschurr commented Jun 2, 2022

Gah. There is, of course, a bug in my timing code. The good news is that it does not obviate the results. All of the results above are in units of steady_clock, not in nanoseconds. But all of the relative values are legit.

Here's the fix in case anyone cares:

        using namespace std::chrono;
        std::int64_t const dur = duration_cast<nanoseconds>(
                                     steady_clock::now() - start_)
                                     .count();

@scottschurr
Copy link
Collaborator

Having thought about this for a little while, I'm inclined toward adopting @seelabs' change rather than this one. @seelabs' change will improve both encoding and decoding of all base58 values, not just AccountIDs. The change in this pull request only improves to_string conversions for AccountIDs. So, at least for an initial take, the other change provides more utility. But, of course, @seelabs' code has not yet passed code review. We should wait to see how that goes.

@nbougalis
Copy link
Contributor Author

nbougalis commented Jun 3, 2022

Having thought about this for a little while, I'm inclined toward adopting @seelabs' change rather than this one. @seelabs' change will improve both encoding and decoding of all base58 values, not just AccountIDs. The change in this pull request only improves to_string conversions for AccountIDs. So, at least for an initial take, the other change provides more utility. But, of course, @seelabs' code has not yet passed code review. We should wait to see how that goes.

I hear what you're saying about @seelabs's change and agree that it makes the encoder better which, of course, improves all base58 encodings.

Taking a step back, I think we agree that converting AccountID to its corresponding Base58Check format is expensive. This is something that servers that service clients do a lot; think how many times it would be invoked in servicing a single account_lines for Bitstamp or GateHub, for example. That's the scenario that the cache helps and why I think the two changes are orthogonal.

With that in mind, I decided to tweak this a bit further to eliminate the need for and overhead of the hardened_hash_map and (imo) reduce complexity but I'm sure it will be controversial (I'm not 100% sold myself).

Copy link
Contributor

@greg7mdp greg7mdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the new design! Indeed it is better than using an unordered_map.

src/ripple/protocol/impl/AccountID.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/AccountID.cpp Outdated Show resolved Hide resolved
src/ripple/core/impl/Config.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/AccountID.cpp Outdated Show resolved Hide resolved
@nbougalis
Copy link
Contributor Author

@greg7mdp, @scottschurr, @seelabs: I took your feedback to hard and updated the code. I think it's dramatically simpler and cleaner now. I would appreciate your feedback. This PR now is on top of #4201, to leverage the new packed_spinlock code, which should (in theory) help avoid contention for the cache lock under most scenarios.

@greg7mdp
Copy link
Contributor

greg7mdp commented Jun 9, 2022

Looks great and very simple indeed. I don't know enough about spinlocks to comment on using it instead of std::mutex.

@greg7mdp greg7mdp closed this Jun 9, 2022
@greg7mdp greg7mdp reopened this Jun 9, 2022
@greg7mdp
Copy link
Contributor

greg7mdp commented Jun 9, 2022

the new packed_spinlock code, which should (in theory) help avoid contention for the cache lock under most scenarios.

Just a thought. Maybe we never need to wait? We could just try to acquire the mutex (try_lock), and if it fails just call encodeBase58Token. In any case the function is not that long, and the cases where that happens should be quite rare.

@scottschurr
Copy link
Collaborator

I really like the changes that were made to this pull request. I especially like replacing the hash table with an array. Very clever.

Given the changes I decided that it made sense to redo the timings I did previously. I used the same technique, but I included the fix so actual nanoseconds are reported, not steady_clock ticks... Then I looked at the results and was startled by how similar the results were between the approaches.

So I decided there must be something wrong with my measurement technique. Ah ha! The optimizer must be moving where steady_clock::now() is being read. So I added an atomic access just after the constructor's read of now() and just before the destructor's read of now(). The atomic accesses act as fences to prevent code execution from being moved to either side. That will make sure the two reads of now() bracket the code of interest. But it also added two atomic accesses that would now be included in the time measurement.

What I learned from this is that accessing atomic variables is expensive. Adding those two atomic accesses added about 10 ns to the average result on all three techniques. But adding the fences did not change the outcome very much beyond adding about 10 ns. Here are the results I got with real nanoseconds and with the atomic accesses in place acting as fences:

Timing of develop (1.9.1) with the fences in place:

AccountID to string -- conversions: 1003520; avg ns: 53; longest ns: 65882
  ranges -- 0-9ns:      0
          10-99ns: 968040
        100-999ns:  35332
      1000-9999ns:     73
    10000-99999ns:     75 
100000ns and over:      0

Timing of Nik's latest code with fences in place:

AccountID to string -- conversions: 1003520; avg ns: 49; longest ns: 31047
  ranges -- 0-9ns:      0
          10-99ns: 993192
        100-999ns:  10065
      1000-9999ns:    151
    10000-99999ns:    112
100000ns and over:      0

Timing of Scott D's base58 work with the fences in place:

AccountID to string -- conversions: 1003520; avg ns: 51; longest ns: 47268
  ranges -- 0-9ns:      0
          10-99ns: 974942
        100-999ns:  28359
      1000-9999ns:    125
    10000-99999ns:     94
100000ns and over:      0

Meh. The cache does give the best result. But it's still only saving 4 ns over the non-cached code in 1.9.1. If we subtract out the (approximately) 10 ns added by the fences, then we'd be going from an average of 43 ns to an average of 39 ns. About a 10% improvement. I'm not convinced the cache is worth it for a reduction that small.

It's certainly possible there's a problem with my measurement technique. Good measurements are hard. But I can't figure out what that problem might be.

Thoughts?

@greg7mdp
Copy link
Contributor

greg7mdp commented Jun 11, 2022

Thanks @scottschurr, very useful. Some random thoughts I have.

  1. strange that longest ns is longer for Scott D's branch than develop. I would assume these occur when we don't have a cache hit, right?
  2. I wonder if the cache is widely too large. How many different AccountIDs does the test use? Are they randomly accessed? I'd suggest trying with a small cache size (maybe 100 or 200 entries) so the cache can reside in the L3 processor cache. Otherwise the whole benchmark is probably limited by the speed of memory accesses, leading to similar results.
  3. I think it might be the case that the cost of doing the base64 encode is not that much more that doing a out of cache memory access.

@nbougalis
Copy link
Contributor Author

I appreciate you redoing the timings, @scottschurr.

There's a couple of things I'd like to point out: first, I suspect that we're already reaping benefits from the existing cache. An interesting approach would be to disable it entirely and run measurements then.

Second, even if this change is, roughly, as fast the existing mechanism, I still think it's a net gain: the code is simpler and avoids doing work at runtime and that can only be a good thing. And I don't think that this is such a huge hot spot that it's worth squeezing every last drop of perfomance out of it.

With that said, why is the performance not significantly better? A couple of thoughts there:

First, despite optimizing the cache itself, we can't just hand a raw pointer in the cache without pinning the memory, so we still end up having to allocate a std::string to return. Perhaps we could use boost::static_string<40>? Also, we could simplify the hashing; the existing code uses the full 160 bits but, as @greg7mdp pointed out, perhaps that's overkill, and using the first 64 bits is sufficient.

Second, I suspect the test scenario you're running is probably not very representative, despite being run against real data. I think that the biggest impact will likely be felt on servers that handle a large volume of RPC requests and have to do these conversions hundreds of times per one RPC call (imagine, for example, responding to an account_lines) times the number of RPC requests handled per second. The same AccountID might be needed tens, or maybe even hundreds of times per call, and so sven with a moderate hit ratio the cache may actually add up.

Apropos of this, I think it's worth examining the size of the cache. Greg has suggested that the defined sizes may be too large, and perhaps that's true. If a cache that's half the size is, say, 90% as efficient, why go larger?

So to sum up my thoughts on this: I feel this commit actually cleans up the code a lot and that alone may be sufficient to get it merged. We could make the cache a runtime configurable option so operators can choose whether to use it or not but that may be overkill. Lastly, I'm not opposed to just removing the cache entirely but I think that before we do that we ought to be very sure that performance doesn't suffer on the servers which actually stand to benefit the most from the cache; I think it's worth running this branch on s1 or s2 and capturing performance over a large amount of queries and comparing results.

@scottschurr
Copy link
Collaborator

@greg7mdp:

strange that longest ns is longer for Scott D's branch than develop. I would assume these occur when we don't have a cache hit, right?

My guess is that the longest ns can be attributed to at least a couple of sources:

  • Multiprocessing/thread swapping can swap a thread out. Where, when, and how often that happens, and how long the thread is swapped out, is non-determinate from the perspective of this test.
  • Having code not in cache can cause delays while that code is fetched into cache. So code with a larger footprint may spike longer times.

I would expect thread swapping to have the largest impact of these two. So the longest ns value may effectively be noise. But it's hard to know for sure.

How many different AccountIDs does the test use? Are they randomly accessed?

The test case is not really a test case as such. I'm installing the timer in the rippled source code and then running that instance of rippled agains the main network. The AccountIDs being used should be representative of the Account IDs seen by a normally running rippled instance. Similarly, the randomness of the access should be the same that a regular rippled instance would see.

I take over 1 million samples to try and average out the uses from test run to test run. But it's also true that the specifics of the test are far from identical between test cases and from run-to-run.

@nbougalis noted:

I suspect that we're already reaping benefits from the existing cache. An interesting approach would be to disable it entirely and run measurements then.

Just to be clear, the timings I made of develop (1.9.1) do not benefit from any cache. The call being timed is to...

  1. encodeBase58Token(TokenType::AccountID, v.data(), v.size()) in AccountID.cpp. That calls
  2. encodeBase58Token(TokenType type, void const* token, std::size_t size) in tokens.cpp. That, in turn, calls
  3. detail::detail::encodeBase58(buf.data(), expanded, buf.data() + expanded, bufsize - expanded)

There's no AccountID caching in any of those calls.

I think that the biggest impact will likely be felt on servers that handle a large volume of RPC requests and have to do these conversions hundreds of times per one RPC call

That's a worthwhile point. Can we demonstrate the benefit? Could we, for example, install an instrumented 1.9.1 rippled and an instrumented version of the code-with-caching in two of our client handlers?

I'd prefer to actually see the results of the measurements than to shrug and say we're measuring the wrong stuff.

I'm not saying the code isn't clean. But if we're going to put in a cache it seems like we ought to demonstrate that we get significant benefit from it.

@nbougalis
Copy link
Contributor Author

That's a worthwhile point. Can we demonstrate the benefit? Could we, for example, install an instrumented 1.9.1 rippled and an instrumented version of the code-with-caching in two of our client handlers?

I think this should be doable. Can you link me to your code please?

@scottschurr
Copy link
Collaborator

@nbougalis wrote...

I think this should be doable. Can you link me to your code please?

Sure thing. Also feel free to adjust the code in case you think there might be problems with the way it is doing the timing.

Here's a branch that does timing for 1.9.1: https://github.com/scottschurr/rippled/tree/to-base-58-timing-1-9-1

Here's a branch that does identical timing for the most recent version of this aidc branch: https://github.com/scottschurr/rippled/tree/to-base-58-timing-nik-aidc

Be aware that the timing results are written to standard out, not to the log. The log is not available at the point we are measuring the timing. So you may need to redirect the server's output to a file or the like.

Thanks for doing the timing. I'm curious to see what the results tell us.

@nbougalis
Copy link
Contributor Author

So, I've been running this branch against a server that's been serving public RPC/WebSocket queries for a little over 2 weeks. Over that time, the cache has a hit ratio that's just shy of 89%, which is significant. About 25% of the calls were ≤ 50ns; over 96% of the rest were ≤ 100 ns.

Caching the base58check encoded version of an `AccountID` has
performance advantages, because because of the computationally
heavy cost associated with the conversion, which requires the
application of SHA-256 twice.

This commit makes the cache significantly more efficient in terms
of memory used: it eliminates the map, using a vector with a size
that is determined by the configured size of the node, and a hash
function to directly map any given `AccountID` to a specific slot
in the cache; the eviction policy is simple: in case of collision
the existing entry is removed and replaced with the new data.

Previously, use of the cache was optional and required additional
effort by the programmer. Now the cache is automatic and does not
require any additional work or information.

The new cache also utilizes a 64-way spinlock, to help reduce any
contention that the pressure on the cache would impose.
This was referenced Aug 4, 2022
@scottschurr
Copy link
Collaborator

@nbougalis, thanks for making the measurements you did. A cache hit ratio of 89% is significant. What I've not yet seen from your timings is evidence of a speed up. For that evidence we'd need to see timings without the cache on the same machine and a comparable workload. If the timings with the cache are significantly faster than the timings without the cache then the complexity introduced by the cache is justified.

Make sense?

@intelliot
Copy link
Collaborator

Opening for review since this may be included in 1.10. e2eed96 (The contributing process requires approvals from at least two reviewers)

@intelliot intelliot reopened this Feb 21, 2023
@intelliot
Copy link
Collaborator

intelliot commented Feb 21, 2023

(Ignore the merge conflicts mentioned by GitHub below.)

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm not convinced (by measurements taken) that this approach provides significant speedups, I don't think it does any harm. 👍 from me.


// The maximum size (in bytes) of an item, along with a null terminator
// and aligned at 8 bytes
std::size_t const size_ = 40;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this constexpr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottschurr given I'll be closing this PR soon (and it might not get looked at again after that), would it be worth opening an issue for this?

If not, we can leave as-is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not worth an issue. We can leave it as is.

@intelliot intelliot added this to the 1.10.0 milestone Feb 21, 2023

return *raSrcAccount;
}();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick. I am not sure what the beauty of a lambda in this case.

@intelliot
Copy link
Collaborator

Thank you both for the reviews!

@intelliot intelliot closed this Feb 23, 2023
@nbougalis nbougalis deleted the aidc branch October 16, 2023 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants