Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Address performance of EthGetTransactionCount #10700

Merged
merged 3 commits into from
Apr 21, 2023

Conversation

fridrik01
Copy link
Contributor

@fridrik01 fridrik01 commented Apr 19, 2023

Fixes: #10538

Context

We have observed that EthGetTransactionCount is one of the hotspots on Glif production notes, and we are seeing regular 10-20 second latencies when calling this rpc method. We have also been able to replicate this issue on local nodes on our devboxes on mainnet when calling this method repeatedly.

The fix

I tracked the high latency spikes and they were correlated when we were running ExecuteTipSet while following the chain.

To address this, we should not rely on tipset computation to get nounce and instead look at the parent tipset and then count the messages sent from the 'addr' as suggested by steb.

Test plan

Sequential stress test

I started and synced my lotus node on mainnet and started calling EthGetTransactionCount rpc method in a loop.

The latency spikes seemed to be all gone, which I then confirmed by looking at the http://localhost:1234/debug/metrics for the getStateNonce metric (getnonce_ms)

image

Of almost 35k calls, all were within 8ms (before this fix, there were often outliers in the 5+ second range)

Concurrent stress test

Same setup as with sequential stress test except that now I called EthGetTransactionCount using ab in order to test calling this method in concurrently.

This was the setup I used:

cat post_data.json
{"method":"eth_getTransactionCount","params":["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"],"id":1,"jsonrpc":"2.0"} 

ab -p post_data.json -T application/json -c 10 -n 1000000 http://localhost:1234/rpc/v1
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            1234

Document Path:          /rpc/v1
Document Length:        41 bytes

Concurrency Level:      10
Time taken for tests:   41.408 seconds
Complete requests:      1000000
Failed requests:        0
Total transferred:      158000000 bytes
Total body sent:        268000000
HTML transferred:       41000000 bytes
Requests per second:    24149.82 [#/sec] (mean)
Time per request:       0.414 [ms] (mean)
Time per request:       0.041 [ms] (mean, across all concurrent requests)
Transfer rate:          3726.24 [Kbytes/sec] received
                        6320.46 kb/s sent
                        10046.70 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    0   0.9      0      74
Waiting:        0    0   0.9      0      74
Total:          0    0   0.9      0      74

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      1
  95%      1
  98%      1
  99%      1
 100%     74 (longest request)

During this benchmark there was a single request taking 100ms but otherwise looks good:
image

@Stebalien
Copy link
Member

I can confirm that this fixes the performance issue (or, at least, we don't get 10s pauses).

@fridrik01 fridrik01 force-pushed the 10538-opt-ethgettransactioncount branch 2 times, most recently from d63ddec to 61cbb2c Compare April 20, 2023 13:21
@fridrik01 fridrik01 marked this pull request as ready for review April 20, 2023 13:22
@fridrik01 fridrik01 requested a review from a team as a code owner April 20, 2023 13:22
chain/messagepool/messagepool.go Outdated Show resolved Hide resolved
chain/messagepool/messagepool.go Outdated Show resolved Hide resolved
chain/messagepool/messagepool.go Outdated Show resolved Hide resolved
chain/messagepool/messagepool.go Outdated Show resolved Hide resolved
@@ -371,7 +371,7 @@ func (ms *msgSet) toSlice() []*types.SignedMessage {
func New(ctx context.Context, api Provider, ds dtypes.MetadataDS, us stmgr.UpgradeSchedule, netName dtypes.NetworkName, j journal.Journal) (*MessagePool, error) {
cache, _ := lru.New2Q[cid.Cid, crypto.Signature](build.BlsSignatureCacheSize)
verifcache, _ := lru.New2Q[string, struct{}](build.VerifSigCacheSize)
noncecache, _ := lru.New[nonceCacheKey, uint64](256)
noncecache, _ := lru.New[nonceCacheKey, uint64](32768) // 32k * ~200 bytes = 6MB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to impose this memory tax on every node, when only hosted RPC nodes would truly benefit as they are public deployments? Besides that, I'm not convinced this increase is useful since the cache will be practically stale every 30 seconds? (The most popular usage pattern is getting the nonce of the pending tipset)

@@ -58,6 +60,23 @@ func (mpp *mpoolProvider) IsLite() bool {
return mpp.lite != nil
}

func (mpp *mpoolProvider) getActorLite(addr address.Address, ts *types.TipSet) (*types.Actor, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit easier to follow for me if refactoring like this (this only extracts the function without any functional changes, right?) is done in a separate commit.

@maciejwitowski maciejwitowski self-requested a review April 21, 2023 11:15
Copy link
Contributor

@maciejwitowski maciejwitowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more hotspot down! 💪

We have observed that EthGetTransactionCount is one of the hotspots
on Glif production notes, and we are seeing regular 10-20 second
latencies when calling this rpc method.

I tracked the high latency spikes and they were correlated when
we were running ExecuteTipSet while following the chain.

To address this, we should not rely on tipset computation to get
nounce and instead look at the parent tipset and then count the
messages sent from the 'addr'.
Bumped from 256 to 32k entries which should be about 6MB of cached
entries given average nonceCacheKey of 200 bytes
@fridrik01 fridrik01 force-pushed the 10538-opt-ethgettransactioncount branch from 61cbb2c to 553da39 Compare April 21, 2023 11:56
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with https://github.com/filecoin-project/lotus/pull/10700/files#r1173643973, but it's not a high priority.

We can definitely optimize this more as discussed (e.g., iterating over messages once per tipset and caching the results) but this fixes the main performance issue.

@fridrik01 fridrik01 merged commit 9d5d6a5 into master Apr 21, 2023
@fridrik01 fridrik01 deleted the 10538-opt-ethgettransactioncount branch April 21, 2023 20:13
@fridrik01
Copy link
Contributor Author

@raulk Updated test plan with results from concurrent stress test, overall looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eth RPC: Address performance of EthGetTransactionCount
4 participants