perf: Address performance of EthGetTransactionCount #10700

fridrik01 · 2023-04-19T18:50:41Z

Context

We have observed that EthGetTransactionCount is one of the hotspots on Glif production notes, and we are seeing regular 10-20 second latencies when calling this rpc method. We have also been able to replicate this issue on local nodes on our devboxes on mainnet when calling this method repeatedly.

The fix

I tracked the high latency spikes and they were correlated when we were running ExecuteTipSet while following the chain.

To address this, we should not rely on tipset computation to get nounce and instead look at the parent tipset and then count the messages sent from the 'addr' as suggested by steb.

Test plan

Sequential stress test

I started and synced my lotus node on mainnet and started calling EthGetTransactionCount rpc method in a loop.

The latency spikes seemed to be all gone, which I then confirmed by looking at the http://localhost:1234/debug/metrics for the getStateNonce metric (getnonce_ms)

Of almost 35k calls, all were within 8ms (before this fix, there were often outliers in the 5+ second range)

Concurrent stress test

Same setup as with sequential stress test except that now I called EthGetTransactionCount using ab in order to test calling this method in concurrently.

This was the setup I used:

cat post_data.json
{"method":"eth_getTransactionCount","params":["0xd4c70007F3F502f212c7e6794b94C06F36173B36", "latest"],"id":1,"jsonrpc":"2.0"} 

ab -p post_data.json -T application/json -c 10 -n 1000000 http://localhost:1234/rpc/v1
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            1234

Document Path:          /rpc/v1
Document Length:        41 bytes

Concurrency Level:      10
Time taken for tests:   41.408 seconds
Complete requests:      1000000
Failed requests:        0
Total transferred:      158000000 bytes
Total body sent:        268000000
HTML transferred:       41000000 bytes
Requests per second:    24149.82 [#/sec] (mean)
Time per request:       0.414 [ms] (mean)
Time per request:       0.041 [ms] (mean, across all concurrent requests)
Transfer rate:          3726.24 [Kbytes/sec] received
                        6320.46 kb/s sent
                        10046.70 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    0   0.9      0      74
Waiting:        0    0   0.9      0      74
Total:          0    0   0.9      0      74

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      1
  95%      1
  98%      1
  99%      1
 100%     74 (longest request)

During this benchmark there was a single request taking 100ms but otherwise looks good:

chain/messagepool/provider.go

Stebalien · 2023-04-19T19:08:13Z

I can confirm that this fixes the performance issue (or, at least, we don't get 10s pauses).

chain/messagepool/messagepool.go

raulk · 2023-04-21T10:20:02Z

chain/messagepool/messagepool.go

@@ -371,7 +371,7 @@ func (ms *msgSet) toSlice() []*types.SignedMessage {
 func New(ctx context.Context, api Provider, ds dtypes.MetadataDS, us stmgr.UpgradeSchedule, netName dtypes.NetworkName, j journal.Journal) (*MessagePool, error) {
 	cache, _ := lru.New2Q[cid.Cid, crypto.Signature](build.BlsSignatureCacheSize)
 	verifcache, _ := lru.New2Q[string, struct{}](build.VerifSigCacheSize)
-	noncecache, _ := lru.New[nonceCacheKey, uint64](256)
+	noncecache, _ := lru.New[nonceCacheKey, uint64](32768) // 32k * ~200 bytes = 6MB


Does it make sense to impose this memory tax on every node, when only hosted RPC nodes would truly benefit as they are public deployments? Besides that, I'm not convinced this increase is useful since the cache will be practically stale every 30 seconds? (The most popular usage pattern is getting the nonce of the pending tipset)

chain/messagepool/messagepool.go

maciejwitowski · 2023-04-21T11:14:57Z

chain/messagepool/provider.go

@@ -58,6 +60,23 @@ func (mpp *mpoolProvider) IsLite() bool {
 	return mpp.lite != nil
 }

+func (mpp *mpoolProvider) getActorLite(addr address.Address, ts *types.TipSet) (*types.Actor, error) {


It's a bit easier to follow for me if refactoring like this (this only extracts the function without any functional changes, right?) is done in a separate commit.

maciejwitowski

One more hotspot down! 💪

We have observed that EthGetTransactionCount is one of the hotspots on Glif production notes, and we are seeing regular 10-20 second latencies when calling this rpc method. I tracked the high latency spikes and they were correlated when we were running ExecuteTipSet while following the chain. To address this, we should not rely on tipset computation to get nounce and instead look at the parent tipset and then count the messages sent from the 'addr'.

Bumped from 256 to 32k entries which should be about 6MB of cached entries given average nonceCacheKey of 200 bytes

Stebalien

I agree with https://github.com/filecoin-project/lotus/pull/10700/files#r1173643973, but it's not a high priority.

We can definitely optimize this more as discussed (e.g., iterating over messages once per tipset and caching the results) but this fixes the main performance issue.

fridrik01 · 2023-04-22T17:35:52Z

@raulk Updated test plan with results from concurrent stress test, overall looks good!

fridrik01 requested a review from Stebalien April 19, 2023 18:50

fridrik01 commented Apr 19, 2023

View reviewed changes

chain/messagepool/provider.go Show resolved Hide resolved

fridrik01 force-pushed the 10538-opt-ethgettransactioncount branch 2 times, most recently from d63ddec to 61cbb2c Compare April 20, 2023 13:21

fridrik01 marked this pull request as ready for review April 20, 2023 13:22

fridrik01 requested a review from a team as a code owner April 20, 2023 13:22

raulk approved these changes Apr 21, 2023

View reviewed changes

maciejwitowski reviewed Apr 21, 2023

View reviewed changes

chain/messagepool/messagepool.go Outdated Show resolved Hide resolved

maciejwitowski reviewed Apr 21, 2023

View reviewed changes

maciejwitowski self-requested a review April 21, 2023 11:15

maciejwitowski approved these changes Apr 21, 2023

View reviewed changes

fridrik01 added 2 commits April 21, 2023 11:55

perf: Increase noncecache in MessagePool

553da39

Bumped from 256 to 32k entries which should be about 6MB of cached entries given average nonceCacheKey of 200 bytes

fridrik01 force-pushed the 10538-opt-ethgettransactioncount branch from 61cbb2c to 553da39 Compare April 21, 2023 11:56

Stebalien approved these changes Apr 21, 2023

View reviewed changes

address review comment

4028c05

fridrik01 enabled auto-merge April 21, 2023 20:05

fridrik01 merged commit 9d5d6a5 into master Apr 21, 2023

fridrik01 deleted the 10538-opt-ethgettransactioncount branch April 21, 2023 20:13

Fatman13 mentioned this pull request Apr 24, 2023

[venus] EthGetTransactionCount perf opt / EthGetTransactionCount 性能提升 filecoin-project/venus#5930

Closed

1 task

snissn mentioned this pull request Apr 24, 2023

Eth RPC: Address performance of EthEstimateGas #10670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Address performance of EthGetTransactionCount #10700

perf: Address performance of EthGetTransactionCount #10700

fridrik01 commented Apr 19, 2023 •

edited

Stebalien commented Apr 19, 2023

raulk Apr 21, 2023

maciejwitowski Apr 21, 2023

maciejwitowski left a comment

Stebalien left a comment

fridrik01 commented Apr 22, 2023

perf: Address performance of EthGetTransactionCount #10700

perf: Address performance of EthGetTransactionCount #10700

Conversation

fridrik01 commented Apr 19, 2023 • edited

Context

The fix

Test plan

Stebalien commented Apr 19, 2023

raulk Apr 21, 2023

Choose a reason for hiding this comment

maciejwitowski Apr 21, 2023

Choose a reason for hiding this comment

maciejwitowski left a comment

Choose a reason for hiding this comment

Stebalien left a comment

Choose a reason for hiding this comment

fridrik01 commented Apr 22, 2023

fridrik01 commented Apr 19, 2023 •

edited