server/eth: Monitor RPC provider health #2125

martonp · 2023-02-11T18:03:11Z

Previously, the ETH backend would not start if any of the providers could not connect or was outdated. Now, if at least one of the providers is able to connect and the header is recent, the backend will start. After connecting, a goroutine will start that will periodically check the health of the RPC providers, sort the list putting the non-outdated ones first, outdated ones after, and the ones that fail to respond last. Requests will then attempt to use the rpc providers in this order.

Closes #2122

norwnd

If you don't mind me reviewing this.

server/asset/eth/rpcclient.go

server/asset/eth/rpcclient_harness_test.go

server/asset/eth/rpcclient.go

Previously, the ETH backend would not start if any of the providers could not connect or was outdated. Now, if at least one of the providers is able to connect and the header is recent, the backend will start. After connecting, a goroutine will start that will periodically check the health of the RPC providers, sort the list putting the non-outdated ones first, outdated ones after, and the ones that fail to respond last. Requests will then attempt to use the rpc providers in this order.

JoeGruffins

Small thing that is loosely related, the simet harness continually appends to the file at $ETH_CONFIG_PATH in the dcrdex harness. If it could just write a new file every time that would be good.

<< to > I believe. cat > $ETH_CONFIG_PATH <<EOF

dcrdex/dex/testing/dcrdex/harness.sh

Lines 223 to 228 in ea8cb4d

    
           cat << EOF >> $ETH_CONFIG_PATH 
        
           ws://localhost:38557 
        
           # comments are respected 
        
           # http://localhost:38556 
        
           ${ETH_IPC_FILE} 
        
           EOF

JoeGruffins · 2023-02-13T02:40:41Z

server/asset/eth/rpcclient.go

+			continue
+		}
+
+		healthyConnections = append(healthyConnections, ec)


The client makes an effort to randomize providers. I presume to keep down requests per provider. Do you think server should also worry about this? Or if an operator had a preferred provider, for example they have an eth node set up and just want a fallback for emergencies. Doesn't have to be done in this pr, just for conversation.

As a client I'd probably sign up for the free tier of multiple providers and want them randomized, but as a server operator, like you said, I'd probably want to pay for a provider I trust and have then generally serve everything, and then have a backup. I think it would be useful if the providers file allowed users to specify a priority along with each provider.

server/asset/eth/rpcclient.go

JoeGruffins · 2023-02-13T03:35:27Z

Maybe could check that endpoints are not the same on initial start up? The harness copies the same address over and over, so I noticed 9 connections to the same node. Could just be two.

2023-02-13 12:33:39.888 [DBG] ASSET[eth]: Parsed 9 endpoints from the ETH config file
2023-02-13 12:33:39.888 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.892 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.894 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.896 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.897 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.899 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.899 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.901 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.902 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.903 [INF] ASSET[eth][RPC]: number of connected ETH providers: 9

JoeGruffins

Working well.

server/asset/eth/rpcclient.go

buck54321 · 2023-02-15T21:18:36Z

server/asset/eth/rpcclient.go

+		case <-ctx.Done():
+			return
+		case <-ticker.C:
+			c.sortConnectionsByHealth(ctx)


Log a warning if false?

buck54321 · 2023-02-15T21:23:24Z

server/asset/eth/rpcclient.go

-		c.log.Errorf("Unpropagated error from %q: %v", c.endpoints[idx], err)
-		// Try the next client.
-		c.idxMtx.Lock()
-		// Only advance it if another thread hasn't.
-		if c.endpointIdx == idx && len(c.endpoints) > 0 {
-			c.endpointIdx = (c.endpointIdx + 1) % len(c.endpoints)
-			c.log.Infof("Switching RPC endpoint to %q", c.endpoints[c.endpointIdx])
-		}
-		c.idxMtx.Unlock()
+
+		c.log.Errorf("Unpropagated error from %q: %v", ec.endpoint, err)


Not "failing" the client in any way means we'll continue to try this client until the next sortConnectionsByHealth catches it.

True, I'll move provider to the end if there's an error.

buck54321 · 2023-02-15T21:42:29Z

server/asset/eth/rpcclient.go

+		if err != nil {
+			return err
 		}
-		return err
+		return nil


You can just return err.

chappjc · 2023-02-15T22:41:46Z

server/asset/eth/eth.go

 	scanner := bufio.NewScanner(file)
 	for scanner.Scan() {
 		line := strings.Trim(scanner.Text(), " ")
-		if line == "" || strings.HasPrefix(line, "#") {
+		if line == "" || strings.HasPrefix(line, "#") || endpointsMap[line] {


In testing I commented one with a ; like a lot of .conf files allow. Would you pls add that HasPrefix check?

chappjc · 2023-02-15T22:49:44Z

server/asset/eth/rpcclient.go

 		cl.tokens[assetID] = tkn
 	}
 	return nil
 }

-func (c *rpcclient) withTokener(assetID uint32, f func(*tokener) error) error {
+func (c *rpcclient) withTokener(ctx context.Context, assetID uint32, f func(*tokener) error) error {


looks like ctx is unused. swap gives it to the closure directly.

chappjc · 2023-02-15T22:56:15Z

server/asset/eth/rpcclient_harness_test.go

+	getEndpoints := func(clients []*ethConn) []string {
+		endpoints := make([]string, 0, len(clients))
+		for _, c := range clients {
+			endpoints = append(endpoints, c.endpoint)
+		}
+		return endpoints
 	}

-	ethClient.idxMtx.RLock()
-	idx = ethClient.endpointIdx
-	ethClient.idxMtx.RUnlock()
-	if idx == 0 {
-		t.Fatalf("endpoint index not advanced")
+	fmt.Println("Original clients:", getEndpoints(originalClients))
+	fmt.Println("Updated clients:", getEndpoints(updatedClients))
+
+	if originalClients[0].endpoint != updatedClients[len(updatedClients)-1].endpoint {
+		t.Fatalf("failing client was not moved to the end. got %s, expected %s", updatedClients[len(updatedClients)-1].endpoint, originalClients[0].endpoint)


I think if you make ethConn into a Stringer then it'll print a slice of pointers to them as intended. Also, would be good to print the failing and outdated slices with higher severity.

c.log.Debugf("healthy connections: %v", healthyConnections) if len(outdatedConnections) > 0 { c.log.Warnf("outdated connections: %v", outdatedConnections) } if len(failingConnections) > 0 { c.log.Warnf("failing connections: %v", failingConnections) }

with

func (ec *ethConn) String() string { return ec.endpoint }

chappjc · 2023-02-15T23:06:59Z

server/asset/eth/rpcclient.go

+	if c.headerIsOutdated(hdr) {
+		c.log.Warnf("header fetched from %q appears to be outdated (time %s). If you continue to see this message, you might need to check your system clock",
+			conn.endpoint, time.Unix(int64(hdr.Time), 0))
+		return outdated
+	}


Minor tweak to show age could be nice:

if c.headerIsOutdated(hdr) { hdrTime := time.Unix(int64(hdr.Time), 0) c.log.Warnf("header fetched from %q appears to be outdated (time %s is %v old). "+ "If you continue to see this message, you might need to check your system clock", conn.endpoint, hdrTime, time.Since(hdrTime)) return outdated }

chappjc · 2023-02-15T23:09:37Z

server/asset/eth/rpcclient.go

-		c.idxMtx.RLock()
-		idx := c.endpointIdx
-		ec := c.clients[idx]
-		c.idxMtx.RUnlock()
+	clients := c.clientsCopy()

+	for _, ec := range clients {


nbd but for _, ec := range c.clientsCopy() { would be fine

chappjc · 2023-02-15T23:10:09Z

server/asset/eth/rpcclient.go

+	return len(healthyConnections) > 0
+}
+
+// monitorConnectionsHealth starts a goroutine that checks the health of all connections


Can re-wrap.

chappjc · 2023-02-15T23:20:05Z

server/asset/eth/rpcclient.go

-		}
-		ec.swapContract = &swapSourceV0{es}
-		ec.caller = client
+	go c.monitorConnectionsHealth(ctx)


I think it's fine to have this unsupervised based on the caller's pattern of connect's context cancellation followed by calling the shutdown() method, but it's a little confusing. Seem like anything could go wrong?

Fundamentally I don't quite get why there's both a context for connect, but also a shutdown method. It's like the context on connect was sort of intended to just apply to the initial connection action (e.g. timeout), but this use of it makes it apply to longer running processes, which is compatible with how (*ETHBackend).Connect works...

Does this all seem fine or are things a bit mixed up?

I got rid of the shutdown method. I guess the reason was to make sure that the ETH backend was shutdown before closing the client connections, but that can be handled with a new context as well.

chappjc · 2023-02-16T17:24:20Z

server/asset/eth/rpcclient.go

+// never been successfully connection will be checked. True is returned if
+// there is at least one healthy connection.
+func (c *rpcclient) sortConnectionsByHealth(ctx context.Context) bool {
+	c.log.Tracef("sorting connections by health counter = %d", c.healthCheckCounter)


Let's remove this trace now that we're getting closer to a final iteration.

chappjc · 2023-02-16T17:24:31Z

server/asset/eth/rpcclient.go

+	}
+
+	if c.healthCheckCounter == 0 && len(c.neverConnectedEndpoints) > 0 {
+		c.log.Tracef("number of never connected endpoints: %d", len(c.neverConnectedEndpoints))


This can be an Infof or even a Warnf
Actually, can remove since it's redundant with the two outcome logs below, but the "successfully connected" log can be an Info because it's nice to know something recovered.

norwnd reviewed Feb 12, 2023

View reviewed changes

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved

server/asset/eth/rpcclient_harness_test.go Outdated Show resolved Hide resolved

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved

martonp force-pushed the ethServerPeriodicCheck branch from f3fbec7 to 23c4710 Compare February 12, 2023 17:01

norwnd approved these changes Feb 13, 2023

View reviewed changes

JoeGruffins reviewed Feb 13, 2023

View reviewed changes

chappjc added this to the 0.6 milestone Feb 13, 2023

chappjc self-requested a review February 14, 2023 13:26

martonp added 3 commits February 14, 2023 11:20

Dcrdex harness creates new eth.conf each run

111a5af

Periodically retry initially failing clients

74a5a59

Avoid duplicate endpoints in eth conf file

df9f53c

JoeGruffins approved these changes Feb 15, 2023

View reviewed changes

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved

Joe review fixes

73f725e

buck54321 requested changes Feb 15, 2023

View reviewed changes

chappjc requested changes Feb 15, 2023

View reviewed changes

chappjc reviewed Feb 16, 2023

View reviewed changes

chappjc mentioned this pull request Feb 16, 2023

client/asset/eth: don't fail in swapGas on estimateInitGas RPC failure #2118

Closed

Chappjc and Buck review fixes

efae8cb

martonp force-pushed the ethServerPeriodicCheck branch from df377bb to efae8cb Compare February 16, 2023 19:33

buck54321 approved these changes Feb 16, 2023

View reviewed changes

chappjc merged commit 6bdd2f1 into decred:master Feb 17, 2023

chappjc added the ETH label Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server/eth: Monitor RPC provider health #2125

server/eth: Monitor RPC provider health #2125

martonp commented Feb 11, 2023 •

edited

norwnd left a comment

JoeGruffins left a comment •

edited

JoeGruffins Feb 13, 2023

martonp Feb 13, 2023

JoeGruffins commented Feb 13, 2023 •

edited

JoeGruffins left a comment

buck54321 Feb 15, 2023

buck54321 Feb 15, 2023

martonp Feb 16, 2023

buck54321 Feb 15, 2023

chappjc Feb 15, 2023

chappjc Feb 15, 2023

chappjc Feb 15, 2023

chappjc Feb 15, 2023

chappjc Feb 15, 2023

chappjc Feb 15, 2023

chappjc Feb 15, 2023 •

edited

martonp Feb 16, 2023

chappjc Feb 16, 2023

chappjc Feb 16, 2023 •

edited

	cat << EOF >> $ETH_CONFIG_PATH
	ws://localhost:38557
	# comments are respected
	# http://localhost:38556
	${ETH_IPC_FILE}
	EOF

server/eth: Monitor RPC provider health #2125

server/eth: Monitor RPC provider health #2125

Conversation

martonp commented Feb 11, 2023 • edited

norwnd left a comment

Choose a reason for hiding this comment

JoeGruffins left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeGruffins commented Feb 13, 2023 • edited

JoeGruffins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chappjc Feb 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chappjc Feb 16, 2023 • edited

Choose a reason for hiding this comment

martonp commented Feb 11, 2023 •

edited

JoeGruffins left a comment •

edited

JoeGruffins commented Feb 13, 2023 •

edited

chappjc Feb 15, 2023 •

edited

chappjc Feb 16, 2023 •

edited