Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/eth: Monitor RPC provider health #2125

Merged
merged 6 commits into from Feb 17, 2023

Conversation

martonp
Copy link
Contributor

@martonp martonp commented Feb 11, 2023

Previously, the ETH backend would not start if any of the providers could not connect or was outdated. Now, if at least one of the providers is able to connect and the header is recent, the backend will start. After connecting, a goroutine will start that will periodically check the health of the RPC providers, sort the list putting the non-outdated ones first, outdated ones after, and the ones that fail to respond last. Requests will then attempt to use the rpc providers in this order.

Closes #2122

Copy link
Contributor

@norwnd norwnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind me reviewing this.

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
server/asset/eth/rpcclient_harness_test.go Outdated Show resolved Hide resolved
server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
Previously, the ETH backend would not start if any of the providers
could not connect or was outdated. Now, if at least one of the
providers is able to connect and the header is recent, the backend
will start. After connecting, a goroutine will start that will
periodically check the health of the RPC providers, sort the list
putting the non-outdated ones first, outdated ones after, and the
ones that fail to respond last. Requests will then attempt to use the
rpc providers in this order.
Copy link
Member

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing that is loosely related, the simet harness continually appends to the file at $ETH_CONFIG_PATH in the dcrdex harness. If it could just write a new file every time that would be good.

<< to > I believe. cat > $ETH_CONFIG_PATH <<EOF

cat << EOF >> $ETH_CONFIG_PATH
ws://localhost:38557
# comments are respected
# http://localhost:38556
${ETH_IPC_FILE}
EOF

continue
}

healthyConnections = append(healthyConnections, ec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client makes an effort to randomize providers. I presume to keep down requests per provider. Do you think server should also worry about this? Or if an operator had a preferred provider, for example they have an eth node set up and just want a fallback for emergencies. Doesn't have to be done in this pr, just for conversation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a client I'd probably sign up for the free tier of multiple providers and want them randomized, but as a server operator, like you said, I'd probably want to pay for a provider I trust and have then generally serve everything, and then have a backup. I think it would be useful if the providers file allowed users to specify a priority along with each provider.

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
@JoeGruffins
Copy link
Member

JoeGruffins commented Feb 13, 2023

Maybe could check that endpoints are not the same on initial start up? The harness copies the same address over and over, so I noticed 9 connections to the same node. Could just be two.

2023-02-13 12:33:39.888 [DBG] ASSET[eth]: Parsed 9 endpoints from the ETH config file
2023-02-13 12:33:39.888 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.892 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.894 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.896 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.897 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.899 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.899 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.901 [DBG] ASSET[eth][RPC]: API endpoints supported by ws://localhost:38557: admin:1.0 eth:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.902 [DBG] ASSET[eth][RPC]: API endpoints supported by /home/joe/dextest/eth/alpha/node/geth.ipc: admin:1.0 clique:1.0 debug:1.0 engine:1.0 eth:1.0 les:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
2023-02-13 12:33:39.903 [INF] ASSET[eth][RPC]: number of connected ETH providers: 9

@chappjc chappjc added this to the 0.6 milestone Feb 13, 2023
@chappjc chappjc self-requested a review February 14, 2023 13:26
Copy link
Member

@JoeGruffins JoeGruffins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working well.

server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
server/asset/eth/rpcclient.go Outdated Show resolved Hide resolved
case <-ctx.Done():
return
case <-ticker.C:
c.sortConnectionsByHealth(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log a warning if false?

Comment on lines -86 to +264
c.log.Errorf("Unpropagated error from %q: %v", c.endpoints[idx], err)
// Try the next client.
c.idxMtx.Lock()
// Only advance it if another thread hasn't.
if c.endpointIdx == idx && len(c.endpoints) > 0 {
c.endpointIdx = (c.endpointIdx + 1) % len(c.endpoints)
c.log.Infof("Switching RPC endpoint to %q", c.endpoints[c.endpointIdx])
}
c.idxMtx.Unlock()

c.log.Errorf("Unpropagated error from %q: %v", ec.endpoint, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not "failing" the client in any way means we'll continue to try this client until the next sortConnectionsByHealth catches it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll move provider to the end if there's an error.

Comment on lines 353 to 356
if err != nil {
return err
}
return err
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just return err.

scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := strings.Trim(scanner.Text(), " ")
if line == "" || strings.HasPrefix(line, "#") {
if line == "" || strings.HasPrefix(line, "#") || endpointsMap[line] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing I commented one with a ; like a lot of .conf files allow. Would you pls add that HasPrefix check?

cl.tokens[assetID] = tkn
}
return nil
}

func (c *rpcclient) withTokener(assetID uint32, f func(*tokener) error) error {
func (c *rpcclient) withTokener(ctx context.Context, assetID uint32, f func(*tokener) error) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like ctx is unused. swap gives it to the closure directly.

Comment on lines 202 to 214
getEndpoints := func(clients []*ethConn) []string {
endpoints := make([]string, 0, len(clients))
for _, c := range clients {
endpoints = append(endpoints, c.endpoint)
}
return endpoints
}

ethClient.idxMtx.RLock()
idx = ethClient.endpointIdx
ethClient.idxMtx.RUnlock()
if idx == 0 {
t.Fatalf("endpoint index not advanced")
fmt.Println("Original clients:", getEndpoints(originalClients))
fmt.Println("Updated clients:", getEndpoints(updatedClients))

if originalClients[0].endpoint != updatedClients[len(updatedClients)-1].endpoint {
t.Fatalf("failing client was not moved to the end. got %s, expected %s", updatedClients[len(updatedClients)-1].endpoint, originalClients[0].endpoint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you make ethConn into a Stringer then it'll print a slice of pointers to them as intended. Also, would be good to print the failing and outdated slices with higher severity.

	c.log.Debugf("healthy connections: %v", healthyConnections)
	if len(outdatedConnections) > 0 {
		c.log.Warnf("outdated connections: %v", outdatedConnections)
	}
	if len(failingConnections) > 0 {
		c.log.Warnf("failing connections: %v", failingConnections)
	}

with

func (ec *ethConn) String() string {
	return ec.endpoint
}

Comment on lines 161 to 165
if c.headerIsOutdated(hdr) {
c.log.Warnf("header fetched from %q appears to be outdated (time %s). If you continue to see this message, you might need to check your system clock",
conn.endpoint, time.Unix(int64(hdr.Time), 0))
return outdated
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor tweak to show age could be nice:

	if c.headerIsOutdated(hdr) {
		hdrTime := time.Unix(int64(hdr.Time), 0)
		c.log.Warnf("header fetched from %q appears to be outdated (time %s is %v old). "+
			"If you continue to see this message, you might need to check your system clock",
			conn.endpoint, hdrTime, time.Since(hdrTime))
		return outdated
	}

Comment on lines 74 to 265
c.idxMtx.RLock()
idx := c.endpointIdx
ec := c.clients[idx]
c.idxMtx.RUnlock()
clients := c.clientsCopy()

for _, ec := range clients {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbd but for _, ec := range c.clientsCopy() { would be fine

return len(healthyConnections) > 0
}

// monitorConnectionsHealth starts a goroutine that checks the health of all connections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can re-wrap.

}
ec.swapContract = &swapSourceV0{es}
ec.caller = client
go c.monitorConnectionsHealth(ctx)
Copy link
Member

@chappjc chappjc Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have this unsupervised based on the caller's pattern of connect's context cancellation followed by calling the shutdown() method, but it's a little confusing. Seem like anything could go wrong?

Fundamentally I don't quite get why there's both a context for connect, but also a shutdown method. It's like the context on connect was sort of intended to just apply to the initial connection action (e.g. timeout), but this use of it makes it apply to longer running processes, which is compatible with how (*ETHBackend).Connect works...

Does this all seem fine or are things a bit mixed up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of the shutdown method. I guess the reason was to make sure that the ETH backend was shutdown before closing the client connections, but that can be handled with a new context as well.

// never been successfully connection will be checked. True is returned if
// there is at least one healthy connection.
func (c *rpcclient) sortConnectionsByHealth(ctx context.Context) bool {
c.log.Tracef("sorting connections by health counter = %d", c.healthCheckCounter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this trace now that we're getting closer to a final iteration.

}

if c.healthCheckCounter == 0 && len(c.neverConnectedEndpoints) > 0 {
c.log.Tracef("number of never connected endpoints: %d", len(c.neverConnectedEndpoints))
Copy link
Member

@chappjc chappjc Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be an Infof or even a Warnf
Actually, can remove since it's redundant with the two outcome logs below, but the "successfully connected" log can be an Info because it's nice to know something recovered.

@chappjc chappjc merged commit 6bdd2f1 into decred:master Feb 17, 2023
@chappjc chappjc added the ETH label Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server/eth: server needs to support startup with failing providers in the list
5 participants