Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response #1015

Open
Pete-LunaNova opened this issue Nov 26, 2021 · 0 comments

Comments

@Pete-LunaNova
Copy link

Description/Reasoning

Whilst it is critical that validators maintain their rpc endpoints to ensure they are able to participate in signing for all the chains they are signalling on, it is possible that an endpoint could fail due to events outside their control. For example, a chain halt may cause rpc nodes to non-respond or respond incorrectly. When the network grows to the point that validators are maintaining dozens of chains it would be unfortunate for a validator to be unable to restart vald without causing disruption to all chains they are supporting, due to a failure on a single rpc endpoint.

Current Behaviour

Vald panics and crashes at startup if one of its configured endpoints does not connect and return an expected response

Expected Behaviour

Vald should be able to handle an issue with one or more endpoints gracefully and produce clear log output highlighting which rpc endpoints have problems (this should be repeated regularly until the error is resolved). This would enable a validator to still participate in keygens on its other chains. If the fault was due to an external event such as a chain halt it would also mean that as soon as rpc functionality was restored the validator would be able to participate immediately. This means that a validator doesn’t have to go through the trouble of altering configs on the fly whenever there is an issue with an external chain, and would not suffer any delay in resuming service once the chain is active again.

It would be useful to augment this with a prometheus metric that outputs the status of all the configured endpoints, eg:
axelar_vald_external_chain_status{chain=”ethereum”} 1
where there is a label for each chain configured and 0 is returned if there is an issue, 1 if the connection is healthy.

Steps to reproduce (for bugs)

Relevant Logs or Files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant