Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLY fails if one node is unavailable #1268

Open
dylanschultzie opened this issue Aug 28, 2023 · 5 comments
Open

RLY fails if one node is unavailable #1268

dylanschultzie opened this issue Aug 28, 2023 · 5 comments

Comments

@dylanschultzie
Copy link

When a node is unreachable, the entire rly process restarts even though there are other channels being covered. It seems like that channel should be passed on, rather than the entire service being ended.

2023-08-28T03:11:56.839633Z	error	Failed to query node status	{"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempt": 5, "max_attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.839657Z	error	Failed to query latest height after max attempts	{"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.840208Z	error	Failed to query latest height after max attempts	{"chain_name": "nolus", "chain_id": "pirin-1", "attempts": 5, "error": "context canceled"}
rly.service: Deactivated successfully.
rly.service: Consumed 1min 8.125s CPU time.
rly.service: Scheduled restart job, restart counter is at 1.
Stopped RLY IBC relayer for mainnet.
rly.service: Consumed 1min 8.125s CPU time.
Started RLY IBC relayer for mainnet.
@jtieri
Copy link
Member

jtieri commented Oct 11, 2023

thanks for opening this issue, i agree that restarting because one node is unreachable doesn't seem like desirable behavior. i'll discuss this internally and see how the team wants to prioritize this, i may possibly be able to take this on in our next sprint.

@tiagocmachado
Copy link

Is there any update on this? Or any config that can be set to prevent it to restart?

@jtieri
Copy link
Member

jtieri commented Jul 17, 2024

Is there any update on this? Or any config that can be set to prevent it to restart?

I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.

Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved:

if stuckPacket != nil &&
ccp.chainProvider.ChainId() == stuckPacket.ChainID &&
newLatestQueriedBlock == int64(stuckPacket.EndHeight) {
i = persistence.latestHeight
ccp.log.Debug("Parsed stuck packet height, skipping to current")
newLatestQueriedBlock, err = ccp.latestHeightWithRetry(ctx)
if err != nil {
ccp.log.Error("Failed to query node height after max attempts. Consider checking endpoint and retyring for stuck packets")
return err
}
}
"

@joelsmith-2019 does this sound correct?


If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.

@joelsmith-2019
Copy link
Contributor

@jtieri - Yes, that does sound correct.

@tiagocmachado
Copy link

Is there any update on this? Or any config that can be set to prevent it to restart?

I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.

Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved:

if stuckPacket != nil &&
ccp.chainProvider.ChainId() == stuckPacket.ChainID &&
newLatestQueriedBlock == int64(stuckPacket.EndHeight) {
i = persistence.latestHeight
ccp.log.Debug("Parsed stuck packet height, skipping to current")
newLatestQueriedBlock, err = ccp.latestHeightWithRetry(ctx)
if err != nil {
ccp.log.Error("Failed to query node height after max attempts. Consider checking endpoint and retyring for stuck packets")
return err
}
}

"
@joelsmith-2019 does this sound correct?

If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.

This could be related to the number of chains and paths we relay.

We faced the issue with 25 chains and 88 paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants