RLY fails if one node is unavailable #1268

dylanschultzie · 2023-08-28T03:15:01Z

When a node is unreachable, the entire rly process restarts even though there are other channels being covered. It seems like that channel should be passed on, rather than the entire service being ended.

2023-08-28T03:11:56.839633Z	error	Failed to query node status	{"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempt": 5, "max_attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.839657Z	error	Failed to query latest height after max attempts	{"chain_name": "lumnetwork", "chain_id": "lum-network-1", "attempts": 5, "error": "failed to query node status: post failed: Post \"http://ip_here\": context deadline exceeded"}
2023-08-28T03:11:56.840208Z	error	Failed to query latest height after max attempts	{"chain_name": "nolus", "chain_id": "pirin-1", "attempts": 5, "error": "context canceled"}
rly.service: Deactivated successfully.
rly.service: Consumed 1min 8.125s CPU time.
rly.service: Scheduled restart job, restart counter is at 1.
Stopped RLY IBC relayer for mainnet.
rly.service: Consumed 1min 8.125s CPU time.
Started RLY IBC relayer for mainnet.

The text was updated successfully, but these errors were encountered:

jtieri · 2023-10-11T19:09:06Z

thanks for opening this issue, i agree that restarting because one node is unreachable doesn't seem like desirable behavior. i'll discuss this internally and see how the team wants to prioritize this, i may possibly be able to take this on in our next sprint.

tiagocmachado · 2024-07-11T14:42:52Z

Is there any update on this? Or any config that can be set to prevent it to restart?

jtieri · 2024-07-17T17:32:46Z

Is there any update on this? Or any config that can be set to prevent it to restart?

I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.

Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved:

relayer/relayer/chains/cosmos/cosmos_chain_processor.go

Lines 486 to 496 in df42391

    
           if stuckPacket != nil && 
        
           	ccp.chainProvider.ChainId() == stuckPacket.ChainID && 
        
           	newLatestQueriedBlock == int64(stuckPacket.EndHeight) { 
        
           	i = persistence.latestHeight 
        
           	ccp.log.Debug("Parsed stuck packet height, skipping to current") 
        
           	newLatestQueriedBlock, err = ccp.latestHeightWithRetry(ctx) 
        
           	if err != nil { 
        
           		ccp.log.Error("Failed to query node height after max attempts. Consider checking endpoint and retyring for stuck packets") 
        
           		return err 
        
           	} 
        
           }

"

@joelsmith-2019 does this sound correct?

If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.

joelsmith-2019 · 2024-07-17T17:49:51Z

@jtieri - Yes, that does sound correct.

tiagocmachado · 2024-07-17T21:07:38Z

Is there any update on this? Or any config that can be set to prevent it to restart?

I started working on a PoC for this awhile back at this point but got pulled away to work on some other stuff.

Recently one of the engineers on our team revisited this issue but was struggling to get the rly process to crash due to one node being unavailable. He said, "I am having a very difficult time figuring out how to make the Chain Processor error out and crash the application. Even if the chain is configured with an invalid node endpoint, it will just keep trying and trying; It never crashes. I've looked into the code, and as of right now, the only time it will fully error out is when there is a stuck packet that doesn't get resolved:

relayer/relayer/chains/cosmos/cosmos_chain_processor.go

Lines 486 to 496 in df42391

if stuckPacket != nil &&

ccp.chainProvider.ChainId() == stuckPacket.ChainID &&

newLatestQueriedBlock == int64(stuckPacket.EndHeight) {

i = persistence.latestHeight

ccp.log.Debug("Parsed stuck packet height, skipping to current")

newLatestQueriedBlock, err = ccp.latestHeightWithRetry(ctx)

if err != nil {

ccp.log.Error("Failed to query node height after max attempts. Consider checking endpoint and retyring for stuck packets")

return err

}

}

"
@joelsmith-2019 does this sound correct?

If we can confirm that this behavior is still present and we can replicate it locally in testing then we should be able to find someone who can take this on sooner rather than later to get things refactored into a state that results in more desirable behavior.

This could be related to the number of chains and paths we relay.

We faced the issue with 25 chains and 88 paths.

jtieri mentioned this issue Feb 1, 2024

Context deadline exceed when starting go-relayer with Osmosis node started more than 1 hr ago #1389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLY fails if one node is unavailable #1268

RLY fails if one node is unavailable #1268

dylanschultzie commented Aug 28, 2023

jtieri commented Oct 11, 2023

tiagocmachado commented Jul 11, 2024

jtieri commented Jul 17, 2024

joelsmith-2019 commented Jul 17, 2024

tiagocmachado commented Jul 17, 2024

RLY fails if one node is unavailable #1268

RLY fails if one node is unavailable #1268

Comments

dylanschultzie commented Aug 28, 2023

jtieri commented Oct 11, 2023

tiagocmachado commented Jul 11, 2024

jtieri commented Jul 17, 2024

joelsmith-2019 commented Jul 17, 2024

tiagocmachado commented Jul 17, 2024