[providers] Fallback provider hangs #2030

LayneHaber · 2021-09-11T02:23:50Z

Describe the bug
When using a FallbackProvider if you have an RPC that hangs, the entire request will hang even if other RPCs are returning quickly without issue.

Important usage note: this issue was reproduced with a quorum of 1 (either because there are only 2 provider URLs used, or quorum is explicitly set to 1.

Reproduction steps

import { JsonRpcProvider, FallbackProvider } from "@ethersproject/providers";

// Must use a list of URLs where one does not return quickly.
// This is the exhibited error behavior for some invalid or overloaded RPCs :(
const urls = ["http://....", "http://...."];
const providers = urls.map(url => new JsonRpcProvider(url));
const fallback = new FallbackProvider(providers, 1);
const promise = fallback.getBlockNumber();
await promise; // Hangs!

Environment:
Reproducible in a node repl, difficult part is finding a URL that hangs! Using "ethers": "^5.4.6"

The text was updated successfully, but these errors were encountered:

LayneHaber · 2021-09-11T04:00:06Z

Seems its happening due to the Promise.all within detectNetwork(), here. Presumably it could happen in other places if an RPC simply stopped responding after the network was set, but haven't been able to test this

gitpusha · 2021-09-15T08:23:01Z

We noticed the same issue

zemse · 2021-09-16T14:38:41Z

Can you try using the stallTimeout config (docs)?

const provider = new FallbackProvider([
  { provider: provider1, stallTimeout: 1000 }, 
  { provider: provider2, stallTimeout: 1500 }
])

timaiv · 2021-11-12T16:06:02Z

For me, problem was in await waitForSync(config, currentBlockNumber) in FallbackProvider.
One provider that hanged cant sync last block number, that was taken from other providers, and no timeout parameter passed.
Reproducible 100% by turning off network adapter staying on breakpoint.

timaiv · 2021-11-17T11:48:53Z

How i can reproduce, and my explanation for getLogs (by logging everywhere).
I reproduced in production and locally by different scenarios, explanation below is simplified (actually in this case i turned off and restarted network adapter).

I have 6 JSON RPC providers

GOOD, weight 1, stallTimeout 2 (For example, take that it returns error rarely)
GOOD, weight 1, stallTimeout 2
GOOD, weight 1, stallTimeout 2
GOOD, weight 1, stallTimeout 2
BAD, weight 1, stallTimeout 2 (always returns 404), cant sync block number.
BAD, weight 1, stallTimeout 2 (always returns 404), cant sync block number.
Fallback quorum is 1.

After shuffling

Bad provider
Good provider
-...

Iterations (simplified), actually they depend on each provider states (blockNumber) in current moment.

Promise.race(waiting) contains 2 waiting (First runner and first staller), staller is completed (timeout).
Promise.race(waiting) contains 3 waiting (First runner, second runner and second staller), second runner is completed (SERVER ERROR)
Promise.race(waiting) contains 1 waiting (First runner, first staller is nulled, second runner is done)
waitForSync waits forever.
Here no new provider was added, cuz inflightWeight 1 (by second provider, that is done BUT recently)
((c.runner && ((t0 - c.start) < c.stallTimeout)) - true

With contract events subscribing, all next getLogs (that raised by polling interval and don't wait previous to complete) calls will stuck without returning result / error.

Update
I think we have 2 bugs (in my case)

inflightWeight calculation
Fixed by:
In perform function
configs.filter((c) => (c.runner && ((t0 - c.start) < c.stallTimeout) && !c.done), actually time check as for can be deleted
Forever waitForSync.
Fixed by adding syncTimeout in FallbackProviderConfig and adding syncTimeout parameter in:
In perform function
getRunner(config, currentBlockNumber, method, params, config.syncTimeout)
In waitForSync function:
async function waitForSync(config: RunningConfig, blockNumber: number, timeout:number): Promise {`
, { oncePoll: provider, timeout:timeout });

Full logs from one of the tests
Inside while loop quorum/1, i/0
inflightWeight/0,quorum/1, i/0
Inside quorum loop, i/1, config url/https://data-seed-prebsc-2-s2.binance.org:8545/
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller added
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/2
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/, set null staller, staller now is [object Object]
Waiting any done
Inside while loop quorum/1, i/1
inflightWeight/0,quorum/1, i/1
Inside quorum loop, i/2, config url/https://data-seed-prebsc-1-s3.binance.org:8545/
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Staller added
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/3
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/, set null staller, staller now is [object Object]
Waiting any done
Inside while loop quorum/1, i/2
inflightWeight/0,quorum/1, i/2
Inside quorum loop, i/3, config url/https://data-seed-prebsc-1-s2.binance.org:8545/
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Staller added
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/4
getRunner on Error, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Waiting any done
Inside while loop quorum/1, i/3
Inflight weight has config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
inflightWeight/1,quorum/1, i/3
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Config is done
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/2
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/, set null staller, staller now is [object Object]
stuck forever

Jonas121 · 2021-11-19T14:44:49Z

Issue exist. If one of the providers on full hang 404, other fallback providers dont do the job. Other providers are up and running and could deliver result. My setup expect is to run request to 3 backends and get the result with the same weight. If any of those delivers, positive result should be returned.

let provider1=new ethers.providers.StaticJsonRpcProvider(prov[0])
let provider2=new ethers.providers.StaticJsonRpcProvider(prov[1])
let provider3=new ethers.providers.StaticJsonRpcProvider(prov[2])

let provider_fallback=new ethers.providers.FallbackProvider([
{provider: provider1, priority: 1, weight: 1, stallTimeout: 0}
{provider: provider2, priority: 1, weight: 1, stallTimeout: 0}
{provider: provider2, priority: 1, weight: 1, stallTimeout: 0}
] , 1)

Error:
Error: could not detect network (event="noNetwork", code=NETWORK_ERROR, version=providers/5.5.0)
at Logger.makeError (/home/testuser/node_modules/@ethersproject/logger/lib/index.js:199:21)
at Logger.throwError (/home/testuser/node_modules/@ethersproject/logger/lib/index.js:208:20)
at StaticJsonRpcProvider. (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:517:54)
at step (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:48:23)
at Object.throw (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:29:53)
at rejected (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:21:65)
at processTicksAndRejections (node:internal/process/task_queues:96:5) {
reason: 'could not detect network',
code: 'NETWORK_ERROR',
event: 'noNetwork'

timaiv · 2021-11-29T12:08:01Z

Found 1 more stuck (not actually relative to fallback provider), but to specific logic in _getInternalBlockNumber.

if (maxAge > 0) {

            // While there are pending internal block requests...
            while (this._internalBlockNumber) {

                // ..."remember" which fetch we started with
                const internalBlockNumber = this._internalBlockNumber;

                try {
                    // Check the result is not too stale
                    const result = await internalBlockNumber;
                    if ((getTime() - result.respTime) <= maxAge) {
                        return result.blockNumber;
                    }

                    // Too old; fetch a new value
                    break;

                } catch(error) {

                    // The fetch rejected; if we are the first to get the
                    // rejection, drop through so we replace it with a new
                    // fetch; all others blocked will then get that fetch
                    // which won't match the one they "remembered" and loop
                    if (this._internalBlockNumber === internalBlockNumber) {
                        break;
                    }
                }
            }
        }

Two problems which are reproducible (i reproduced this by sendTransaction and using FallbackProvider, that currently waits all providers to complete, and after it returns any success).

if (method === "sendTransaction") {
            const results: Array<string | Error> = await Promise.all(this.providerConfigs.map((c) => {

sendTransaction is hangs on several seconds/minutes/hours (according to count of _getInternalBlockNumber requests).

Problems:

ethers.js waits any active eth_blockNumber request to complete, and after fail starts - itself.
We cant disptach all old result (by speed) and continues take new requests, app will down by memory.
After any new real request with maxAge = 0, old spectator waits this new request (while loop can be almost infinitive)

1st problem details.
List of function call id in bad provider and maxAge parameter.
0(maxage = 0), 1(0), 2(2100), 3(2100), 4(2100), 5(2100)

0 called, is active
1 called, is active
2 called, waits 1 to complete
3 called, waits 1 to complete
1 failed, 2 is active, 3 now waits 2
4 called, waits 2 to complete
2 failed, 3 is active,
..
50 called, waits 10 to complete
After minute
40 spectators wait previous active requests.

eramosr16 · 2021-12-07T23:38:16Z

I'm having this very same issue, the NETWORK_ERROR is being raised as a result from calling a contract function instead from withing the provider, there should be some kind try_catch in between so the FallbackProvider could catch this and switch to a different provider. This is my FallBackProvider setup:

let prvs = []
for (let i = 0; i < settings.nodes.length; i++) {
        const node = settings.nodes[i]
        let prv = new ethers.providers.JsonRpcProvider(
          { url: node, timeout: 1000 },
          {
            name: 'binance',
            chainId: 56,
          },
        )

        await prv.ready
        prvs.push({
          provider: prv,
          priority: 1,
          weight: i + 1,
          stallTimeout: 1000,
        })
      }
const provider = new ethers.providers.FallbackProvider(prvs)

A quick update with this code I was able to avoid the NETWORK_ERROR but now I'm having this:

Error: failed to meet quorum (method="getBlockNumber", params={}, results=[{"weight":5,"start":1639593840435,"result":13503917},{"weight":4,"start":1639593840435,"result":13503917},{"weight":3,"start":1639593841435,"result":13503917},{"weight":1,"start":1639593841435,"result":13503917},{"weight":2,"start":1639593841435,"result":13503917}], provider="[object Object]", code=SERVER_ERROR, version=providers/5.5.1

ryandgoulding · 2022-02-18T13:51:17Z

I am also running into this issue. I am hosting two of my own nodes, and I can replicate this issue by rerouting the destination provider url for one of the nodes using an IP route or a firewall rule. Is it safe to say we should avoid this feature for now in favor of a Promise.any() or similar?

humblecoder · 2022-09-08T22:23:19Z

Any updates on this? I'm having the same issue. Also, more detailed information overall (and perhaps even a quality recovery mechanism) would be nice.

I receive the call that timed out (e.g. method:eth_blockNumber), but no indication as to why the timeout occurred. I realize you can really only provide whatever info the node sends back, but perhaps some info on why every fallback provider failed would be useful.

Timing out (when you have 3 endpoints configured) then receiving noNetwork afterward (perpetually) while still being able to manually geth attach to every single endpoint listed (or run separate simultaneous scripts that use the same nodes) is 🤯 to say the least.

travisbotello · 2023-02-10T17:29:18Z

I am facing the same problem. Seems like right now it's better to just stick to one solid provider instead of using 4 fallback providers that include a black sheep that regularly times out/hangs...

Just one provider giving me more uptime.

drptbl · 2024-05-27T19:40:10Z

Still occurs on ethers@v6. Sadly, this is a huge blocker which makes all of us create custom solutions for fallbacks.

LayneHaber added the investigate Under investigation and may be a bug. label Sep 11, 2021

ricmoo added the on-deck This Enhancement or Bug is currently being worked on. label Dec 13, 2021

dawsbot mentioned this issue Feb 4, 2022

Add FallthroughProvider dawsbot/essential-eth#22

Closed

asaj mentioned this issue Jul 29, 2022

agents should be able to use a quorum of multiple providers hyperlane-xyz/hyperlane-monorepo#832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[providers] Fallback provider hangs #2030

[providers] Fallback provider hangs #2030

LayneHaber commented Sep 11, 2021 •

edited

Loading

LayneHaber commented Sep 11, 2021 •

edited

Loading

gitpusha commented Sep 15, 2021

zemse commented Sep 16, 2021

timaiv commented Nov 12, 2021

timaiv commented Nov 17, 2021 •

edited

Loading

Jonas121 commented Nov 19, 2021

timaiv commented Nov 29, 2021 •

edited

Loading

eramosr16 commented Dec 7, 2021 •

edited

Loading

ryandgoulding commented Feb 18, 2022

humblecoder commented Sep 8, 2022 •

edited

Loading

travisbotello commented Feb 10, 2023

drptbl commented May 27, 2024

[providers] Fallback provider hangs #2030

[providers] Fallback provider hangs #2030

Comments

LayneHaber commented Sep 11, 2021 • edited Loading

LayneHaber commented Sep 11, 2021 • edited Loading

gitpusha commented Sep 15, 2021

zemse commented Sep 16, 2021

timaiv commented Nov 12, 2021

timaiv commented Nov 17, 2021 • edited Loading

Jonas121 commented Nov 19, 2021

timaiv commented Nov 29, 2021 • edited Loading

eramosr16 commented Dec 7, 2021 • edited Loading

ryandgoulding commented Feb 18, 2022

humblecoder commented Sep 8, 2022 • edited Loading

travisbotello commented Feb 10, 2023

drptbl commented May 27, 2024

LayneHaber commented Sep 11, 2021 •

edited

Loading

LayneHaber commented Sep 11, 2021 •

edited

Loading

timaiv commented Nov 17, 2021 •

edited

Loading

timaiv commented Nov 29, 2021 •

edited

Loading

eramosr16 commented Dec 7, 2021 •

edited

Loading

humblecoder commented Sep 8, 2022 •

edited

Loading