Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[providers] Fallback provider hangs #2030

Open
LayneHaber opened this issue Sep 11, 2021 · 12 comments
Open

[providers] Fallback provider hangs #2030

LayneHaber opened this issue Sep 11, 2021 · 12 comments
Labels
investigate Under investigation and may be a bug. on-deck This Enhancement or Bug is currently being worked on.

Comments

@LayneHaber
Copy link

LayneHaber commented Sep 11, 2021

Describe the bug
When using a FallbackProvider if you have an RPC that hangs, the entire request will hang even if other RPCs are returning quickly without issue.

Important usage note: this issue was reproduced with a quorum of 1 (either because there are only 2 provider URLs used, or quorum is explicitly set to 1.

Reproduction steps

import { JsonRpcProvider, FallbackProvider } from "@ethersproject/providers";

// Must use a list of URLs where one does not return quickly.
// This is the exhibited error behavior for some invalid or overloaded RPCs :(
const urls = ["http://....", "http://...."];
const providers = urls.map(url => new JsonRpcProvider(url));
const fallback = new FallbackProvider(providers, 1);
const promise = fallback.getBlockNumber();
await promise; // Hangs!

Environment:
Reproducible in a node repl, difficult part is finding a URL that hangs! Using "ethers": "^5.4.6"

@LayneHaber LayneHaber added the investigate Under investigation and may be a bug. label Sep 11, 2021
@LayneHaber
Copy link
Author

LayneHaber commented Sep 11, 2021

Seems its happening due to the Promise.all within detectNetwork(), here. Presumably it could happen in other places if an RPC simply stopped responding after the network was set, but haven't been able to test this

@gitpusha
Copy link

We noticed the same issue

@zemse
Copy link
Collaborator

zemse commented Sep 16, 2021

Can you try using the stallTimeout config (docs)?

const provider = new FallbackProvider([
  { provider: provider1, stallTimeout: 1000 }, 
  { provider: provider2, stallTimeout: 1500 }
])

@timaiv
Copy link

timaiv commented Nov 12, 2021

For me, problem was in await waitForSync(config, currentBlockNumber) in FallbackProvider.
One provider that hanged cant sync last block number, that was taken from other providers, and no timeout parameter passed.
Reproducible 100% by turning off network adapter staying on breakpoint.

@timaiv
Copy link

timaiv commented Nov 17, 2021

How i can reproduce, and my explanation for getLogs (by logging everywhere).
I reproduced in production and locally by different scenarios, explanation below is simplified (actually in this case i turned off and restarted network adapter).

I have 6 JSON RPC providers

  • GOOD, weight 1, stallTimeout 2 (For example, take that it returns error rarely)
  • GOOD, weight 1, stallTimeout 2
  • GOOD, weight 1, stallTimeout 2
  • GOOD, weight 1, stallTimeout 2
  • BAD, weight 1, stallTimeout 2 (always returns 404), cant sync block number.
  • BAD, weight 1, stallTimeout 2 (always returns 404), cant sync block number.
    Fallback quorum is 1.

After shuffling

  1. Bad provider
  2. Good provider
    -...

Iterations (simplified), actually they depend on each provider states (blockNumber) in current moment.

  1. Promise.race(waiting) contains 2 waiting (First runner and first staller), staller is completed (timeout).
  2. Promise.race(waiting) contains 3 waiting (First runner, second runner and second staller), second runner is completed (SERVER ERROR)
  3. Promise.race(waiting) contains 1 waiting (First runner, first staller is nulled, second runner is done)
    waitForSync waits forever.
    Here no new provider was added, cuz inflightWeight 1 (by second provider, that is done BUT recently)
    ((c.runner && ((t0 - c.start) < c.stallTimeout)) - true

With contract events subscribing, all next getLogs (that raised by polling interval and don't wait previous to complete) calls will stuck without returning result / error.


Update
I think we have 2 bugs (in my case)

  1. inflightWeight calculation
    Fixed by:
    In perform function
    configs.filter((c) => (c.runner && ((t0 - c.start) < c.stallTimeout) && !c.done), actually time check as for can be deleted
  2. Forever waitForSync.
    Fixed by adding syncTimeout in FallbackProviderConfig and adding syncTimeout parameter in:
    In perform function
    getRunner(config, currentBlockNumber, method, params, config.syncTimeout)
    In waitForSync function:
    async function waitForSync(config: RunningConfig, blockNumber: number, timeout:number): Promise {`
    , { oncePoll: provider, timeout:timeout });

Full logs from one of the tests
Inside while loop quorum/1, i/0
inflightWeight/0,quorum/1, i/0
Inside quorum loop, i/1, config url/https://data-seed-prebsc-2-s2.binance.org:8545/
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller added
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/2
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/, set null staller, staller now is [object Object]
Waiting any done
Inside while loop quorum/1, i/1
inflightWeight/0,quorum/1, i/1
Inside quorum loop, i/2, config url/https://data-seed-prebsc-1-s3.binance.org:8545/
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Staller added
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/3
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/, set null staller, staller now is [object Object]
Waiting any done
Inside while loop quorum/1, i/2
inflightWeight/0,quorum/1, i/2
Inside quorum loop, i/3, config url/https://data-seed-prebsc-1-s2.binance.org:8545/
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Staller added
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/4
getRunner on Error, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Waiting any done
Inside while loop quorum/1, i/3
Inflight weight has config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
inflightWeight/1,quorum/1, i/3
configs.forEach
Config, url/https://data-seed-prebsc-2-s2.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s3.binance.org:8545/
Staller is null
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/
Config is done
Config, url/https://data-seed-prebsc-1-s1.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s3.binance.org:8545/
Config doesnot have runnner
Config, url/https://data-seed-prebsc-2-s1.binance.org:8545/
Config doesnot have runnner
Waiting any result, config length/6, length/2
Config, url/https://data-seed-prebsc-1-s2.binance.org:8545/, set null staller, staller now is [object Object]
stuck forever

@Jonas121
Copy link

Issue exist. If one of the providers on full hang 404, other fallback providers dont do the job. Other providers are up and running and could deliver result. My setup expect is to run request to 3 backends and get the result with the same weight. If any of those delivers, positive result should be returned.

let provider1=new ethers.providers.StaticJsonRpcProvider(prov[0])
let provider2=new ethers.providers.StaticJsonRpcProvider(prov[1])
let provider3=new ethers.providers.StaticJsonRpcProvider(prov[2])

let provider_fallback=new ethers.providers.FallbackProvider([
{provider: provider1, priority: 1, weight: 1, stallTimeout: 0}
{provider: provider2, priority: 1, weight: 1, stallTimeout: 0}
{provider: provider2, priority: 1, weight: 1, stallTimeout: 0}
] , 1)

Error:
Error: could not detect network (event="noNetwork", code=NETWORK_ERROR, version=providers/5.5.0)
at Logger.makeError (/home/testuser/node_modules/@ethersproject/logger/lib/index.js:199:21)
at Logger.throwError (/home/testuser/node_modules/@ethersproject/logger/lib/index.js:208:20)
at StaticJsonRpcProvider. (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:517:54)
at step (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:48:23)
at Object.throw (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:29:53)
at rejected (/home/testuser/node_modules/@ethersproject/providers/lib/json-rpc-provider.js:21:65)
at processTicksAndRejections (node:internal/process/task_queues:96:5) {
reason: 'could not detect network',
code: 'NETWORK_ERROR',
event: 'noNetwork'

@timaiv
Copy link

timaiv commented Nov 29, 2021

Found 1 more stuck (not actually relative to fallback provider), but to specific logic in _getInternalBlockNumber.

if (maxAge > 0) {

            // While there are pending internal block requests...
            while (this._internalBlockNumber) {

                // ..."remember" which fetch we started with
                const internalBlockNumber = this._internalBlockNumber;

                try {
                    // Check the result is not too stale
                    const result = await internalBlockNumber;
                    if ((getTime() - result.respTime) <= maxAge) {
                        return result.blockNumber;
                    }

                    // Too old; fetch a new value
                    break;

                } catch(error) {

                    // The fetch rejected; if we are the first to get the
                    // rejection, drop through so we replace it with a new
                    // fetch; all others blocked will then get that fetch
                    // which won't match the one they "remembered" and loop
                    if (this._internalBlockNumber === internalBlockNumber) {
                        break;
                    }
                }
            }
        }

Two problems which are reproducible (i reproduced this by sendTransaction and using FallbackProvider, that currently waits all providers to complete, and after it returns any success).

if (method === "sendTransaction") {
            const results: Array<string | Error> = await Promise.all(this.providerConfigs.map((c) => {

sendTransaction is hangs on several seconds/minutes/hours (according to count of _getInternalBlockNumber requests).

Problems:

  1. ethers.js waits any active eth_blockNumber request to complete, and after fail starts - itself.
    We cant disptach all old result (by speed) and continues take new requests, app will down by memory.
  2. After any new real request with maxAge = 0, old spectator waits this new request (while loop can be almost infinitive)

1st problem details.
List of function call id in bad provider and maxAge parameter.
0(maxage = 0), 1(0), 2(2100), 3(2100), 4(2100), 5(2100)

  • 0 called, is active
  • 1 called, is active
  • 2 called, waits 1 to complete
  • 3 called, waits 1 to complete
  • 1 failed, 2 is active, 3 now waits 2
  • 4 called, waits 2 to complete
  • 2 failed, 3 is active,
    ..
  • 50 called, waits 10 to complete
    After minute
    40 spectators wait previous active requests.

@eramosr16
Copy link

eramosr16 commented Dec 7, 2021

I'm having this very same issue, the NETWORK_ERROR is being raised as a result from calling a contract function instead from withing the provider, there should be some kind try_catch in between so the FallbackProvider could catch this and switch to a different provider. This is my FallBackProvider setup:

let prvs = []
for (let i = 0; i < settings.nodes.length; i++) {
        const node = settings.nodes[i]
        let prv = new ethers.providers.JsonRpcProvider(
          { url: node, timeout: 1000 },
          {
            name: 'binance',
            chainId: 56,
          },
        )

        await prv.ready
        prvs.push({
          provider: prv,
          priority: 1,
          weight: i + 1,
          stallTimeout: 1000,
        })
      }
const provider = new ethers.providers.FallbackProvider(prvs)

A quick update with this code I was able to avoid the NETWORK_ERROR but now I'm having this:

Error: failed to meet quorum (method="getBlockNumber", params={}, results=[{"weight":5,"start":1639593840435,"result":13503917},{"weight":4,"start":1639593840435,"result":13503917},{"weight":3,"start":1639593841435,"result":13503917},{"weight":1,"start":1639593841435,"result":13503917},{"weight":2,"start":1639593841435,"result":13503917}], provider="[object Object]", code=SERVER_ERROR, version=providers/5.5.1

@ricmoo ricmoo added the on-deck This Enhancement or Bug is currently being worked on. label Dec 13, 2021
@ryandgoulding
Copy link

I am also running into this issue. I am hosting two of my own nodes, and I can replicate this issue by rerouting the destination provider url for one of the nodes using an IP route or a firewall rule. Is it safe to say we should avoid this feature for now in favor of a Promise.any() or similar?

@humblecoder
Copy link

humblecoder commented Sep 8, 2022

Any updates on this? I'm having the same issue. Also, more detailed information overall (and perhaps even a quality recovery mechanism) would be nice.

I receive the call that timed out (e.g. method:eth_blockNumber), but no indication as to why the timeout occurred. I realize you can really only provide whatever info the node sends back, but perhaps some info on why every fallback provider failed would be useful.

Timing out (when you have 3 endpoints configured) then receiving noNetwork afterward (perpetually) while still being able to manually geth attach to every single endpoint listed (or run separate simultaneous scripts that use the same nodes) is 🤯 to say the least.

@travisbotello
Copy link

I am facing the same problem. Seems like right now it's better to just stick to one solid provider instead of using 4 fallback providers that include a black sheep that regularly times out/hangs...

Just one provider giving me more uptime.

@drptbl
Copy link

drptbl commented May 27, 2024

Still occurs on ethers@v6. Sadly, this is a huge blocker which makes all of us create custom solutions for fallbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Under investigation and may be a bug. on-deck This Enhancement or Bug is currently being worked on.
Projects
None yet
Development

No branches or pull requests