[repro, don't merge]: peer comms going crazy on random fake network error #4250

dpc · 2024-02-06T08:40:34Z

repro esily with:

env RUST_LOG=debug TEST_ARGS="out_of_band_cancel --no-capture" ./scripts/tests/backend-test.sh

(might new few tries if you're lucky).

What will you see:

When things go OK, logs are relatively quiet.

When the randomized error is introduced comms are going crazy. I added IN and OUT logs and seems like communidation between peers is working, just there's tons of traffic and things just getting stuck, in particular given that error has a higher chance of being introduced the more traffic is going through the connection, possibly re-introducing the reason for problems periodically.

Notably without refactor: improve client executor loop, the problem seems gone. Which is weird - it should be strictly client-side chagne. It's unclear to me if there's a bug there introducing a problem, or could it be that more aggressive state machine execution is either overloading consensus or something else entirely.

Edit:

In c82edbf I made sure that after initial failures the peers have time to recover any communication without further failures. But it doesn't help. The test fails like all other times anyway.

I wonder/suspect that either something about aleph item period/timeouts is wrong, or it has a bug.

Make client side state machines go brrrr: * no busy looping/polling anything (good for phone bateries) * no canceling anything * parallel state transitions

dpc · 2024-02-06T22:14:15Z

@joschisan @elsirion Any thoughts? It seems to me like alephbft issue and I don't know where to take it from here.

dpc · 2024-02-06T22:29:13Z

Hmmm....

2024-02-06T22:27:34.396063Z  INFO fedimint_server::consensus::server: Delay delay=50.0 round_index=54
2024-02-06T22:27:34.464473Z  INFO fedimint_client: Last client reference dropped, shutting down client task group
2024-02-06T22:27:34.464516Z  INFO fedimint_client::sm::executor: Shutting down state machine executor runner due to shutdown signal
2024-02-06T22:27:34.464624Z DEBUG fedimint_client::sm::executor: Executor already stopped, ignoring stop request
2024-02-06T22:27:34.466670Z ERROR AlephBFT-backup-saver: couldn't respond with saved unit to runway
thread 'tokio-runtime-worker' panicked at /home/dpc/lab/fedimint/fedimint-server/src/atomic_broadcast/spawner.rs:30:29:
We own the rx.: ()
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt

dpc · 2024-02-06T22:34:28Z

Ohhh....

2024-02-06T22:34:02.580873Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.581322Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.651356Z  INFO fedimint_server::consensus::server: Delay delay=50.0 round_index=486 expected_rounds=2700 exponential_slowdown_offs
et=8100 BASE=1.005
2024-02-06T22:34:02.665184Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.690723Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.716511Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.728908Z  INFO fedimint_server::consensus::server: Delay delay=50.0 round_index=487 expected_rounds=2700 exponential_slowdown_offs
et=8100 BASE=1.005

I think I know... :D

dpc · 2024-02-06T22:41:30Z

Ignore this comment.

2024-02-06T22:39:41.064127Z  INFO task{name="fedimintd"}: fedimint_server::consensus::server: Delay delay=9.967362535478008e20 round_index=9010 expec
ted_rounds=2700 exponential_slowdown_offset=8100 BASE=1.005
thread 'tokio-runtime-worker' panicked at library/core/src/time.rs:914:31:
overflow when adding durations

   5: core::option::Option<T>::expect
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/option.rs:898:21
   6: <core::time::Duration as core::ops::arith::Add>::add
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/time.rs:914:31
   7: <core::time::Duration as core::ops::arith::AddAssign>::add_assign
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/time.rs:921:17
   8: fedimint_aleph_bft::config::time_to_reach_round
             at /home/dpc/.cargo/registry/src/index.crates.io-6f17d22bba15001f/fedimint-aleph-bft-0.30.0/src/config.rs:206:9
   9: fedimint_aleph_bft::config::create_config
             at /home/dpc/.cargo/registry/src/index.crates.io-6f17d22bba15001f/fedimint-aleph-bft-0.30.0/src/config.rs:127:8
  10: fedimint_server::consensus::server::ConsensusServer::run_session::{{closure}}
             at /home/dpc/lab/fedimint/fedimint-server/src/consensus/server.rs:368:22
  11: fedimint_server::consensus::server::ConsensusServer::run_consensus::{{closure}}
             at /home/dpc/lab/fedimint/fedimint-server/src/consensus/server.rs:290:45
  12: fedimint_server::consensus::server::ConsensusServer::run::{{closure}}

  203  fn time_to_reach_round(round: Round, delay_schedule: &DelaySchedule) -> Duration {
  204      let mut total_time: Duration = Duration::from_millis(0);
  205      for r: u16 in 0..round {
  206          total_time += delay_schedule(r as usize);
  207      }
  208      total_time
  209  }


  127      if time_to_reach_round(max_round, delay_schedule: &delay_config.unit_creation_delay) < time_to_reach_max_round {
  128          error!(
  129              target: "AlephBFT-config",
  130              "Reaching max_round will happen too fast with the given Config. Consider increasing max_round or lowering time_to_reach_max_round.
                   ↪ "
  131          );
  132          return Err(InvalidConfigError);
  133      }

~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/fedimint-aleph-bft-0.30.0/src/config.rs

So I decreased the exponential_slowdown_offs somewhat, and that makes the check there overflow. I see.

I guess we should submit a PR. Overflow should be checked.

And we should also put some cap on that exponentiation.

dpc · 2024-02-06T23:41:04Z

I tried to increase the outgoing channel by a ton to make sure messages are not being dropped to check if alephbft somehow doesn't like it. Didn't help.

There's still a message that's dropped on a simulated disconnection, but then ... BFT protocol can't assume lack of any failures. Are we supposed be running any kind of acknowledgement protocol and keep messages in the buffer until the peer responds that they've seen them? Or is fire and forget that we do OK? Just graphing at straws and double checking.

... until fedimint#4250 is diagned and fixed.

dpc · 2024-02-15T06:48:20Z

It's possible that the alphbft behavior I'm seeing here is actually normal, and the root of the problems I was seeing was #4329

dpc requested a review from a team as a code owner February 6, 2024 08:40

dpc changed the title ~~[repro, don't merge]: peers going crazy on random fake network error~~ [repro, don't merge]: peer comms going crazy on random fake network error Feb 6, 2024

dpc force-pushed the 24-02-06-peer-comms-going-bonkers branch from 7794872 to e2838f0 Compare February 6, 2024 08:44

dpc requested a review from a team as a code owner February 6, 2024 08:44

dpc added 9 commits February 6, 2024 12:28

refactor: improve client executor loop

69401f2

Make client side state machines go brrrr: * no busy looping/polling anything (good for phone bateries) * no canceling anything * parallel state transitions

chore: rewrite loop using select

dfb9dd3

fix: executor loop is infallible

c42c2e8

fix: raw bytes Debug formatting

6c04ee5

chore: useful debug message

882ab78

feat: unreliable connection faults with a rampup time

a6ad34c

chore: debug peer 0's comms

2112809

feat: limit amount of failures injected

c82edbf

chore: silence, logs

36231f7

dpc force-pushed the 24-02-06-peer-comms-going-bonkers branch from e2838f0 to 36231f7 Compare February 6, 2024 22:01

This was referenced Feb 7, 2024

Investigate excessive p2p reconnects seen in the wild #4157

Open

refactor: improve client executor loop #4230

Merged

dpc added a commit to dpc/fedimint that referenced this pull request Feb 8, 2024

chore: disable random peer network failures...

68c54cb

... until fedimint#4250 is diagned and fixed.

This was referenced Feb 8, 2024

chore: anti-flakiness spring offensive #4278

Merged

Re-enable random peer networking errors #4281

Open

dpc added a commit to dpc/fedimint that referenced this pull request Feb 9, 2024

chore: disable random peer network failures...

cdae8fa

... until fedimint#4250 is diagned and fixed.

dpc added a commit to dpc/fedimint that referenced this pull request Feb 9, 2024

chore: disable random peer network failures...

458a72d

... until fedimint#4250 is diagned and fixed.

dpc added a commit to dpc/fedimint that referenced this pull request Feb 10, 2024

chore: disable random peer network failures...

71bfa34

... until fedimint#4250 is diagned and fixed.

dpc added a commit to dpc/fedimint that referenced this pull request Feb 11, 2024

chore: disable random peer network failures...

787b62d

... until fedimint#4250 is diagned and fixed.

dpc added a commit to dpc/fedimint that referenced this pull request Feb 12, 2024

chore: disable random peer network failures...

7d42b78

... until fedimint#4250 is diagned and fixed.

dpc added a commit to dpc/fedimint that referenced this pull request Feb 12, 2024

chore: disable random peer network failures...

195bb26

... until fedimint#4250 is diagned and fixed.

dpc added a commit to dpc/fedimint that referenced this pull request Feb 12, 2024

chore: disable random peer network failures...

5537cae

... until fedimint#4250 is diagned and fixed.

dpc closed this Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[repro, don't merge]: peer comms going crazy on random fake network error #4250

[repro, don't merge]: peer comms going crazy on random fake network error #4250

dpc commented Feb 6, 2024 •

edited

dpc commented Feb 6, 2024

dpc commented Feb 6, 2024

dpc commented Feb 6, 2024

dpc commented Feb 6, 2024 •

edited

dpc commented Feb 6, 2024 •

edited

dpc commented Feb 15, 2024

[repro, don't merge]: peer comms going crazy on random fake network error #4250

[repro, don't merge]: peer comms going crazy on random fake network error #4250

Conversation

dpc commented Feb 6, 2024 • edited

dpc commented Feb 6, 2024

dpc commented Feb 6, 2024

dpc commented Feb 6, 2024

dpc commented Feb 6, 2024 • edited

dpc commented Feb 6, 2024 • edited

dpc commented Feb 15, 2024

dpc commented Feb 6, 2024 •

edited

dpc commented Feb 6, 2024 •

edited

dpc commented Feb 6, 2024 •

edited