Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[repro, don't merge]: peer comms going crazy on random fake network error #4250

Closed
wants to merge 9 commits into from

Conversation

dpc
Copy link
Contributor

@dpc dpc commented Feb 6, 2024

repro esily with:

env RUST_LOG=debug TEST_ARGS="out_of_band_cancel --no-capture" ./scripts/tests/backend-test.sh

(might new few tries if you're lucky).

What will you see:

When things go OK, logs are relatively quiet.

When the randomized error is introduced comms are going crazy. I added IN and OUT logs and seems like communidation between peers is working, just there's tons of traffic and things just getting stuck, in particular given that error has a higher chance of being introduced the more traffic is going through the connection, possibly re-introducing the reason for problems periodically.

Notably without refactor: improve client executor loop, the problem seems gone. Which is weird - it should be strictly client-side chagne. It's unclear to me if there's a bug there introducing a problem, or could it be that more aggressive state machine execution is either overloading consensus or something else entirely.

Edit:

In c82edbf I made sure that after initial failures the peers have time to recover any communication without further failures. But it doesn't help. The test fails like all other times anyway.

I wonder/suspect that either something about aleph item period/timeouts is wrong, or it has a bug.

@dpc dpc requested a review from a team as a code owner February 6, 2024 08:40
@dpc dpc changed the title [repro, don't merge]: peers going crazy on random fake network error [repro, don't merge]: peer comms going crazy on random fake network error Feb 6, 2024
@dpc dpc force-pushed the 24-02-06-peer-comms-going-bonkers branch from 7794872 to e2838f0 Compare February 6, 2024 08:44
@dpc dpc requested a review from a team as a code owner February 6, 2024 08:44
@dpc dpc force-pushed the 24-02-06-peer-comms-going-bonkers branch from e2838f0 to 36231f7 Compare February 6, 2024 22:01
@dpc
Copy link
Contributor Author

dpc commented Feb 6, 2024

@joschisan @elsirion Any thoughts? It seems to me like alephbft issue and I don't know where to take it from here.

@dpc
Copy link
Contributor Author

dpc commented Feb 6, 2024

Hmmm....

2024-02-06T22:27:34.396063Z  INFO fedimint_server::consensus::server: Delay delay=50.0 round_index=54
2024-02-06T22:27:34.464473Z  INFO fedimint_client: Last client reference dropped, shutting down client task group
2024-02-06T22:27:34.464516Z  INFO fedimint_client::sm::executor: Shutting down state machine executor runner due to shutdown signal
2024-02-06T22:27:34.464624Z DEBUG fedimint_client::sm::executor: Executor already stopped, ignoring stop request
2024-02-06T22:27:34.466670Z ERROR AlephBFT-backup-saver: couldn't respond with saved unit to runway
thread 'tokio-runtime-worker' panicked at /home/dpc/lab/fedimint/fedimint-server/src/atomic_broadcast/spawner.rs:30:29:
We own the rx.: ()
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt

@dpc
Copy link
Contributor Author

dpc commented Feb 6, 2024

Ohhh....

2024-02-06T22:34:02.580873Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.581322Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.651356Z  INFO fedimint_server::consensus::server: Delay delay=50.0 round_index=486 expected_rounds=2700 exponential_slowdown_offs
et=8100 BASE=1.005
2024-02-06T22:34:02.665184Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.690723Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.716511Z DEBUG net::peer: Could not send outgoing message since the channel is full
2024-02-06T22:34:02.728908Z  INFO fedimint_server::consensus::server: Delay delay=50.0 round_index=487 expected_rounds=2700 exponential_slowdown_offs
et=8100 BASE=1.005

I think I know... :D

@dpc
Copy link
Contributor Author

dpc commented Feb 6, 2024

Ignore this comment.

2024-02-06T22:39:41.064127Z  INFO task{name="fedimintd"}: fedimint_server::consensus::server: Delay delay=9.967362535478008e20 round_index=9010 expec
ted_rounds=2700 exponential_slowdown_offset=8100 BASE=1.005
thread 'tokio-runtime-worker' panicked at library/core/src/time.rs:914:31:
overflow when adding durations
   5: core::option::Option<T>::expect
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/option.rs:898:21
   6: <core::time::Duration as core::ops::arith::Add>::add
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/time.rs:914:31
   7: <core::time::Duration as core::ops::arith::AddAssign>::add_assign
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/time.rs:921:17
   8: fedimint_aleph_bft::config::time_to_reach_round
             at /home/dpc/.cargo/registry/src/index.crates.io-6f17d22bba15001f/fedimint-aleph-bft-0.30.0/src/config.rs:206:9
   9: fedimint_aleph_bft::config::create_config
             at /home/dpc/.cargo/registry/src/index.crates.io-6f17d22bba15001f/fedimint-aleph-bft-0.30.0/src/config.rs:127:8
  10: fedimint_server::consensus::server::ConsensusServer::run_session::{{closure}}
             at /home/dpc/lab/fedimint/fedimint-server/src/consensus/server.rs:368:22
  11: fedimint_server::consensus::server::ConsensusServer::run_consensus::{{closure}}
             at /home/dpc/lab/fedimint/fedimint-server/src/consensus/server.rs:290:45
  12: fedimint_server::consensus::server::ConsensusServer::run::{{closure}}
  203  fn time_to_reach_round(round: Round, delay_schedule: &DelaySchedule) -> Duration {
  204      let mut total_time: Duration = Duration::from_millis(0);
  205      for r: u16 in 0..round {
  206          total_time += delay_schedule(r as usize);
  207      }
  208      total_time
  209  }


  127      if time_to_reach_round(max_round, delay_schedule: &delay_config.unit_creation_delay) < time_to_reach_max_round {
  128          error!(
  129              target: "AlephBFT-config",
  130              "Reaching max_round will happen too fast with the given Config. Consider increasing max_round or lowering time_to_reach_max_round.
                   ↪ "
  131          );
  132          return Err(InvalidConfigError);
  133      }

~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/fedimint-aleph-bft-0.30.0/src/config.rs

So I decreased the exponential_slowdown_offs somewhat, and that makes the check there overflow. I see.

I guess we should submit a PR. Overflow should be checked.

And we should also put some cap on that exponentiation.

@dpc
Copy link
Contributor Author

dpc commented Feb 6, 2024

I tried to increase the outgoing channel by a ton to make sure messages are not being dropped to check if alephbft somehow doesn't like it. Didn't help.

There's still a message that's dropped on a simulated disconnection, but then ... BFT protocol can't assume lack of any failures. Are we supposed be running any kind of acknowledgement protocol and keep messages in the buffer until the peer responds that they've seen them? Or is fire and forget that we do OK? Just graphing at straws and double checking.

dpc added a commit to dpc/fedimint that referenced this pull request Feb 8, 2024
dpc added a commit to dpc/fedimint that referenced this pull request Feb 9, 2024
dpc added a commit to dpc/fedimint that referenced this pull request Feb 9, 2024
dpc added a commit to dpc/fedimint that referenced this pull request Feb 10, 2024
dpc added a commit to dpc/fedimint that referenced this pull request Feb 11, 2024
dpc added a commit to dpc/fedimint that referenced this pull request Feb 12, 2024
dpc added a commit to dpc/fedimint that referenced this pull request Feb 12, 2024
dpc added a commit to dpc/fedimint that referenced this pull request Feb 12, 2024
@dpc
Copy link
Contributor Author

dpc commented Feb 15, 2024

It's possible that the alphbft behavior I'm seeing here is actually normal, and the root of the problems I was seeing was #4329

@dpc dpc closed this Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant