refactor: peer connection handling #4929

dpc · 2024-04-11T07:04:02Z

See commit messages

Summary:

To avoid very spamming with reconnection attempts, track connection attempts and add some delay.

Since request is cancellable, we could drop the future in the middle of connecting. Do it in a background task (Jit) instead.

Since `request` is used in query strategies, it will often be canceled, possibly cancelling the connection attempts. That's not great. By using Jit, we can get connection to peers happen in the background, and avoid the problem. While at it - keep attempting to reconnect. The strategy will cancel the request anyway.

dpc · 2024-04-12T03:11:40Z

fedimint-core/src/api.rs

+        let shared: Arc<_> = tokio::sync::Mutex::new(FederationPeerClientShared::new()).into();
+
+        Self {
+            client: Self::new_jit_client(peer_id, url, shared.clone()),


I'm on the fence here if a new FederationPeerClient should connect in the background immediately, or start with Err("Not connected") Jit state.

By connecting immediately, it can be read immediately when it's actually needed, which might make things slightly more snappy.

On the downside - if no request is ever made, the connection is kind of unnecessary.

Just connect asap. If it's the CLI we don't care about wasted resources imo as long as it doesn't impact latency. For anything more long running we'll want a connection anyway 99% of times.

dpc · 2024-04-12T05:21:59Z

I would rather land and follow up with any fixes due to #4940

elsirion

I strongly dislike reducing our unit testing coverage, but there's nothing broken about the code afaik. So ACK under the condition that we add back some testing for all that reconnect logic (especially concurrency-safety).

elsirion · 2024-04-12T11:04:10Z

fedimint-core/src/api.rs

-    }
-
-    #[test_log::test(tokio::test)]
-    async fn concurrent_requests() {


Testing if concurrency works seems kinda important.

The previous test was just testing implementation details of previous implementation, that's why it no longer worked after it changed. I don't really want to add another test like it.

I think about a test that creates an Api and then calls in a loop for N threads for some time and checks everything.

elsirion · 2024-04-12T11:08:49Z

fedimint-core/src/api.rs

@@ -1516,13 +1516,26 @@ where
 {
    #[instrument(level = "trace", fields(peer = %self.peer_id, %method), skip_all)]
    pub async fn request(&self, method: &str, params: &[Value]) -> JsonRpcResult<Value> {
-        loop {
+        for attempts in 0.. {
            let rclient = self.client.read().await;


Could assert attempts <= 1 here. Took me a bit to figure out that invariant.

douglaz · 2024-04-12T16:50:47Z

fedimint-core/src/api.rs

+            .unwrap_or_default();
+
+        let sleep_duration = desired_timeout.saturating_sub(since_last_connect);
+        if Duration::from_millis(0) < sleep_duration {


Suggested change

if Duration::from_millis(0) < sleep_duration {

if sleep_duration > Duration::ZERO {

douglaz · 2024-04-12T16:58:20Z

fedimint-core/src/api.rs

+                Err(e) => {
+                    // Strategies using timeouts often depend on failing requests returning quickly,
+                    // so every request gets only one reconnection attempt.
+                    if 0 < attempts {


Else log something?

douglaz · 2024-04-12T16:58:36Z

fedimint-core/src/api.rs

-                            %err, "Unable to connect to peer");
-                        return Err(err)?;
+                }
+                _ => {


is this Ok(_client)? Better be explicit about it.

douglaz · 2024-04-12T16:58:47Z

fedimint-core/src/api.rs

-                        return Err(err)?;
+                }
+                _ => {
+                    if 0 < attempts {


Else log something?

douglaz · 2024-04-12T16:59:16Z

fedimint-core/src/api.rs

+            let mut wclient = self.client.write().await;
+            match wclient.client.get_try().await {
+                Ok(client) if client.is_connected() => {
+                    // someone else connected, just loop again


debug!/trace! ?

douglaz

Minor stuff, can be implemented later

fedimint#4929

dpc · 2024-04-12T17:33:24Z

All but extra tests in #4946

fedimint#4929

chore: timeout before FederationPeer reconnection attempts

411cdbb

dpc requested a review from a team as a code owner April 11, 2024 07:04

dpc force-pushed the 24-04-11-fix-4837 branch 2 times, most recently from bb6b2a7 to 8000eb9 Compare April 11, 2024 07:08

dpc force-pushed the 24-04-11-fix-4837 branch from 8000eb9 to 8044feb Compare April 11, 2024 07:17

chore: delete two test that were testing (previous) impl details

29fb7d0

dpc force-pushed the 24-04-11-fix-4837 branch from e45fab5 to 29fb7d0 Compare April 11, 2024 19:33

dpc added 3 commits April 11, 2024 19:49

fix: don't make a connection on every consensus round

1ba280c

chore: helpful log line

306813a

fix: do not loop attempting reconnects

9d2d520

dpc force-pushed the 24-04-11-fix-4837 branch from eab3326 to 9d2d520 Compare April 12, 2024 02:59

dpc requested a review from a team as a code owner April 12, 2024 02:59

dpc commented Apr 12, 2024

View reviewed changes

dpc mentioned this pull request Apr 12, 2024

refactor: split out fedimint-api-client from fedimint-core #4940

Merged

elsirion approved these changes Apr 12, 2024

View reviewed changes

douglaz reviewed Apr 12, 2024

View reviewed changes

fedimint-core/src/api.rs

return Err(err)?;

}

_ => {

if 0 < attempts {

Copy link

Contributor

douglaz Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Else log something?

douglaz reviewed Apr 12, 2024

View reviewed changes

douglaz approved these changes Apr 12, 2024

View reviewed changes

dpc added this pull request to the merge queue Apr 12, 2024

dpc added a commit to dpc/fedimint that referenced this pull request Apr 12, 2024

chore(client-api): minor follow-up fixes after review

149b408

fedimint#4929

dpc mentioned this pull request Apr 12, 2024

chore(client-api): minor follow-up fixes after review #4946

Merged

Merged via the queue into fedimint:master with commit 7c513d5 Apr 12, 2024
21 checks passed

dpc deleted the 24-04-11-fix-4837 branch April 12, 2024 18:31

dpc added a commit to dpc/fedimint that referenced this pull request Apr 13, 2024

chore(client-api): minor follow-up fixes after review

c2e175a

fedimint#4929

elsirion mentioned this pull request May 6, 2024

Too aggressive LN SM block height polling #5199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: peer connection handling #4929

refactor: peer connection handling #4929

dpc commented Apr 11, 2024 •

edited

dpc Apr 12, 2024

elsirion Apr 12, 2024

dpc commented Apr 12, 2024

elsirion left a comment

elsirion Apr 12, 2024

dpc Apr 12, 2024

elsirion Apr 12, 2024

douglaz Apr 12, 2024 •

edited

douglaz Apr 12, 2024

douglaz Apr 12, 2024

douglaz Apr 12, 2024

douglaz Apr 12, 2024

douglaz left a comment

dpc commented Apr 12, 2024

	if Duration::from_millis(0) < sleep_duration {
	if sleep_duration > Duration::ZERO {

refactor: peer connection handling #4929

refactor: peer connection handling #4929

Conversation

dpc commented Apr 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpc commented Apr 12, 2024

elsirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglaz Apr 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglaz left a comment

Choose a reason for hiding this comment

dpc commented Apr 12, 2024

dpc commented Apr 11, 2024 •

edited

douglaz Apr 12, 2024 •

edited