Skip to content

connmgr: Overhaul to use wrapped conns plus ctx.#3695

Open
davecgh wants to merge 12 commits into
decred:masterfrom
davecgh:connmgr_wrapped_conns
Open

connmgr: Overhaul to use wrapped conns plus ctx.#3695
davecgh wants to merge 12 commits into
decred:masterfrom
davecgh:connmgr_wrapped_conns

Conversation

@davecgh
Copy link
Copy Markdown
Member

@davecgh davecgh commented May 17, 2026

This requires #3692.

The existing connection manager code was written well before contexts were introduced. Further, due to the old async model that has now been converted to a synchronous model, it is based around connection requests that have their state atomically updated asynchronously as various things happen.

While it has undoubtedly worked well enough for over a decade, it has always been a challenge to add new functionality to it and requires the use of a lot of less than ideal and highly outdated techniques such as polling for state changes. It is also rather brittle in terms of requiring output connections to be manually disconnected in the connection manager after they've been closed to avoid things like leaking goroutines and failing to update target outbound counts.

Moreover, it only tracks outgoing connections which ultimately forces a lot of connection-related tasks to be split across different layers instead of residing in the connection manager itself where they more naturally belong. Notably, that split, for all intents and purposes, prevents implementing some desirable more advanced features such as immediate connection shedding, different connection types, and listeners tied to specific network types.

With the primary goal of addressing all of the aforementioned points and providing a solid base to work on for adding new features moving forward, this significantly reworks the connection manager to completely get rid of the notion of exposed connection requests in favor of a new custom connection type that wraps the underlying net.Conn.

The new wrapped connections automatically handle cleanup when closed and have an associated connection type enum that allows easily distinguishing inbound, outbound, and manual connections as well as supporting new connection types in the future.

Another nice feature of the new wrapped connections is they provide efficient access to concrete parsed address types which paves the way for avoiding a lot of constant reparsing, repeated host/port splitting and joining, and generally much more ergonomic immutable address types.

Since changing to wrapped connections basically required a rather significant rewrite of large portions of the connection manager anyway, this also takes the opportunity to improve several other aspects of the connection manager in the process such as implementing full context support, full tracking of all connection types by the manager itself, much more robust semaphore-based automatic connection limiting, cleaner persistent connection handling with independent limits, prevention of multiple connections of any type to the same address:port, more useful debug logging, and cleanly closing all connections at during shutdown.

It is also important to note that the following overall semantics have been intentionally been changed versus the existing connection manager:

  • A maximum of 8 persistent connections is now imposed and they no longer count toward the configured target number of automatic outbound peers to maintain
  • Duplicate addresses (including port) are now rejected by the connection manager for all types (inbound, outbound, manual, persistent)
    • Note that inbound conns from the same IP will necessarily have different ports, so the same max IP limits apply in that case
  • RPC 'node connect' for all connection attempts now:
    • Supports the RPC connection and server contexts
    • Properly handles duplicate address rejection including pending attempts
  • RPC 'node connect' for non-persistent conn attempts now:
    • Waits for the connection attempt result before returning
    • Returns an error if the connection attempt fails
    • Cancels the connection attempt if the RPC connection is closed before it succeeds
  • RPC 'node remove' now supports removing a pending connection by its persistent connection ID (since no peer ID exists before a valid connection is established)
  • It is no longer possible for state transitions to allow things like duplicate addresses or failed cancellation

Closes #3273

davecgh added 3 commits May 13, 2026 05:27
This moves the logic related to requesting and advertising addresses to
after the handshake completes as opposed to doing it during the
handshake in the version message handler.

In practice, the overall effect is the same, but moving it to happen
after the handshake has at least the following benefits:

- It helps ensure no other messages are queued up or sent between the
  version and verack messages
- It keeps the version handler focused on its primary purpose of
  negotiation and rejection of peers deemed to be unsuitable
- It is easier to reason about the sequence of events
This adds the Network method to the addrmgr.NetAddress type so that it
can be used as a stdlib net.Addr.
This adds a new context-aware semaphore type with Acquire and Release
methods for use in upcoming changes that aim to simplify connection
limiting by making use of semaphores for blocking until permits become
available.
@davecgh davecgh added this to the 2.2.0 milestone May 17, 2026
This adds tests for the new context-aware semaphore to ensure the
acquire, release, and context cancel semantics work as expected.
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch 2 times, most recently from 15748ea to e3f13d7 Compare May 17, 2026 12:50
@davecgh
Copy link
Copy Markdown
Member Author

davecgh commented May 17, 2026

Example of the debug logging (IP addrs and dates redacted, but IP addr matching kept):

It's fairly clear to see that 8 simultaneous attempts are made as expected and retries happen until all 8 are established (12:13:10.018). Then no more are tried until one goes away (12:46:31.699). While there are no inbounds to showcase in the log, the connection id and type are clearly logged as well.

2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to <ipv4a>:9108 (id: 2, type: outbound)
2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to <ipv4b> (id: 1, type: outbound)
2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to [ipv6a]:9108 (id: 3, type: outbound)
2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to <ipv4c>:9108 (id: 4, type: outbound)
2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to <ipv4d>:9108 (id: 6, type: outbound)
2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to [ipv6b]:9108 (id: 5, type: outbound)
2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to <ipv4e>:9108 (id: 7, type: outbound)
2026-xx-xx 12:12:23.461 [DBG] CMGR: Attempting to connect to <ipv4f>:9108 (id: 8, type: outbound)
2026-xx-xx 12:12:23.518 [DBG] CMGR: Connected to <ipv4a>:9108 (id: 2, type: outbound)
2026-xx-xx 12:12:23.524 [DBG] CMGR: Connected to <ipv4d>:9108 (id: 6, type: outbound)
2026-xx-xx 12:12:23.527 [DBG] CMGR: Connected to [ipv6a]:9108 (id: 3, type: outbound)
...

2026-xx-xx 12:12:25.688 [DBG] CMGR: Failed to connect to <ipv4f>:9108: dial tcp <ipv4f>:9108: connectex: No connection could be made because the target machine actively refused it.
2026-xx-xx 12:12:25.690 [DBG] CMGR: Attempting to connect to <ipv4g>:9108 (id: 9, type: outbound)
2026-xx-xx 12:12:44.463 [DBG] CMGR: Failed to connect to <ipv4b>: dial tcp <ipv4b>: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:12:44.463 [DBG] CMGR: Attempting to connect to <ipv4h>:9108 (id: 10, type: outbound)
2026-xx-xx 12:12:44.463 [DBG] CMGR: Failed to connect to <ipv4c>:9108: dial tcp <ipv4c>:9108: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:12:44.464 [DBG] CMGR: Attempting to connect to <ipv4i>:9108 (id: 11, type: outbound)
2026-xx-xx 12:12:44.466 [DBG] CMGR: Failed to connect to <ipv4e>:9108: dial tcp <ipv4e>:9108: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:12:44.466 [DBG] CMGR: Failed to connect to [ipv6b]:9108: dial tcp [ipv6b]:9108: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:12:44.466 [DBG] CMGR: Attempting to connect to [ipv6c]:9108 (id: 12, type: outbound)
2026-xx-xx 12:12:44.466 [DBG] CMGR: Attempting to connect to <ipv4j>:9108 (id: 13, type: outbound)
2026-xx-xx 12:12:44.507 [DBG] CMGR: Connected to <ipv4i>:9108 (id: 11, type: outbound)
...
2026-xx-xx 12:12:44.608 [DBG] CMGR: Connected to <ipv4h>:9108 (id: 10, type: outbound)
2026-xx-xx 12:12:44.617 [DBG] CMGR: Connected to <ipv4j>:9108 (id: 13, type: outbound)
...
2026-xx-xx 12:12:46.694 [DBG] CMGR: Failed to connect to <ipv4g>:9108: dial tcp <ipv4g>:9108: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:12:46.696 [DBG] CMGR: Attempting to connect to <ipv4k>:9108 (id: 14, type: outbound)
2026-xx-xx 12:12:47.086 [DBG] CMGR: Failed to connect to [ipv6c]:9108: dial tcp [ipv6c]:9108: connectex: No connection could be made because the target machine actively refused it.
2026-xx-xx 12:12:47.087 [DBG] CMGR: Attempting to connect to <ipv4l>:9108 (id: 15, type: outbound)
2026-xx-xx 12:13:07.698 [DBG] CMGR: Failed to connect to <ipv4k>:9108: dial tcp <ipv4k>:9108: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:13:07.699 [DBG] CMGR: Attempting to connect to <ipv4m>:9108 (id: 16, type: outbound)
2026-xx-xx 12:13:08.091 [DBG] CMGR: Failed to connect to <ipv4l>:9108: dial tcp <ipv4l>:9108: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:13:08.091 [DBG] CMGR: Attempting to connect to <ipv4m>:9108 (id: 17, type: outbound)
2026-xx-xx 12:13:08.136 [DBG] CMGR: Connected to <ipv4m>:9108 (id: 17, type: outbound)
...
2026-xx-xx 12:13:09.967 [DBG] CMGR: Failed to connect to <ipv4m>:9108: dial tcp <ipv4m>:9108: connectex: No connection could be made because the target machine actively refused it.
2026-xx-xx 12:13:09.967 [DBG] CMGR: Attempting to connect to <ipv4n>:9108 (id: 18, type: outbound)
2026-xx-xx 12:13:10.018 [DBG] CMGR: Connected to <ipv4n>:9108 (id: 18, type: outbound)
...
2026-xx-xx 12:46:31.699 [DBG] CMGR: Disconnected from <ipv4d>:9108 (id: 6, type: outbound)
2026-xx-xx 12:46:31.724 [DBG] CMGR: Attempting to connect to <ipv4o>:9108 (id: 19, type: outbound)
2026-xx-xx 12:46:52.732 [DBG] CMGR: Failed to connect to <ipv4o>:9108: dial tcp <ipv4o>:9108: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2026-xx-xx 12:46:52.733 [DBG] CMGR: Attempting to connect to <ipv4p>:9108 (id: 20, type: outbound)
2026-xx-xx 12:46:52.808 [DBG] CMGR: Connected to <ipv4p>:9108 (id: 20, type: outbound)

Comment thread rpcadaptors.go
})
if !found {
break
outbounds := make([]*serverPeer, 0, 1)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can technically be simplified now since it's no longer possible to have duplicate addrs. It's already changing a lot in this PR though, so I decided to just convert it to keep the same semantics for now.

Comment thread internal/connmgr/connmanager.go
Comment thread internal/connmgr/connmanager.go Outdated
Comment thread internal/connmgr/connmanager.go Outdated
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch from e3f13d7 to 3ced8cb Compare May 17, 2026 23:18
Comment thread internal/connmgr/connmanager.go Outdated
Comment thread internal/connmgr/connmanager.go Outdated
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch 2 times, most recently from 26edc8d to 21a975e Compare May 18, 2026 19:28
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch 2 times, most recently from 3a4b0b5 to 16aa5dd Compare May 19, 2026 19:08
Comment thread internal/connmgr/connmanager.go Outdated
Comment thread internal/connmgr/connmanager.go Outdated
Comment thread internal/connmgr/connmanager.go Outdated
Comment thread internal/connmgr/connmanager_test.go Outdated
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch 2 times, most recently from 7b007af to 8dbfa0c Compare May 19, 2026 19:56
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch from 8dbfa0c to ad8e6d9 Compare May 20, 2026 07:20
Comment thread internal/connmgr/connmanager_test.go Outdated
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch 2 times, most recently from e59d872 to 40545b2 Compare May 20, 2026 14:12
Comment thread internal/connmgr/connmanager.go Outdated
davecgh added 4 commits May 20, 2026 20:06
The existing connection manager code was written well before contexts
were introduced.  Further, due to the old async model that has now been
converted to a synchronous model, it is based around connection requests
that have their state atomically updated asynchronously as various
things happen.

While it has undoubtedly worked well enough for over a decade, it has
always been a challenge to add new functionality to it and requires the
use of a lot of less than ideal and highly outdated techniques such as
polling for state changes.  It is also rather brittle in terms of
requiring output connections to be manually disconnected in the
connection manager after they've been closed to avoid things like
leaking goroutines and failing to update target outbound counts.

Moreover, it only tracks outgoing connections which ultimately forces a
lot of connection-related tasks to be split across different layers
instead of residing in the connection manager itself where they more
naturally belong.  Notably, that split, for all intents and purposes,
prevents implementing some desirable more advanced features such as
immediate connection shedding, different connection types, and listeners
tied to specific network types.

With the primary goal of addressing all of the aforementioned points and
providing a solid base to work on for adding new features moving
forward, this significantly reworks the connection manager to completely
get rid of the notion of exposed connection requests in favor of a new
custom connection type that wraps the underlying net.Conn.

The new wrapped connections automatically handle cleanup when closed and
have an associated connection type enum that allows easily
distinguishing inbound, outbound, and manual connections as well as
supporting new connection types in the future.

Another nice feature of the new wrapped connections is they provide
efficient access to concrete parsed address types which paves the way
for avoiding a lot of constant reparsing, repeated host/port splitting
and joining, and generally much more ergonomic immutable address types.

Since changing to wrapped connections basically required a rather
significant rewrite of large portions of the connection manager anyway,
this also takes the opportunity to improve several other aspects of the
connection manager in the process such as implementing full context
support, full tracking of all connection types by the manager itself,
much more robust semaphore-based automatic connection limiting, cleaner
persistent connection handling with independent limits, prevention of
multiple connections of any type to the same address:port, more useful
debug logging, and cleanly closing all connections at during shutdown.

It is also important to note that the following overall semantics have
been intentionally been changed versus the existing connection manager:

- A maximum of 8 persistent connections is now imposed and they no
  longer count toward the configured target number of automatic outbound
  peers to maintain
- Duplicate addresses (including port) are now rejected by the
  connection manager for all types (inbound, outbound, manual,
  persistent)
  - Note that inbound conns from the same IP will necessarily have
    different ports, so the same max IP limits apply in that case
- RPC 'node connect' for all connection attempts now:
  - Supports the RPC connection and server contexts
  - Properly handles duplicate address rejection including pending
    attempts
- RPC 'node connect' for non-persistent conn attempts now:
  - Waits for the connection attempt result before returning
  - Returns an error if the connection attempt fails
  - Cancels the connection attempt if the RPC connection is closed
    before it succeeds
- RPC 'node remove' now supports removing a pending connection by its
  persistent connection ID (since no peer ID exists before a valid
  connection is established)
- It is no longer possible for state transitions to allow things like
  duplicate addresses or failed cancellation
The max retry duration is currently an unexported global variable that
the tests override at init time.  At least one of the tests also
additionally overrides it for that specified test too.  While this
works, it is somewhat brittle and prevents the tests from being run in
parallel.

This improves the situation by making the max retry duration a field on
the connection manager instead of a global variable and adding a test
helper for creating a new connection manager that overrides it by
default.  Then any tests that need a different value can simply override
it on their local instance.

It also makes the tests parallel since they can no longer clobber one
another.
This adds tests to ensure closing a connection multiple times works as
intended.
This adds tests to ensure duplication connections are rejected for all
possible states.
davecgh added 4 commits May 20, 2026 20:06
This adds tests to ensure attempts to add more than the maximum allowed
number of persistent are rejected.
This adds tests to ensure the Disconnect method properly disconnects
pending and established connections for both non-persistent and
persistent connections.
This adds tests to ensure the Remove method properly disconnects and
removes pending and established connections for both non-persistent and
persistent connections.
This updates the connmgr package README.md to match the new design and
capabilities.
@davecgh davecgh force-pushed the connmgr_wrapped_conns branch from 40545b2 to 307eb44 Compare May 21, 2026 01:49
// Perform each sequence of acquires and releases as specified by the
// per semaphore tests.
for _, psTest := range test.perSemTests {
const timeout = 10 * time.Millisecond
Copy link
Copy Markdown
Member

@jholdstock jholdstock May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly 10ms is negligible, but I think these tests could be implemented with zero sleeping if synctest is used

Copy link
Copy Markdown
Member Author

@davecgh davecgh May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't messed with synctest much because most modules don't require a new enough version of Go and therefore can't use it. This is internal code now that does require a new enough version though, so it might make sense.

That said, this is intentionally testing context cancelation behavior with the semaphore too, so I'm not sure how synctest would interplay with that.

Also, there is a non-blocking variant added a couple of PRs from this one that overrides the value to 0 to ensure the canceled context both before and after it's called cause it fail the acquire that I'm not sure would play well with synctest.

numAcquires: 5,
numReleases: 1,
}, {
name: "cap 2 (4 acquired): acquire 2, release 5",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cap 5

// Pause automatic outbound connections for a retry timeout after too
// many failed connection attempts. The network very likely has become
// temporarily unreachable.
if cm.failedAttempts.Load() >= maxFailedAttempts {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cm.failedAttempts is only used in this handler and thus it would be better as a local. It would also be a bit more robust to always increment it regardless of the failure instead of doing it in the dialer only under certain circumstances which is easier to forget about new cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Examine manual node disconnect for correct functionality on multiple peer types with the same IP.

2 participants