Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connections: put conns safely #319

Closed
wants to merge 2 commits into from
Closed

Conversation

liamstask
Copy link
Contributor

@liamstask liamstask commented Aug 31, 2020

❓ What kind of change does this PR introduce?

  • 🐞 bug fix
  • 🐣 feature
  • πŸ“‹ docs update
  • πŸ“‹ tests/coverage improvement
  • πŸ“‹ refactoring
  • πŸ’₯ other

πŸ“‹ What is the related issue number (starting with #)

Fixes #317

❓ What is the current behavior? (You can also link to an open issue here)

seeing errors while iterating over the map of connections within the selector in the event that a put() call overlaps an expire() call

❓ What is the new behavior (if this is a feature change)?

instead of registering sockets with the selector from other threads, ensure they're registered from within the thread running the selector.

πŸ“‹ Other information:

This introduces https://pypi.org/project/backports.socketpair in order to provide a pure-python implementation of socket.socketpair() for platforms that don't have have it, prior to python 3.5+.

We could possibly use some of the enhanced tests from #318 to validate

πŸ“‹ Checklist:

  • I think the code is well written
  • I wrote good commit messages
  • I have squashed related commits together after the changes have been approved
  • Unit tests for the changes exist
  • Integration tests for the changes exist (if applicable)
  • I used the same coding conventions as the rest of the project
  • The new code doesn't generate linter offenses
  • Documentation reflects the changes
  • The PR relates to only one subject with a clear title
    and description in grammatically correct, complete sentences

This change is Reviewable

@@ -137,27 +155,15 @@ def get_conn(self): # noqa: C901 # FIXME
in self._selector.select(timeout=0.01)
]
except OSError:
# Mark any connection which no longer appears valid
invalid_entries = []
Copy link
Contributor Author

@liamstask liamstask Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factoring this out in order to address the noqa FIXME above

@@ -1804,6 +1804,7 @@ def serve(self):
traceback=True,
)

self._connections.close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this call inside the thread that is running the selector to ensure the selector is only ever accessed from a single thread

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs a code comment

@lgtm-com

This comment has been minimized.

@lgtm-com

This comment has been minimized.

@lgtm-com
Copy link

lgtm-com bot commented Sep 1, 2020

This pull request introduces 4 alerts when merging 6600715 into d2f28f2 - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 1 for Module is imported with 'import' and 'import from'

cheroot/_compat.py Outdated Show resolved Hide resolved
cheroot/connections.py Outdated Show resolved Hide resolved
cheroot/connections.py Outdated Show resolved Hide resolved
cheroot/connections.py Outdated Show resolved Hide resolved

def _get_selector_conns(self):
# helper to retrieve conns registered with the selector
for _, (_, sock_fd, _, conn) in self._selector.get_map().items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be safer to copy the get_map return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the strategy in this PR is to ensure that only a single thread is ever touching the selector, so from a thread safety perspective it should not be necessary. do you have another safety case in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is in a separate method, the callers would need to account for the implementation details. Hence this is unsafe and will probably introduce future bugs if somebody will attempt to refactor the calling code (turn two loops into one) w/o having the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this is marked as a 'private' helper, i'd just as soon indicate in the docstring that this is returning references and callers should not store them, etc. if you feel strongly, it should be ok to copy them, but this is mainly intended just to provide a filter on the iteration of the map items.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this, I'm mostly thinking about preventing future regressions caused by the human factor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webknjaz about the idea of copying theget_map return value, the internal map is not a dictionary, but rather is generated by making __getitem__ calls, there is no copy() method (and the copy module will basically iterate over the content). The class of the object that is returned as a map is https://github.com/python/cpython/blob/master/Lib/selectors.py#L61-L78 which if you see the implementation of the Mapping abstract class, you'll see that the get_map().items() will end up in https://github.com/python/cpython/blob/master/Lib/_collections_abc.py#L764, and it works by iterating the keys and making the __getitem__ method call.

The alternative is to use the private dictionaries of that class, but I think that it would be a very bad idea to break that interface contract, that's why I preferred to manage an external set of key instead of having the overhead of iterating all the keys to make the copy (register/unregister returns the key) and the additional method calling, plus a lock to make sure we have a copy that can be iterated over in a threaded context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree that if the additional locking & copying are acceptable, the approach in #318 from @cyraxjoe makes sense to me. this PR is predicated on the assumption that those are undesirable.

given the request to copy these to account for developer error in the future, it sounds like that might be preferable? either way is ok by me.

@webknjaz
Copy link
Member

webknjaz commented Sep 1, 2020

@cyraxjoe I'd like to hear your take on this PR

@liamstask oh, and looks like this needs rebasing, I've added a few ignores for those cryptography warnings in master

@liamstask
Copy link
Contributor Author

rebased

cheroot/_compat.py Outdated Show resolved Hide resolved
@lgtm-com

This comment has been minimized.

@lgtm-com

This comment has been minimized.

cheroot/_compat.py Outdated Show resolved Hide resolved
@liamstask
Copy link
Contributor Author

looks like some sporadic failures on windows at the moment - will need to look more closely to see what's going on.

@webknjaz
Copy link
Member

webknjaz commented Sep 1, 2020

Looks like test_keepalive_conn_management became flakier.

Copy link
Contributor

@cyraxjoe cyraxjoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the following issues with this implementation:

  • Adding a new dependency for not a good enough reason (internal communication, within the same instance).
  • Mixing the file descriptors to be selected just for the purpose of internal communication (why not just a queue.Queue? Because of the overhead of qsize() or exception handling?).
  • Delaying the addition of keep alive connections from the worker until get_conn is called, in effect we lost one "tick" that could be used to handle an event in the socket.
  • By delaying the addition of the connections we are introducing a side effect in get_conn, it doesn't really just gets you a connections, its more like update the internal state and then gives you a connection.
  • To manage the new additional communication the class now has to keep a connection counter (and make sure we keep this in mind for the future), a new pair of sockets, one of them added to the internal fd to be selected, a lock and a new deque, all of these loosely coupled in the implementation, which indeed is prone to errors when the next person comes along. I would advice to factor those out in another class, in case that you really are convinced with this approach.
  • We open the gate on the idea of using something in the very core of cheroot for message passing, later someone could come along an add an additional special message to be included and then carefully excluded, because we are no longer sure if this is a real connection or internal communication.

@liamstask
Copy link
Contributor Author

@cyraxjoe thanks for taking a look.

Adding a new dependency for not a good enough reason (internal communication, within the same instance).

given that this is a backport, it seems not too controversial, but i'll defer to the project maintainers.

Mixing the file descriptors to be selected just for the purpose of internal communication (why not just a queue.Queue?)

a queue is not useful in this case because there's no execution context to wait on it - we're already waiting on the selector, so sending a message on a control socket is a standard way to wake it up.

Delaying the addition of keep alive connections from the worker until get_conn is called, in effect we lost one "tick" that could be used to handle an event in the socket.
By delaying the addition of the connections we are introducing a side effect in get_conn, it doesn't really just gets you a connections, its more like update the internal state and then gives you a connection.

i agree, but also think the fact that the server is driven by a tick() is undesirable in the first place - i've removed this in #311 in favor of processing sockets as they become active in the selector rather than needlessly polling via tick(). get_conn() is removed altogether in this strategy.

To manage the new additional communication the class now has to keep a connection counter (and make sure we keep this in mind for the future), a new pair of sockets, one of them added to the internal fd to be selected, a lock and a new deque, all of these loosely coupled in the implementation, which indeed is prone to errors when the next person comes along. I would advice to factor those out in another class, in case that you really are convinced with this approach.

definitely open to suggestions here - we could conceivably break out the concept of the control socket to a separate class, which would include both control sockets and the lock.

We open the gate on the idea of using something in the very core of cheroot for message passing, later someone could come along an add an additional special message to be included and then carefully excluded, because we are no longer sure if this is a real connection or internal communication.

i'm not sure i understand this point - we always know whether it's internal or external comms based on the socket it arrived on, same as we distinguish the server socket from other sockets today.

@cyraxjoe
Copy link
Contributor

cyraxjoe commented Sep 2, 2020

@liamstask addressing some of your comments:

Mixing the file descriptors to be selected just for the purpose of internal communication (why not just a queue.Queue?)

a queue is not useful in this case because there's no execution context to wait on it - we're already waiting on the selector, so sending a message on a control socket is a standard way to wake it up.

From my perspective, the instance of the ConnectionManager class is the shared execution context. It controls all the calls that it receives from any thread and shares the state in self.

For example:

  1. Adding into a self.expire_queue.put_nowait(conn) from the ConnectionManager.put method
  2. Have a new method to be called by tick() like self.connections.sync_keep_alive_conns(), before the get_conn() method call.
  3. The sync_keep_alive_conn consumes and registers the connections.
  4. The get_conn works without any additional verification.

Delaying the addition of keep alive connections from the worker until get_conn is called, in effect we lost one "tick" that could be used to handle an event in the socket.
By delaying the addition of the connections we are introducing a side effect in get_conn, it doesn't really just gets you a connections, its more like update the internal state and then gives you a connection.

i agree, but also think the fact that the server is driven by a tick() is undesirable in the first place - i've removed this in #311 in favor of processing sockets as they become active in the selector rather than needlessly polling via tick(). get_conn() is removed altogether in this strategy.

This is an architectural decision, not a matter of handling connections. It does changes the fundamentals on how the server works (also make sure you update those docstrings that refers to the workings of the server, the one in server.py is already outdated given the recent changes). Before considering the PR #311 we should have as a priority to stabilize cheroot (currently we have broken cheroot that emits errors messages at the first connection it receives), have better test coverage of connections.py and server.py (specially of the error handling paths), some sort of basic performance benchmark to compare and then consider the architecture of Server in conjunction with the ConnectionManager. I'm not in favor of the current architecture, I'm in favor of stability, specially in the very core of cheroot.

We open the gate on the idea of using something in the very core of cheroot for message passing, later someone could come along an add an additional special message to be included and then carefully excluded, because we are no longer sure if this is a real connection or internal communication.

i'm not sure i understand this point - we always know whether it's internal or external comms based on the socket it arrived on, same as we distinguish the server socket from other sockets today.

I mean to open the conceptual pattern that is laid out, I'm not saying that is technically backwards or that you are not using the APIs as they should, I'm saying that from my perspective, they are not required in this case, it seems overkill and mixes the conceptual structure of what are we selecting on. The _selector property became a message bus, not a client/server connection manager.

We are no longer only selecting in sockets dedicated for the purpose of managing the server and client connections, but to pass messages between the server and worker threads via the ConnectionManager methods, is not even used outside the same instance, if it was used from a loosely integrated object running in another thread (like a plugin) it may make more sense.

@liamstask
Copy link
Contributor Author

From my perspective, the instance of the ConnectionManager class is the shared execution context. It controls all the calls that it receives from any thread and shares the state in self.

I'm referring more specifically to a thread of execution - there is one, and it is responsible for waiting on the selector & processing events as they occur.

Adding into a self.expire_queue.put_nowait(conn) from the ConnectionManager.put method
Have a new method to be called by tick() like self.connections.sync_keep_alive_conns(), before the get_conn() method call.
The sync_keep_alive_conn consumes and registers the connections.
The get_conn works without any additional verification.

This is undesirable because the next tick() call is already delayed by _selector.select() which is blocking. The control msg is required in order to wake up the select call, to reduce this delay (and ensure that the timeout specified to select() does not limit performance, as was the case in #305).

This is an architectural decision, not a matter of handling connections. It does changes the fundamentals on how the server works (also make sure you update those docstrings that refers to the workings of the server, the one in server.py is already outdated given the recent changes). Before considering the PR #311 we should have as a priority to stabilize cheroot (currently we have broken cheroot that emits errors messages at the first connection it receives), have better test coverage of connections.py and server.py (specially of the error handling paths), some sort of basic performance benchmark to compare and then consider the architecture of Server in conjunction with the ConnectionManager.

I would propose that the control flow for connection handling is an architectural decision - it informs whether the system is polling or event driven.

I'm not in favor of the current architecture, I'm in favor of stability, specially in the very core of cheroot.

These are separate concerns - the proposed change is not inherently unstable. I agree stability should be improved.

I mean to open the conceptual pattern that is laid out, I'm not saying that is technically backwards or that you are not using the APIs as they should, I'm saying that from my perspective, they are not required in this case, it seems overkill and mixes the conceptual structure of what are we selecting on. The _selector property became a message bus, not a client/server connection manager.

That is a fair perspective, i just think it is not aligned with the fact that this is a well understood pattern, as evidenced by the presence of APIs like eventfd and signalfd which address this specific issue. Those happen to be Linux-specific and as such are not cross-platform, but this takes a similar approach.

Again, if the maintainers prefer another approach, that's totally fine - I think this one is preferable since it reduces contention (ie within put(), the only part of the api that is called from other threads) and means data is not copied unnecessarily.

@cyraxjoe
Copy link
Contributor

cyraxjoe commented Sep 2, 2020

I see... thank you for your feedback @liamstask, ok then! I suppose is up to @webknjaz or @jaraco, is been a while since I've been actively involved with the project, that I believe I no longer have enough street cred to take those decision, it has to be made by any of those two.

My involvement in all of these was just due to the recent unstable releases. πŸ˜…

@liamstask
Copy link
Contributor Author

closing this out in favor of a simpler approach

@liamstask liamstask closed this Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dictionary changed size during iteration
3 participants