KEP-987: Refactor node to re-use sessions. remove raft. #228
Conversation
9ab3ab0
to
841fb79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First quick pass.
I don't see how stale sessions are cleaned up.
Holding a strong shared pointer will not allow asio to destroy the session and cleanup.
node/node.cpp
Outdated
this->weak_priv_protobuf_handler = | ||
[weak_self = weak_from_this()](auto msg, auto session) | ||
{ | ||
auto strong_self = weak_self.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a weak ptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to break the cyclical reference between session and node. It could be a shared pointer if node were changed to have weak ptrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you track the session in node, you will always create a new connection when pbft could of used an existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this? We do track the session in node.
this->chaos->reschedule_message(std::bind(&node::send_message_str, shared_from_this(), std::move(ep_copy), std::move(msg), close_session)); | ||
return; | ||
} | ||
std::lock_guard<std::mutex> lock(this->session_map_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd look at a shared lock which allows for reader/writer access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that the lock is never held for long (only looking up a session, or at worst initiating an async operation), so I didn't want to prematurely optimize
The answer to "how do sessions get cleaned up" is lazily: when node tries to send a message on a session and discovers that it has closed, it replaces the session, which removes the last shared pointer. The case where it doesn't get cleaned up is where no message is ever sent to that endpoint (say it was a one-off client). This still doesn't leak a file descriptor (the socket times out and closes), but it does leak some memory. Changing it to a weak pointer makes it a bit cleaner, but it doesn't solve the issue: the weak pointer itself won't be removed from node's session map in this case, and it adds overhead on every message send. I think the solution with either kind of pointer is to regularly sweep the map and remove dead sessions; my intent was to delay implementing that because this PR is too large already. (the other comments I agree with and I'll put a commit for them in later; I just wanted to unblock this conversation promptly) |
How? The member variable containing the websocket will never be destroyed, because session's destructor will never be called or until you replace the entry. A timeout does not close the FD as far as I understand networking. I could be wrong. Have you tested this? Does Beast do this for you? Edit: OK I see you are closing on error in the completion handlers.
I imagine any overhead would be nothing next to what it takes to sign the data or invoke any of the locks we are using. It's premature to worry about this and I'd rather let a profiler guide us instead.
This is how subscription manager deals with it.
Yeah, I'll try to finish my review ASAP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few indentation code style errors when lambdas are being used as function args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you've removed the ability to auto-close the session after a message is sent. I was wondering what the rationale for that is?
Also, in the case of an error on session::write, you re-queue the message then close the socket. Is there any intention to try and restart communication? If not, what's the point in re-pushing the data (apart from logging a warning when the session is destroyed)?
The sender of the message won't (and shouldn't) know if the message is sent over an existing session or a new one, and in the former case closing the session would potentially interfere with other messages. Ideally I'd like to have sessions be contained within node, so other stuff doesn't have to reason about them.
Yes, my intent is for the session to be re-used with a new connection automatically. |
node/session.hpp
Outdated
@@ -71,7 +71,7 @@ namespace bzn | |||
std::list<std::shared_ptr<bzn::encoded_message>> write_queue; | |||
|
|||
bzn::protobuf_handler proto_handler; | |||
bzn::session_death_handler death_handler; | |||
bzn::session_shutdown_handler death_handler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I should of also mentioned renaming the member variable as well.
c0643b7
to
c9ade56
Compare
c9ade56
to
951a3b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to let Monty know that subscriptions will close after 5 min of inactivity. He will need to reconnect or periodically call status or something else to keep the connection alive.
The interesting files are
swarm.cpp
node.cpp
session.cpp
node_test_common.hpp
node_test.cpp
session_test.cpp
pretty much everything else is just collateral damage of the interface changes. I also want to make send_message operate by uuid instead of endpoint, but that requires some design thinking and this was too large already.