New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WORKAROUND: Disable PING/PONG on SSH connections, as it would just immediately drop the session #5727
Conversation
…sult in the terminal session being torn down. This is a bandaid solution until a proper fix can be implemented by somebody who actually understands how this code was meant to work.
the ping/pong requests in those function are only called IF certain things are set in your environment,
|
I didn't need to do anything special in my HTML configuration to trigger this bug, but I did admittedly turn on agentping/pong in the configuration file. That seemed to help with making my "Desktop" tab work reliably across various proxies and larger-scale internet connections. The documentation doesn't say that turning these keep-alive packages on would make agentless SSH impossible to use; and if that hypothetically was the intended behavior, then the UI should remove SSH completely, whenever either of those options have been set in config.json. But it seems unlikely that the intention was to offer either Desktop or agentless SSH, but never both in the same instance of MeshCentral. As is, as soon as you enable agentping/pong, MeshCentral will mix encrypted SSH traffic on the same WebSocket as unencrypted JSON. That's fundamentally incompatible with each other and will cause all agentless SSH connections to drop. You might very well be right though, that "noping" should be set for this particular relay. But I have no idea how I would modify the MeshCentral source code so that it does this whenever creating a WebSocket that carries SSH traffic. Can you advise? |
Im confused? The websocket is encrypted with an ssl certificate? So why would the json be unencrypted? And all relaying done in the browser uses a separate websocket connection/file than the normal websocket which handles all the events,notifications, etc? |
Maybe post a video on the problem, and provide info on your network configuration of admin/mesh server/agent and the network links in play and network security/filtering devices in the mix. Also what OS are you coming from and going to. To troubleshoot bugs you have to setup a repeatable test that has a failure. |
[ TL;DR: Try turning on agent ping/pong messages in config.json and then open an agent-less SSH connection. See if it gets torn down and disconnects when the first ping is sent. ]
You and me are in full agreement here. I have been staring at the code and instrumenting it for two days now, and it is just really puzzling. I wish I had a good way to share a virtualized cluster of machines with you. It's so difficult to come up with a good reproducible test case when dealing with distributed systems; and that's doubly so, as I am not quite clear on the low-level implementation details of all the component. I don't even know exactly how data flows in an agent-less configuration. Let me try walking you through things as best as I can, and you can tell me if any of it makes sense or if you need more details from me. I configured MeshCentral with a configuration file that looks essentially like this (stripped off bits dealing with user authentication):
There is an NGINX reverse proxy running on 10.0.0.80 that encrypts the WebSocket between MeshCentral and the user's web browser. It connects to MeshCentral with HTTP (not HTTPS!) on port 480 when establishing WebSockets or downloading static resources. Agents communicate with wss:// on port 443. Port 80 is used for redirection to the canonical URL. Except for TLS offloading, this is pretty standard for how MeshCentral would work out of the box. I am 95% certain this problem also happens without the reverse proxy, as that's how I originally noticed it. But configuring TLS offloading should make debugging easier, so I kept it enabled. But if you want to just run a quick-n-dirty test, I think you can omit all of Port, AgentPort, AgentPortTls, and TLSOffLoad, and you should still be able to see the bug. It'll just be harder to debug, as the relevant payload is now encrypted, thus obfuscating the root cause. You can reduce all the ping/pong times in config.json to something like 5s to make things easier to debug. It'll trigger the error much faster that way. BTW, this is a copy of MeshCentral that I installed on a dedicated (virtual) machine running Ubuntu. I installed it using "npm". I then connect to MeshCentral from my Chrome browser and add a device group for agent-less local devices. In this device group, I add a Linux/SSH device. I open the Terminal tab for this device and click the SSH Connect button. Up to this point, everything works just fine. I get a terminal with an SSH session that works as expected (at least for a couple of seconds). Also, just for the record, I checked that all the other usual features work. For instance, if I have a device that uses MeshAgent, I can successfully connect to it using both Desktop and Terminal. Things obviously work fine when using agents, but they seem to fail for me when using agent-less connections. Everything appears to be fine until a PING request is being generated by MeshCentral. At that point, the connection gets torn down, and I see an error message from MeshCentral (printed to the console; this required starting MeshCentral in debug mode, I think) that the SSH packet is malformed and doesn't have a valid length field. That agrees with what I see in RFC4253 saying that each SSH data packet starts with its length. I tried printing out the chunks that I see traveling over this WebSocket, and it stands out really obviously that most of the data is binary and then there suddenly is an unencrypted plain-text JSON fragment. When I instrument the source code, I see that obj.ws in meshrelay.js passes what looks like SSH encrypted data back and forth. But it is possible that I am misreading this and it actually is TLS encrypted. It's a little hard to tell those two apart by mere visual inspection of the console.log() statements that I sprinkled into the code. I am not 100% certain that this is the connection between the browser and the central agent. As the central agent is acting as a relay here, this could also be the connection from the central agent to the device that I log into over SSH. It could also be an internal connection between parts of the system that I don't fully understand. And I suspect, I am not the only one who is confused by the different sockets that are involved with this relaying process. It is somewhat obvious that MeshCentral shouldn't mix JSON into an SSH data stream. But it is not clear to me that this is the correct socket to send ping/pong packets on. Quite possibly, it's trying to do the right thing and just using the wrong socket? On the other hand, if this actually is the WebSocket between the browser and MeshCentral, then ideally we should figure out how to send ping/pong-style keepalive packets that are compatible with the rest of the payload. I can see that there is code that switches the wire protocol between an initial authentication state and a later state when the connection is fully established. Maybe that logic was added later and is now incompatible with ping/pong messages on the same socket? Reverse proxies are commonly used in these type of settings though, and they tend to work more reliably if ping/pong-style keepalive data can be passed back and forth every so often. So, it would be nice to fix this correctly. For now, I cranked up timeouts on the reverse proxy to ridiculously high values and I then disabled ping/pong. That fixes my immediate problem, but it's not very satisfactory. In particular, it sucks because ping/pong works really well for connections that go to actual MeshAgents. |
This is not a proper and complete fix! It's merely a work-around for SSH connections dropping immediately whenever
PING/PONG is enabled in the configuration file.
Feel free to completely disregard this pull request, if you have a better and more complete solution. I spent a good
while trying to fix things properly, but I don't understand how all the parts are supposed to play together. So, this was
the best I could come up with. And it does provide some relief for the buggy behavior that I observed.