Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WORKAROUND: Disable PING/PONG on SSH connections, as it would just immediately drop the session #5727

Closed
wants to merge 1 commit into from

Conversation

gutschke
Copy link

This is not a proper and complete fix! It's merely a work-around for SSH connections dropping immediately whenever
PING/PONG is enabled in the configuration file.

Feel free to completely disregard this pull request, if you have a better and more complete solution. I spent a good
while trying to fix things properly, but I don't understand how all the parts are supposed to play together. So, this was
the best I could come up with. And it does provide some relief for the buggy behavior that I observed.

…sult in

the terminal session being torn down. This is a bandaid solution until a proper
fix can be implemented by somebody who actually understands how this code was
meant to work.
@si458
Copy link
Collaborator

si458 commented Jan 24, 2024

the ping/pong requests in those function are only called IF certain things are set in your environment,
i.e you have it set in the HTML request headers AND you have the agentping or agentpong set in your config.json
Line 1258

        // Setup the agent PING/PONG timers unless requested not to
        if (obj.req.query.noping != 1) {
            if ((typeof parent.parent.args.agentping == 'number') && (obj.pingtimer == null)) { obj.pingtimer = setInterval(sendPing, parent.parent.args.agentping * 1000); }
            else if ((typeof parent.parent.args.agentpong == 'number') && (obj.pongtimer == null)) { obj.pongtimer = setInterval(sendPong, parent.parent.args.agentpong * 1000); }
        }

@si458 si458 closed this Jan 24, 2024
@gutschke
Copy link
Author

I didn't need to do anything special in my HTML configuration to trigger this bug, but I did admittedly turn on agentping/pong in the configuration file. That seemed to help with making my "Desktop" tab work reliably across various proxies and larger-scale internet connections.

The documentation doesn't say that turning these keep-alive packages on would make agentless SSH impossible to use; and if that hypothetically was the intended behavior, then the UI should remove SSH completely, whenever either of those options have been set in config.json. But it seems unlikely that the intention was to offer either Desktop or agentless SSH, but never both in the same instance of MeshCentral.

As is, as soon as you enable agentping/pong, MeshCentral will mix encrypted SSH traffic on the same WebSocket as unencrypted JSON. That's fundamentally incompatible with each other and will cause all agentless SSH connections to drop.

You might very well be right though, that "noping" should be set for this particular relay. But I have no idea how I would modify the MeshCentral source code so that it does this whenever creating a WebSocket that carries SSH traffic. Can you advise?

@si458
Copy link
Collaborator

si458 commented Jan 24, 2024

Im confused? The websocket is encrypted with an ssl certificate? So why would the json be unencrypted? And all relaying done in the browser uses a separate websocket connection/file than the normal websocket which handles all the events,notifications, etc?
I would have to do some testing, but you can you share ur config.json where you are having an issue with ssh?
I'm also assuming you are using the 'ssh connect' button in the terminal tab rather than the 'connect' button in the Web browser?

@silversword411
Copy link
Contributor

Maybe post a video on the problem, and provide info on your network configuration of admin/mesh server/agent and the network links in play and network security/filtering devices in the mix. Also what OS are you coming from and going to.

To troubleshoot bugs you have to setup a repeatable test that has a failure.

@gutschke
Copy link
Author

gutschke commented Jan 24, 2024

[ TL;DR: Try turning on agent ping/pong messages in config.json and then open an agent-less SSH connection. See if it gets torn down and disconnects when the first ping is sent. ]

I'm confused?

You and me are in full agreement here. I have been staring at the code and instrumenting it for two days now, and it is just really puzzling.

I wish I had a good way to share a virtualized cluster of machines with you. It's so difficult to come up with a good reproducible test case when dealing with distributed systems; and that's doubly so, as I am not quite clear on the low-level implementation details of all the component. I don't even know exactly how data flows in an agent-less configuration.

Let me try walking you through things as best as I can, and you can tell me if any of it makes sense or if you need more details from me.

I configured MeshCentral with a configuration file that looks essentially like this (stripped off bits dealing with user authentication):

{
"$schema": "https://raw.githubusercontent.com/Ylianst/MeshCentral/master/meshcentral-config-schema.json",
"settings": {
"Port": 480,
"AgentPort": 443,
"AgentPortTls": true,
"AllowLoginToken": true,
"AgentPing": 50,
"AgentPong": 50,
"BrowserPing": 50,
"BrowserPong": 50,
"TLSOffLoad": "10.0.0.80"
}
}

There is an NGINX reverse proxy running on 10.0.0.80 that encrypts the WebSocket between MeshCentral and the user's web browser. It connects to MeshCentral with HTTP (not HTTPS!) on port 480 when establishing WebSockets or downloading static resources.

Agents communicate with wss:// on port 443. Port 80 is used for redirection to the canonical URL. Except for TLS offloading, this is pretty standard for how MeshCentral would work out of the box.

I am 95% certain this problem also happens without the reverse proxy, as that's how I originally noticed it. But configuring TLS offloading should make debugging easier, so I kept it enabled. But if you want to just run a quick-n-dirty test, I think you can omit all of Port, AgentPort, AgentPortTls, and TLSOffLoad, and you should still be able to see the bug. It'll just be harder to debug, as the relevant payload is now encrypted, thus obfuscating the root cause.

You can reduce all the ping/pong times in config.json to something like 5s to make things easier to debug. It'll trigger the error much faster that way.

BTW, this is a copy of MeshCentral that I installed on a dedicated (virtual) machine running Ubuntu. I installed it using "npm".

I then connect to MeshCentral from my Chrome browser and add a device group for agent-less local devices. In this device group, I add a Linux/SSH device. I open the Terminal tab for this device and click the SSH Connect button. Up to this point, everything works just fine. I get a terminal with an SSH session that works as expected (at least for a couple of seconds).

Also, just for the record, I checked that all the other usual features work. For instance, if I have a device that uses MeshAgent, I can successfully connect to it using both Desktop and Terminal. Things obviously work fine when using agents, but they seem to fail for me when using agent-less connections.

Everything appears to be fine until a PING request is being generated by MeshCentral. At that point, the connection gets torn down, and I see an error message from MeshCentral (printed to the console; this required starting MeshCentral in debug mode, I think) that the SSH packet is malformed and doesn't have a valid length field. That agrees with what I see in RFC4253 saying that each SSH data packet starts with its length. I tried printing out the chunks that I see traveling over this WebSocket, and it stands out really obviously that most of the data is binary and then there suddenly is an unencrypted plain-text JSON fragment.

When I instrument the source code, I see that obj.ws in meshrelay.js passes what looks like SSH encrypted data back and forth. But it is possible that I am misreading this and it actually is TLS encrypted. It's a little hard to tell those two apart by mere visual inspection of the console.log() statements that I sprinkled into the code. I am not 100% certain that this is the connection between the browser and the central agent. As the central agent is acting as a relay here, this could also be the connection from the central agent to the device that I log into over SSH. It could also be an internal connection between parts of the system that I don't fully understand.

And I suspect, I am not the only one who is confused by the different sockets that are involved with this relaying process. It is somewhat obvious that MeshCentral shouldn't mix JSON into an SSH data stream. But it is not clear to me that this is the correct socket to send ping/pong packets on. Quite possibly, it's trying to do the right thing and just using the wrong socket? On the other hand, if this actually is the WebSocket between the browser and MeshCentral, then ideally we should figure out how to send ping/pong-style keepalive packets that are compatible with the rest of the payload. I can see that there is code that switches the wire protocol between an initial authentication state and a later state when the connection is fully established. Maybe that logic was added later and is now incompatible with ping/pong messages on the same socket?

Reverse proxies are commonly used in these type of settings though, and they tend to work more reliably if ping/pong-style keepalive data can be passed back and forth every so often. So, it would be nice to fix this correctly. For now, I cranked up timeouts on the reverse proxy to ridiculously high values and I then disabled ping/pong. That fixes my immediate problem, but it's not very satisfactory. In particular, it sucks because ping/pong works really well for connections that go to actual MeshAgents.

@si458 si458 mentioned this pull request Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants