-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move multi host handling to the bridge instead of cockpit-ws #6102
Conversation
931aec3
to
b2508e5
Compare
@@ -157,6 +157,7 @@ WEBPACK_PACKAGES = \ | |||
selinux \ | |||
shell \ | |||
sosreport \ | |||
ssh \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this in a new directory? It seems like it unfortunately shares so much code with cockpit-ws. Worth keeping in the same directory? In addition the manifest.json stuff could go in the dashboard manifest.json if it needs a home.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes we want this functionality without the dashboard. EI the kubernetes container. That's why I pulled out it. But we can always just mess with it in the container setup scripts if you think that's better.
src/ssh/cockpitsshservice.c
Outdated
session->thawing++; | ||
for (l = frozen->head; l != NULL; l = g_list_next (l)) | ||
{ | ||
g_printerr ("THAWING: %s\n", (gchar *)l->data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I accidentally left this in here.
G_PARAM_WRITABLE | G_PARAM_CONSTRUCT_ONLY | G_PARAM_STATIC_STRINGS)); | ||
g_object_class_install_property (object_class, PROP_PASSWORD, | ||
g_param_spec_string ("password", NULL, NULL, NULL, | ||
G_PARAM_READWRITE | G_PARAM_CONSTRUCT_ONLY | G_PARAM_STATIC_STRINGS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that password is a string means it's copied throughout memory here. We should be using a g_param_spec_boxed of type G_TYPE_BYTES here. However, this could be a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently copy it once here (https://github.com/cockpit-project/cockpit/blob/master/src/ws/cockpitwebservice.c#L1222) in this case I'm only copying it once as well here (b2508e5#diff-d05217b82cb7f915e32af94329806ef5R531). I could be missing something but i think it's the same number of copies as what we do currently. Is that good enough for now.
@@ -749,12 +731,14 @@ cockpit_ssh_transport_class_init (CockpitSshTransportClass *klass) | |||
CockpitTransport * | |||
cockpit_ssh_transport_new (const gchar *host, | |||
guint port, | |||
CockpitCreds *creds) | |||
const gchar *user, | |||
const gchar *password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, this should be a GBytes. However this could be a follow up.
@@ -476,16 +261,6 @@ cockpit_web_service_dispose (GObject *object) | |||
|
|||
cockpit_sockets_close (&self->sockets, NULL); | |||
|
|||
g_hash_table_iter_init (&iter, self->sessions.by_transport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should replace this code with:
if (!self->sent_done)
{
self->sent_done = TRUE;
cockpit_transport_close (self->transport, NULL);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That happens a couple lines above. Before the sockets gets closed. Or was the point that closes the transport should happen after the sockets?
I pushed a fixup for the test-sshtransport regression (due to using SSH_ASKPASS now). Please double-check that this makes sense. I now ran the test case in a loop and it's stable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found some minor issues. I'll push a FIXUP for review.
{ | ||
"match": { "host": null }, | ||
"spawn": [ "@libexecdir@/cockpit-ssh" ], | ||
"environ": [ "SSH_ASKPASS=@libexecdir@/cockpit-askpass" ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we use libssh and call the askpass program ourselves (ssh_askpass()
in cockpitsshrelay.c), I wonder if we should really call that $SSH_ASKPASS
; as the cockpit-ssh bridge gets called under that environment, won't that propagate to commands run as the bridge too? I don't think we want to use that askpass helper for running ssh
commands through the bridge on the remote system. I. e. use $COCKPIT_ASKPASS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment doesn't get propagated across ssh by default ... that's both a blessing and a curse. Or have you seen this happen in real life? In which case it's not Cockpit specific: this would be a case with any SSH_ASKPASS
usage anywhere (including stock ssh).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, if we don't propagate the var then it's all good, and everything else is stuff that we can easily change later on if necessary. Thanks!
src/ssh/cockpitsshrelay.c
Outdated
@@ -924,12 +919,9 @@ cockpit_ssh_authenticate (CockpitSshData *data) | |||
int methods_to_try = SSH_AUTH_METHOD_INTERACTIVE | | |||
SSH_AUTH_METHOD_GSSAPI_MIC; | |||
|
|||
#ifdef HAVE_SSH_SET_AGENT_SOCKET | |||
methods_to_try = methods_to_try | SSH_AUTH_METHOD_PUBLICKEY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this like should be dropped. We already conditionally set it below (line 924) if the auth type is private-key
. OTOH, if we always want to try PUBLIC_KEY (which can't hurt IMHO -- if the user has a matching key, this is better than password authentication), then the following two lines can go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the #ifdef but keep flag. Now we rely on the standard SSH_AUTH_SOCK
whereas before we had to setup an fd for the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
(g_strcmp0 (data->auth_options->auth_type, "basic") == 0 || | ||
g_strcmp0 (data->auth_options->auth_type, | ||
auth_method_description (method)) == 0); | ||
has_creds = data->in_bridge || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if in_bridge
is really sufficient? As far as I can see, data->initial_auth_data
is set in lines 687 and 748 when called through the bridge, but ssh_askpass()
can fail, and return NULL
. I suppose if it does then we'll just get an auth failure later and there's nothing that we can do anyway, as we don't do "basic" auth through the bridge? It's just a bit confusing to get has_creds == TRUE
here even if that's not so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough yet to provide any feedback on this one.
case PROP_PASSWORD: | ||
password = g_value_get_string (value); | ||
if (password) | ||
self->password = g_bytes_new_take (g_strdup (password), strlen (password)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why none of the previous values get freed here (in the existing code too). Can/do we assume that we only ever set the properties once?
Update: yes, we apparently use CONSTRUCT_ONLY
for those, so nevermind.
goto out; | ||
} | ||
|
||
if (argc > 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be < 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not, we do support running without arguments. The error message is just wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ssh/test-sshservice.c
Outdated
g_free (cmd); | ||
g_object_unref (service); | ||
|
||
stop_mock_sshd (test->mock_sshd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case also uses teardown()
which stops the mock sshd again., so I think this can be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
src/ssh/test-sshservice.c
Outdated
gconstpointer data) | ||
{ | ||
stop_mock_sshd (test->mock_sshd); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to add something else here still? Seems easier to move the stop_mock_sshd()
code into here and only have one teardown function, particulary as it doesn't appear to be completely reentrant (closing the pid if already closed), see below.
I pushed some fixups from my review (simplify One fails on a cockpit-pcp SIGSEGV:
And a troubleshooting regression: The PNG shows a "Your session has been terminated [Reconnect]" page instead of the usual machine trouble shooting page.
Same fate (apparently) in
|
Wrt. the pcp crash: I got a symbolic backtrace and then found a two year old bug report about the same issue. I attached it there, did a minimal analysis and asked for reopening/updating (I can't do that myself). So this is nothing new, but the bridge rearrangement apparently changed the timing slightly: This is a race condition, sometimes the test passes because the "bridge was killed: 11" message does not make it into the journal fast enough. But I always got it to crash in my manual runs with a As we are under a time constraint here and this is an old bug which apparently doesn't visually affect the user experience, I propose this for now: --- a/test/verify/check-multi-machine
+++ b/test/verify/check-multi-machine
@@ -517,6 +517,8 @@ class TestMultiMachine(MachineCase):
# kill admin, lock account
m2.execute('passwd -l admin')
kill_user_admin(m2)
+ # this step causes a crash in PCP (https://bugzilla.redhat.com/show_bug.cgi?id=1235962)
+ self.allow_journal_messages(".*cockpit-pcp: bridge was killed: 11")
b.wait_present(".curtains-ct");
b.wait_text(".curtains-ct h1", "Couldn't connect to the machine") Sanity-checking appreciated. I propose to not push the above right away so that the tests have a chance to finish once and we get a full overview of remaining problems. @stefwalter is working on the troubleshooting regression, we can push this together with his fix? |
What about a known issue? I believe it would be more appropriate: This is a failure that affects cockpit users on certain operating systems and is a bug of those operating systems. We can also get data on when this happens and/or stops happening. |
In addition, this also does not mask an actual test running through until the end. If it did, that would be a reason to avoid a known issue. |
Reworked the pcp crash ignoring to be a known issue #6108: Preliminary patch is https://paste.fedoraproject.org/paste/dT-d8b-CAFt4NzggOlCunl5M1UNdIGYhyRLivL9gydE=/ but I suppose it will affect the Atomics too, and possibly Debian/Ubuntu. Waiting for more test results to come in. |
Removing needswork ... we need real test feedback |
src/bridge/cockpitrouter.c
Outdated
@@ -420,6 +421,15 @@ process_kill (CockpitRouter *self, | |||
g_warning ("received invalid \"group\" field in kill command"); | |||
return; | |||
} | |||
else if (!cockpit_json_get_string (options, "host", NULL, &host)) | |||
{ | |||
g_warning ("received invalid \"group\" field in kill command"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should say hosts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup this should say "host" ... are you going to make the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
pcp crash test failure is independent, now part of PR #6109. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the "move to /etc/ssh/ssh_known_hosts" changes on top of this. It depends on this rework to avoid having to introduce a temporary ws capability, but it should also not land after this as it would then require a new capability. Tests pass locally.
Also, with the fixups I pushed earlier all my (relevant) remarks got addressed.
This is an old bug (https://bugzilla.redhat.com/show_bug.cgi?id=1235962) but the cockpit-ssh rework (PR cockpit-project#6102) now changes the timing to exihibit this much more often. This affects at least check-multi-machine and check-dashboard, so keep the quirk generic to match unexpected journal messages from any test. Closes cockpit-project#6109
Built/verified the fix manually, works great now. Thanks @petervo! |
This is an old bug (https://bugzilla.redhat.com/show_bug.cgi?id=1235962) but the cockpit-ssh rework (PR #6102) now changes the timing to exihibit this much more often. This affects at least check-multi-machine and check-dashboard, so keep the quirk generic to match unexpected journal messages from any test. Closes #6109 Reviewed-by: Stef Walter <stefw@redhat.com>
cockpit_router_ban_hosts (CockpitRouter *self) | ||
{ | ||
RouterRule *rule; | ||
JsonObject *match = json_object_new (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't an appropriate "bridges"
entry in src/base1/manifest.json.in
solve this? If not, it's worth documenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially that's what I had. But cockpit-stub initially didn't load bridges. So hosts were falling through to the regular channel processing, leading to weird hard to debug results. Since hosts is so fundamental to our protocol, i figured that we should just refuse by default the way we used to even if no bridges get loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Looks like we'll have to add checksum handling here too ...
315bae3
to
2d46021
Compare
It will be up to the bridges to handle connecting to additional hosts
Only implement "kill" with a "host" when the "host" matches the one we received in our "init" message. For other cases the "kill" command is forwarded to a peer bridge.
When an ssh session is private only for specific channel(s) then when those channels are gone, close the ssh session immediately. No need to wait for the timeout in this case.
There is no real reason for maintaining our own /var/lib/cockpit/known_hosts file, as ssh itself already has a global one in /etc/ssh/ssh_known_hosts. Use that by default, but fallback to the legacy file for (1) lookups if a host is not already known in the former but known in the latter; and (2) for writing if the ws we talk to is still an old version (by checking if ws still has the "ssh" capability). Move the determination and setting of the known hosts file into a new set_knownhosts_file() function, as it is now reasonably complex, will be extended further in the future with more sources of known hosts, and avoids handling SSH_OPTIONS_KNOWNHOSTS in multiple different places. Adjust the integration tests to the new path and add new tests for covering the fallback to the legacy file.
Remove key tests from atomic
kubernetes is are only use case for this for now. Let's just distribute there to keep it out of cockpit ws. Added a provides so we can split it out later if needed.
This ensures that a router will refuse to process a open command with a host unless specifially configured to do so.
Commit 1dab8a5 hardcoded the install path of cockpit-pcp, but libexecdir is different in Debianish distros. We don't care about the particular location, so use pattern matching instead.
The message is cockpit-ssh admin@10.111.112.135:22: -1 couldn't connect: Connection refused '10.111.112.135' '22'' So we were missing the "-1" after the colon, and missed a period for the '*'.
Fixed tests harder |
check-loopback passes on debian-testing failure is intermittent |
Closes #6102 Reviewed-by: Stef Walter <stefw@redhat.com>
Depends on: