Move multi host handling to the bridge instead of cockpit-ws #6102

petervo · 2017-03-14T16:34:37Z

Depends on:

Routing and Peer cleanup #6104

stefwalter · 2017-03-14T21:55:03Z

Makefile.am

@@ -157,6 +157,7 @@ WEBPACK_PACKAGES = \
 	selinux \
 	shell \
 	sosreport \
+	ssh \


Why do we need this in a new directory? It seems like it unfortunately shares so much code with cockpit-ws. Worth keeping in the same directory? In addition the manifest.json stuff could go in the dashboard manifest.json if it needs a home.

Sometimes we want this functionality without the dashboard. EI the kubernetes container. That's why I pulled out it. But we can always just mess with it in the container setup scripts if you think that's better.

stefwalter · 2017-03-14T22:15:23Z

src/ssh/cockpitsshservice.c

+      session->thawing++;
+      for (l = frozen->head; l != NULL; l = g_list_next (l))
+        {
+g_printerr ("THAWING: %s\n", (gchar *)l->data);


Whoops, I accidentally left this in here.

stefwalter · 2017-03-14T22:16:58Z

src/ssh/cockpitsshtransport.c

-                             G_PARAM_WRITABLE | G_PARAM_CONSTRUCT_ONLY | G_PARAM_STATIC_STRINGS));
+  g_object_class_install_property (object_class, PROP_PASSWORD,
+         g_param_spec_string ("password", NULL, NULL, NULL,
+                              G_PARAM_READWRITE | G_PARAM_CONSTRUCT_ONLY | G_PARAM_STATIC_STRINGS));


The fact that password is a string means it's copied throughout memory here. We should be using a g_param_spec_boxed of type G_TYPE_BYTES here. However, this could be a follow up.

We currently copy it once here (https://github.com/cockpit-project/cockpit/blob/master/src/ws/cockpitwebservice.c#L1222) in this case I'm only copying it once as well here (b2508e5#diff-d05217b82cb7f915e32af94329806ef5R531). I could be missing something but i think it's the same number of copies as what we do currently. Is that good enough for now.

stefwalter · 2017-03-14T22:17:14Z

src/ssh/cockpitsshtransport.c

@@ -749,12 +731,14 @@ cockpit_ssh_transport_class_init (CockpitSshTransportClass *klass)
 CockpitTransport *
 cockpit_ssh_transport_new (const gchar *host,
                           guint port,
-                           CockpitCreds *creds)
+                           const gchar *user,
+                           const gchar *password)


As above, this should be a GBytes. However this could be a follow up.

stefwalter · 2017-03-14T22:20:48Z

src/ws/cockpitwebservice.c

@@ -476,16 +261,6 @@ cockpit_web_service_dispose (GObject *object)

  cockpit_sockets_close (&self->sockets, NULL);

-  g_hash_table_iter_init (&iter, self->sessions.by_transport);


Seems like we should replace this code with:

if (!self->sent_done) { self->sent_done = TRUE; cockpit_transport_close (self->transport, NULL); }

That happens a couple lines above. Before the sockets gets closed. Or was the point that closes the transport should happen after the sockets?

martinpitt · 2017-03-15T07:47:55Z

I pushed a fixup for the test-sshtransport regression (due to using SSH_ASKPASS now). Please double-check that this makes sense. I now ran the test case in a loop and it's stable.

martinpitt

Found some minor issues. I'll push a FIXUP for review.

martinpitt · 2017-03-15T08:52:25Z

pkg/ssh/manifest.json.in

+        {
+            "match": { "host": null },
+            "spawn": [ "@libexecdir@/cockpit-ssh" ],
+            "environ": [ "SSH_ASKPASS=@libexecdir@/cockpit-askpass" ],


As we use libssh and call the askpass program ourselves (ssh_askpass() in cockpitsshrelay.c), I wonder if we should really call that $SSH_ASKPASS; as the cockpit-ssh bridge gets called under that environment, won't that propagate to commands run as the bridge too? I don't think we want to use that askpass helper for running ssh commands through the bridge on the remote system. I. e. use $COCKPIT_ASKPASS?

The environment doesn't get propagated across ssh by default ... that's both a blessing and a curse. Or have you seen this happen in real life? In which case it's not Cockpit specific: this would be a case with any SSH_ASKPASS usage anywhere (including stock ssh).

Ack, if we don't propagate the var then it's all good, and everything else is stuff that we can easily change later on if necessary. Thanks!

martinpitt · 2017-03-15T08:59:02Z

src/ssh/cockpitsshrelay.c

@@ -924,12 +919,9 @@ cockpit_ssh_authenticate (CockpitSshData *data)
  int methods_to_try = SSH_AUTH_METHOD_INTERACTIVE |
                       SSH_AUTH_METHOD_GSSAPI_MIC;

-#ifdef HAVE_SSH_SET_AGENT_SOCKET
  methods_to_try = methods_to_try | SSH_AUTH_METHOD_PUBLICKEY;


I think this like should be dropped. We already conditionally set it below (line 924) if the auth type is private-key. OTOH, if we always want to try PUBLIC_KEY (which can't hurt IMHO -- if the user has a matching key, this is better than password authentication), then the following two lines can go.

We should remove the #ifdef but keep flag. Now we rely on the standard SSH_AUTH_SOCK whereas before we had to setup an fd for the agent.

martinpitt · 2017-03-15T09:05:27Z

src/ssh/cockpitsshrelay.c

-                      (g_strcmp0 (data->auth_options->auth_type, "basic") == 0 ||
-                       g_strcmp0 (data->auth_options->auth_type,
-                                 auth_method_description (method)) == 0);
+          has_creds = data->in_bridge ||


I wonder if in_bridge is really sufficient? As far as I can see, data->initial_auth_data is set in lines 687 and 748 when called through the bridge, but ssh_askpass() can fail, and return NULL. I suppose if it does then we'll just get an auth failure later and there's nothing that we can do anyway, as we don't do "basic" auth through the bridge? It's just a bit confusing to get has_creds == TRUE here even if that's not so.

I don't know enough yet to provide any feedback on this one.

martinpitt · 2017-03-15T09:30:02Z

src/ssh/cockpitsshtransport.c

+    case PROP_PASSWORD:
+      password = g_value_get_string (value);
+      if (password)
+        self->password = g_bytes_new_take (g_strdup (password), strlen (password));


I wonder why none of the previous values get freed here (in the existing code too). Can/do we assume that we only ever set the properties once?

Update: yes, we apparently use CONSTRUCT_ONLY for those, so nevermind.

martinpitt · 2017-03-15T09:36:46Z

src/ssh/ssh.c

+      goto out;
+    }
+
+  if (argc > 2)


This should be < 2

Actually not, we do support running without arguments. The error message is just wrong.

martinpitt · 2017-03-15T09:49:00Z

src/ssh/test-sshservice.c

+  g_free (cmd);
+  g_object_unref (service);
+
+  stop_mock_sshd (test->mock_sshd);


This test case also uses teardown() which stops the mock sshd again., so I think this can be dropped.

Sounds good.

martinpitt · 2017-03-15T09:50:42Z

src/ssh/test-sshservice.c

+                    gconstpointer data)
+{
+  stop_mock_sshd (test->mock_sshd);
+


Do you plan to add something else here still? Seems easier to move the stop_mock_sshd() code into here and only have one teardown function, particulary as it doesn't appear to be completely reentrant (closing the pid if already closed), see below.

martinpitt · 2017-03-15T10:30:05Z

I pushed some fixups from my review (simplify SSH_AUTH_METHOD_PUBLICKEY flag setting, clean up mock sshd stopping in tests, fix cockpit-ssh error message with too many args). Recording the relevant integration test failures here, they reproduce locally too:

One fails on a cockpit-pcp SIGSEGV:

not ok 3 testFrameNavigation (__main__.TestMultiMachine) duration: 27s
Traceback (most recent call last):
  File "test/verify/check-multi-machine", line 202, in tearDown
    MachineCase.tearDown(self)
  File "/home/martin/upstream/cockpit/test/common/testlib.py", line 533, in tearDown
    self.check_journal_messages()
  File "/home/martin/upstream/cockpit/test/common/testlib.py", line 689, in check_journal_messages
    raise Error(first)
Error: /usr/libexec/cockpit-pcp: bridge was killed: 11

And a troubleshooting regression: The PNG shows a "Your session has been terminated [Reconnect]" page instead of the usual machine trouble shooting page.

not ok 24 testTroubleshooting (check_multi_machine.TestMultiMachine) duration: 107s
Traceback (most recent call last):
  File "./verify/check-multi-machine", line 726, in testTroubleshooting
    b.wait_not_visible(".curtains-ct")
  File "/build/cockpit/test/common/testlib.py", line 240, in wait_not_visible
    return self.wait_js_func('!ph_is_visible', selector)
  File "/build/cockpit/test/common/testlib.py", line 210, in wait_js_func
    return self.phantom.wait("%s(%s)" % (func, ','.join(map(jsquote, args))))
  File "/build/cockpit/test/common/testlib.py", line 736, in <lambda>
    return lambda *args: self._invoke(name, *args)
  File "/build/cockpit/test/common/testlib.py", line 762, in _invoke
    raise Error(res['error'])
Error: timeout
No such file or directory

Same fate (apparently) in

not ok 15 testBasic (check_multi_machine_key.TestMultiMachineKeyAuth) duration: 169s
Traceback (most recent call last):
  File "./verify/check-multi-machine-key", line 154, in testBasic
    b.wait_in_text('#troubleshoot-dialog', "Fingerprint")
  File "/build/cockpit/test/common/testlib.py", line 243, in wait_in_text
    return self.wait_js_func('ph_in_text', selector, text)
  File "/build/cockpit/test/common/testlib.py", line 210, in wait_js_func
    return self.phantom.wait("%s(%s)" % (func, ','.join(map(jsquote, args))))
  File "/build/cockpit/test/common/testlib.py", line 736, in <lambda>
    return lambda *args: self._invoke(name, *args)
  File "/build/cockpit/test/common/testlib.py", line 762, in _invoke
    raise Error(res['error'])
Error: timeout

martinpitt · 2017-03-15T11:42:47Z

Wrt. the pcp crash: I got a symbolic backtrace and then found a two year old bug report about the same issue. I attached it there, did a minimal analysis and asked for reopening/updating (I can't do that myself).

So this is nothing new, but the bridge rearrangement apparently changed the timing slightly: This is a race condition, sometimes the test passes because the "bridge was killed: 11" message does not make it into the journal fast enough. But I always got it to crash in my manual runs with a sit() at the end.

As we are under a time constraint here and this is an old bug which apparently doesn't visually affect the user experience, I propose this for now:

--- a/test/verify/check-multi-machine
+++ b/test/verify/check-multi-machine
@@ -517,6 +517,8 @@ class TestMultiMachine(MachineCase):
         # kill admin, lock account
         m2.execute('passwd -l admin')
         kill_user_admin(m2)
+        # this step causes a crash in PCP (https://bugzilla.redhat.com/show_bug.cgi?id=1235962)
+        self.allow_journal_messages(".*cockpit-pcp: bridge was killed: 11")
 
         b.wait_present(".curtains-ct");
         b.wait_text(".curtains-ct h1", "Couldn't connect to the machine")

Sanity-checking appreciated. I propose to not push the above right away so that the tests have a chance to finish once and we get a full overview of remaining problems. @stefwalter is working on the troubleshooting regression, we can push this together with his fix?

stefwalter · 2017-03-15T11:45:17Z

What about a known issue? I believe it would be more appropriate: This is a failure that affects cockpit users on certain operating systems and is a bug of those operating systems.

We can also get data on when this happens and/or stops happening.

stefwalter · 2017-03-15T11:46:33Z

In addition, this also does not mask an actual test running through until the end. If it did, that would be a reason to avoid a known issue.

martinpitt · 2017-03-15T12:02:06Z

Reworked the pcp crash ignoring to be a known issue #6108: Preliminary patch is https://paste.fedoraproject.org/paste/dT-d8b-CAFt4NzggOlCunl5M1UNdIGYhyRLivL9gydE=/ but I suppose it will affect the Atomics too, and possibly Debian/Ubuntu. Waiting for more test results to come in.

stefwalter · 2017-03-15T13:12:10Z

Removing needswork ... we need real test feedback

petervo · 2017-03-15T13:13:12Z

src/bridge/cockpitrouter.c

@@ -420,6 +421,15 @@ process_kill (CockpitRouter *self,
      g_warning ("received invalid \"group\" field in kill command");
      return;
    }
+  else if (!cockpit_json_get_string (options, "host", NULL, &host))
+    {
+      g_warning ("received invalid \"group\" field in kill command");


Should say hosts?

Yup this should say "host" ... are you going to make the change?

martinpitt · 2017-03-15T13:16:26Z

pcp crash test failure is independent, now part of PR #6109.

martinpitt

I pushed the "move to /etc/ssh/ssh_known_hosts" changes on top of this. It depends on this rework to avoid having to introduce a temporary ws capability, but it should also not land after this as it would then require a new capability. Tests pass locally.

Also, with the fixups I pushed earlier all my (relevant) remarks got addressed.

This is an old bug (https://bugzilla.redhat.com/show_bug.cgi?id=1235962) but the cockpit-ssh rework (PR cockpit-project#6102) now changes the timing to exihibit this much more often. This affects at least check-multi-machine and check-dashboard, so keep the quirk generic to match unexpected journal messages from any test. Closes cockpit-project#6109

martinpitt · 2017-03-15T17:24:14Z

Built/verified the fix manually, works great now. Thanks @petervo!

This is an old bug (https://bugzilla.redhat.com/show_bug.cgi?id=1235962) but the cockpit-ssh rework (PR #6102) now changes the timing to exihibit this much more often. This affects at least check-multi-machine and check-dashboard, so keep the quirk generic to match unexpected journal messages from any test. Closes #6109 Reviewed-by: Stef Walter <stefw@redhat.com>

stefwalter · 2017-03-15T19:23:06Z

src/bridge/cockpitrouter.c

+cockpit_router_ban_hosts (CockpitRouter *self)
+{
+  RouterRule *rule;
+  JsonObject *match = json_object_new ();


Wouldn't an appropriate "bridges" entry in src/base1/manifest.json.in solve this? If not, it's worth documenting.

Initially that's what I had. But cockpit-stub initially didn't load bridges. So hosts were falling through to the regular channel processing, leading to weird hard to debug results. Since hosts is so fundamental to our protocol, i figured that we should just refuse by default the way we used to even if no bridges get loaded.

I agree. Looks like we'll have to add checksum handling here too ...

It will be up to the bridges to handle connecting to additional hosts

Only implement "kill" with a "host" when the "host" matches the one we received in our "init" message. For other cases the "kill" command is forwarded to a peer bridge.

When an ssh session is private only for specific channel(s) then when those channels are gone, close the ssh session immediately. No need to wait for the timeout in this case.

There is no real reason for maintaining our own /var/lib/cockpit/known_hosts file, as ssh itself already has a global one in /etc/ssh/ssh_known_hosts. Use that by default, but fallback to the legacy file for (1) lookups if a host is not already known in the former but known in the latter; and (2) for writing if the ws we talk to is still an old version (by checking if ws still has the "ssh" capability). Move the determination and setting of the known hosts file into a new set_knownhosts_file() function, as it is now reasonably complex, will be extended further in the future with more sources of known hosts, and avoids handling SSH_OPTIONS_KNOWNHOSTS in multiple different places. Adjust the integration tests to the new path and add new tests for covering the fallback to the legacy file.

Remove key tests from atomic

kubernetes is are only use case for this for now. Let's just distribute there to keep it out of cockpit ws. Added a provides so we can split it out later if needed.

This ensures that a router will refuse to process a open command with a host unless specifially configured to do so.

Commit 1dab8a5 hardcoded the install path of cockpit-pcp, but libexecdir is different in Debianish distros. We don't care about the particular location, so use pattern matching instead.

The message is cockpit-ssh admin@10.111.112.135:22: -1 couldn't connect: Connection refused '10.111.112.135' '22'' So we were missing the "-1" after the colon, and missed a period for the '*'.

martinpitt · 2017-03-15T21:03:41Z

Fixed tests harder

stefwalter · 2017-03-15T21:48:32Z

check-loopback passes on debian-testing failure is intermittent

Closes #6102 Reviewed-by: Stef Walter <stefw@redhat.com>

petervo added the bot label Mar 14, 2017

petervo force-pushed the bridge-ssh branch 2 times, most recently from 931aec3 to b2508e5 Compare March 14, 2017 17:06

stefwalter added blocked Don't land until something else happens first (see task list) needswork labels Mar 14, 2017

stefwalter suggested changes Mar 14, 2017

View reviewed changes

petervo force-pushed the bridge-ssh branch from fb1e6e8 to e45db51 Compare March 15, 2017 05:46

petervo removed the blocked Don't land until something else happens first (see task list) label Mar 15, 2017

petervo force-pushed the bridge-ssh branch from 7a97e57 to 6a3191c Compare March 15, 2017 08:05

stefwalter added release-blocker Targetted for next release and removed bot labels Mar 15, 2017

martinpitt requested changes Mar 15, 2017

View reviewed changes

stefwalter removed the needswork label Mar 15, 2017

petervo commented Mar 15, 2017

View reviewed changes

martinpitt mentioned this pull request Mar 15, 2017

Test and build fixes #6109

Closed

martinpitt approved these changes Mar 15, 2017

View reviewed changes

martinpitt mentioned this pull request Mar 15, 2017

ws: Move to /etc/ssh/ssh_known_hosts #6025

Closed

3 tasks

stefwalter approved these changes Mar 15, 2017

View reviewed changes

petervo force-pushed the bridge-ssh branch from c504a99 to 5e1a994 Compare March 15, 2017 14:20

petervo force-pushed the bridge-ssh branch from fa006ad to c45a7c4 Compare March 15, 2017 19:12

stefwalter approved these changes Mar 15, 2017

View reviewed changes

petervo force-pushed the bridge-ssh branch 2 times, most recently from 315bae3 to 2d46021 Compare March 15, 2017 20:20

stefwalter added bot and removed bot labels Mar 15, 2017

petervo and others added 11 commits March 15, 2017 21:56

ws: Remove all multi host and ssh handling from ws

ba9bc23

It will be up to the bridges to handle connecting to additional hosts

ssh: Launch cockpit-ssh from the bridge

1c9ce33

ssh: Use SSH_ASKPASS when we need a password in cockpit-ssh

62c9173

bridge: Defer to other bridges for "kill" command with "host"

959c678

Only implement "kill" with a "host" when the "host" matches the one we received in our "init" message. For other cases the "kill" command is forwarded to a peer bridge.

ssh: When an ssh session is private, cleanup immediately

7b4d1e5

When an ssh session is private only for specific channel(s) then when those channels are gone, close the ssh session immediately. No need to wait for the timeout in this case.

test: Update key tests to run on debian

0f761fb

Remove key tests from atomic

tools: Remove cockpit-stub from cockpit-ws

c4f03fb

kubernetes is are only use case for this for now. Let's just distribute there to keep it out of cockpit ws. Added a provides so we can split it out later if needed.

bridge: Setup a no hosts allowed rule by default.

8a3cc49

This ensures that a router will refuse to process a open command with a host unless specifially configured to do so.

bridge: Allow cockpit-stub to open bridges

7844b5b

test: Generalize cockpit-pcp path in known issue

10cfb0b

Commit 1dab8a5 hardcoded the install path of cockpit-pcp, but libexecdir is different in Debianish distros. We don't care about the particular location, so use pattern matching instead.

martinpitt added the needswork label Mar 15, 2017

test: Fix regexp for allowed journal message

2f900ff

The message is cockpit-ssh admin@10.111.112.135:22: -1 couldn't connect: Connection refused '10.111.112.135' '22'' So we were missing the "-1" after the colon, and missed a period for the '*'.

martinpitt force-pushed the bridge-ssh branch from 2d46021 to 2f900ff Compare March 15, 2017 21:03

martinpitt removed the needswork label Mar 15, 2017

stefwalter closed this in 082da24 Mar 15, 2017

stefwalter pushed a commit that referenced this pull request Mar 15, 2017

ssh: Launch cockpit-ssh from the bridge

b3a463b

Closes #6102 Reviewed-by: Stef Walter <stefw@redhat.com>

petervo deleted the bridge-ssh branch March 15, 2017 21:53

martinpitt mentioned this pull request Oct 1, 2019

RHEL, CentOS, Fedora, Debian, Ubuntu: PCP libraries crash in __pmFindProfile() cockpit-project/bots#65

Closed

mvollmer mentioned this pull request Oct 7, 2020

shell: Allow editing hosts without admin privs #14688

Merged

6 tasks

		@@ -476,16 +261,6 @@ cockpit_web_service_dispose (GObject *object)

		cockpit_sockets_close (&self->sockets, NULL);

		g_hash_table_iter_init (&iter, self->sessions.by_transport);

Move multi host handling to the bridge instead of cockpit-ws #6102

Move multi host handling to the bridge instead of cockpit-ws #6102

Conversation

petervo commented Mar 14, 2017 • edited by martinpitt Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinpitt commented Mar 15, 2017

martinpitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinpitt commented Mar 15, 2017 • edited Loading

martinpitt commented Mar 15, 2017 • edited Loading

stefwalter commented Mar 15, 2017

stefwalter commented Mar 15, 2017

martinpitt commented Mar 15, 2017

stefwalter commented Mar 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinpitt commented Mar 15, 2017

martinpitt left a comment

Choose a reason for hiding this comment

martinpitt commented Mar 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinpitt commented Mar 15, 2017

stefwalter commented Mar 15, 2017

petervo commented Mar 14, 2017 •

edited by martinpitt

Loading

martinpitt commented Mar 15, 2017 •

edited

Loading

martinpitt commented Mar 15, 2017 •

edited

Loading