Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move multi host handling to the bridge instead of cockpit-ws #6102

Closed
wants to merge 12 commits into from

Conversation

petervo
Copy link
Contributor

@petervo petervo commented Mar 14, 2017

@petervo petervo added the bot label Mar 14, 2017
@petervo petervo force-pushed the bridge-ssh branch 2 times, most recently from 931aec3 to b2508e5 Compare March 14, 2017 17:06
@stefwalter stefwalter added blocked Don't land until something else happens first (see task list) needswork labels Mar 14, 2017
@@ -157,6 +157,7 @@ WEBPACK_PACKAGES = \
selinux \
shell \
sosreport \
ssh \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in a new directory? It seems like it unfortunately shares so much code with cockpit-ws. Worth keeping in the same directory? In addition the manifest.json stuff could go in the dashboard manifest.json if it needs a home.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes we want this functionality without the dashboard. EI the kubernetes container. That's why I pulled out it. But we can always just mess with it in the container setup scripts if you think that's better.

session->thawing++;
for (l = frozen->head; l != NULL; l = g_list_next (l))
{
g_printerr ("THAWING: %s\n", (gchar *)l->data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I accidentally left this in here.

G_PARAM_WRITABLE | G_PARAM_CONSTRUCT_ONLY | G_PARAM_STATIC_STRINGS));
g_object_class_install_property (object_class, PROP_PASSWORD,
g_param_spec_string ("password", NULL, NULL, NULL,
G_PARAM_READWRITE | G_PARAM_CONSTRUCT_ONLY | G_PARAM_STATIC_STRINGS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that password is a string means it's copied throughout memory here. We should be using a g_param_spec_boxed of type G_TYPE_BYTES here. However, this could be a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently copy it once here (https://github.com/cockpit-project/cockpit/blob/master/src/ws/cockpitwebservice.c#L1222) in this case I'm only copying it once as well here (b2508e5#diff-d05217b82cb7f915e32af94329806ef5R531). I could be missing something but i think it's the same number of copies as what we do currently. Is that good enough for now.

@@ -749,12 +731,14 @@ cockpit_ssh_transport_class_init (CockpitSshTransportClass *klass)
CockpitTransport *
cockpit_ssh_transport_new (const gchar *host,
guint port,
CockpitCreds *creds)
const gchar *user,
const gchar *password)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, this should be a GBytes. However this could be a follow up.

@@ -476,16 +261,6 @@ cockpit_web_service_dispose (GObject *object)

cockpit_sockets_close (&self->sockets, NULL);

g_hash_table_iter_init (&iter, self->sessions.by_transport);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should replace this code with:

if (!self->sent_done)
  {
    self->sent_done = TRUE;
    cockpit_transport_close (self->transport, NULL);
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That happens a couple lines above. Before the sockets gets closed. Or was the point that closes the transport should happen after the sockets?

@petervo petervo removed the blocked Don't land until something else happens first (see task list) label Mar 15, 2017
@martinpitt
Copy link
Member

I pushed a fixup for the test-sshtransport regression (due to using SSH_ASKPASS now). Please double-check that this makes sense. I now ran the test case in a loop and it's stable.

@stefwalter stefwalter added release-blocker Targetted for next release and removed bot labels Mar 15, 2017
Copy link
Member

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some minor issues. I'll push a FIXUP for review.

{
"match": { "host": null },
"spawn": [ "@libexecdir@/cockpit-ssh" ],
"environ": [ "SSH_ASKPASS=@libexecdir@/cockpit-askpass" ],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we use libssh and call the askpass program ourselves (ssh_askpass() in cockpitsshrelay.c), I wonder if we should really call that $SSH_ASKPASS; as the cockpit-ssh bridge gets called under that environment, won't that propagate to commands run as the bridge too? I don't think we want to use that askpass helper for running ssh commands through the bridge on the remote system. I. e. use $COCKPIT_ASKPASS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment doesn't get propagated across ssh by default ... that's both a blessing and a curse. Or have you seen this happen in real life? In which case it's not Cockpit specific: this would be a case with any SSH_ASKPASS usage anywhere (including stock ssh).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, if we don't propagate the var then it's all good, and everything else is stuff that we can easily change later on if necessary. Thanks!

@@ -924,12 +919,9 @@ cockpit_ssh_authenticate (CockpitSshData *data)
int methods_to_try = SSH_AUTH_METHOD_INTERACTIVE |
SSH_AUTH_METHOD_GSSAPI_MIC;

#ifdef HAVE_SSH_SET_AGENT_SOCKET
methods_to_try = methods_to_try | SSH_AUTH_METHOD_PUBLICKEY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this like should be dropped. We already conditionally set it below (line 924) if the auth type is private-key. OTOH, if we always want to try PUBLIC_KEY (which can't hurt IMHO -- if the user has a matching key, this is better than password authentication), then the following two lines can go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the #ifdef but keep flag. Now we rely on the standard SSH_AUTH_SOCK whereas before we had to setup an fd for the agent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

(g_strcmp0 (data->auth_options->auth_type, "basic") == 0 ||
g_strcmp0 (data->auth_options->auth_type,
auth_method_description (method)) == 0);
has_creds = data->in_bridge ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if in_bridge is really sufficient? As far as I can see, data->initial_auth_data is set in lines 687 and 748 when called through the bridge, but ssh_askpass() can fail, and return NULL. I suppose if it does then we'll just get an auth failure later and there's nothing that we can do anyway, as we don't do "basic" auth through the bridge? It's just a bit confusing to get has_creds == TRUE here even if that's not so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough yet to provide any feedback on this one.

case PROP_PASSWORD:
password = g_value_get_string (value);
if (password)
self->password = g_bytes_new_take (g_strdup (password), strlen (password));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why none of the previous values get freed here (in the existing code too). Can/do we assume that we only ever set the properties once?

Update: yes, we apparently use CONSTRUCT_ONLY for those, so nevermind.

goto out;
}

if (argc > 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be < 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not, we do support running without arguments. The error message is just wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

g_free (cmd);
g_object_unref (service);

stop_mock_sshd (test->mock_sshd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case also uses teardown() which stops the mock sshd again., so I think this can be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

gconstpointer data)
{
stop_mock_sshd (test->mock_sshd);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to add something else here still? Seems easier to move the stop_mock_sshd() code into here and only have one teardown function, particulary as it doesn't appear to be completely reentrant (closing the pid if already closed), see below.

@martinpitt
Copy link
Member

martinpitt commented Mar 15, 2017

I pushed some fixups from my review (simplify SSH_AUTH_METHOD_PUBLICKEY flag setting, clean up mock sshd stopping in tests, fix cockpit-ssh error message with too many args). Recording the relevant integration test failures here, they reproduce locally too:

One fails on a cockpit-pcp SIGSEGV:

not ok 3 testFrameNavigation (__main__.TestMultiMachine) duration: 27s
Traceback (most recent call last):
  File "test/verify/check-multi-machine", line 202, in tearDown
    MachineCase.tearDown(self)
  File "/home/martin/upstream/cockpit/test/common/testlib.py", line 533, in tearDown
    self.check_journal_messages()
  File "/home/martin/upstream/cockpit/test/common/testlib.py", line 689, in check_journal_messages
    raise Error(first)
Error: /usr/libexec/cockpit-pcp: bridge was killed: 11

And a troubleshooting regression: The PNG shows a "Your session has been terminated [Reconnect]" page instead of the usual machine trouble shooting page.

not ok 24 testTroubleshooting (check_multi_machine.TestMultiMachine) duration: 107s
Traceback (most recent call last):
  File "./verify/check-multi-machine", line 726, in testTroubleshooting
    b.wait_not_visible(".curtains-ct")
  File "/build/cockpit/test/common/testlib.py", line 240, in wait_not_visible
    return self.wait_js_func('!ph_is_visible', selector)
  File "/build/cockpit/test/common/testlib.py", line 210, in wait_js_func
    return self.phantom.wait("%s(%s)" % (func, ','.join(map(jsquote, args))))
  File "/build/cockpit/test/common/testlib.py", line 736, in <lambda>
    return lambda *args: self._invoke(name, *args)
  File "/build/cockpit/test/common/testlib.py", line 762, in _invoke
    raise Error(res['error'])
Error: timeout
No such file or directory

Same fate (apparently) in

not ok 15 testBasic (check_multi_machine_key.TestMultiMachineKeyAuth) duration: 169s
Traceback (most recent call last):
  File "./verify/check-multi-machine-key", line 154, in testBasic
    b.wait_in_text('#troubleshoot-dialog', "Fingerprint")
  File "/build/cockpit/test/common/testlib.py", line 243, in wait_in_text
    return self.wait_js_func('ph_in_text', selector, text)
  File "/build/cockpit/test/common/testlib.py", line 210, in wait_js_func
    return self.phantom.wait("%s(%s)" % (func, ','.join(map(jsquote, args))))
  File "/build/cockpit/test/common/testlib.py", line 736, in <lambda>
    return lambda *args: self._invoke(name, *args)
  File "/build/cockpit/test/common/testlib.py", line 762, in _invoke
    raise Error(res['error'])
Error: timeout

@martinpitt
Copy link
Member

martinpitt commented Mar 15, 2017

Wrt. the pcp crash: I got a symbolic backtrace and then found a two year old bug report about the same issue. I attached it there, did a minimal analysis and asked for reopening/updating (I can't do that myself).

So this is nothing new, but the bridge rearrangement apparently changed the timing slightly: This is a race condition, sometimes the test passes because the "bridge was killed: 11" message does not make it into the journal fast enough. But I always got it to crash in my manual runs with a sit() at the end.

As we are under a time constraint here and this is an old bug which apparently doesn't visually affect the user experience, I propose this for now:

--- a/test/verify/check-multi-machine
+++ b/test/verify/check-multi-machine
@@ -517,6 +517,8 @@ class TestMultiMachine(MachineCase):
         # kill admin, lock account
         m2.execute('passwd -l admin')
         kill_user_admin(m2)
+        # this step causes a crash in PCP (https://bugzilla.redhat.com/show_bug.cgi?id=1235962)
+        self.allow_journal_messages(".*cockpit-pcp: bridge was killed: 11")
 
         b.wait_present(".curtains-ct");
         b.wait_text(".curtains-ct h1", "Couldn't connect to the machine")

Sanity-checking appreciated. I propose to not push the above right away so that the tests have a chance to finish once and we get a full overview of remaining problems. @stefwalter is working on the troubleshooting regression, we can push this together with his fix?

@stefwalter
Copy link
Contributor

What about a known issue? I believe it would be more appropriate: This is a failure that affects cockpit users on certain operating systems and is a bug of those operating systems.

We can also get data on when this happens and/or stops happening.

@stefwalter
Copy link
Contributor

In addition, this also does not mask an actual test running through until the end. If it did, that would be a reason to avoid a known issue.

@martinpitt
Copy link
Member

Reworked the pcp crash ignoring to be a known issue #6108: Preliminary patch is https://paste.fedoraproject.org/paste/dT-d8b-CAFt4NzggOlCunl5M1UNdIGYhyRLivL9gydE=/ but I suppose it will affect the Atomics too, and possibly Debian/Ubuntu. Waiting for more test results to come in.

@stefwalter
Copy link
Contributor

Removing needswork ... we need real test feedback

@@ -420,6 +421,15 @@ process_kill (CockpitRouter *self,
g_warning ("received invalid \"group\" field in kill command");
return;
}
else if (!cockpit_json_get_string (options, "host", NULL, &host))
{
g_warning ("received invalid \"group\" field in kill command");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say hosts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this should say "host" ... are you going to make the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@martinpitt martinpitt mentioned this pull request Mar 15, 2017
@martinpitt
Copy link
Member

pcp crash test failure is independent, now part of PR #6109.

Copy link
Member

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the "move to /etc/ssh/ssh_known_hosts" changes on top of this. It depends on this rework to avoid having to introduce a temporary ws capability, but it should also not land after this as it would then require a new capability. Tests pass locally.

Also, with the fixups I pushed earlier all my (relevant) remarks got addressed.

martinpitt added a commit to martinpitt/cockpit that referenced this pull request Mar 15, 2017
This is an old bug (https://bugzilla.redhat.com/show_bug.cgi?id=1235962)
but the cockpit-ssh rework (PR cockpit-project#6102) now changes the timing to exihibit
this much more often.

This affects at least check-multi-machine and check-dashboard, so keep
the quirk generic to match unexpected journal messages from any test.

Closes cockpit-project#6109
@martinpitt
Copy link
Member

Built/verified the fix manually, works great now. Thanks @petervo!

stefwalter pushed a commit that referenced this pull request Mar 15, 2017
This is an old bug (https://bugzilla.redhat.com/show_bug.cgi?id=1235962)
but the cockpit-ssh rework (PR #6102) now changes the timing to exihibit
this much more often.

This affects at least check-multi-machine and check-dashboard, so keep
the quirk generic to match unexpected journal messages from any test.

Closes #6109
Reviewed-by: Stef Walter <stefw@redhat.com>
cockpit_router_ban_hosts (CockpitRouter *self)
{
RouterRule *rule;
JsonObject *match = json_object_new ();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't an appropriate "bridges" entry in src/base1/manifest.json.in solve this? If not, it's worth documenting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially that's what I had. But cockpit-stub initially didn't load bridges. So hosts were falling through to the regular channel processing, leading to weird hard to debug results. Since hosts is so fundamental to our protocol, i figured that we should just refuse by default the way we used to even if no bridges get loaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Looks like we'll have to add checksum handling here too ...

@petervo petervo force-pushed the bridge-ssh branch 2 times, most recently from 315bae3 to 2d46021 Compare March 15, 2017 20:20
@stefwalter stefwalter added bot and removed bot labels Mar 15, 2017
petervo and others added 11 commits March 15, 2017 21:56
It will be up to the bridges to handle connecting to
additional hosts
Only implement "kill" with a "host" when the "host" matches
the one we received in our "init" message. For other cases
the "kill" command is forwarded to a peer bridge.
When an ssh session is private only for specific channel(s)
then when those channels are gone, close the ssh session
immediately. No need to wait for the timeout in this case.
There is no real reason for maintaining our own
/var/lib/cockpit/known_hosts file, as ssh itself already has a global
one in /etc/ssh/ssh_known_hosts. Use that by default, but fallback to
the legacy file for (1) lookups if a host is not already known in the
former but known in the latter; and (2) for writing if the ws we talk to
is still an old version (by checking if ws still has the "ssh"
capability).

Move the determination and setting of the known hosts file into a new
set_knownhosts_file() function, as it is now reasonably complex, will
be extended further in the future with more sources of known hosts, and
avoids handling SSH_OPTIONS_KNOWNHOSTS in multiple different places.

Adjust the integration tests to the new path and add new tests for
covering the fallback to the legacy file.
Remove key tests from atomic
kubernetes is are only use case for this for now.
Let's just distribute there to keep it out of cockpit
ws. Added a provides so we can split it out later
if needed.
This ensures that a router will refuse to process
a open command with a host unless specifially configured
to do so.
Commit 1dab8a5 hardcoded the install path of cockpit-pcp, but
libexecdir is different in Debianish distros. We don't care about the
particular location, so use pattern matching instead.
The message is

  cockpit-ssh admin@10.111.112.135:22: -1 couldn't connect: Connection refused '10.111.112.135' '22''

So we were missing the "-1" after the colon, and missed a period for the
'*'.
@martinpitt
Copy link
Member

Fixed tests harder

@stefwalter
Copy link
Contributor

check-loopback passes on debian-testing failure is intermittent

stefwalter pushed a commit that referenced this pull request Mar 15, 2017
Closes #6102
Reviewed-by: Stef Walter <stefw@redhat.com>
@petervo petervo deleted the bridge-ssh branch March 15, 2017 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker Targetted for next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants