common: Fix deadlock in channel pressure handling #14079

martinpitt · 2020-05-12T12:53:34Z

When a channel applies back-pressure to its input because its output
queue is getting too large, it relies on the consumer to acknowledge the
reads by ponging to the pings that the channel sent out.

However, it can happen that the last ping was sent for a sequence number
before the one that the channel expects for lifting the pressure. Then
the receiver could drain the output queue and dutifully respond to all
pings, but the last sequence number would still be smaller than the
treshold for lifting the pressure. But the channel isn't sending
anything else either and thus never generates new pings.

Fix this by sending an additional ping when going into pressure mode, to
ensure that the last pong sequence number is within the out_window
again.

This can also be reproduced/investigated with this test-spawn-proc unit
test:

QUnit.only("stream large output - find", function (assert) {
    const done = assert.async();
    assert.expect(2);

    cockpit.spawn(["find", "/"], { err: "ignore" })
            .stream(function(resp) {
                console.log("got block", resp.length);
            })
            .then(function(resp) {
                assert.equal(resp, "", "no done data");
            })
            .always(function() {
                assert.equal(this.state(), "resolved", "didn't fail");
                done();
            });
});

But this is not appropriate for actually committing -- the find often
fails at the end, and takes too many IO resources for a unit test. The
integration test confirms the fix, though.

Fixes #12580
https://bugzilla.redhat.com/show_bug.cgi?id=1751783

Builds on top of various unit test fixes: PR Various unit test/debugging fixes #14075
Update PR reference in integration test
Revert the order again and beautify the code

martinpitt · 2020-05-12T16:05:51Z

I talked this over with @stefwalter , and he confirmed that this is a valid fix. The actual new flooding test in check-terminal succeeds everywhere, I just botched the subsequent checks. I can't run this locally due to chromium changes. I have a fix prepared locally, but I also still want to reformat the code a bit.

martinpitt · 2020-05-12T18:29:00Z

PR #14075 landed, and I applied the code clenaup. @stefwalter, do you want to officially review that?

mvollmer · 2020-05-13T07:12:35Z

Nice! This makes sense to me, for what it's worth.

martinpitt · 2020-05-13T07:59:36Z

@mvollmer brought up that it's not clear that we avoid ping/pong loops with the same sequence number. I changed the level to edge triggering to ensure this.

stefwalter · 2020-05-13T08:49:09Z

src/common/cockpitchannel.c

-      if (out_sequence / CHANNEL_FLOW_PING != priv->out_sequence / CHANNEL_FLOW_PING)
+      /* If we've sent more than the window, we just got under pressure;
+       * do an edge trigger instead of level trigger to avoid ping/signal loops */
+      under_pressure = (priv->out_sequence <= priv->out_window) && (out_sequence > priv->out_window);


If this is edge triggered then the variable name needs to change. The variable is named for a level.

Fair enough. I renamed it to trigger_pressure.

stefwalter · 2020-05-13T08:50:03Z

src/common/cockpitchannel.c

      priv->out_sequence = out_sequence;
-      if (priv->out_sequence > priv->out_window)
+
+      if (under_pressure)
        {
          g_debug ("%s: sent too much data without acknowledgement, emitting back pressure until %"
                   G_GINT64_FORMAT, priv->id, priv->out_window);


That means this emit_pressure call here has also changed from level triggered to edge triggered. Is that in line with all the other uses of this function call? Worth doing a check.

src/bridge/cockpitpacketchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/bridge/cockpitpacketchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered. ^^

src/bridge/cockpitstream.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/bridge/cockpitstream.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

src/common/cockpitchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);

This other call in CockpitChannel is edge triggered. But may have an off by one on line 248. I need to double check this.

src/common/cockpitchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

This is the one we're changing.

src/common/cockpitpipe.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/common/cockpitpipe.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

src/common/cockpitwebresponse.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/common/cockpitwebresponse.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

src/websocket/websocketconnection.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/websocket/websocketconnection.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

As we just figured out, all other pressure calls use edge triggering already, so that gets cockpitchannel.c in line with the others.

When a channel applies back-pressure to its input because its output queue is getting too large, it relies on the consumer to acknowledge the reads by ponging to the pings that the channel sent out. However, it can happen that the last ping was sent for a sequence number *before* the one that the channel expects for lifting the pressure. Then the receiver could drain the output queue and dutifully respond to all pings, but the last sequence number would still be smaller than the treshold for lifting the pressure. But the channel isn't sending anything else either and thus never generates new pings. Fix this by sending an additional ping when going into pressure mode, to ensure that the last pong sequence number is within the out_window again. Ensure that this only happens on edge trigger, so that we never send out the extra ping more than once. This also avoids raising the pressure-on signal more than once as a side effect. This can also be reproduced/investigated with this test-spawn-proc unit test: ```js QUnit.only("stream large output - find", function (assert) { const done = assert.async(); assert.expect(2); cockpit.spawn(["find", "/"], { err: "ignore" }) .stream(function(resp) { console.log("got block", resp.length); }) .then(function(resp) { assert.equal(resp, "", "no done data"); }) .always(function() { assert.equal(this.state(), "resolved", "didn't fail"); done(); }); }); ``` But this is not appropriate for actually committing -- the find often fails at the end, and takes too many IO resources for a unit test. The integration test confirms the fix, though. Fix the integration test to happen before resetting the terminal, as the subsequent checks expect it to be empty. Fixes cockpit-project#12580 https://bugzilla.redhat.com/show_bug.cgi?id=1751783 Closes cockpit-project#14079

martinpitt force-pushed the spawn-flooding branch from 2ac59dc to fd057bd Compare May 12, 2020 12:55

martinpitt mentioned this pull request May 12, 2020

spawned processes with lots of output get stuck #12580

Closed

martinpitt requested a review from stefwalter May 12, 2020 12:56

martinpitt force-pushed the spawn-flooding branch 2 times, most recently from 61555e9 to c933802 Compare May 12, 2020 18:27

martinpitt marked this pull request as ready for review May 12, 2020 18:27

martinpitt requested a review from allisonkarlitskaya May 12, 2020 18:27

martinpitt added the release-blocker Targetted for next release label May 12, 2020

martinpitt force-pushed the spawn-flooding branch from c933802 to af3e1d6 Compare May 13, 2020 07:58

martinpitt requested review from mvollmer and removed request for allisonkarlitskaya May 13, 2020 08:00

stefwalter reviewed May 13, 2020

View reviewed changes

martinpitt force-pushed the spawn-flooding branch from af3e1d6 to fe309c8 Compare May 13, 2020 09:48

martinpitt requested a review from stefwalter May 13, 2020 09:49

mvollmer approved these changes May 13, 2020

View reviewed changes

martinpitt merged commit 7eafa1a into cockpit-project:master May 13, 2020

martinpitt deleted the spawn-flooding branch May 13, 2020 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: Fix deadlock in channel pressure handling #14079

common: Fix deadlock in channel pressure handling #14079

martinpitt commented May 12, 2020 •

edited

martinpitt commented May 12, 2020

martinpitt commented May 12, 2020

mvollmer commented May 13, 2020

martinpitt commented May 13, 2020

stefwalter May 13, 2020

martinpitt May 13, 2020

stefwalter May 13, 2020

stefwalter May 13, 2020 •

edited

martinpitt May 13, 2020

common: Fix deadlock in channel pressure handling #14079

common: Fix deadlock in channel pressure handling #14079

Conversation

martinpitt commented May 12, 2020 • edited

martinpitt commented May 12, 2020

martinpitt commented May 12, 2020

mvollmer commented May 13, 2020

martinpitt commented May 13, 2020

stefwalter May 13, 2020

Choose a reason for hiding this comment

martinpitt May 13, 2020

Choose a reason for hiding this comment

stefwalter May 13, 2020

Choose a reason for hiding this comment

stefwalter May 13, 2020 • edited

Choose a reason for hiding this comment

martinpitt May 13, 2020

Choose a reason for hiding this comment

martinpitt commented May 12, 2020 •

edited

stefwalter May 13, 2020 •

edited