Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: Fix deadlock in channel pressure handling #14079

Merged
merged 1 commit into from May 13, 2020

Conversation

martinpitt
Copy link
Member

@martinpitt martinpitt commented May 12, 2020

When a channel applies back-pressure to its input because its output
queue is getting too large, it relies on the consumer to acknowledge the
reads by ponging to the pings that the channel sent out.

However, it can happen that the last ping was sent for a sequence number
before the one that the channel expects for lifting the pressure. Then
the receiver could drain the output queue and dutifully respond to all
pings, but the last sequence number would still be smaller than the
treshold for lifting the pressure. But the channel isn't sending
anything else either and thus never generates new pings.

Fix this by sending an additional ping when going into pressure mode, to
ensure that the last pong sequence number is within the out_window
again.

This can also be reproduced/investigated with this test-spawn-proc unit
test:

QUnit.only("stream large output - find", function (assert) {
    const done = assert.async();
    assert.expect(2);

    cockpit.spawn(["find", "/"], { err: "ignore" })
            .stream(function(resp) {
                console.log("got block", resp.length);
            })
            .then(function(resp) {
                assert.equal(resp, "", "no done data");
            })
            .always(function() {
                assert.equal(this.state(), "resolved", "didn't fail");
                done();
            });
});

But this is not appropriate for actually committing -- the find often
fails at the end, and takes too many IO resources for a unit test. The
integration test confirms the fix, though.

Fixes #12580
https://bugzilla.redhat.com/show_bug.cgi?id=1751783

@martinpitt
Copy link
Member Author

I talked this over with @stefwalter , and he confirmed that this is a valid fix. The actual new flooding test in check-terminal succeeds everywhere, I just botched the subsequent checks. I can't run this locally due to chromium changes. I have a fix prepared locally, but I also still want to reformat the code a bit.

@martinpitt martinpitt force-pushed the spawn-flooding branch 2 times, most recently from 61555e9 to c933802 Compare May 12, 2020 18:27
@martinpitt martinpitt marked this pull request as ready for review May 12, 2020 18:27
@martinpitt martinpitt added the release-blocker Targetted for next release label May 12, 2020
@martinpitt
Copy link
Member Author

PR #14075 landed, and I applied the code clenaup. @stefwalter, do you want to officially review that?

@mvollmer
Copy link
Member

Nice! This makes sense to me, for what it's worth.

@martinpitt
Copy link
Member Author

@mvollmer brought up that it's not clear that we avoid ping/pong loops with the same sequence number. I changed the level to edge triggering to ensure this.

@martinpitt martinpitt requested review from mvollmer and removed request for allisonkarlitskaya May 13, 2020 08:00
if (out_sequence / CHANNEL_FLOW_PING != priv->out_sequence / CHANNEL_FLOW_PING)
/* If we've sent more than the window, we just got under pressure;
* do an edge trigger instead of level trigger to avoid ping/signal loops */
under_pressure = (priv->out_sequence <= priv->out_window) && (out_sequence > priv->out_window);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is edge triggered then the variable name needs to change. The variable is named for a level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I renamed it to trigger_pressure.

priv->out_sequence = out_sequence;
if (priv->out_sequence > priv->out_window)

if (under_pressure)
{
g_debug ("%s: sent too much data without acknowledgement, emitting back pressure until %"
G_GINT64_FORMAT, priv->id, priv->out_window);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means this emit_pressure call here has also changed from level triggered to edge triggered. Is that in line with all the other uses of this function call? Worth doing a check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/bridge/cockpitpacketchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/bridge/cockpitpacketchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered. ^^

src/bridge/cockpitstream.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/bridge/cockpitstream.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

src/common/cockpitchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);

This other call in CockpitChannel is edge triggered. But may have an off by one on line 248. I need to double check this.

src/common/cockpitchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

This is the one we're changing.

src/common/cockpitpipe.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/common/cockpitpipe.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

src/common/cockpitwebresponse.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/common/cockpitwebresponse.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

src/websocket/websocketconnection.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/websocket/websocketconnection.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);

Edge triggered ^^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we just figured out, all other pressure calls use edge triggering already, so that gets cockpitchannel.c in line with the others.

When a channel applies back-pressure to its input because its output
queue is getting too large, it relies on the consumer to acknowledge the
reads by ponging to the pings that the channel sent out.

However, it can happen that the last ping was sent for a sequence number
*before* the one that the channel expects for lifting the pressure. Then
the receiver could drain the output queue and dutifully respond to all
pings, but the last sequence number would still be smaller than the
treshold for lifting the pressure. But the channel isn't sending
anything else either and thus never generates new pings.

Fix this by sending an additional ping when going into pressure mode, to
ensure that the last pong sequence number is within the out_window
again. Ensure that this only happens on edge trigger, so that we never
send out the extra ping more than once. This also avoids raising the
pressure-on signal more than once as a side effect.

This can also be reproduced/investigated with this test-spawn-proc unit
test:
```js
QUnit.only("stream large output - find", function (assert) {
    const done = assert.async();
    assert.expect(2);

    cockpit.spawn(["find", "/"], { err: "ignore" })
            .stream(function(resp) {
                console.log("got block", resp.length);
            })
            .then(function(resp) {
                assert.equal(resp, "", "no done data");
            })
            .always(function() {
                assert.equal(this.state(), "resolved", "didn't fail");
                done();
            });
});
```
But this is not appropriate for actually committing -- the find often
fails at the end, and takes too many IO resources for a unit test. The
integration test confirms the fix, though.

Fix the integration test to happen before resetting the terminal, as the
subsequent checks expect it to be empty.

Fixes cockpit-project#12580
https://bugzilla.redhat.com/show_bug.cgi?id=1751783

Closes cockpit-project#14079
@martinpitt martinpitt merged commit 7eafa1a into cockpit-project:master May 13, 2020
@martinpitt martinpitt deleted the spawn-flooding branch May 13, 2020 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker Targetted for next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spawned processes with lots of output get stuck
3 participants