New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common: Fix deadlock in channel pressure handling #14079
Conversation
2ac59dc
to
fd057bd
Compare
I talked this over with @stefwalter , and he confirmed that this is a valid fix. The actual new flooding test in check-terminal succeeds everywhere, I just botched the subsequent checks. I can't run this locally due to chromium changes. I have a fix prepared locally, but I also still want to reformat the code a bit. |
61555e9
to
c933802
Compare
PR #14075 landed, and I applied the code clenaup. @stefwalter, do you want to officially review that? |
Nice! This makes sense to me, for what it's worth. |
c933802
to
af3e1d6
Compare
@mvollmer brought up that it's not clear that we avoid ping/pong loops with the same sequence number. I changed the level to edge triggering to ensure this. |
src/common/cockpitchannel.c
Outdated
if (out_sequence / CHANNEL_FLOW_PING != priv->out_sequence / CHANNEL_FLOW_PING) | ||
/* If we've sent more than the window, we just got under pressure; | ||
* do an edge trigger instead of level trigger to avoid ping/signal loops */ | ||
under_pressure = (priv->out_sequence <= priv->out_window) && (out_sequence > priv->out_window); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is edge triggered then the variable name needs to change. The variable is named for a level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I renamed it to trigger_pressure
.
priv->out_sequence = out_sequence; | ||
if (priv->out_sequence > priv->out_window) | ||
|
||
if (under_pressure) | ||
{ | ||
g_debug ("%s: sent too much data without acknowledgement, emitting back pressure until %" | ||
G_GINT64_FORMAT, priv->id, priv->out_window); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means this emit_pressure call here has also changed from level triggered to edge triggered. Is that in line with all the other uses of this function call? Worth doing a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/bridge/cockpitpacketchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/bridge/cockpitpacketchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);
Edge triggered. ^^
src/bridge/cockpitstream.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/bridge/cockpitstream.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);
Edge triggered ^^
src/common/cockpitchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
This other call in CockpitChannel is edge triggered. But may have an off by one on line 248. I need to double check this.
src/common/cockpitchannel.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);
This is the one we're changing.
src/common/cockpitpipe.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/common/cockpitpipe.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);
Edge triggered ^^
src/common/cockpitwebresponse.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/common/cockpitwebresponse.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);
Edge triggered ^^
src/websocket/websocketconnection.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), FALSE);
src/websocket/websocketconnection.c: cockpit_flow_emit_pressure (COCKPIT_FLOW (self), TRUE);
Edge triggered ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we just figured out, all other pressure calls use edge triggering already, so that gets cockpitchannel.c in line with the others.
When a channel applies back-pressure to its input because its output queue is getting too large, it relies on the consumer to acknowledge the reads by ponging to the pings that the channel sent out. However, it can happen that the last ping was sent for a sequence number *before* the one that the channel expects for lifting the pressure. Then the receiver could drain the output queue and dutifully respond to all pings, but the last sequence number would still be smaller than the treshold for lifting the pressure. But the channel isn't sending anything else either and thus never generates new pings. Fix this by sending an additional ping when going into pressure mode, to ensure that the last pong sequence number is within the out_window again. Ensure that this only happens on edge trigger, so that we never send out the extra ping more than once. This also avoids raising the pressure-on signal more than once as a side effect. This can also be reproduced/investigated with this test-spawn-proc unit test: ```js QUnit.only("stream large output - find", function (assert) { const done = assert.async(); assert.expect(2); cockpit.spawn(["find", "/"], { err: "ignore" }) .stream(function(resp) { console.log("got block", resp.length); }) .then(function(resp) { assert.equal(resp, "", "no done data"); }) .always(function() { assert.equal(this.state(), "resolved", "didn't fail"); done(); }); }); ``` But this is not appropriate for actually committing -- the find often fails at the end, and takes too many IO resources for a unit test. The integration test confirms the fix, though. Fix the integration test to happen before resetting the terminal, as the subsequent checks expect it to be empty. Fixes cockpit-project#12580 https://bugzilla.redhat.com/show_bug.cgi?id=1751783 Closes cockpit-project#14079
af3e1d6
to
fe309c8
Compare
When a channel applies back-pressure to its input because its output
queue is getting too large, it relies on the consumer to acknowledge the
reads by ponging to the pings that the channel sent out.
However, it can happen that the last ping was sent for a sequence number
before the one that the channel expects for lifting the pressure. Then
the receiver could drain the output queue and dutifully respond to all
pings, but the last sequence number would still be smaller than the
treshold for lifting the pressure. But the channel isn't sending
anything else either and thus never generates new pings.
Fix this by sending an additional ping when going into pressure mode, to
ensure that the last pong sequence number is within the out_window
again.
This can also be reproduced/investigated with this test-spawn-proc unit
test:
But this is not appropriate for actually committing -- the find often
fails at the end, and takes too many IO resources for a unit test. The
integration test confirms the fix, though.
Fixes #12580
https://bugzilla.redhat.com/show_bug.cgi?id=1751783