onResponseHeaders not called on continuous sendData #2213

carloseltuerto · 2022-04-26T17:35:55Z

Pushing more ByteBuffers always has priority over performing the onResponseHeaders callback. You could have received the ResponseHeaders one hour ago - still no callback indicating so if you keep pushing more data.

This problem is systematic on H2 with explicit flow control when using the EM Network Thread to perform the callback logic.

There is a specific test for this in Cronet: https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/test/javatests/src/org/chromium/net/BidirectionalStreamTest.java;l=465
This test is being ported, but is currently failing due to this issue.

And there is a PR with a simple test demonstrating the failure: #2212

RyanTheOptimist · 2022-04-26T17:57:55Z

Interesting bug! I'm slightly unclear about something. In the test, data is sent via setOnSendWindowAvailable() which, I assume, takes a callback which is executed when new send window is available. In the test, is there ever a period when no send window is available?

carloseltuerto · 2022-04-26T18:14:24Z

Here is the buggy sequence:

create stream
send headers
send first buffer: EM will then send the first OnSendWindowAvailable once ready
EM receives the response headers: here, the onResponseHeaders is not invoked
EM invokes the OnSendWindowAvailable callback: that callback writes one more buffer
EM invokes the OnSendWindowAvailable callback: that callback writes one more buffer
... one trillion times
EM invokes the OnSendWindowAvailable callback: that callback writes the last buffer
EM invokes the onResponseHeaders callback which invokes cancel: this is too late
EM invokes onCancel: the stream has ended

Here is how this should go, more or less:

create stream
send headers
send first buffer: EM will then send the first OnSendWindowAvailable once ready
EM invokes the OnSendWindowAvailable callback: that callback writes one more buffer
EM receives the response headers
EM invokes the onResponseHeaders callback which invokes cancel
EM invokes the onCancel callback: the stream has ended

RyanTheOptimist · 2022-04-26T18:16:44Z

Right, that makes sense. But I'm confused why we seem to have infinite send window available. Do you understand why we're never hitting a limit?

(Independent of the issue you're raising that onResponseHeaders should have priority over OnSendWindowAvailable)

RyanTheOptimist · 2022-04-26T18:40:21Z

Looking into this a bit, it looks like Client::sendData() calls onSendWindowAvailable() immediately if there is still window available:

void Client::sendData(envoy_stream_t stream, envoy_data data, bool end_stream) {
...
  if (direct_stream) {
...
    if (direct_stream->explicit_flow_control_ && !end_stream) {
      if (direct_stream->read_disable_count_ == 0) {
        // If there is still buffer space after the write, notify the sender                                                 
        // that send window is available.                                                                                    
        direct_stream->wants_write_notification_ = false;
        direct_stream->callbacks_->onSendWindowAvailable();
      } else {
...
      }
    }
  }
}

@alyssawilk it looks like this code came from #1545. Should we consider doing something async here?

carloseltuerto · 2022-04-26T18:43:20Z

This is not an infinite window. In fact this is very sequential, one ByteBuffer at a time.

stream.sendData(buffer, false); // false means more to come
EM callback with OnSendWindowAvailable
stream.sendData(buffer, false);
EM callback with OnSendWindowAvailable
stream.sendData(buffer, false);
... one trillion times
EM callback with OnSendWindowAvailable
stream.sendData(buffer, true); // true means this was the last buffer.

Look at the Cronet test:
https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/test/javatests/src/org/chromium/net/BidirectionalStreamTest.java;l=465
You will notice that each time the onWriteCompleted is called, it sends one more ByteBuffer: this is an infinite loop. The 'cancel' is supposed to break that loop. But the 'cancel' is executed on the onResponseHeaders callback which never occurs with EM.

RyanTheOptimist · 2022-04-26T18:50:15Z

Hm. I'm not sure we're on the same page. With, say, HTTP/2 or HTTP/3 there is a per-stream limit on the amount of data the client is allowed to send to the peer. With HTTP/1 this limit is in the OS at the socket layer. I would have expected that we would eventually exhaust this limit, but it doesn't look like that is happening in this test and that confuses me.

I agree with you that onResponseHeaders should be called and the fact that it is not called is a problem. I'm just trying to understand the test setup.

RyanTheOptimist · 2022-04-26T19:09:03Z

So to follow up from the offline discussion, the test sends only 1000 bytes (albeit one byte at a time) which explains why we never hit the limit. Thanks, this explains it!

but scheduling repeated onSendWindowAvailable calls in the next dispatcher iteration. Adds the regression test Charles cooked up in envoyproxy#2212 Fixes envoyproxy#2213

by scheduling repeated onSendWindowAvailable calls in the next dispatcher iteration. Adds the regression test Charles cooked up in envoyproxy#2212 Fixes envoyproxy#2213 Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist · 2022-05-02T03:57:55Z

After cooking up a fix for this issue in #2221 and discussing the behavior with @goaway, I think this is an artifact of how the Cronet executor is implemented. the onSendWindowAvailable() callback is handed off to the executor to schedule. In non-cronet executors, this happens on a different thread from the Envoy networking thread. As a result, the thread hopping allows the read loop to be pumped and the headers to be delivered. However, assuming we understand the Cronet executor correctly, with Cronet the callback happens on the same thread as the networking thread and so the read loop is never pumped.

While we could land a fix like #2221, it's not clear that is the right approach since it is at odds with how other callbacks work in Envoy Mobile. I think that probably a change to the Cronet executor is probably in order. @carloseltuerto, what do you think?

carloseltuerto · 2022-05-02T10:32:56Z

In this case, the user specifically requested to use the Network Thread to run all the business logic.

https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/test/javatests/src/org/chromium/net/BidirectionalStreamTest.java;l=469

Chromium/Cronet behaves as expected under this constraint: Cronet runs all the user code under the Network Thread, in the expected order.

In other words, the sole Executor provided to Cronvoy is a Direct Executor: there is nothing else to choose from. Unless Cronvoy unilaterally spawns a "private" Executor Thread, ignoring the user's prescription.

Also, please keep in mind that Cronet was designed to let the savvy user to use the Network Thread:

https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/api/src/org/chromium/net/UrlRequest.java;l=109?q=allow%20direct%20executor

This is obviously an optimisation - this avoids context switching between each request body ByteBuffer. We could take the decision to ignore this optimisation, but this comes probably at a price: for some performance tests, Cronvoy won't be able to surpass Cronet.

RyanTheOptimist · 2022-05-02T14:19:08Z

Oh, interesting. Thanks! I had forgotten about Cronet's direct executor mode. Hm. Ok, I'll do some more investigation. (I'm also curious about the buffering differences between Cronet and Envoy Mobile and if there are semantic differences between Cronet's onWriteComplete() and Envoy Mobile's onSendWindowAvailable())

carloseltuerto · 2022-05-02T17:00:52Z

The semantic is the opposite at the surface. Cronet tells you when it is finished, and EM tells you when it is ready for next ByteBuffer. The Cronvoy implementation needs to fill the gap here. Since there is no final callback for the "last write", Cronvoy takes the liberty of immediately scheduling the Cronet's onWriteComplete last callback upon writing the last ByteBuffer. This could look like a "lie" - the last ByteBuffer might not in reality have hit the wire when onWriteComplete gets called. Still, for all intent and purposes, this does not matter at all - if there is a problem, onError will get called anyway. And I do not know how Cronet implemented this: I would not be surprised if the Cronet C++ layer does exactly the same.

RyanTheOptimist · 2022-05-02T20:54:05Z

OK, I've spend the day coming up to speed on the behavior of BidirectionalStreamTest#testCancelWhileWriteDataPending in Cronet and at this point I think I understand what is happening. As @carloseltuerto said, the OnWriteCompleted() is being executed synchronously on the network thread. However, there is a significant difference between when Cronet's OnWriteCompleted() method is invoked and when Envoy Mobile's onSendWindowAvailable() method is. Namely, Cronet's OnWriteCompleted() is invoked immediately when data is written to the socket's send buffer (roughly) whereas Envoy Mobile's onSendWindowAvailable() method is always invoked, so long as send window is available. This means that mapping Envoy Mobile's onSendWindowAvailable() to Cronet's OnWriteCompleted() does not yield the same semantics. In particular, the reason Cronet's test passes is that the OnWriteCompleted() is only able to starve the read loop until the socket's send buffer fills up. However, the test fails in Envoy Mobile because onSendWindowAvailable() will block until the entire send window is exhausted. It looks like in the test setup the send window is about 1024000 bytes, whereas in the cronet test the send buffer doubles with each RTT but is at about 12K by the time the response headers arrive.

Ok, so I think that explains what is happening. But I'm not entirely sure what we should do to fix this. @alyssawilk how feasible would it be for Envoy/Envoy Mobile to expose a signal more like Cronet's OnWriteComplete? Maybe this is really hard due to the communication / buffering that happens between upstream and downstream?

RyanTheOptimist · 2022-05-05T17:20:13Z

Ok, let me summarize where we are here. Cronet direct executor mode is an important performance optimization in which writes happen on the network thread. It seems like we should preserve this behavior. PR #2221 attempts to resolve this issue by scheduling the onSendWindowAvailable() on the next iteration of the dispatcher. However, Envoy Mobile callbacks are typically invoked on the current dispatcher loop. There is concern that moving to the next iteration of dispatcher will be problematic, based on previous experience with deferredDelete(). At the same time, it sounds like from conversations in the Envoy Mobile meeting last week that it would generally desirable to move callback to the next iteration of the message loop.

After taking with @alyssawilk, maybe the right approach would be to try to land #2221 and see what it reveals. If we are able to sort out any issues then we can use this as a first step towards moving all Envoy Mobile callbacks.

@goaway, how does that sound to you?

goaway · 2022-05-05T17:51:48Z

@RyanTheOptimist it looks like at least in the current version of #2221 the new logic is fully guarded by explicit_flow_control_ being enabled. Given that's off by default, I'm comfortable with moving forward with that approach.

…g, (#2221) Description: Fix Envoy Mobile bug where writing prevents the read loop from running, by scheduling repeated onSendWindowAvailable calls in the next dispatcher iteration. Adds the regression test Charles cooked up in #2212 Fixes #2213 Risk Level: Low Testing: New regression test Docs Changes: N/A Release Notes: N/A Signed-off-by: Ryan Hamilton <rch@google.com>

carloseltuerto · 2022-05-16T18:00:42Z

Unfortunately, this fix crashes EM.

Trying to remove this @Ignore ends up with this stack trace:

https://github.com/envoyproxy/envoy-mobile/blob/main/test/java/org/chromium/net/BidirectionalStreamTest.java#L453

---------------  T H R E A D  ---------------

Current thread (0x000000013ea0e800):  JavaThread "EnvoyMain" [_thread_in_vm, id=173835, stack(0x000000016c360000,0x000000016c3e3000)]

Stack: [0x000000016c360000,0x000000016c3e3000],  sp=0x000000016c3e1c80,  free space=519k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x37bd6c]  jni_CallObjectMethodV+0x104
C  [libndk_envoy_jni.jnilib+0x7550]  _ZN7JNIEnv_16CallObjectMethodEP8_jobjectP10_jmethodIDz+0x40
C  [libndk_envoy_jni.jnilib+0x60a4]  _ZL28jvm_on_send_window_available18envoy_stream_intelPv+0xa8
C  [libndk_envoy_jni.jnilib+0xe15c74]  _ZN5Envoy4Http6Client21DirectStreamCallbacks21onSendWindowAvailableEv+0x2b0
C  [libndk_envoy_jni.jnilib+0xe3b274]  _ZZN5Envoy4Http6Client8sendDataEl10envoy_databENK3$_3clEv+0x24
C  [libndk_envoy_jni.jnilib+0xe3b230]  _ZNSt3__1L8__invokeIRZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3JEEEDTclscT_fp_spscT0_fp0_EEOS7_DpOS8_+0x18
C  [libndk_envoy_jni.jnilib+0xe3b1e8]  _ZNSt3__128__invoke_void_return_wrapperIvLb1EE6__callIJRZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3EEEvDpOT_+0x1c
C  [libndk_envoy_jni.jnilib+0xe3b1c0]  _ZNSt3__110__function12__alloc_funcIZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3NS_9allocatorIS6_EEFvvEEclEv+0x1c
C  [libndk_envoy_jni.jnilib+0xe39b28]  _ZNSt3__110__function6__funcIZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3NS_9allocatorIS6_EEFvvEEclEv+0x1c
C  [libndk_envoy_jni.jnilib+0x2879c0]  _ZNKSt3__110__function12__value_funcIFvvEEclEv+0x3c
C  [libndk_envoy_jni.jnilib+0x287978]  _ZNKSt3__18functionIFvvEEclEv+0x18
C  [libndk_envoy_jni.jnilib+0x1ff9c20]  _ZZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENSt3__18functionIFvvEEEENK3$_6clEv+0x2c
C  [libndk_envoy_jni.jnilib+0x1ff9bd4]  _ZNSt3__1L8__invokeIRZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6JEEEDTclscT_fp_spscT0_fp0_EEOS9_DpOSA_+0x18
C  [libndk_envoy_jni.jnilib+0x1ff9b8c]  _ZNSt3__128__invoke_void_return_wrapperIvLb1EE6__callIJRZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6EEEvDpOT_+0x1c
C  [libndk_envoy_jni.jnilib+0x1ff9b64]  _ZNSt3__110__function12__alloc_funcIZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6NS_9allocatorIS8_EES6_EclEv+0x1c
C  [libndk_envoy_jni.jnilib+0x1ff86d8]  _ZNSt3__110__function6__funcIZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6NS_9allocatorIS8_EES6_EclEv+0x1c
C  [libndk_envoy_jni.jnilib+0x2879c0]  _ZNKSt3__110__function12__value_funcIFvvEEclEv+0x3c
C  [libndk_envoy_jni.jnilib+0x287978]  _ZNKSt3__18functionIFvvEEclEv+0x18
C  [libndk_envoy_jni.jnilib+0x2917588]  _ZZN5Envoy5Event23SchedulableCallbackImplC1ERNS_9CSmartPtrI10event_baseXadL_Z15event_base_freeEEEENSt3__18functionIFvvEEEENK3$_0clEisPv+0x30
C  [libndk_envoy_jni.jnilib+0x291754c]  _ZZN5Envoy5Event23SchedulableCallbackImplC1ERNS_9CSmartPtrI10event_baseXadL_Z15event_base_freeEEEENSt3__18functionIFvvEEEEN3$_08__invokeEisPv+0x2c
C  [libndk_envoy_jni.jnilib+0x29ebb24]  event_process_active_single_queue+0x434
C  [libndk_envoy_jni.jnilib+0x29e5de4]  event_process_active+0x158
C  [libndk_envoy_jni.jnilib+0x29e4c24]  event_base_loop+0x4b0
C  [libndk_envoy_jni.jnilib+0x2912a34]  _ZN5Envoy5Event17LibeventScheduler3runENS0_10Dispatcher7RunTypeE+0x80
C  [libndk_envoy_jni.jnilib+0x1fd6a48]  _ZN5Envoy5Event14DispatcherImpl3runENS0_10Dispatcher7RunTypeE+0x54
C  [libndk_envoy_jni.jnilib+0x2c5fd8]  _ZN5Envoy6Server12InstanceImpl3runEv+0x3b0
C  [libndk_envoy_jni.jnilib+0xf9654]  _ZN5Envoy14MainCommonBase3runEv+0x64
C  [libndk_envoy_jni.jnilib+0x946c0]  _ZN5Envoy12EngineCommon3runEv+0x1c
C  [libndk_envoy_jni.jnilib+0x93be8]  _ZN5Envoy6Engine4mainENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEES7_+0xff0
C  [libndk_envoy_jni.jnilib+0xb585c]  _ZNSt3__1L8__invokeIMN5Envoy6EngineEF14envoy_status_tNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEES9_EPS2_JS9_S9_EvEEDTcldsdescT0_fp0_fp_spscT1_fp1_EEOT_OSD_DpOSE_+0xa8
C  [libndk_envoy_jni.jnilib+0xb5768]  _ZNSt3__1L16__thread_executeINS_10unique_ptrINS_15__thread_structENS_14default_deleteIS2_EEEEMN5Envoy6EngineEF14envoy_status_tNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESE_EJPS7_SE_SE_EJLm2ELm3ELm4EEEEvRNS_5tupleIJT_T0_DpT1_EEENS_15__tuple_indicesIJXspT2_EEEE+0x60
C  [libndk_envoy_jni.jnilib+0xb4d8c]  _ZNSt3__1L14__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN5Envoy6EngineEF14envoy_status_tNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESF_EPS8_SF_SF_EEEEEPvSK_+0x54
C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94

…cancelled. Fixes: envoyproxy#2213 Signed-off-by: Ryan Hamilton <rch@google.com>

…cancelled (#2292) Cancel the scheduled onSendWindowAvailable callback when a stream is cancelled. Fixes: #2213 Signed-off-by: Ryan Hamilton <rch@google.com>

carloseltuerto added bug Something isn't working cronvoy-beta labels Apr 26, 2022

carloseltuerto assigned RyanTheOptimist Apr 26, 2022

RyanTheOptimist mentioned this issue Apr 27, 2022

Fix Envoy Mobile bug where writing prevents the read loop from running, #2221

Merged

goaway closed this as completed in #2221 May 10, 2022

carloseltuerto reopened this May 16, 2022

RyanTheOptimist added a commit to RyanTheOptimist/envoy-mobile that referenced this issue May 18, 2022

Cancel the scheduled onSendWindowAvailable callback when a stream is …

041129c

…cancelled. Fixes: envoyproxy#2213 Signed-off-by: Ryan Hamilton <rch@google.com>

RyanTheOptimist mentioned this issue May 18, 2022

Cancel the scheduled onSendWindowAvailable callback when a stream is cancelled #2292

Merged

RyanTheOptimist closed this as completed in #2292 May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onResponseHeaders not called on continuous sendData #2213

onResponseHeaders not called on continuous sendData #2213

carloseltuerto commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

carloseltuerto commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

carloseltuerto commented Apr 26, 2022 •

edited

RyanTheOptimist commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

RyanTheOptimist commented May 2, 2022

carloseltuerto commented May 2, 2022 •

edited

RyanTheOptimist commented May 2, 2022 •

edited

carloseltuerto commented May 2, 2022

RyanTheOptimist commented May 2, 2022

RyanTheOptimist commented May 5, 2022

goaway commented May 5, 2022

carloseltuerto commented May 16, 2022

onResponseHeaders not called on continuous sendData #2213

onResponseHeaders not called on continuous sendData #2213

Comments

carloseltuerto commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

carloseltuerto commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

carloseltuerto commented Apr 26, 2022 • edited

RyanTheOptimist commented Apr 26, 2022

RyanTheOptimist commented Apr 26, 2022

RyanTheOptimist commented May 2, 2022

carloseltuerto commented May 2, 2022 • edited

RyanTheOptimist commented May 2, 2022 • edited

carloseltuerto commented May 2, 2022

RyanTheOptimist commented May 2, 2022

RyanTheOptimist commented May 5, 2022

goaway commented May 5, 2022

carloseltuerto commented May 16, 2022

carloseltuerto commented Apr 26, 2022 •

edited

carloseltuerto commented May 2, 2022 •

edited

RyanTheOptimist commented May 2, 2022 •

edited