Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onResponseHeaders not called on continuous sendData #2213

Closed
carloseltuerto opened this issue Apr 26, 2022 · 15 comments · Fixed by #2221 or #2292
Closed

onResponseHeaders not called on continuous sendData #2213

carloseltuerto opened this issue Apr 26, 2022 · 15 comments · Fixed by #2221 or #2292
Assignees
Labels
bug Something isn't working cronvoy-beta

Comments

@carloseltuerto
Copy link
Contributor

cc: @alyssawilk @goaway

Pushing more ByteBuffers always has priority over performing the onResponseHeaders callback. You could have received the ResponseHeaders one hour ago - still no callback indicating so if you keep pushing more data.

This problem is systematic on H2 with explicit flow control when using the EM Network Thread to perform the callback logic.

There is a specific test for this in Cronet: https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/test/javatests/src/org/chromium/net/BidirectionalStreamTest.java;l=465
This test is being ported, but is currently failing due to this issue.

And there is a PR with a simple test demonstrating the failure: #2212

@RyanTheOptimist
Copy link
Contributor

Interesting bug! I'm slightly unclear about something. In the test, data is sent via setOnSendWindowAvailable() which, I assume, takes a callback which is executed when new send window is available. In the test, is there ever a period when no send window is available?

@carloseltuerto
Copy link
Contributor Author

Here is the buggy sequence:

  • create stream
  • send headers
  • send first buffer: EM will then send the first OnSendWindowAvailable once ready
  • EM receives the response headers: here, the onResponseHeaders is not invoked
  • EM invokes the OnSendWindowAvailable callback: that callback writes one more buffer
  • EM invokes the OnSendWindowAvailable callback: that callback writes one more buffer
  • ... one trillion times
  • EM invokes the OnSendWindowAvailable callback: that callback writes the last buffer
  • EM invokes the onResponseHeaders callback which invokes cancel: this is too late
  • EM invokes onCancel: the stream has ended

Here is how this should go, more or less:

  • create stream
  • send headers
  • send first buffer: EM will then send the first OnSendWindowAvailable once ready
  • EM invokes the OnSendWindowAvailable callback: that callback writes one more buffer
  • EM receives the response headers
  • EM invokes the onResponseHeaders callback which invokes cancel
  • EM invokes the onCancel callback: the stream has ended

@RyanTheOptimist
Copy link
Contributor

Right, that makes sense. But I'm confused why we seem to have infinite send window available. Do you understand why we're never hitting a limit?

(Independent of the issue you're raising that onResponseHeaders should have priority over OnSendWindowAvailable)

@RyanTheOptimist
Copy link
Contributor

Looking into this a bit, it looks like Client::sendData() calls onSendWindowAvailable() immediately if there is still window available:

void Client::sendData(envoy_stream_t stream, envoy_data data, bool end_stream) {
...
  if (direct_stream) {
...
    if (direct_stream->explicit_flow_control_ && !end_stream) {
      if (direct_stream->read_disable_count_ == 0) {
        // If there is still buffer space after the write, notify the sender                                                 
        // that send window is available.                                                                                    
        direct_stream->wants_write_notification_ = false;
        direct_stream->callbacks_->onSendWindowAvailable();
      } else {
...
      }
    }
  }
}

@alyssawilk it looks like this code came from #1545. Should we consider doing something async here?

@carloseltuerto
Copy link
Contributor Author

carloseltuerto commented Apr 26, 2022

This is not an infinite window. In fact this is very sequential, one ByteBuffer at a time.

  • stream.sendData(buffer, false); // false means more to come
  • EM callback with OnSendWindowAvailable
  • stream.sendData(buffer, false);
  • EM callback with OnSendWindowAvailable
  • stream.sendData(buffer, false);
    ... one trillion times
  • EM callback with OnSendWindowAvailable
  • stream.sendData(buffer, true); // true means this was the last buffer.

Look at the Cronet test:
https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/test/javatests/src/org/chromium/net/BidirectionalStreamTest.java;l=465
You will notice that each time the onWriteCompleted is called, it sends one more ByteBuffer: this is an infinite loop. The 'cancel' is supposed to break that loop. But the 'cancel' is executed on the onResponseHeaders callback which never occurs with EM.

@RyanTheOptimist
Copy link
Contributor

Hm. I'm not sure we're on the same page. With, say, HTTP/2 or HTTP/3 there is a per-stream limit on the amount of data the client is allowed to send to the peer. With HTTP/1 this limit is in the OS at the socket layer. I would have expected that we would eventually exhaust this limit, but it doesn't look like that is happening in this test and that confuses me.

I agree with you that onResponseHeaders should be called and the fact that it is not called is a problem. I'm just trying to understand the test setup.

@RyanTheOptimist
Copy link
Contributor

So to follow up from the offline discussion, the test sends only 1000 bytes (albeit one byte at a time) which explains why we never hit the limit. Thanks, this explains it!

RyanTheOptimist added a commit to RyanTheOptimist/envoy-mobile that referenced this issue Apr 27, 2022
but scheduling repeated onSendWindowAvailable calls in the next
dispatcher iteration.

Adds the regression test Charles cooked up in envoyproxy#2212

Fixes envoyproxy#2213
RyanTheOptimist added a commit to RyanTheOptimist/envoy-mobile that referenced this issue Apr 27, 2022
by scheduling repeated onSendWindowAvailable calls in the next
dispatcher iteration.

Adds the regression test Charles cooked up in envoyproxy#2212

Fixes envoyproxy#2213

Signed-off-by: Ryan Hamilton <rch@google.com>
@RyanTheOptimist
Copy link
Contributor

After cooking up a fix for this issue in #2221 and discussing the behavior with @goaway, I think this is an artifact of how the Cronet executor is implemented. the onSendWindowAvailable() callback is handed off to the executor to schedule. In non-cronet executors, this happens on a different thread from the Envoy networking thread. As a result, the thread hopping allows the read loop to be pumped and the headers to be delivered. However, assuming we understand the Cronet executor correctly, with Cronet the callback happens on the same thread as the networking thread and so the read loop is never pumped.

While we could land a fix like #2221, it's not clear that is the right approach since it is at odds with how other callbacks work in Envoy Mobile. I think that probably a change to the Cronet executor is probably in order. @carloseltuerto, what do you think?

@carloseltuerto
Copy link
Contributor Author

carloseltuerto commented May 2, 2022

In this case, the user specifically requested to use the Network Thread to run all the business logic.

https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/test/javatests/src/org/chromium/net/BidirectionalStreamTest.java;l=469

Chromium/Cronet behaves as expected under this constraint: Cronet runs all the user code under the Network Thread, in the expected order.

In other words, the sole Executor provided to Cronvoy is a Direct Executor: there is nothing else to choose from. Unless Cronvoy unilaterally spawns a "private" Executor Thread, ignoring the user's prescription.

Also, please keep in mind that Cronet was designed to let the savvy user to use the Network Thread:

https://source.chromium.org/chromium/chromium/src/+/main:components/cronet/android/api/src/org/chromium/net/UrlRequest.java;l=109?q=allow%20direct%20executor

This is obviously an optimisation - this avoids context switching between each request body ByteBuffer. We could take the decision to ignore this optimisation, but this comes probably at a price: for some performance tests, Cronvoy won't be able to surpass Cronet.

@RyanTheOptimist
Copy link
Contributor

RyanTheOptimist commented May 2, 2022

Oh, interesting. Thanks! I had forgotten about Cronet's direct executor mode. Hm. Ok, I'll do some more investigation. (I'm also curious about the buffering differences between Cronet and Envoy Mobile and if there are semantic differences between Cronet's onWriteComplete() and Envoy Mobile's onSendWindowAvailable())

@carloseltuerto
Copy link
Contributor Author

The semantic is the opposite at the surface. Cronet tells you when it is finished, and EM tells you when it is ready for next ByteBuffer. The Cronvoy implementation needs to fill the gap here. Since there is no final callback for the "last write", Cronvoy takes the liberty of immediately scheduling the Cronet's onWriteComplete last callback upon writing the last ByteBuffer. This could look like a "lie" - the last ByteBuffer might not in reality have hit the wire when onWriteComplete gets called. Still, for all intent and purposes, this does not matter at all - if there is a problem, onError will get called anyway. And I do not know how Cronet implemented this: I would not be surprised if the Cronet C++ layer does exactly the same.

@RyanTheOptimist
Copy link
Contributor

OK, I've spend the day coming up to speed on the behavior of BidirectionalStreamTest#testCancelWhileWriteDataPending in Cronet and at this point I think I understand what is happening. As @carloseltuerto said, the OnWriteCompleted() is being executed synchronously on the network thread. However, there is a significant difference between when Cronet's OnWriteCompleted() method is invoked and when Envoy Mobile's onSendWindowAvailable() method is. Namely, Cronet's OnWriteCompleted() is invoked immediately when data is written to the socket's send buffer (roughly) whereas Envoy Mobile's onSendWindowAvailable() method is always invoked, so long as send window is available. This means that mapping Envoy Mobile's onSendWindowAvailable() to Cronet's OnWriteCompleted() does not yield the same semantics. In particular, the reason Cronet's test passes is that the OnWriteCompleted() is only able to starve the read loop until the socket's send buffer fills up. However, the test fails in Envoy Mobile because onSendWindowAvailable() will block until the entire send window is exhausted. It looks like in the test setup the send window is about 1024000 bytes, whereas in the cronet test the send buffer doubles with each RTT but is at about 12K by the time the response headers arrive.

Ok, so I think that explains what is happening. But I'm not entirely sure what we should do to fix this. @alyssawilk how feasible would it be for Envoy/Envoy Mobile to expose a signal more like Cronet's OnWriteComplete? Maybe this is really hard due to the communication / buffering that happens between upstream and downstream?

@RyanTheOptimist
Copy link
Contributor

Ok, let me summarize where we are here. Cronet direct executor mode is an important performance optimization in which writes happen on the network thread. It seems like we should preserve this behavior. PR #2221 attempts to resolve this issue by scheduling the onSendWindowAvailable() on the next iteration of the dispatcher. However, Envoy Mobile callbacks are typically invoked on the current dispatcher loop. There is concern that moving to the next iteration of dispatcher will be problematic, based on previous experience with deferredDelete(). At the same time, it sounds like from conversations in the Envoy Mobile meeting last week that it would generally desirable to move callback to the next iteration of the message loop.

After taking with @alyssawilk, maybe the right approach would be to try to land #2221 and see what it reveals. If we are able to sort out any issues then we can use this as a first step towards moving all Envoy Mobile callbacks.

@goaway, how does that sound to you?

@goaway
Copy link
Contributor

goaway commented May 5, 2022

@RyanTheOptimist it looks like at least in the current version of #2221 the new logic is fully guarded by explicit_flow_control_ being enabled. Given that's off by default, I'm comfortable with moving forward with that approach.

goaway pushed a commit that referenced this issue May 10, 2022
…g, (#2221)

Description: Fix Envoy Mobile bug where writing prevents the read loop from running,
by scheduling repeated onSendWindowAvailable calls in the next
dispatcher iteration.

Adds the regression test Charles cooked up in #2212

Fixes #2213 

Risk Level: Low
Testing: New regression test
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Ryan Hamilton <rch@google.com>
@carloseltuerto
Copy link
Contributor Author

Unfortunately, this fix crashes EM.

Trying to remove this @Ignore ends up with this stack trace:

https://github.com/envoyproxy/envoy-mobile/blob/main/test/java/org/chromium/net/BidirectionalStreamTest.java#L453

---------------  T H R E A D  ---------------

Current thread (0x000000013ea0e800):  JavaThread "EnvoyMain" [_thread_in_vm, id=173835, stack(0x000000016c360000,0x000000016c3e3000)]

Stack: [0x000000016c360000,0x000000016c3e3000],  sp=0x000000016c3e1c80,  free space=519k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x37bd6c]  jni_CallObjectMethodV+0x104
C  [libndk_envoy_jni.jnilib+0x7550]  _ZN7JNIEnv_16CallObjectMethodEP8_jobjectP10_jmethodIDz+0x40
C  [libndk_envoy_jni.jnilib+0x60a4]  _ZL28jvm_on_send_window_available18envoy_stream_intelPv+0xa8
C  [libndk_envoy_jni.jnilib+0xe15c74]  _ZN5Envoy4Http6Client21DirectStreamCallbacks21onSendWindowAvailableEv+0x2b0
C  [libndk_envoy_jni.jnilib+0xe3b274]  _ZZN5Envoy4Http6Client8sendDataEl10envoy_databENK3$_3clEv+0x24
C  [libndk_envoy_jni.jnilib+0xe3b230]  _ZNSt3__1L8__invokeIRZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3JEEEDTclscT_fp_spscT0_fp0_EEOS7_DpOS8_+0x18
C  [libndk_envoy_jni.jnilib+0xe3b1e8]  _ZNSt3__128__invoke_void_return_wrapperIvLb1EE6__callIJRZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3EEEvDpOT_+0x1c
C  [libndk_envoy_jni.jnilib+0xe3b1c0]  _ZNSt3__110__function12__alloc_funcIZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3NS_9allocatorIS6_EEFvvEEclEv+0x1c
C  [libndk_envoy_jni.jnilib+0xe39b28]  _ZNSt3__110__function6__funcIZN5Envoy4Http6Client8sendDataEl10envoy_databE3$_3NS_9allocatorIS6_EEFvvEEclEv+0x1c
C  [libndk_envoy_jni.jnilib+0x2879c0]  _ZNKSt3__110__function12__value_funcIFvvEEclEv+0x3c
C  [libndk_envoy_jni.jnilib+0x287978]  _ZNKSt3__18functionIFvvEEclEv+0x18
C  [libndk_envoy_jni.jnilib+0x1ff9c20]  _ZZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENSt3__18functionIFvvEEEENK3$_6clEv+0x2c
C  [libndk_envoy_jni.jnilib+0x1ff9bd4]  _ZNSt3__1L8__invokeIRZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6JEEEDTclscT_fp_spscT0_fp0_EEOS9_DpOSA_+0x18
C  [libndk_envoy_jni.jnilib+0x1ff9b8c]  _ZNSt3__128__invoke_void_return_wrapperIvLb1EE6__callIJRZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6EEEvDpOT_+0x1c
C  [libndk_envoy_jni.jnilib+0x1ff9b64]  _ZNSt3__110__function12__alloc_funcIZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6NS_9allocatorIS8_EES6_EclEv+0x1c
C  [libndk_envoy_jni.jnilib+0x1ff86d8]  _ZNSt3__110__function6__funcIZN5Envoy5Event14DispatcherImpl25createSchedulableCallbackENS_8functionIFvvEEEE3$_6NS_9allocatorIS8_EES6_EclEv+0x1c
C  [libndk_envoy_jni.jnilib+0x2879c0]  _ZNKSt3__110__function12__value_funcIFvvEEclEv+0x3c
C  [libndk_envoy_jni.jnilib+0x287978]  _ZNKSt3__18functionIFvvEEclEv+0x18
C  [libndk_envoy_jni.jnilib+0x2917588]  _ZZN5Envoy5Event23SchedulableCallbackImplC1ERNS_9CSmartPtrI10event_baseXadL_Z15event_base_freeEEEENSt3__18functionIFvvEEEENK3$_0clEisPv+0x30
C  [libndk_envoy_jni.jnilib+0x291754c]  _ZZN5Envoy5Event23SchedulableCallbackImplC1ERNS_9CSmartPtrI10event_baseXadL_Z15event_base_freeEEEENSt3__18functionIFvvEEEEN3$_08__invokeEisPv+0x2c
C  [libndk_envoy_jni.jnilib+0x29ebb24]  event_process_active_single_queue+0x434
C  [libndk_envoy_jni.jnilib+0x29e5de4]  event_process_active+0x158
C  [libndk_envoy_jni.jnilib+0x29e4c24]  event_base_loop+0x4b0
C  [libndk_envoy_jni.jnilib+0x2912a34]  _ZN5Envoy5Event17LibeventScheduler3runENS0_10Dispatcher7RunTypeE+0x80
C  [libndk_envoy_jni.jnilib+0x1fd6a48]  _ZN5Envoy5Event14DispatcherImpl3runENS0_10Dispatcher7RunTypeE+0x54
C  [libndk_envoy_jni.jnilib+0x2c5fd8]  _ZN5Envoy6Server12InstanceImpl3runEv+0x3b0
C  [libndk_envoy_jni.jnilib+0xf9654]  _ZN5Envoy14MainCommonBase3runEv+0x64
C  [libndk_envoy_jni.jnilib+0x946c0]  _ZN5Envoy12EngineCommon3runEv+0x1c
C  [libndk_envoy_jni.jnilib+0x93be8]  _ZN5Envoy6Engine4mainENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEES7_+0xff0
C  [libndk_envoy_jni.jnilib+0xb585c]  _ZNSt3__1L8__invokeIMN5Envoy6EngineEF14envoy_status_tNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEES9_EPS2_JS9_S9_EvEEDTcldsdescT0_fp0_fp_spscT1_fp1_EEOT_OSD_DpOSE_+0xa8
C  [libndk_envoy_jni.jnilib+0xb5768]  _ZNSt3__1L16__thread_executeINS_10unique_ptrINS_15__thread_structENS_14default_deleteIS2_EEEEMN5Envoy6EngineEF14envoy_status_tNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESE_EJPS7_SE_SE_EJLm2ELm3ELm4EEEEvRNS_5tupleIJT_T0_DpT1_EEENS_15__tuple_indicesIJXspT2_EEEE+0x60
C  [libndk_envoy_jni.jnilib+0xb4d8c]  _ZNSt3__1L14__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEMN5Envoy6EngineEF14envoy_status_tNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESF_EPS8_SF_SF_EEEEEPvSK_+0x54
C  [libsystem_pthread.dylib+0x726c]  _pthread_start+0x94

RyanTheOptimist added a commit to RyanTheOptimist/envoy-mobile that referenced this issue May 18, 2022
…cancelled.

Fixes: envoyproxy#2213

Signed-off-by: Ryan Hamilton <rch@google.com>
RyanTheOptimist added a commit that referenced this issue May 19, 2022
…cancelled (#2292)

Cancel the scheduled onSendWindowAvailable callback when a stream is cancelled.

Fixes: #2213

Signed-off-by: Ryan Hamilton <rch@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment