Error { kind: Failed, description: "Received a new call on in-use question id 0" } #359

edevil · 2022-12-07T17:12:06Z

Hello.

I'm seeing an error happen sporadically when I run my tests on CI, and I've been trying to reproduce it locally with little luck.

Basically I have two components: F (capnproto-rust) and E (pycapnp). F calls E and passes it, among other things, a capability for it to call F. The test where this error happens is one where there the F->E call timeouts and it needs to retry. RPC calls start throwing the error "Error { kind: Failed, description: "Received a new call on in-use question id 0" }", and shortly after the RpcSystem promise also returns with the same error.

I know I don't have a lot of information to give, but I was wondering if someone could explain to me in which circumstances I can obtain the above error so that I can more easily reproduce it. It seems like a bug in the rpc library that should be fixed. From looking at the code I get the impression that what causes the error is the call E->F, even though I only notice the issue afterwards when I try to do more rpc calls from F.

The text was updated successfully, but these errors were encountered:

dwrensha · 2022-12-07T18:28:17Z

The test where this error happens is one where there the F->E call timeouts and it needs to retry

Is this a timeout deliberately triggered by you, for testing purposes? How are you causing the timeout?

edevil · 2022-12-07T18:36:33Z

I have a sleep on the E component, and on the F side I'm using tokio::time::timeout when awaiting the send promise.

edevil · 2022-12-21T11:54:58Z

I have observed more CI runs that product this same error. Usually the pattern is:

F calls E and passes a capability
E uses capability to call F, which succeeds
E handling of original F call finishes
F never gets the response back from E and timeouts (no sleep here, maybe some bug in pycapnp)
F tries again to call E
E uses capability to call F, call does not succeed, connection is terminated
F all capnp module-related calls return "Received a new call on in-use question id 0" error.

edevil · 2022-12-21T15:52:10Z

I should add that I'm using tokio::time::timeout and then reuse the same capnp object to re-issue the request.

edevil · 2023-01-02T22:50:50Z

Just adding a few more details. I sprinkled a few debug statements in ConnectionState::handle_message().

This is a normal interaction that finishes successfully:

Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 0
Jan 02 22:40:14.819 handle_message return END question_id 0
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 1
Jan 02 22:40:14.819 handle_message return END question_id 1
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 2
Jan 02 22:40:14.819 handle_message return END question_id 2
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 3
Jan 02 22:40:14.819 handle_message return END question_id 3
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message call question_id 0 interface_id 12987388290453739558 method_id 0 cap_table_array 0
Jan 02 22:40:14.819 handle_message call END
Jan 02 22:40:14.904 handle_message
Jan 02 22:40:14.904 handle_message finish question_id 0
Jan 02 22:40:14.904 handle_message
Jan 02 22:40:14.904 handle_message return question_id 4
Jan 02 22:40:14.904 handle_message return END question_id 4
Jan 02 22:40:14.904 handle_message
Jan 02 22:40:14.904 handle_message release id 0

Now this is an interaction that includes a timeout:

Jan 02 17:26:48.997 handle_message
Jan 02 17:26:48.997 handle_message return question_id 0
Jan 02 17:26:48.997 handle_message return END question_id 0
Jan 02 17:26:48.997 handle_message
Jan 02 17:26:48.998 handle_message return question_id 1
Jan 02 17:26:48.998 handle_message return END question_id 1
Jan 02 17:26:48.998 handle_message
Jan 02 17:26:48.998 handle_message return question_id 2
Jan 02 17:26:48.999 handle_message return END question_id 2
Jan 02 17:26:48.999 handle_message
Jan 02 17:26:48.999 handle_message return question_id 3
Jan 02 17:26:48.999 handle_message return END question_id 3
Jan 02 17:26:49.000 handle_message
Jan 02 17:26:49.000 handle_message call question_id 0 interface_id 12987388290453739558 method_id 0 cap_table_array 0
Jan 02 17:26:49.000 handle_message call END
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message return question_id 1
Jan 02 17:27:23.994 handle_message return END question_id 1
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message return question_id 2
Jan 02 17:27:23.994 handle_message return END question_id 2
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message return question_id 3
Jan 02 17:27:23.994 handle_message return END question_id 3
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message call question_id 0 interface_id 12987388290453739558 method_id 0 cap_table_array 0

When the timeout happens I reuse the runtime object to create and send a new request. It seems the other party (pycapnp) reuses the same question_id when calling us back on the same connection, when a previous question is still in play, which is a bug I guess.

What I'm trying to do now is when a timeout happens I destroy the current connection and setup a new one. For this I'm using the disconnector object we can get from RpcSystem, but the RpcSystem future itself never returns even after the disconnector finds the state as "Disconnected". I inspected the TaskSet and it appears to have in_progress Pending futures that never complete.

I thought the disconnector would force the termination of whatever questions were still running, is this not the case @dwrensha ?

dwrensha · 2023-01-03T00:38:37Z

Yeah, that sounds like a bug. I expect that once the Disconnector future finishes, the RpcSystem future should finish too. Can you isolate a testcase where that happens so that I can reproduce it?

edevil · 2023-01-03T17:53:36Z

I'm trying to extract some code to replicate this issue, but it's not been easy. I think it's related to losing some packets due to the way I was reading the socket on the Python side. Meanwhile, by manually dropping some capnp responses on the Python side I was able to trigger a panic on the Rust side:

thread 'main' panicked at 'not implemented', /Users/acruz/work/vc/capnp-tester/client/capnp-rpc/src/rpc.rs:1065:41
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: core::panicking::panic
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:115:5
   3: capnp_rpc::rpc::ConnectionState<VatId>::handle_message
             at ./capnp-rpc/src/rpc.rs:1065:41
   4: capnp_rpc::rpc::ConnectionState<VatId>::message_loop::{{closure}}

Apparently the return_::Canceled(_) message is sent in some cases but not implemented yet in the rust module.

edevil · 2023-01-04T11:36:19Z

Ok, I finally tracked this down.

The Received a new call on in-use question id 0 happened because my Python server code had a race condition when reading data and could drop messages to the client. I was induced in error by https://capnproto.github.io/pycapnp/quickstart.html#starting-a-server-asyncio

The issue with the disconnector happened because I was calling get_disconnector() before calling bootstrap() on the RPC system. This caused the disconnector to be initialized already in a disconnected state, and when awaited would not do anything. Inverting the order of the operations made it work correctly.

So, I guess it was all my fault all along. 😅

dwrensha · 2023-01-16T14:09:13Z

I've eliminated those panics on the Rust side: 6087383

edevil closed this as completed Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error { kind: Failed, description: "Received a new call on in-use question id 0" } #359

Error { kind: Failed, description: "Received a new call on in-use question id 0" } #359

edevil commented Dec 7, 2022

dwrensha commented Dec 7, 2022 •

edited

edevil commented Dec 7, 2022

edevil commented Dec 21, 2022

edevil commented Dec 21, 2022

edevil commented Jan 2, 2023

dwrensha commented Jan 3, 2023

edevil commented Jan 3, 2023

edevil commented Jan 4, 2023

dwrensha commented Jan 16, 2023

Error { kind: Failed, description: "Received a new call on in-use question id 0" } #359

Error { kind: Failed, description: "Received a new call on in-use question id 0" } #359

Comments

edevil commented Dec 7, 2022

dwrensha commented Dec 7, 2022 • edited

edevil commented Dec 7, 2022

edevil commented Dec 21, 2022

edevil commented Dec 21, 2022

edevil commented Jan 2, 2023

dwrensha commented Jan 3, 2023

edevil commented Jan 3, 2023

edevil commented Jan 4, 2023

dwrensha commented Jan 16, 2023

dwrensha commented Dec 7, 2022 •

edited