Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error { kind: Failed, description: "Received a new call on in-use question id 0" } #359

Closed
edevil opened this issue Dec 7, 2022 · 9 comments

Comments

@edevil
Copy link

edevil commented Dec 7, 2022

Hello.

I'm seeing an error happen sporadically when I run my tests on CI, and I've been trying to reproduce it locally with little luck.

Basically I have two components: F (capnproto-rust) and E (pycapnp). F calls E and passes it, among other things, a capability for it to call F. The test where this error happens is one where there the F->E call timeouts and it needs to retry. RPC calls start throwing the error "Error { kind: Failed, description: "Received a new call on in-use question id 0" }", and shortly after the RpcSystem promise also returns with the same error.

I know I don't have a lot of information to give, but I was wondering if someone could explain to me in which circumstances I can obtain the above error so that I can more easily reproduce it. It seems like a bug in the rpc library that should be fixed. From looking at the code I get the impression that what causes the error is the call E->F, even though I only notice the issue afterwards when I try to do more rpc calls from F.

@dwrensha
Copy link
Member

dwrensha commented Dec 7, 2022

The test where this error happens is one where there the F->E call timeouts and it needs to retry

Is this a timeout deliberately triggered by you, for testing purposes? How are you causing the timeout?

@edevil
Copy link
Author

edevil commented Dec 7, 2022

I have a sleep on the E component, and on the F side I'm using tokio::time::timeout when awaiting the send promise.

@edevil
Copy link
Author

edevil commented Dec 21, 2022

I have observed more CI runs that product this same error. Usually the pattern is:

F calls E and passes a capability
E uses capability to call F, which succeeds
E handling of original F call finishes
F never gets the response back from E and timeouts (no sleep here, maybe some bug in pycapnp)
F tries again to call E
E uses capability to call F, call does not succeed, connection is terminated
F all capnp module-related calls return "Received a new call on in-use question id 0" error.

@edevil
Copy link
Author

edevil commented Dec 21, 2022

I should add that I'm using tokio::time::timeout and then reuse the same capnp object to re-issue the request.

@edevil
Copy link
Author

edevil commented Jan 2, 2023

Just adding a few more details. I sprinkled a few debug statements in ConnectionState::handle_message().

This is a normal interaction that finishes successfully:

Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 0
Jan 02 22:40:14.819 handle_message return END question_id 0
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 1
Jan 02 22:40:14.819 handle_message return END question_id 1
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 2
Jan 02 22:40:14.819 handle_message return END question_id 2
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message return question_id 3
Jan 02 22:40:14.819 handle_message return END question_id 3
Jan 02 22:40:14.819 handle_message
Jan 02 22:40:14.819 handle_message call question_id 0 interface_id 12987388290453739558 method_id 0 cap_table_array 0
Jan 02 22:40:14.819 handle_message call END
Jan 02 22:40:14.904 handle_message
Jan 02 22:40:14.904 handle_message finish question_id 0
Jan 02 22:40:14.904 handle_message
Jan 02 22:40:14.904 handle_message return question_id 4
Jan 02 22:40:14.904 handle_message return END question_id 4
Jan 02 22:40:14.904 handle_message
Jan 02 22:40:14.904 handle_message release id 0

Now this is an interaction that includes a timeout:

Jan 02 17:26:48.997 handle_message
Jan 02 17:26:48.997 handle_message return question_id 0
Jan 02 17:26:48.997 handle_message return END question_id 0
Jan 02 17:26:48.997 handle_message
Jan 02 17:26:48.998 handle_message return question_id 1
Jan 02 17:26:48.998 handle_message return END question_id 1
Jan 02 17:26:48.998 handle_message
Jan 02 17:26:48.998 handle_message return question_id 2
Jan 02 17:26:48.999 handle_message return END question_id 2
Jan 02 17:26:48.999 handle_message
Jan 02 17:26:48.999 handle_message return question_id 3
Jan 02 17:26:48.999 handle_message return END question_id 3
Jan 02 17:26:49.000 handle_message
Jan 02 17:26:49.000 handle_message call question_id 0 interface_id 12987388290453739558 method_id 0 cap_table_array 0
Jan 02 17:26:49.000 handle_message call END
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message return question_id 1
Jan 02 17:27:23.994 handle_message return END question_id 1
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message return question_id 2
Jan 02 17:27:23.994 handle_message return END question_id 2
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message return question_id 3
Jan 02 17:27:23.994 handle_message return END question_id 3
Jan 02 17:27:23.994 handle_message
Jan 02 17:27:23.994 handle_message call question_id 0 interface_id 12987388290453739558 method_id 0 cap_table_array 0

When the timeout happens I reuse the runtime object to create and send a new request. It seems the other party (pycapnp) reuses the same question_id when calling us back on the same connection, when a previous question is still in play, which is a bug I guess.

What I'm trying to do now is when a timeout happens I destroy the current connection and setup a new one. For this I'm using the disconnector object we can get from RpcSystem, but the RpcSystem future itself never returns even after the disconnector finds the state as "Disconnected". I inspected the TaskSet and it appears to have in_progress Pending futures that never complete.

I thought the disconnector would force the termination of whatever questions were still running, is this not the case @dwrensha ?

@dwrensha
Copy link
Member

dwrensha commented Jan 3, 2023

Yeah, that sounds like a bug. I expect that once the Disconnector future finishes, the RpcSystem future should finish too. Can you isolate a testcase where that happens so that I can reproduce it?

@edevil
Copy link
Author

edevil commented Jan 3, 2023

I'm trying to extract some code to replicate this issue, but it's not been easy. I think it's related to losing some packets due to the way I was reading the socket on the Python side. Meanwhile, by manually dropping some capnp responses on the Python side I was able to trigger a panic on the Rust side:

thread 'main' panicked at 'not implemented', /Users/acruz/work/vc/capnp-tester/client/capnp-rpc/src/rpc.rs:1065:41
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: core::panicking::panic
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:115:5
   3: capnp_rpc::rpc::ConnectionState<VatId>::handle_message
             at ./capnp-rpc/src/rpc.rs:1065:41
   4: capnp_rpc::rpc::ConnectionState<VatId>::message_loop::{{closure}}

Apparently the return_::Canceled(_) message is sent in some cases but not implemented yet in the rust module.

@edevil
Copy link
Author

edevil commented Jan 4, 2023

Ok, I finally tracked this down.

The Received a new call on in-use question id 0 happened because my Python server code had a race condition when reading data and could drop messages to the client. I was induced in error by https://capnproto.github.io/pycapnp/quickstart.html#starting-a-server-asyncio

The issue with the disconnector happened because I was calling get_disconnector() before calling bootstrap() on the RPC system. This caused the disconnector to be initialized already in a disconnected state, and when awaited would not do anything. Inverting the order of the operations made it work correctly.

So, I guess it was all my fault all along. 😅

@edevil edevil closed this as completed Jan 4, 2023
@dwrensha
Copy link
Member

I've eliminated those panics on the Rust side: 6087383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants