Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.0] always set OC's sandbox socket to blocking mode fixing random OC errors seen in CI & maybe elsewhere #2163

Merged
merged 1 commit into from
Jan 31, 2024

Conversation

spoonincode
Copy link
Member

It has been relatively rare, but sometimes we've seen an unexplained failed to compile wasm error in EOS VM OC unittests. We've also seen at least one report from a user with a similar error. Up until now we've chalked this up to OC's subjective limits being inappropriately enabled during testing, as subjective limits could certainly explain this particular failure case where OC's sandbox fails to produce a response. These subjective limits are now disabled on release/5.0 and main so we can no longer attribute failures to that.

After upgrading the CI image to the latest Linux kernel -- 6.7.1 at this time -- we've seen an explosion in the number of failed to compile wasm errors. I can also reproduce this issue locally with 6.7.x as well, as long as tests are run with a high -j value; i.e. high system load.1

The problem is occurring in OC's compile monitor component in its sandbox. When a compile is started an asio local datagram socket is created from a socketpair,

void kick_compile_off(const code_tuple& code_id, const eosvmoc::config& eosvmoc_config, wrapped_fd&& wasm_code) {
//prepare a requst to go out to the trampoline
int socks[2];
socketpair(AF_UNIX, SOCK_SEQPACKET | SOCK_CLOEXEC, 0, socks);
local::datagram_protocol::socket response_socket(_ctx);
response_socket.assign(local::datagram_protocol(), socks[0]);

the compile monitor will then async_wait() on it,
void read_message_from_compile_task(std::list<std::tuple<code_tuple, local::datagram_protocol::socket>>::iterator current_compile_it) {
auto& [code, socket] = *current_compile_it;
socket.async_wait(local::datagram_protocol::socket::wait_read, [this, current_compile_it](auto ec) {

and then once the socket is indicated as ready, call read_message_with_fds()
auto [success, message, fds] = read_message_with_fds(socket);

which ultimately calls recvmsg,
int red;
do {
red = recvmsg(fd, &msg, 0);
} while(red == -1 && errno == EINTR);
if(red < 1 || static_cast<unsigned>(red) >= sizeof(buff))
return {false, message, std::move(fds)};

read_message_with_fds() is expecting recvmsg() to be a blocking call: no where does OC anywhere set up a socket to be non-blocking. However, asio's async_wait() will always set the underlying file descriptor to be non-blocking. It doesn't seem like that would be a problem though: we only call recvmsg() after the kernel (through the asio abstraction) has indicated EPOLLIN so the "naive" (as defined shortly below) expectation is that any call to recvmsg() will succeed with a proper message or indication of a connection failure.

But on 6.7.1 sometimes the async_wait() indicates readiness but the recvmsg() will return with an EAGAIN. The current code considers that a fatal communication error with the compilation process and bails, thus giving us the failed to compile wasm error. And this is where some spec lawyering comes in to play. It's not clear to me what exactly is guaranteed via this interface: when select returns indicating a socket is readable, is it a guarantee that a non blocking read must return a message on the next call? afaict no: it's only a guarantee that a blocking read will not block. It seems reasonable to expect a non blocking read would always return the message but I'm not convinced from the specification I've read that's guaranteed.

Of course, asio isn't using select, it's using the linux specific epoll... in edge triggered mode too. This further obscures what the rules ought to be, to me.

It's also possible we've stumbled on a defect introduced in Linux. Unfortunately without either a small test case or bisecting where this issue started to become rampant, neither of which is a small effort, it's hard to approach the developers with it.

This PR now asks the asio socket to have its underlying file descriptor set to blocking mode after the async_wait() and before the recvmsg(). This effectively removes the possibility of EAGAIN creeping up on us here. I also entertained read_message_with_fds() looping on EAGAIN, as it already does for EINTR. That also does work in testing. It's not immediately clear which way is better.

One negative aspect of this change is that a read_message_with_fds(asiosocket.native_handle()) will subvert the code that sets blocking mode. At the moment the only usages of read_message_with_fds(int) pass file descriptors that were never "asio-ified"

Footnotes

  1. I don't mean to make 6.7.x appear to be culprit even though I explicitly mention it by value multiple times in this PR. The CI image hasn't been touched in nearly a year (so a handful of kernel versions ago), and I don't run the tests with a high -j value locally regularly enough to know if this failure has only recently started occurring with such frequency. But it does seem very clear that within the last year some kernel behavior has shifted.

@ericpassmore
Copy link
Contributor

Note:start
group: STABILITY
category: TEST
summary: Changes to socket handling to improve continuous integration test passes.
Note:end

@spoonincode spoonincode merged commit b010e7e into release/5.0 Jan 31, 2024
57 checks passed
@spoonincode spoonincode deleted the oc_blocking_socket_50 branch January 31, 2024 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants