Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc Cancel and grpc Deadline #1412

Closed
DarkMatterV opened this issue Jul 29, 2023 · 9 comments
Closed

grpc Cancel and grpc Deadline #1412

DarkMatterV opened this issue Jul 29, 2023 · 9 comments

Comments

@DarkMatterV
Copy link

hi, i test aosp rbe, but sometimes I run into "io grpc cancel and deadline_exceed" problems. like as:

Jul 29, 2023 3:12:59 AM build.buildfarm.instance.shard.ShardInstance$3 onError
WARNING: DEADLINE_EXCEEDED: read({hash}/{size}) on worker bazel-buildfarm-shard-worker.bazel-buildfarm-shard-worker-service:8981 after 1114112 bytes of content
Jul 29, 2023 3:12:59 AM build.buildfarm.common.services.ByteStreamService$1 onError
WARNING: error reading {hash}/{size}(shard) at offset 0 after 17 responses and 1114112 bytes of content
io.grpc.StatusRuntimeException: CANCELLED: RPC cancelled
	at io.grpc.Status.asRuntimeException(Status.java:539)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: io.grpc.StatusRuntimeException: CANCELLED: RPC cancelled

because the timeout period I set in config.yml is long.
maybe it about io.grpc.Context io.grpc.Deadline? Whether the parameter value of the withDeadlineAfter function in the ShardInstance.java file is affected?

@DarkMatterV
Copy link
Author

log also have this:


Jul 29, 2023 3:16:49 AM build.buildfarm.common.services.ByteStreamService$1 onError
WARNING: error reading {hash1}/{hash1_size}(shard) at offset 0 after 276 responses and 18087936 bytes of content
io.grpc.StatusRuntimeException: CANCELLED: Cancelled by client with StreamObserver.onError()
	at io.grpc.Status.asRuntimeException(Status.java:539)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: io.grpc.StatusException: CANCELLED
	at io.grpc.Status.asException(Status.java:547)
	at build.buildfarm.instance.stub.StubInstance$ReadBlobInterchange.lambda$beforeStart$0(StubInstance.java:545)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onCancel(ServerCalls.java:195)
	at io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
	at io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:40)
	at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onCancel(TransmitStatusRuntimeExceptionInterceptor.java:83)
	at io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
	at io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:40)
	at io.grpc.Contexts$ContextualizedServerCallListener.onCancel(Contexts.java:96)
	at io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
	at io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closedInternal(ServerCallImpl.java:378)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closed(ServerCallImpl.java:365)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1Closed.runInContext(ServerImpl.java:923)
	... 5 more

@werkt
Copy link
Collaborator

werkt commented Jul 29, 2023

The request was cancelled. The client closed their connection, or they explicitly cancelled the operation. Is this correlated with any actual client errors?

The level is a WARNING because this occurs during the normal operation of clients, who typically do not clean up after their own outstanding request mess.

@DarkMatterV
Copy link
Author

The preceding log is the log of the buildfarm server.
The levels of DEADLINE_EXCEEDED and io.grpc.StatusRuntimeException: CANCELLED are WARNING, so the exception stack does not need to be concerned, right?

@werkt
Copy link
Collaborator

werkt commented Jul 31, 2023

The DEADLINE_EXCEEDED is an unfortunate misleading piece of hardcoded copy: the only status here is a CANCELLED, for which we need to fix the error output. There is no DEADLINE_EXCEEDED, and yes, this is all just warnings, so it's a result of clients cancelling their reads.

I'll put up a diff that makes the error less confusing when the status isn't DEADLINE_EXCEEDED and it fails the SHARD_IS_RETRIABLE test.

@DarkMatterV
Copy link
Author

okay, thanks.

It seems that the aosp client can't receive the complete blob file from the buildfarm server. Therefore, the aosp client parses the file and reports the error "error: unterminated conditional direct" and then "fatal error: too many errors committed, stopping now" to terminate the request.

@DarkMatterV
Copy link
Author

When client NINJA_REMOTE_NUM_JOBS=500(The number of concurrent requests is 500.), it can build success.

But client NINJA_REMOTE_NUM_JOBS=1000, it build failed, and reports the error "error: unterminated conditional direct" and "fatal error: too many errors committed, stopping now".

Do you know why the client can't receive the complete blob file (because aosp rbe isn't opensource, I can't locate)?

@werkt
Copy link
Collaborator

werkt commented Jul 31, 2023

You're probably running out of bandwidth to receive the blobs in the timeouts of each transfer. I'm not familiar with the remote execution client you're referring to, but the math is constrained by delivery regardless, and if the client is not designed to accept any progress (and even this breaks down eventually) for each transfer allowing it to continue beyond the deadline (bazel originally had this problem), you will not be able to request concurrency beyond a limit:

For a number of concurrent (i.e. your JOBS count) activities, your bandwidth is divided by your jobs. If there are jobs downloads happening, they're each only getting a portion of the bandwidth. Your transfer time is then your largest blob size divided by this rate. If this time is greater than your timeout, you will never complete all of your transfers within the timeout.

This all assumes naive behavior by the client, which, as you've mentioned, is not transparent to our investigation, so this is the most likely failure case I can guess from my side. Is there even a link to this client? I assume this is Android's Open Source Project? Is there a tool that I can see without paying for something?

@DarkMatterV
Copy link
Author

DarkMatterV commented Jul 31, 2023

I think you are right.
Do I have the fastest way to query the largest blob size? I want to compute transfer time

https://source.android.com/docs/setup/download/downloading this is android open source project, but it's rbe(Google's Remote Build Execution(RBE) service)isn't open source, it just calls the binary in the prebuilts/remoteexecution-client/live directory.

@werkt
Copy link
Collaborator

werkt commented Aug 4, 2023

Since this is client centric, it is usually up to a client to log what it is sending to the remote side. I am not familiar with how to do it on this client, and since client support has been available for enumeration in bazel since its inception, I have never implemented it on the server.

Closing this as Cancelled is a recognized condition with client exhaustion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants