grpc Cancel and grpc Deadline #1412

DarkMatterV · 2023-07-29T10:34:53Z

hi, i test aosp rbe， but sometimes I run into "io grpc cancel and deadline_exceed" problems. like as:

Jul 29, 2023 3:12:59 AM build.buildfarm.instance.shard.ShardInstance$3 onError
WARNING: DEADLINE_EXCEEDED: read({hash}/{size}) on worker bazel-buildfarm-shard-worker.bazel-buildfarm-shard-worker-service:8981 after 1114112 bytes of content
Jul 29, 2023 3:12:59 AM build.buildfarm.common.services.ByteStreamService$1 onError
WARNING: error reading {hash}/{size}(shard) at offset 0 after 17 responses and 1114112 bytes of content
io.grpc.StatusRuntimeException: CANCELLED: RPC cancelled
	at io.grpc.Status.asRuntimeException(Status.java:539)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: io.grpc.StatusRuntimeException: CANCELLED: RPC cancelled

because the timeout period I set in config.yml is long.
maybe it about io.grpc.Context io.grpc.Deadline? Whether the parameter value of the withDeadlineAfter function in the ShardInstance.java file is affected？

The text was updated successfully, but these errors were encountered:

DarkMatterV · 2023-07-29T10:37:56Z

log also have this:


Jul 29, 2023 3:16:49 AM build.buildfarm.common.services.ByteStreamService$1 onError
WARNING: error reading {hash1}/{hash1_size}(shard) at offset 0 after 276 responses and 18087936 bytes of content
io.grpc.StatusRuntimeException: CANCELLED: Cancelled by client with StreamObserver.onError()
	at io.grpc.Status.asRuntimeException(Status.java:539)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:576)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:757)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:736)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: io.grpc.StatusException: CANCELLED
	at io.grpc.Status.asException(Status.java:547)
	at build.buildfarm.instance.stub.StubInstance$ReadBlobInterchange.lambda$beforeStart$0(StubInstance.java:545)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49)
	at build.buildfarm.common.grpc.DelegateServerCallStreamObserver.lambda$new$0(DelegateServerCallStreamObserver.java:22)
	at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onCancel(ServerCalls.java:195)
	at io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
	at io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:40)
	at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onCancel(TransmitStatusRuntimeExceptionInterceptor.java:83)
	at io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
	at io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:40)
	at io.grpc.Contexts$ContextualizedServerCallListener.onCancel(Contexts.java:96)
	at io.grpc.PartialForwardingServerCallListener.onCancel(PartialForwardingServerCallListener.java:40)
	at io.grpc.ForwardingServerCallListener.onCancel(ForwardingServerCallListener.java:23)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closedInternal(ServerCallImpl.java:378)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.closed(ServerCallImpl.java:365)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1Closed.runInContext(ServerImpl.java:923)
	... 5 more

werkt · 2023-07-29T19:47:56Z

The request was cancelled. The client closed their connection, or they explicitly cancelled the operation. Is this correlated with any actual client errors?

The level is a WARNING because this occurs during the normal operation of clients, who typically do not clean up after their own outstanding request mess.

DarkMatterV · 2023-07-31T00:59:33Z

The preceding log is the log of the buildfarm server.
The levels of DEADLINE_EXCEEDED and io.grpc.StatusRuntimeException: CANCELLED are WARNING, so the exception stack does not need to be concerned, right?

werkt · 2023-07-31T01:10:36Z

The DEADLINE_EXCEEDED is an unfortunate misleading piece of hardcoded copy: the only status here is a CANCELLED, for which we need to fix the error output. There is no DEADLINE_EXCEEDED, and yes, this is all just warnings, so it's a result of clients cancelling their reads.

I'll put up a diff that makes the error less confusing when the status isn't DEADLINE_EXCEEDED and it fails the SHARD_IS_RETRIABLE test.

DarkMatterV · 2023-07-31T02:05:31Z

okay， thanks.

It seems that the aosp client can't receive the complete blob file from the buildfarm server. Therefore, the aosp client parses the file and reports the error "error: unterminated conditional direct" and then "fatal error: too many errors committed, stopping now" to terminate the request.

DarkMatterV · 2023-07-31T02:11:09Z

When client NINJA_REMOTE_NUM_JOBS=500（The number of concurrent requests is 500.）， it can build success.

But client NINJA_REMOTE_NUM_JOBS=1000， it build failed, and reports the error "error: unterminated conditional direct" and "fatal error: too many errors committed, stopping now".

Do you know why the client can't receive the complete blob file （because aosp rbe isn't opensource, I can't locate）？

werkt · 2023-07-31T02:21:56Z

You're probably running out of bandwidth to receive the blobs in the timeouts of each transfer. I'm not familiar with the remote execution client you're referring to, but the math is constrained by delivery regardless, and if the client is not designed to accept any progress (and even this breaks down eventually) for each transfer allowing it to continue beyond the deadline (bazel originally had this problem), you will not be able to request concurrency beyond a limit:

For a number of concurrent (i.e. your JOBS count) activities, your bandwidth is divided by your jobs. If there are jobs downloads happening, they're each only getting a portion of the bandwidth. Your transfer time is then your largest blob size divided by this rate. If this time is greater than your timeout, you will never complete all of your transfers within the timeout.

This all assumes naive behavior by the client, which, as you've mentioned, is not transparent to our investigation, so this is the most likely failure case I can guess from my side. Is there even a link to this client? I assume this is Android's Open Source Project? Is there a tool that I can see without paying for something?

DarkMatterV · 2023-07-31T06:14:00Z

I think you are right.
Do I have the fastest way to query the largest blob size? I want to compute transfer time

https://source.android.com/docs/setup/download/downloading this is android open source project, but it's rbe（Google's Remote Build Execution(RBE) service）isn't open source, it just calls the binary in the prebuilts/remoteexecution-client/live directory.

werkt · 2023-08-04T12:18:06Z

Since this is client centric, it is usually up to a client to log what it is sending to the remote side. I am not familiar with how to do it on this client, and since client support has been available for enumeration in bazel since its inception, I have never implemented it on the server.

Closing this as Cancelled is a recognized condition with client exhaustion.

werkt closed this as completed Aug 4, 2023

monaka mentioned this issue May 5, 2024

Many WARNING: DEADLINE_EXCEEDED: write logs from buildfarm-server. #1724

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc Cancel and grpc Deadline #1412

grpc Cancel and grpc Deadline #1412

DarkMatterV commented Jul 29, 2023

DarkMatterV commented Jul 29, 2023

werkt commented Jul 29, 2023

DarkMatterV commented Jul 31, 2023

werkt commented Jul 31, 2023

DarkMatterV commented Jul 31, 2023

DarkMatterV commented Jul 31, 2023

werkt commented Jul 31, 2023

DarkMatterV commented Jul 31, 2023 •

edited

Loading

werkt commented Aug 4, 2023

grpc Cancel and grpc Deadline #1412

grpc Cancel and grpc Deadline #1412

Comments

DarkMatterV commented Jul 29, 2023

DarkMatterV commented Jul 29, 2023

werkt commented Jul 29, 2023

DarkMatterV commented Jul 31, 2023

werkt commented Jul 31, 2023

DarkMatterV commented Jul 31, 2023

DarkMatterV commented Jul 31, 2023

werkt commented Jul 31, 2023

DarkMatterV commented Jul 31, 2023 • edited Loading

werkt commented Aug 4, 2023

DarkMatterV commented Jul 31, 2023 •

edited

Loading