-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc Cancel and grpc Deadline #1412
Comments
log also have this:
|
The request was cancelled. The client closed their connection, or they explicitly cancelled the operation. Is this correlated with any actual client errors? The level is a WARNING because this occurs during the normal operation of clients, who typically do not clean up after their own outstanding request mess. |
The preceding log is the log of the buildfarm server. |
The DEADLINE_EXCEEDED is an unfortunate misleading piece of hardcoded copy: the only status here is a CANCELLED, for which we need to fix the error output. There is no DEADLINE_EXCEEDED, and yes, this is all just warnings, so it's a result of clients cancelling their reads. I'll put up a diff that makes the error less confusing when the status isn't DEADLINE_EXCEEDED and it fails the SHARD_IS_RETRIABLE test. |
okay, thanks. It seems that the aosp client can't receive the complete blob file from the buildfarm server. Therefore, the aosp client parses the file and reports the error "error: unterminated conditional direct" and then "fatal error: too many errors committed, stopping now" to terminate the request. |
When client NINJA_REMOTE_NUM_JOBS=500(The number of concurrent requests is 500.), it can build success. But client NINJA_REMOTE_NUM_JOBS=1000, it build failed, and reports the error "error: unterminated conditional direct" and "fatal error: too many errors committed, stopping now". Do you know why the client can't receive the complete blob file (because aosp rbe isn't opensource, I can't locate)? |
You're probably running out of bandwidth to receive the blobs in the timeouts of each transfer. I'm not familiar with the remote execution client you're referring to, but the math is constrained by delivery regardless, and if the client is not designed to accept any progress (and even this breaks down eventually) for each transfer allowing it to continue beyond the deadline (bazel originally had this problem), you will not be able to request concurrency beyond a limit: For a number of concurrent (i.e. your JOBS count) activities, your bandwidth is divided by your jobs. If there are jobs downloads happening, they're each only getting a portion of the bandwidth. Your transfer time is then your largest blob size divided by this rate. If this time is greater than your timeout, you will never complete all of your transfers within the timeout. This all assumes naive behavior by the client, which, as you've mentioned, is not transparent to our investigation, so this is the most likely failure case I can guess from my side. Is there even a link to this client? I assume this is Android's Open Source Project? Is there a tool that I can see without paying for something? |
I think you are right. https://source.android.com/docs/setup/download/downloading this is android open source project, but it's rbe(Google's Remote Build Execution(RBE) service)isn't open source, it just calls the binary in the prebuilts/remoteexecution-client/live directory. |
Since this is client centric, it is usually up to a client to log what it is sending to the remote side. I am not familiar with how to do it on this client, and since client support has been available for enumeration in bazel since its inception, I have never implemented it on the server. Closing this as Cancelled is a recognized condition with client exhaustion. |
hi, i test aosp rbe, but sometimes I run into "io grpc cancel and deadline_exceed" problems. like as:
because the timeout period I set in config.yml is long.
maybe it about
io.grpc.Context
io.grpc.Deadline
? Whether the parameter value of the withDeadlineAfter function in the ShardInstance.java file is affected?The text was updated successfully, but these errors were encountered: