New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS 503s should be retried #2749
Comments
@kcibul Was this run with JP's retry update or without? |
without -- once that's merged and part of GATK4 we can try again
…-------------------------------
Kristian Cibulskis
Engineering Director, Data Sciences & Data Engineering
Broad Institute of MIT and Harvard
kcibul@broadinstitute.org
On Thu, May 25, 2017 at 3:30 PM, Louis Bergelson ***@***.***> wrote:
@kcibul <https://github.com/kcibul> Was this run with JP's retry update
or without?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2749 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABW4g8XXyRKumch-DAUKE6Dd5GrjC8bNks5r9da6gaJpZM4NlH-C>
.
|
They are retried. You're seeing this message because it failed more than The new version which you really want to use for any test at this point is described there: Among other things, this new version puts in a message about 'retry failed' when it runs out of retries to eliminate the very confusion that you ran into. GenomicsDBImport opens a large number of parallel connections and as a result is getting throttled fairly heavily (by the host GCE machine if nothing else). This results in timeouts and dropped connections. One way forward is to increase the retry delays, another is to find a way to do the same work with fewer parallel open connections. |
Thanks for the clarification @jean-philippe-martin. Based on JP's analysis @kcibul, it might help decrease the frequency of these errors to run Before resorting to running with a smaller batch size, we should probably evaluate the combined effects of JP's recently merged #2750 as well as his jp_aggressive_reopen branch in gcloud. We hope to move to a newer gcloud release or snapshot soon, but in a pinch we could provide you with a custom-built jar built on the unmerged gcloud branch. |
We'll test whether this is resolved by #2822 |
@jean-philippe-martin In our initial tests with the latest gatk, we're still getting errors like this at a rate of ~2%:
Now, I know that you put in an explicit retry for 503's, so I'm wondering what could be going on. I've asked the person running the tests to check that they're using an up-to-date GATK jar, but I'm wondering if we're setting all the right retry options on the GATK side. Eg., your PR googleapis/google-cloud-java#2083 says "but only when OptionMaxChannelReopens is set" -- are we setting this properly? Any other thoughts on things we could try? Thanks! |
Actually, I think the jar they're using might be wrong -- will confirm in a bit (hope that's what it is :) |
Am 90% certain at this point that the jar they were using for testing was built incorrectly. The |
Got a 503 that doesn't have the
|
@jean-philippe-martin This one is interesting, since it happens in a What are the retries/reopens set to by default in |
@jean-philippe-martin It might be good if we could set the retry settings globally on startup in google-cloud-java instead of on a per-Path basis -- thoughts? |
Correct, You're asking about the default: my understanding is that the default retry count is 6. Global settings would be useful, yes I agree. |
@jean-philippe-martin Would it make sense to change the default Would it be easy to add the ability to configure these settings globally rather than per-Path? Should we create a ticket for that? |
I don't think so. Network clients aren't normally expected to re-open files when connections are dropped, those are usually user-facing errors. Our re-opening the files is only legit because we know that the files aren't being written to but that's not an assumption we can make in general. Yes please file a ticket - most likely it'll be me implementing it but this way we'll remember. We have to think about whether we want a truly global setting or per-bucket, and what we want to do if libraries start changing the global setting... |
…number of retries/reopens globally This moves us to a snapshot of google-cloud-java based off of a branch in my fork here: https://github.com/droazen/google-cloud-java/tree/dr_retry_CloudStorageReadChannel_fetchSize. This patch wraps many more operations within retries, and in our tests resolves the intermittent 503/SSL errors completely when running at scale. This PR also migrates us from setting retry settings per-Path to setting it globally, using a new API from that google-cloud-java branch. This fixes an issue where the number of reopens was getting set to 0 deep in the google-cloud-java library. Resolves #2749 Resolves #2685 Resolves #3118 Resolves #3120 Resolves #3253
…number of retries/reopens globally (#3295) Move to google-cloud-java snapshot with more robust retries, and set number of retries/reopens globally This moves us to a snapshot of google-cloud-java based off of a branch in my fork here: https://github.com/droazen/google-cloud-java/tree/dr_better_nio_retries This patch wraps many more operations within retries, and in our tests resolves the intermittent 503/SSL errors completely when running at scale. This PR also migrates us from setting retry settings per-Path to setting it globally, using a new API from that google-cloud-java branch. This fixes an issue where the number of reopens was getting set to 0 deep in the google-cloud-java library. Resolves #2749 Resolves #2685 Resolves #3118 Resolves #3120 Resolves #3253
503 (Service Unavailable) errors are transient and should be retried. Saw this today running GenomicsDBImport
htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: 503 Service Unavailable Service Unavailable, for input source: gs://fc-4c1c7765-2de2-4214-ac41-dc10bbcbb55b/1e300bb3-6990-4342-8959-118826efb3dd/PairedEndSingleSampleWorkflow/d0aac891-4d32-43b9-8adb-edb6b5204af9/call-GatherVCFs/S76-1-2.g.vcf.gz at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:102) at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:86) at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:106) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromVCFUri(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReaders(GenomicsDBImport.java:425) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:349) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:747) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:174) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:193) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:121) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:142) at org.broadinstitute.hellbender.Main.main(Main.java:220) Caused by: com.google.cloud.storage.StorageException: 503 Service Unavailable Service Unavailable at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:186) at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:332) at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:186) at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:183) at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:92) at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:47) at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:183) at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:197) at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.fetchSize(CloudStorageReadChannel.java:194) at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.<init>(CloudStorageReadChannel.java:72) at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.create(CloudStorageReadChannel.java:62) at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newReadChannel(CloudStorageFileSystemProvider.java:268) at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:229) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at htsjdk.samtools.seekablestream.SeekablePathStream.<init>(SeekablePathStream.java:41) at htsjdk.tribble.util.ParsingUtils.openInputStream(ParsingUtils.java:110) at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:99) ... 12 more Caused by: shaded.cloud_nio.com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable Service Unavailable at shaded.cloud_nio.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145) at shaded.cloud_nio.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) at shaded.cloud_nio.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321) at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:330) ... 28 more
The text was updated successfully, but these errors were encountered: