Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS 503s should be retried #2749

Closed
kcibul opened this issue May 24, 2017 · 14 comments · Fixed by #3295
Closed

GCS 503s should be retried #2749

kcibul opened this issue May 24, 2017 · 14 comments · Fixed by #3295
Assignees
Milestone

Comments

@kcibul
Copy link
Contributor

kcibul commented May 24, 2017

503 (Service Unavailable) errors are transient and should be retried. Saw this today running GenomicsDBImport

htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: 503 Service Unavailable Service Unavailable, for input source: gs://fc-4c1c7765-2de2-4214-ac41-dc10bbcbb55b/1e300bb3-6990-4342-8959-118826efb3dd/PairedEndSingleSampleWorkflow/d0aac891-4d32-43b9-8adb-edb6b5204af9/call-GatherVCFs/S76-1-2.g.vcf.gz at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:102) at htsjdk.tribble.TabixFeatureReader.<init>(TabixFeatureReader.java:86) at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:106) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromVCFUri(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReaders(GenomicsDBImport.java:425) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:349) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:747) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:116) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:174) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:193) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:121) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:142) at org.broadinstitute.hellbender.Main.main(Main.java:220) Caused by: com.google.cloud.storage.StorageException: 503 Service Unavailable Service Unavailable at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:186) at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:332) at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:186) at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:183) at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:92) at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:47) at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:183) at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:197) at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.fetchSize(CloudStorageReadChannel.java:194) at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.<init>(CloudStorageReadChannel.java:72) at com.google.cloud.storage.contrib.nio.CloudStorageReadChannel.create(CloudStorageReadChannel.java:62) at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newReadChannel(CloudStorageFileSystemProvider.java:268) at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newByteChannel(CloudStorageFileSystemProvider.java:229) at java.nio.file.Files.newByteChannel(Files.java:361) at java.nio.file.Files.newByteChannel(Files.java:407) at htsjdk.samtools.seekablestream.SeekablePathStream.<init>(SeekablePathStream.java:41) at htsjdk.tribble.util.ParsingUtils.openInputStream(ParsingUtils.java:110) at htsjdk.tribble.TabixFeatureReader.readHeader(TabixFeatureReader.java:99) ... 12 more Caused by: shaded.cloud_nio.com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable Service Unavailable at shaded.cloud_nio.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145) at shaded.cloud_nio.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) at shaded.cloud_nio.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321) at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:330) ... 28 more

@lbergelson
Copy link
Member

@kcibul Was this run with JP's retry update or without?

@kcibul
Copy link
Contributor Author

kcibul commented May 25, 2017 via email

@droazen droazen added this to the beta milestone May 25, 2017
@jean-philippe-martin
Copy link
Contributor

They are retried. You're seeing this message because it failed more than maxChannelReopens times.

The new version which you really want to use for any test at this point is described there:
#2685 (comment)

Among other things, this new version puts in a message about 'retry failed' when it runs out of retries to eliminate the very confusion that you ran into.

GenomicsDBImport opens a large number of parallel connections and as a result is getting throttled fairly heavily (by the host GCE machine if nothing else). This results in timeouts and dropped connections. One way forward is to increase the retry delays, another is to find a way to do the same work with fewer parallel open connections.

@droazen
Copy link
Collaborator

droazen commented May 25, 2017

Thanks for the clarification @jean-philippe-martin.

Based on JP's analysis @kcibul, it might help decrease the frequency of these errors to run GenomicsDBImport with a smaller --batchSize value, since that controls the number of simultaneous open GCS connections. This would come at a cost of greater fragmentation within GenomicsDB itself, however, which might force us to run with --consolidate.

Before resorting to running with a smaller batch size, we should probably evaluate the combined effects of JP's recently merged #2750 as well as his jp_aggressive_reopen branch in gcloud. We hope to move to a newer gcloud release or snapshot soon, but in a pinch we could provide you with a custom-built jar built on the unmerged gcloud branch.

@droazen
Copy link
Collaborator

droazen commented Jun 6, 2017

We'll test whether this is resolved by #2822

@droazen
Copy link
Collaborator

droazen commented Jun 7, 2017

@jean-philippe-martin In our initial tests with the latest gatk, we're still getting errors like this at a rate of ~2%:

com.google.cloud.storage.StorageException: 503 Service Unavailable
java.lang.IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the builder.
com.google.cloud.storage.StorageException: 503 Service Unavailable
com.google.cloud.storage.StorageException: 503 Service Unavailable
java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Connection reset
com.google.cloud.storage.StorageException: 503 Service Unavailable
com.google.cloud.storage.StorageException: 503 Service Unavailable
com.google.cloud.storage.StorageException: 503 Service Unavailable
com.google.cloud.storage.StorageException: 503 Service Unavailable
com.google.cloud.storage.StorageException: 503 Service Unavailable
java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.google.cloud.storage.StorageException: Connection has been shutdown: javax.net.ssl.SSLException: java.net.SocketException: Connection reset
htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Remote host closed connection during handshake, for input source:

Now, I know that you put in an explicit retry for 503's, so I'm wondering what could be going on. I've asked the person running the tests to check that they're using an up-to-date GATK jar, but I'm wondering if we're setting all the right retry options on the GATK side. Eg., your PR googleapis/google-cloud-java#2083 says "but only when OptionMaxChannelReopens is set" -- are we setting this properly? Any other thoughts on things we could try?

Thanks!

@droazen
Copy link
Collaborator

droazen commented Jun 7, 2017

Actually, I think the jar they're using might be wrong -- will confirm in a bit (hope that's what it is :)

@droazen
Copy link
Collaborator

droazen commented Jun 7, 2017

Am 90% certain at this point that the jar they were using for testing was built incorrectly. The CloudStorageReadChannel.class file in the latest google-cloud-nio-0.19.0-alpha-shaded.jar should be 6169 bytes, but their jar has 5401 bytes for that class. And that's the primary class JP patched in the latest release. So I am pretty hopeful that this was just a simple build error.

@Horneth
Copy link

Horneth commented Jun 14, 2017

Got a 503 that doesn't have the All reopens failed message:

com.google.cloud.storage.StorageException: 503 Service Unavailable
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:189)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:335)
	at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:191)
	at com.google.cloud.storage.StorageImpl$5.call(StorageImpl.java:188)
	at shaded.cloud_nio.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:93)
	at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:49)
	at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:188)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.checkAccess(CloudStorageFileSystemProvider.java:550)
	at java.nio.file.Files.exists(Files.java:2385)
	at htsjdk.tribble.util.ParsingUtils.resourceExists(ParsingUtils.java:428)
	at htsjdk.tribble.AbstractFeatureReader.isTabix(AbstractFeatureReader.java:217)
	at htsjdk.tribble.AbstractFeatureReader$ComponentMethods.isTabix(AbstractFeatureReader.java:223)
	at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:103)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getReaderFromVCFUri(GenomicsDBImport.java:454)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.getFeatureReaders(GenomicsDBImport.java:436)
	at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:358)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:779)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:122)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:143)
	at org.broadinstitute.hellbender.Main.main(Main.java:221)
Caused by: shaded.cloud_nio.com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
	at shaded.cloud_nio.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
	at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at shaded.cloud_nio.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
	at com.google.cloud.storage.spi.v1.HttpStorageRpc.get(HttpStorageRpc.java:333)
	... 21 more

@droazen
Copy link
Collaborator

droazen commented Jun 14, 2017

@jean-philippe-martin This one is interesting, since it happens in a Files.exists() check in ParsingUtils.resourceExists() in htsjdk, which creates its own Path object, bypassing the generous retries set in the GATK.

What are the retries/reopens set to by default in CloudStorageReadChannel? It looks to me like maxChannelReopens might default to 0, even though maxRetries defaults to at least 3? What about the default settings for the values we set in StorageOptions.Builder.setRetrySettings() in GATK?

@droazen
Copy link
Collaborator

droazen commented Jun 14, 2017

@jean-philippe-martin It might be good if we could set the retry settings globally on startup in google-cloud-java instead of on a per-Path basis -- thoughts?

@jean-philippe-martin
Copy link
Contributor

Correct, MaxChannelReopens defaults to 0. So this path used the underlying retry mechanism only (which doesn't give a special message when it's done, hence we didn't see "All retries failed").

You're asking about the default: my understanding is that the default retry count is 6.

Global settings would be useful, yes I agree.

@droazen
Copy link
Collaborator

droazen commented Jun 15, 2017

@jean-philippe-martin Would it make sense to change the default maxChannelReopens to 3? There will inevitably be code paths that construct their own Path objects.

Would it be easy to add the ability to configure these settings globally rather than per-Path? Should we create a ticket for that?

@droazen droazen modified the milestones: Engine-4.0, Engine-beta Jun 15, 2017
@jean-philippe-martin
Copy link
Contributor

I don't think so. Network clients aren't normally expected to re-open files when connections are dropped, those are usually user-facing errors. Our re-opening the files is only legit because we know that the files aren't being written to but that's not an assumption we can make in general.

Yes please file a ticket - most likely it'll be me implementing it but this way we'll remember. We have to think about whether we want a truly global setting or per-bucket, and what we want to do if libraries start changing the global setting...

@droazen droazen modified the milestones: Engine-beta2, Engine-4.0 Jul 10, 2017
droazen added a commit that referenced this issue Jul 17, 2017
…number of retries/reopens globally

This moves us to a snapshot of google-cloud-java based off of a branch in my fork here:
https://github.com/droazen/google-cloud-java/tree/dr_retry_CloudStorageReadChannel_fetchSize.
This patch wraps many more operations within retries, and in our tests
resolves the intermittent 503/SSL errors completely when running at scale.

This PR also migrates us from setting retry settings per-Path to setting it globally,
using a new API from that google-cloud-java branch. This fixes an issue where the
number of reopens was getting set to 0 deep in the google-cloud-java library.

Resolves #2749
Resolves #2685
Resolves #3118
Resolves #3120
Resolves #3253
droazen added a commit that referenced this issue Jul 20, 2017
…number of retries/reopens globally (#3295)

Move to google-cloud-java snapshot with more robust retries, and set number of retries/reopens globally

This moves us to a snapshot of google-cloud-java based off of a branch in my fork here:

https://github.com/droazen/google-cloud-java/tree/dr_better_nio_retries

This patch wraps many more operations within retries, and in our tests
resolves the intermittent 503/SSL errors completely when running at scale.

This PR also migrates us from setting retry settings per-Path to setting it globally,
using a new API from that google-cloud-java branch. This fixes an issue where the
number of reopens was getting set to 0 deep in the google-cloud-java library.

Resolves #2749
Resolves #2685
Resolves #3118
Resolves #3120
Resolves #3253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants