Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications. #4738

Open
steveloughran opened this issue Dec 4, 2023 · 7 comments
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@steveloughran
Copy link

Describe the bug

We see this when the S3 client is trying to use S3Express CreateSession and is configured such that it is doing so many retries (10) that the call times out before the retry limit is reached. Rather than include the underlying exception triggering the retries, a simpler "call timeout out" exception is raised with the suppressed exception of "java.lang.RuntimeException: Task failed."

I believe this is a regression from v1 sdk.

Expected Behavior

ApiCallTimeoutException to include the exception triggering retries internally.

Current Behavior

See HADOOP-19000 for this surfacing connecting to S3Express buckets.

Stack trace on timeouts contains no root cause information.

Caused by: software.amazon.awssdk.core.exception.ApiCallTimeoutException: Client execution did not complete before the specified timeout configuration: 10000 millis
        at software.amazon.awssdk.core.exception.ApiCallTimeoutException$BuilderImpl.build(ApiCallTimeoutException.java:97)
        at software.amazon.awssdk.core.exception.ApiCallTimeoutException.create(ApiCallTimeoutException.java:38)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.generateApiCallTimeoutException(ApiCallTimeoutTrackingStage.java:151)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.handleInterruptedException(ApiCallTimeoutTrackingStage.java:139)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.translatePipelineException(ApiCallTimeoutTrackingStage.java:107)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:62)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
        at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
        at software.amazon.awssdk.services.s3.DefaultS3Client.createSession(DefaultS3Client.java:1589)
        at software.amazon.awssdk.services.s3.S3Client.createSession(S3Client.java:2505)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.getCredentials(S3ExpressIdentityCache.java:88)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.lambda$getCachedCredentials$0(S3ExpressIdentityCache.java:73)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.refreshResult(CachedS3ExpressCredentials.java:91)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.lambda$new$0(CachedS3ExpressCredentials.java:70)
        at software.amazon.awssdk.utils.cache.CachedSupplier.lambda$jitteredPrefetchValueSupplier$8(CachedSupplier.java:300)
        at software.amazon.awssdk.utils.cache.NonBlocking.fetch(NonBlocking.java:151)
        at software.amazon.awssdk.utils.cache.CachedSupplier.refreshCache(CachedSupplier.java:208)
        at software.amazon.awssdk.utils.cache.CachedSupplier.get(CachedSupplier.java:135)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.get(CachedS3ExpressCredentials.java:85)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.get(S3ExpressIdentityCache.java:61)
        at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.lambda$resolveIdentity$0(DefaultS3ExpressIdentityProvider.java:56)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
        at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
        at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.resolveIdentity(DefaultS3ExpressIdentityProvider.java:49)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.trySelectAuthScheme(S3AuthSchemeInterceptor.java:142)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.selectAuthScheme(S3AuthSchemeInterceptor.java:81)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.beforeExecution(S3AuthSchemeInterceptor.java:61)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$beforeExecution$1(ExecutionInterceptorChain.java:62)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.beforeExecution(ExecutionInterceptorChain.java:62)
        at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.runInitialInterceptors(AwsExecutionContextBuilder.java:239)
        at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:130)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.invokeInterceptorsAndCreateExecutionContext(AwsSyncClientHandler.java:67)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:76)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
        at software.amazon.awssdk.services.s3.DefaultS3Client.listObjectsV2(DefaultS3Client.java:7323)
        at software.amazon.awssdk.services.s3.DelegatingS3Client.lambda$listObjectsV2$63(DelegatingS3Client.java:5856)
        at software.amazon.awssdk.services.s3.internal.crossregion.S3CrossRegionSyncClient.invokeOperation(S3CrossRegionSyncClient.java:73)
        at software.amazon.awssdk.services.s3.DelegatingS3Client.listObjectsV2(DelegatingS3Client.java:5856)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$13(S3AFileSystem.java:2963)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        ... 8 more
        Suppressed: java.lang.RuntimeException: Task failed.
                at software.amazon.awssdk.utils.CompletableFutureUtils.joinLikeSync(CompletableFutureUtils.java:254)
                at software.amazon.awssdk.auth.signer.AwsSignerExecutionAttribute.awsCredentialsReadMapping(AwsSignerExecutionAttribute.java:201)
                at software.amazon.awssdk.core.interceptor.ExecutionAttribute$DerivationValueStorage.get(ExecutionAttribute.java:260)
                at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getAttribute(ExecutionAttributes.java:53)
                at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getOptionalAttribute(ExecutionAttributes.java:68)
                at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:144)
                ... 22 more

Reproduction Steps

  • set retries to longer than api timeout
  • try to an operation which fail with a retryable exception (connectivity etc)
  • wait for failure

Possible Solution

We really need that underlying exception for our own decision-making about what to do next. I fear we are going to have to change the S3A retry policies so that we have special handling for the first failure of any S3 operation on the basis that this is a configuration problem that retries will not recover from. Thanks will add the overhead of a needless S3 call.
Would it be possible to do something like save the innermost exception and add it as the root cause when throwing an ApiCallTimeoutException?

Additional Information/Context

No response

AWS Java SDK version used

2.21.33

JDK version used

openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode)

Operating System and version

macos 13.4.1

@steveloughran steveloughran added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 4, 2023
@debora-ito debora-ito added the needs-review This issue or PR needs review from the team. label Dec 13, 2023
@debora-ito debora-ito added p0 This issue is the highest priority p2 This is a standard priority issue and removed p0 This issue is the highest priority needs-triage This issue or PR still needs to be triaged. labels Dec 27, 2023
@debora-ito
Copy link
Member

@steveloughran acknowledged. We'll think of a way to surface the underlying exception.

@debora-ito debora-ito added p1 This is a high priority issue p2 This is a standard priority issue and removed needs-review This issue or PR needs review from the team. p2 This is a standard priority issue p1 This is a high priority issue labels Jan 9, 2024
@debora-ito
Copy link
Member

@steveloughran after the change released in #4779, do you still see this issue?

The idea is that since the timeouts are propagated now, this should be mitigated.

@debora-ito debora-ito added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Jan 24, 2024
@steveloughran
Copy link
Author

I'd still like the reason for internal retry failures to be passed up. For example
-client retries on all IOEs
-endpoint is generating UnknownHostException
-which is retried
-timeout exception will say timeout; inner cause is lost

we deal with that in our code by having a very limited set of retries. if it weren't for the transfer manager I'd set the SDK retry count to 0 and handle it all ourselves

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Jan 26, 2024
@james-s-w-clark
Copy link

This could be really helpful for us. We hit the 60 second timeout (even for tiny files in S3), and still have the same issue with a 120 second timeout.

We don't know the root cause, and it seems like #4057 - random failures that cause all future s3 interactions to fail.

I can't share a reproducible snippet unfortunately.

@steveloughran
Copy link
Author

@IdiosApps if the http connection is stale/dead and gets returned into the pool, you can get recurrent failures. check your ttls and if your read() gets errors: abort() the connection.

@pilardi
Copy link

pilardi commented Apr 11, 2024

not to say this is the root cause of the problem (there's probably a bug in the code since) but have you checked the throughput parameter in the client, it might be slowing things down by default

@steveloughran
Copy link
Author

our problems weren't with throughput manager, saw it for unknown host exceptions, e.g trying to use fips in a region without it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

4 participants