ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications. #4738

steveloughran · 2023-12-04T13:26:20Z

Describe the bug

We see this when the S3 client is trying to use S3Express CreateSession and is configured such that it is doing so many retries (10) that the call times out before the retry limit is reached. Rather than include the underlying exception triggering the retries, a simpler "call timeout out" exception is raised with the suppressed exception of "java.lang.RuntimeException: Task failed."

I believe this is a regression from v1 sdk.

Expected Behavior

ApiCallTimeoutException to include the exception triggering retries internally.

Current Behavior

See HADOOP-19000 for this surfacing connecting to S3Express buckets.

Stack trace on timeouts contains no root cause information.

Caused by: software.amazon.awssdk.core.exception.ApiCallTimeoutException: Client execution did not complete before the specified timeout configuration: 10000 millis
        at software.amazon.awssdk.core.exception.ApiCallTimeoutException$BuilderImpl.build(ApiCallTimeoutException.java:97)
        at software.amazon.awssdk.core.exception.ApiCallTimeoutException.create(ApiCallTimeoutException.java:38)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.generateApiCallTimeoutException(ApiCallTimeoutTrackingStage.java:151)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.handleInterruptedException(ApiCallTimeoutTrackingStage.java:139)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.translatePipelineException(ApiCallTimeoutTrackingStage.java:107)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:62)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
        at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
        at software.amazon.awssdk.services.s3.DefaultS3Client.createSession(DefaultS3Client.java:1589)
        at software.amazon.awssdk.services.s3.S3Client.createSession(S3Client.java:2505)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.getCredentials(S3ExpressIdentityCache.java:88)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.lambda$getCachedCredentials$0(S3ExpressIdentityCache.java:73)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.refreshResult(CachedS3ExpressCredentials.java:91)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.lambda$new$0(CachedS3ExpressCredentials.java:70)
        at software.amazon.awssdk.utils.cache.CachedSupplier.lambda$jitteredPrefetchValueSupplier$8(CachedSupplier.java:300)
        at software.amazon.awssdk.utils.cache.NonBlocking.fetch(NonBlocking.java:151)
        at software.amazon.awssdk.utils.cache.CachedSupplier.refreshCache(CachedSupplier.java:208)
        at software.amazon.awssdk.utils.cache.CachedSupplier.get(CachedSupplier.java:135)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.get(CachedS3ExpressCredentials.java:85)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.get(S3ExpressIdentityCache.java:61)
        at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.lambda$resolveIdentity$0(DefaultS3ExpressIdentityProvider.java:56)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
        at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
        at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.resolveIdentity(DefaultS3ExpressIdentityProvider.java:49)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.trySelectAuthScheme(S3AuthSchemeInterceptor.java:142)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.selectAuthScheme(S3AuthSchemeInterceptor.java:81)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.beforeExecution(S3AuthSchemeInterceptor.java:61)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$beforeExecution$1(ExecutionInterceptorChain.java:62)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.beforeExecution(ExecutionInterceptorChain.java:62)
        at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.runInitialInterceptors(AwsExecutionContextBuilder.java:239)
        at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:130)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.invokeInterceptorsAndCreateExecutionContext(AwsSyncClientHandler.java:67)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:76)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
        at software.amazon.awssdk.services.s3.DefaultS3Client.listObjectsV2(DefaultS3Client.java:7323)
        at software.amazon.awssdk.services.s3.DelegatingS3Client.lambda$listObjectsV2$63(DelegatingS3Client.java:5856)
        at software.amazon.awssdk.services.s3.internal.crossregion.S3CrossRegionSyncClient.invokeOperation(S3CrossRegionSyncClient.java:73)
        at software.amazon.awssdk.services.s3.DelegatingS3Client.listObjectsV2(DelegatingS3Client.java:5856)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$13(S3AFileSystem.java:2963)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        ... 8 more
        Suppressed: java.lang.RuntimeException: Task failed.
                at software.amazon.awssdk.utils.CompletableFutureUtils.joinLikeSync(CompletableFutureUtils.java:254)
                at software.amazon.awssdk.auth.signer.AwsSignerExecutionAttribute.awsCredentialsReadMapping(AwsSignerExecutionAttribute.java:201)
                at software.amazon.awssdk.core.interceptor.ExecutionAttribute$DerivationValueStorage.get(ExecutionAttribute.java:260)
                at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getAttribute(ExecutionAttributes.java:53)
                at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getOptionalAttribute(ExecutionAttributes.java:68)
                at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:144)
                ... 22 more

Reproduction Steps

set retries to longer than api timeout
try to an operation which fail with a retryable exception (connectivity etc)
wait for failure

Possible Solution

We really need that underlying exception for our own decision-making about what to do next. I fear we are going to have to change the S3A retry policies so that we have special handling for the first failure of any S3 operation on the basis that this is a configuration problem that retries will not recover from. Thanks will add the overhead of a needless S3 call.
Would it be possible to do something like save the innermost exception and add it as the root cause when throwing an ApiCallTimeoutException?

Additional Information/Context

No response

AWS Java SDK version used

2.21.33

JDK version used

openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode)

Operating System and version

macos 13.4.1

The text was updated successfully, but these errors were encountered:

debora-ito · 2024-01-09T02:06:55Z

@steveloughran acknowledged. We'll think of a way to surface the underlying exception.

debora-ito · 2024-01-24T19:33:47Z

@steveloughran after the change released in #4779, do you still see this issue?

The idea is that since the timeouts are propagated now, this should be mitigated.

steveloughran · 2024-01-25T22:30:15Z

I'd still like the reason for internal retry failures to be passed up. For example
-client retries on all IOEs
-endpoint is generating UnknownHostException
-which is retried
-timeout exception will say timeout; inner cause is lost

we deal with that in our code by having a very limited set of retries. if it weren't for the transfer manager I'd set the SDK retry count to 0 and handle it all ourselves

james-s-w-clark · 2024-04-08T16:40:59Z

This could be really helpful for us. We hit the 60 second timeout (even for tiny files in S3), and still have the same issue with a 120 second timeout.

We don't know the root cause, and it seems like #4057 - random failures that cause all future s3 interactions to fail.

I can't share a reproducible snippet unfortunately.

steveloughran · 2024-04-11T19:09:17Z

@IdiosApps if the http connection is stale/dead and gets returned into the pool, you can get recurrent failures. check your ttls and if your read() gets errors: abort() the connection.

pilardi · 2024-04-11T19:19:58Z

not to say this is the root cause of the problem (there's probably a bug in the code since) but have you checked the throughput parameter in the client, it might be slowing things down by default

steveloughran · 2024-04-11T19:26:07Z

our problems weren't with throughput manager, saw it for unknown host exceptions, e.g trying to use fips in a region without it

steveloughran added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Dec 4, 2023

steveloughran mentioned this issue Dec 4, 2023

Add an option to disable retries on UnknownHostException #4739

Open

debora-ito added the needs-review This issue or PR needs review from the team. label Dec 13, 2023

debora-ito added p0 This issue is the highest priority p2 This is a standard priority issue and removed p0 This issue is the highest priority needs-triage This issue or PR still needs to be triaged. labels Dec 27, 2023

debora-ito added p1 This is a high priority issue p2 This is a standard priority issue and removed needs-review This issue or PR needs review from the team. p2 This is a standard priority issue p1 This is a high priority issue labels Jan 9, 2024

debora-ito added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Jan 24, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Jan 26, 2024

steveloughran mentioned this issue Mar 15, 2024

HADOOP-19066. S3A: AWS SDK V2 - Enabling FIPS should be allowed with central endpoint apache/hadoop#6539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications. #4738

ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications. #4738

steveloughran commented Dec 4, 2023

debora-ito commented Jan 9, 2024

debora-ito commented Jan 24, 2024

steveloughran commented Jan 25, 2024

james-s-w-clark commented Apr 8, 2024

steveloughran commented Apr 11, 2024

pilardi commented Apr 11, 2024

steveloughran commented Apr 11, 2024

ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications. #4738

ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications. #4738

Comments

steveloughran commented Dec 4, 2023

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

AWS Java SDK version used

JDK version used

Operating System and version

debora-ito commented Jan 9, 2024

debora-ito commented Jan 24, 2024

steveloughran commented Jan 25, 2024

james-s-w-clark commented Apr 8, 2024

steveloughran commented Apr 11, 2024

pilardi commented Apr 11, 2024

steveloughran commented Apr 11, 2024