New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications. #4738
Comments
@steveloughran acknowledged. We'll think of a way to surface the underlying exception. |
@steveloughran after the change released in #4779, do you still see this issue? The idea is that since the timeouts are propagated now, this should be mitigated. |
I'd still like the reason for internal retry failures to be passed up. For example we deal with that in our code by having a very limited set of retries. if it weren't for the transfer manager I'd set the SDK retry count to 0 and handle it all ourselves |
This could be really helpful for us. We hit the 60 second timeout (even for tiny files in S3), and still have the same issue with a 120 second timeout. We don't know the root cause, and it seems like #4057 - random failures that cause all future s3 interactions to fail. I can't share a reproducible snippet unfortunately. |
@IdiosApps if the http connection is stale/dead and gets returned into the pool, you can get recurrent failures. check your ttls and if your read() gets errors: abort() the connection. |
not to say this is the root cause of the problem (there's probably a bug in the code since) but have you checked the throughput parameter in the client, it might be slowing things down by default |
our problems weren't with throughput manager, saw it for unknown host exceptions, e.g trying to use fips in a region without it |
Describe the bug
We see this when the S3 client is trying to use S3Express CreateSession and is configured such that it is doing so many retries (10) that the call times out before the retry limit is reached. Rather than include the underlying exception triggering the retries, a simpler "call timeout out" exception is raised with the suppressed exception of "java.lang.RuntimeException: Task failed."
I believe this is a regression from v1 sdk.
Expected Behavior
ApiCallTimeoutException to include the exception triggering retries internally.
Current Behavior
See HADOOP-19000 for this surfacing connecting to S3Express buckets.
Stack trace on timeouts contains no root cause information.
Reproduction Steps
Possible Solution
We really need that underlying exception for our own decision-making about what to do next. I fear we are going to have to change the S3A retry policies so that we have special handling for the first failure of any S3 operation on the basis that this is a configuration problem that retries will not recover from. Thanks will add the overhead of a needless S3 call.
Would it be possible to do something like save the innermost exception and add it as the root cause when throwing an ApiCallTimeoutException?
Additional Information/Context
No response
AWS Java SDK version used
2.21.33
JDK version used
openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode)
Operating System and version
macos 13.4.1
The text was updated successfully, but these errors were encountered: