You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I acknowledge the upcoming end-of-support for AWS SDK for Java v1 was announced, and migration to AWS SDK for Java v2 is recommended.
Describe the bug
Hello,
We have an interesting problem that happens intermittently in our environment that causes the S3 PUT via HTTP PUT operation stall between 17-19 minutes. Let me try to describe this in detail.
First of, Environment details. We are running OSS spark and Hadoop on EKS with Karpenter.
JDK version : 11.0.19
Spark Version: 3.4.1
Hadoop Version: 3.3.4
EKS Version: 1.26
Hudi Version: 0.14.x
OS: Verified on both Bottlerocket & AL2
Issue Details:
Occasionally, we notice that Spark stage & few tasks get stalled for about 17 minutes, this delay is consistent whenever it happens. We have noticed that this is due to a stalled socket write on a close() within AWS SDK which uses Apache HTTP Client. When we expect a bad TLS connection, and the underlying socket should be terminated eagerly for a retry we don’t see that happening. Instead, the Socket is left until OS triggers a terminate. This seems to be due to the implementation of socket Linger option which is set to -1 by default in the JDK. An option exists to set Linger to 0 which means bad connections are immediately removed. But neither the AWS SDK nor the Apache HTTP Client sets this option to alter the default Linger behavior in the JDK.
Attached are the logs with slightly different errors with DEBUG level for AWS SDK and Hadoop S3a and Apache HTTP Client with when the issue is encountered.
We have tried to fork the aws sdk by adding the LINGER option with default to 0 in here and set it to the SSL socket options here. But that did not fix the issue, which could be due to how the JDK version is treating the socket options.
Expected Behavior
The socket file descriptor should close non-gracefully/"prematurely", forcing the write to terminate immediately.
Current Behavior
close() blocks until the OS forces the socket closed at the transport layer, causing the socket write to fail
establish a connection between two hosts/VMs, have the client side perform sizable writes (enough to fill up socket buffers etc.), the server just reads and discards.
introduce a null route on either side (or otherwise prevent transmission of TCP acks from the server to the client) force the client to attempt retransmits
wait until you're stuck in a write() (check stack dumps), then call close() on the client-side socket.
Possible Solution
Assuming the right implementation of the LINGER by jdk, it would be good to allow users to set this ClientConfiguration.java which gets set into Apache HTTP Client settings.
Thank you for reporting the issue. I see that you have also opened a Support ticket for the same.
We will continue to provide support through the ticket you have opened. I will mark this as Closing-soon, to avoid duplicate efforts. Kindly send your questions through the Support case. Thanks.
Posting a recap of the issue and available workarounds for others encountering the same problem.
We troubleshoot the reported issue of intermittent high request latency when uploading files to S3.
The root cause is that S3 started TLS 1.3 deployment starting this year and triggered a known Java TLS 1.3 issue related to half-closed connection: if a connection is half-closed (outbound closed) on the server side while the client is still writing to socket, the Java SDK would not able to detect and handle half-closed connection properly, causing the request to hang.
Below are the options available as a workaround to this issue:
On TLSv1.3, set Java system property jdk.tls.acknowledgeCloseNotify to true OR
Force TLS 1.2, for example by setting the jdk.tls.client.protocols system property: jdk.tls.client.protocols=TLSv1.2 OR
Use AWS SDK for Java 2.x with non-Apache HTTP client.
Java SDK team is to work on a permanent fix for this issue in v1 and v2 SDK versions.
Regards,
Chaitanya
bhoradc
added
p1
This is a high priority issue
and removed
third-party
This issue is related to third-party libraries or applications.
labels
Jun 7, 2024
Upcoming End-of-Support
Describe the bug
Hello,
We have an interesting problem that happens intermittently in our environment that causes the S3 PUT via HTTP PUT operation stall between 17-19 minutes. Let me try to describe this in detail.
First of, Environment details. We are running OSS spark and Hadoop on EKS with Karpenter.
JDK version : 11.0.19
Spark Version: 3.4.1
Hadoop Version: 3.3.4
EKS Version: 1.26
Hudi Version: 0.14.x
OS: Verified on both Bottlerocket & AL2
Issue Details:
Occasionally, we notice that Spark stage & few tasks get stalled for about 17 minutes, this delay is consistent whenever it happens. We have noticed that this is due to a stalled socket write on a close() within AWS SDK which uses Apache HTTP Client. When we expect a bad TLS connection, and the underlying socket should be terminated eagerly for a retry we don’t see that happening. Instead, the Socket is left until OS triggers a terminate. This seems to be due to the implementation of socket Linger option which is set to -1 by default in the JDK. An option exists to set Linger to 0 which means bad connections are immediately removed. But neither the AWS SDK nor the Apache HTTP Client sets this option to alter the default Linger behavior in the JDK.
Attached are the logs with slightly different errors with DEBUG level for AWS SDK and Hadoop S3a and Apache HTTP Client with when the issue is encountered.
After further investigation we have found this JDK bug : https://bugs.openjdk.org/browse/JDK-8241239. This perfectly describes and reproduces the issue we are having.
We have tried to fork the aws sdk by adding the LINGER option with default to 0 in here and set it to the SSL socket options here. But that did not fix the issue, which could be due to how the JDK version is treating the socket options.
Expected Behavior
The socket file descriptor should close non-gracefully/"prematurely", forcing the write to terminate immediately.
Current Behavior
close() blocks until the OS forces the socket closed at the transport layer, causing the socket write to fail
Reproduction Steps
As mentioned in https://bugs.openjdk.org/browse/JDK-8241239
Possible Solution
Assuming the right implementation of the LINGER by jdk, it would be good to allow users to set this ClientConfiguration.java which gets set into Apache HTTP Client settings.
Additional Information/Context
No response
AWS Java SDK version used
1.12.26
JDK version used
OpenJDK Runtime Environment Temurin-11.0.19+7
Operating System and version
BottleRocket & AL2
Logs & Other Attachments
aws+sdk+httpclient+debug.log
The text was updated successfully, but these errors were encountered: