You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Either accidentally or maliciously cause a node to run out of file descriptors on a Unix-type system by creating a cache or caches with the number of partitions exceeding the number of remaining file descriptors (native persistence has to be on), or by keep opening socket connections to the server (no SSL certificate required) without ever closing them, or by using a commercial piece of software such as 3DNS that periodically polls the Ignite discovery port to check the live-ness of the port - this seems to cause Ignite to leak open files.
Doing any of the above will ultimately cause the node to run out of file descriptors and the message send to another node to fail because it can't open a new socket connection but the node then waits indefinitely for a reply that will never be received because the original message wasn't sent.
ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately attempts to read from a socket which has no socket timeout set.
A handshake timeout is eventually triggered by the system timer thread and it attempts to close the socket to break the stalemate. The socket close process invokes the GridNioSslFilter -> onSessionClose() function and that tries to acquire the sslHandler lock but the lock is already owned by the socket read or other related thread; the result is deadlock.
A separate watchdog thread spots that the system timer thread has stopped updating its heartbeat time value and reports "Blocked system-critical thread has been detected" and triggers the failure handler.
If the failure handler is set to restart, the node restart process tries to create a marker file but fails because there are no free file descriptors and the restart process stalls. The node is now in an invalid state. If you try to stop the JVM with a SIGTERM, the shutdown hook handler deadlocks too.
If the failure handler is set to stop, it first attempts to cleanly close all existing connections; eventually it tries to close deadlocked connection but before doing so the GridNioSslFilter again attempts to acquire the sslHandler lock first, deadlocking the stop process too.
Suggested fix(es): Add socket timeout before calling unmarshal() and/or add time limit in GridNioSslFilter when waiting to acquire the sslHandler lock. Update the restart process so it doesn't have to create a flag file.
The text was updated successfully, but these errors were encountered:
@daniverltd , hi! Could you fix JIRA ticket link, please?
Discovery and Communication are different protocols and their threads should not lock mutually. So, can you clarify steps, how can we reproduce this problem?
Either accidentally or maliciously cause a node to run out of file descriptors on Linux by creating a cache or caches with the number of partitions exceeding the number of remaining file descriptors (native persistence has to be on), or by keep opening socket connections to the server (no SSL certificate required) without ever closing them, or by using a commercial piece of software such as 3DNS that periodically polls the Ignite discovery port to check the live-ness of the port - this seems to cause Ignite to leak open files.
Doing any of the above will ultimately cause a send message to another node to fail because it can't open a new socket connection but then wait indefinitely for a reply that will never happen because the original message wasn't sent. This then triggers the handshake timeout, the blocked system thread and then the restart handler to fail.
Killing the node with a SIGTERM causes the node to log "shutdown hook invoked" and then nothing else and it never exits; only a SIGKILL will break the deadlock.
I've documented this issue in https://issues.apache.org/jira/browse/IGNITE-20940
SSL has to be enabled to trigger this deadlock.
Either accidentally or maliciously cause a node to run out of file descriptors on a Unix-type system by creating a cache or caches with the number of partitions exceeding the number of remaining file descriptors (native persistence has to be on), or by keep opening socket connections to the server (no SSL certificate required) without ever closing them, or by using a commercial piece of software such as 3DNS that periodically polls the Ignite discovery port to check the live-ness of the port - this seems to cause Ignite to leak open files.
Doing any of the above will ultimately cause the node to run out of file descriptors and the message send to another node to fail because it can't open a new socket connection but the node then waits indefinitely for a reply that will never be received because the original message wasn't sent.
ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately attempts to read from a socket which has no socket timeout set.
A handshake timeout is eventually triggered by the system timer thread and it attempts to close the socket to break the stalemate. The socket close process invokes the GridNioSslFilter -> onSessionClose() function and that tries to acquire the sslHandler lock but the lock is already owned by the socket read or other related thread; the result is deadlock.
A separate watchdog thread spots that the system timer thread has stopped updating its heartbeat time value and reports "Blocked system-critical thread has been detected" and triggers the failure handler.
If the failure handler is set to restart, the node restart process tries to create a marker file but fails because there are no free file descriptors and the restart process stalls. The node is now in an invalid state. If you try to stop the JVM with a SIGTERM, the shutdown hook handler deadlocks too.
If the failure handler is set to stop, it first attempts to cleanly close all existing connections; eventually it tries to close deadlocked connection but before doing so the GridNioSslFilter again attempts to acquire the sslHandler lock first, deadlocking the stop process too.
Suggested fix(es): Add socket timeout before calling unmarshal() and/or add time limit in GridNioSslFilter when waiting to acquire the sslHandler lock. Update the restart process so it doesn't have to create a flag file.
The text was updated successfully, but these errors were encountered: