Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignite SSL filter can cause internal node deadlock on inter-node communication #11271

Open
daniverltd opened this issue Mar 10, 2024 · 3 comments

Comments

@daniverltd
Copy link

daniverltd commented Mar 10, 2024

I've documented this issue in https://issues.apache.org/jira/browse/IGNITE-20940

SSL has to be enabled to trigger this deadlock.

Either accidentally or maliciously cause a node to run out of file descriptors on a Unix-type system by creating a cache or caches with the number of partitions exceeding the number of remaining file descriptors (native persistence has to be on), or by keep opening socket connections to the server (no SSL certificate required) without ever closing them, or by using a commercial piece of software such as 3DNS that periodically polls the Ignite discovery port to check the live-ness of the port - this seems to cause Ignite to leak open files.

Doing any of the above will ultimately cause the node to run out of file descriptors and the message send to another node to fail because it can't open a new socket connection but the node then waits indefinitely for a reply that will never be received because the original message wasn't sent.

ServerImpl -> SocketReader -> body() calls unmarshal() which ultimately attempts to read from a socket which has no socket timeout set.

A handshake timeout is eventually triggered by the system timer thread and it attempts to close the socket to break the stalemate. The socket close process invokes the GridNioSslFilter -> onSessionClose() function and that tries to acquire the sslHandler lock but the lock is already owned by the socket read or other related thread; the result is deadlock.

A separate watchdog thread spots that the system timer thread has stopped updating its heartbeat time value and reports "Blocked system-critical thread has been detected" and triggers the failure handler.

If the failure handler is set to restart, the node restart process tries to create a marker file but fails because there are no free file descriptors and the restart process stalls. The node is now in an invalid state. If you try to stop the JVM with a SIGTERM, the shutdown hook handler deadlocks too.

If the failure handler is set to stop, it first attempts to cleanly close all existing connections; eventually it tries to close deadlocked connection but before doing so the GridNioSslFilter again attempts to acquire the sslHandler lock first, deadlocking the stop process too.

Suggested fix(es): Add socket timeout before calling unmarshal() and/or add time limit in GridNioSslFilter when waiting to acquire the sslHandler lock. Update the restart process so it doesn't have to create a flag file.

@shishkovilja
Copy link
Contributor

@daniverltd , hi! Could you fix JIRA ticket link, please?

Discovery and Communication are different protocols and their threads should not lock mutually. So, can you clarify steps, how can we reproduce this problem?

@daniverltd
Copy link
Author

Either accidentally or maliciously cause a node to run out of file descriptors on Linux by creating a cache or caches with the number of partitions exceeding the number of remaining file descriptors (native persistence has to be on), or by keep opening socket connections to the server (no SSL certificate required) without ever closing them, or by using a commercial piece of software such as 3DNS that periodically polls the Ignite discovery port to check the live-ness of the port - this seems to cause Ignite to leak open files.

Doing any of the above will ultimately cause a send message to another node to fail because it can't open a new socket connection but then wait indefinitely for a reply that will never happen because the original message wasn't sent. This then triggers the handshake timeout, the blocked system thread and then the restart handler to fail.

Killing the node with a SIGTERM causes the node to log "shutdown hook invoked" and then nothing else and it never exits; only a SIGKILL will break the deadlock.

@daniverltd
Copy link
Author

Oh, and the correct JIRA ticket link: https://issues.apache.org/jira/browse/IGNITE-20940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants