Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sftp upload request hangs in disconnect after callback-abort #3650

Closed
bagder opened this issue Mar 6, 2019 · 16 comments
Closed

sftp upload request hangs in disconnect after callback-abort #3650

bagder opened this issue Mar 6, 2019 · 16 comments
Labels

Comments

@bagder
Copy link
Member

bagder commented Mar 6, 2019

I did this

Ketul Barot reported this on the mailing list, Feb 14 2019: "sftp upload request hangs with a high latency server using multi interface"

It seems there is a large latency (>500 ms RTT) and packet loss connection over which data is uploaded. Implementing a custom timeout logic, it returns an error from the progress timeout.

If that timeout happens, it seems libssh2 can get stuck on the libssh2_session_disconnect call.

I expected the following

It should disconnect the transfer just fine.

curl/libcurl version

Libcurl version: 7.61.1
curl 7.61.1 (arm-none-linux-gnueabi) libcurl/7.61.1 GnuTLS/3.5.13 zlib/1.2.8
libssh2/1.7.0
Release-Date: 2018-09-05
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smb
smbs smtp smtps telnet tftp
Features: AsynchDNS Debug TrackMemory IPv6 Largefile NTLM NTLM_WB SSL libz UnixSockets
HTTPS-proxy
Operating system: Linux (OpenWRT)

operating system

Probably independent

@bagder bagder added the SCP/SFTP label Mar 6, 2019
@ketulbarot
Copy link

The description looks good to me but we can edit it a bit:

It seems there is a large latency (>500 ms RTT) and packet loss connection over which data is
uploaded. Implementing a custom timeout logic, it returns an error from the progress timeout.

There is a large latency (500 ms RTT) and packet loss connection (30%) over which data is uploaded. Implemented timeout logic by using the libcurl progress callback. Once the timeout is hit we do clean up of the curl Handles in following way:

curl_multi_remove_handle
curl_easy_cleanup
curl_multi_cleanup (If this was the last easy handle)

and the hang happens in the curl_multi_remove_handle. And here, it seems libssh2 can get stuck on the libssh2_session_disconnect call.

@ketulbarot
Copy link

Answer to your question asked on email:

I think that sounds as if we can leave libssh2 in a bad state so when we
timeout and call another libssh2 function to cleanup, something from the
previous situation somehow still lingers.

Yes sums up the situation properly.

When this happens, as it is during an SFTP upload, do you think
libssh2_sftp_write() was the last libssh2 function call before the timeout
triggered (it is called from the lib/ssh.c:sftp_send() function)? If so, can
you log the return code from that and see if you can detect a pattern in the
hang and what the last return code from that is?

I believe this happens before the actual upload starts. From the debugging till now what I have understood is that due to latency and packet loss it takes lot of time to just establish a SSH session, transfer the AUTH KEYS and then begin the actual transfer of the data (payload). This time of initial SSH setup sometimes happens quickly and sometimes it takes > 5mins. And so we happen to hit the progress timeout and call the libssh2_disconnect which doesn't complete because libssh2 is still stuck in the initial setup.

I don't remember if you told us, but you are using a recent libssh2 version,
right? Have you checked the version from the libssh2 git repo?

libssh2 version I am using is 1.7.0 and the latest available is 1.8.0

I am not sure if this explains the problem properly but I am uploading the log here which might throw some more light on the issue.
libssh2_log.txt

@bagder
Copy link
Member Author

bagder commented Mar 7, 2019

This log seems to pretty accurately pinpoint either a bad use of libssh2 or a bug in libssh2. I'm looking at the lines that keep repeating at the end:

Mar  4 21:29:13 UniTask: * Entering libssh2_session_disconnect
Mar  4 21:29:13 UniTask: [libssh2] 1692.183196 Transport: Redirecting into the key re-exchange from _libssh2_transport_send
Mar  4 21:29:14 UniTask: * Entering libssh2_session_disconnect
Mar  4 21:29:14 UniTask: [libssh2] 1693.183388 Transport: Redirecting into the key re-exchange from _libssh2_transport_send

They come from this code in libssh2:

https://github.com/libssh2/libssh2/blob/3f927a8439948fcd698dd7efcdd9a1e4c11f51b3/src/transport.c#L716

The comment there says:

/* Don't write any new packets if we're still in the middle of a key exchange. */

... which of course is crazy, since we've given up on the key exchange and just want to shut down and now we're stuck! That session->state should probably be cleared in the session disconnect function...

Can you try this simple libssh2 patch?

diff --git a/src/session.c b/src/session.c
index 1aee429..aaa7103 100644
--- a/src/session.c
+++ b/src/session.c
@@ -1154,11 +1154,11 @@ session_disconnect(LIBSSH2_SESSION *session, int reason,
 LIBSSH2_API int
 libssh2_session_disconnect_ex(LIBSSH2_SESSION *session, int reason,
                               const char *desc, const char *lang)
 {
     int rc;
-
+    session->state = 0;
     BLOCK_ADJUST(rc, session,
                  session_disconnect(session, reason, desc, lang));
 
     return rc;
 }

@ketulbarot
Copy link

Tried this patch. The hang intensity has drastically reduced. Now I am seeing hang once in 3 hrs. But, I am still seeing the hang sometimes not exactly sure what the reason is although the hang is still at curl_multi_remove_handle I need some more time debugging it . I will share more details soon.

@bagder
Copy link
Member Author

bagder commented Mar 7, 2019

The hang intensity has drastically reduced.

Cool. A step in the right direction then at least. But what kind of (libssh2) log outputs do you get now then when it hangs? Surely it's not the same repeated logs as before?

bagder added a commit to libssh2/libssh2 that referenced this issue Mar 7, 2019
If authentication is started but not completed before the application
gives up and instead wants to shut down the session, the '->state' field
might still be set and thus effectively dead-lock session_disconnect.

This happens because both _libssh2_transport_send() and
_libssh2_transport_read() refuse to do anything as long as state is set
without the LIBSSH2_STATE_KEX_ACTIVE bit.

Reported in curl bug curl/curl#3650
@ketulbarot
Copy link

But what kind of (libssh2) log outputs do you get now then when it hangs?

Since this happens randomly now I am not able to capture the logs at that particular instant when it gets hang. I am looking at some way where I can trigger it and get the libssh2/libcurl logs.

Surely it's not the same repeated logs as before?

I am sure it is not the same repeated logs. Will update as soon as I have something.

@ketulbarot
Copy link

Hi,

I saw hang for sometime like 10-12 mins and then everything get back to normal. Sharing the log:

39281:Mar 13 17:53:00 UniTask: * Callback aborted
39282:Mar 13 17:53:00 UniTask: * Marked for [closure]: Aborted by callback
39283:Mar 13 17:53:00 UniTask: * STATE: PROTOCONNECT => DONE handle 0x1c3c3a0; line 2128 (connection #0)
39284:Mar 13 17:53:00 UniTask: * multi_done
39285:Mar 13 17:53:00 UniTask: * SSH DISCONNECT starts now
39286:Mar 13 17:53:00 UniTask: * SFTP 0x1c4b708 state change from SSH_SFTP_INIT to SSH_SFTP_SHUTDOWN
39287:Mar 13 17:53:00 UniTask: * SFTP 0x1c4b708 state change from SSH_SFTP_SHUTDOWN to SSH_SESSION_DISCONNECT
39288:Mar 13 17:53:00 UniTask: [libssh2] 12007.882157 Transport: Disconnecting: reason=11, desc=Shutdown, lang=
39289:Mar 13 17:53:00 UniTask: => libssh2_transport_write plain (21 bytes)
39292:Mar 13 17:53:00 UniTask: => libssh2_transport_write plain2 (0 bytes)
39293:Mar 13 17:53:00 UniTask: [libssh2] 12007.882553 Socket: Sent 32/32 bytes at 0x1c2e504
39294:Mar 13 17:53:00 UniTask: => libssh2_transport_write send() (32 bytes)
39297:Mar 13 17:53:00 UniTask: * SFTP 0x1c4b708 state change from SSH_SESSION_DISCONNECT to SSH_SESSION_FREE
39298:Mar 13 17:53:00 UniTask: [libssh2] 12007.882692 Transport: Freeing session resource
39299:Mar 13 17:53:00 UniTask: [libssh2] 12007.882706 Conn: Freeing channel 0/0 resources
39300:Mar 13 17:53:00 UniTask: [libssh2] 12007.882718 Conn: Sending EOF on channel 0/0
39301:Mar 13 17:53:00 UniTask: => libssh2_transport_write plain (5 bytes)
39303:Mar 13 17:53:00 UniTask: [libssh2] 12007.883090 Socket: Sent 16/16 bytes at 0x1c2e504
39304:Mar 13 17:53:00 UniTask: => libssh2_transport_write send() (16 bytes)
39306:Mar 13 17:53:00 UniTask: [libssh2] 12007.883135 Conn: Closing channel 0/0
39307:Mar 13 17:53:00 UniTask: => libssh2_transport_write plain (5 bytes)
39309:Mar 13 17:53:00 UniTask: [libssh2] 12007.883383 Socket: Sent 16/16 bytes at 0x1c2e504
39310:Mar 13 17:53:00 UniTask: => libssh2_transport_write send() (16 bytes)

It got stuck here for good ~12 mins and then it recovered by itself. Have not yet exactly caught the hang but this might shed some light on where possibly it could hang?

bagder added a commit to libssh2/libssh2 that referenced this issue Mar 15, 2019
If authentication is started but not completed before the application
gives up and instead wants to shut down the session, the '->state' field
might still be set and thus effectively dead-lock session_disconnect.

This happens because both _libssh2_transport_send() and
_libssh2_transport_read() refuse to do anything as long as state is set
without the LIBSSH2_STATE_KEX_ACTIVE bit.

Reported in curl bug curl/curl#3650

Closes #310
@ketulbarot
Copy link

This hang happens only when there is a state change from SSH_SFTP_INIT to SSH_SFTP_SHUTDOWN (see above log) in case of an abort. At SSH_SFTP_INIT all the authentication has been done and we have established a valid session. After SSH_SFTP_SHUTDOWN comes SSH_SESSION_DISCONNECT and then SSH_SESSION_FREE. It is here that it gets hang. In the session_free call it tries to close the channels associated with the transfer where it get's stuck in this while loop:

    if (channel->close_state == libssh2_NB_state_sent) {
    /* We must wait for the remote SSH_MSG_CHANNEL_CLOSE message */

    while (!channel->remote.close && !rc &&
           (session->socket_state != LIBSSH2_SOCKET_DISCONNECTED))
        rc = _libssh2_transport_read(session);
}

This is present in src/channel.c file function is _libssh2_channel_close get's called from _libssh2_channel_free which in turn is called by libssh2_session_free. Here it is expecting to read from the socket and it just waits there (_libssh2_transport_read).

The log in the comment above is from the hang. Below is the log when there is no hang and it clears up everything fine:

Mar 13 14:58:05 UniTask: * SFTP 0x1c3a9e0 state change from SSH_SFTP_INIT to SSH_SFTP_SHUTDOWN
Mar 13 14:58:05 UniTask: * SFTP 0x1c3a9e0 state change from SSH_SFTP_SHUTDOWN to SSH_SESSION_DISCONNECT
Mar 13 14:58:05 UniTask: [libssh2] 1512.319901 Transport: Disconnecting: reason=11, desc=Shutdown, lang=
Mar 13 14:58:05 UniTask: => libssh2_transport_write plain (21 bytes)
Mar 13 14:58:05 UniTask: 0000: 01 00 00 00 0B 00 00 00 08 53 68 75 74 64 6F 77 : .........Shutdow
Mar 13 14:58:05 UniTask: 0010: 6E 00 00 00 00 : n....
Mar 13 14:58:05 UniTask: => libssh2_transport_write plain2 (0 bytes)
Mar 13 14:58:05 UniTask: [libssh2] 1512.320122 Socket: Sent 32/32 bytes at 0x1c2e504
Mar 13 14:58:05 UniTask: => libssh2_transport_write send() (32 bytes)
Mar 13 14:58:05 UniTask: 0000: 00 00 00 1C 06 01 00 00 00 0B 00 00 00 08 53 68 : ..............Sh
Mar 13 14:58:05 UniTask: 0010: 75 74 64 6F 77 6E 00 00 00 00 87 E5 2C D7 55 52 : utdown......,.UR
Mar 13 14:58:05 UniTask: * SFTP 0x1c3a9e0 state change from SSH_SESSION_DISCONNECT to SSH_SESSION_FREE
Mar 13 14:58:05 UniTask: [libssh2] 1512.321060 Transport: Freeing session resource
Mar 13 14:58:05 UniTask: [libssh2] 1512.321077 Conn: Freeing channel 0/0 resources
Mar 13 14:58:05 UniTask: [libssh2] 1512.321088 Conn: Sending EOF on channel 0/0
Mar 13 14:58:05 UniTask: => libssh2_transport_write plain (5 bytes)
Mar 13 14:58:05 UniTask: 0000: 60 00 00 00 00 : `....
Mar 13 14:58:05 UniTask: [libssh2] 1512.321549 Socket: Sent 16/16 bytes at 0x1c2e504
Mar 13 14:58:05 UniTask: => libssh2_transport_write send() (16 bytes):q

Mar 13 14:58:05 UniTask: 0000: 00 00 00 0C 06 60 00 00 00 00 85 3C 6E F5 5D 27 : .....`.....<n.]'
Mar 13 14:58:05 UniTask: [libssh2] 1512.321613 Conn: Closing channel 0/0
Mar 13 14:58:05 UniTask: => libssh2_transport_write plain (5 bytes)
Mar 13 14:58:05 UniTask: 0000: 61 00 00 00 00 : a....
Mar 13 14:58:05 UniTask: [libssh2] 1512.321672 Socket: Sent 16/16 bytes at 0x1c2e504
Mar 13 14:58:05 UniTask: => libssh2_transport_write send() (16 bytes)
Mar 13 14:58:05 UniTask: 0000: 00 00 00 0C 06 61 00 00 00 00 5C 72 F9 B6 9D 50 : .....a....\r...P
Mar 13 14:58:44 UniTask: [libssh2] 1551.455306 Socket: Error recving 16384 bytes (got 104)
Mar 13 14:58:44 UniTask: [libssh2] 1551.455347 Transport: Looking for packet of type: 94
Mar 13 14:58:44 UniTask: [libssh2] 1551.455361 Transport: Looking for packet of type: 95
Mar 13 14:58:44 UniTask: [libssh2] 1551.455382 Transport: packet left with id 53
Mar 13 14:58:44 UniTask: [libssh2] 1551.455393 Transport: Extra packets left 1
Mar 13 14:58:44 UniTask: * SFTP 0x1c3a9e0 state change from SSH_SESSION_FREE to SSH_STOP
Mar 13 14:58:44 UniTask: * SSH DISCONNECT is done
Mar 13 14:58:44 UniTask: * Closing connection 0
Mar 13 14:58:44 UniTask: * The cache now contains 0 members

@bagder
Copy link
Member Author

bagder commented Mar 18, 2019

More problems that are not curl bugs but libssh2 bugs. This issue doesn't belong here.

In the session_free call it tries to close the channels associated with the transfer where it get's stuck in this while loop

If it gets stuck there it must mean that _libssh2_transport_read() keeps returning 0, right? How can it do that if there's no data to return? Or if there is data, why does it get stuck?

@Sammydov30
Copy link

You can register a shutdown function. This will be called when the script finishes, when the user aborts or when you call exit() or die(). http://php.net/manual/en/function.register-shutdown-function.php

@bagder
Copy link
Member Author

bagder commented Mar 20, 2019

@Sammydov30 you speak PHP, we don't.

@ketulbarot
Copy link

ketulbarot commented Mar 20, 2019

More problems that are not curl bugs but libssh2 bugs. This issue doesn't belong here.

I agree with you. I have raised this issue on libssh2 forum a while back but no one is responding to that.

If it gets stuck there it must mean that _libssh2_transport_read() keeps returning 0, right? How can it
do that if there's no data to return? Or if there is data, why does it get stuck?

I am not sure how or why it is getting stuck there but from the logs, that seems to be the only place where there is a while loop.

@bagder
Copy link
Member Author

bagder commented Mar 20, 2019

I have raised this issue on libssh2 forum a while back but no one is responding to that.

While that is sad and unfortunate, it is still not a good enough reason to discuss libssh2 bugs in the curl issue tracker. curl seems to do right and the problem looks like it is within libssh2. I must insist that you continue this in the libssh2 project.

@ketulbarot
Copy link

I must insist that you continue this in the libssh2 project.

I totally understand it. Is there a way we can move this existing issue to the libssh2 project? Since, @bagder opened this issue may be you can do it. Otherwise it will be hard to properly present all the data points from this issue to a new issue under libssh2 project.

Thanks for helping out.

@jay
Copy link
Member

jay commented Mar 21, 2019

Issue transfer requirements are pretty strict, you need to be an admin on both repos, it would only be possible if @bagder is an admin there as well and agrees. The meat of this seems to be #3650 (comment) so if it doesn't work out and you have to manually post to the libssh2 tracker I suggest stick with that and then reference this issue and the one you previously made through a different channel with libssh2.

@bagder
Copy link
Member Author

bagder commented May 12, 2019

This is a libssh2 bug, closing here.

@bagder bagder closed this as completed May 12, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Aug 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

4 participants