Skip to content

reconnect FTP sessions after timeout-poisoned transfers#508

Merged
crazy-max merged 2 commits intomasterfrom
ftp-retr
Apr 17, 2026
Merged

reconnect FTP sessions after timeout-poisoned transfers#508
crazy-max merged 2 commits intomasterfrom
ftp-retr

Conversation

@crazy-max
Copy link
Copy Markdown
Owner

fixes #507

The FTP client now keeps the state needed to reconnect lazily after timeout-shaped transfer failures. Retrieve also stops ignoring resp.Close() failures, because that close path is where the trailing FTP control response timeout is surfaced by jlaffaye/ftp module.

#507 describes a real failure mode where one stalled transfer poisons the control connection and turns a short run into minutes of repeated failures. FTP has no reliable in-band resync after that state is corrupted, so the correct recovery path is to discard the connection and reconnect before the next operation.

@com6056
Copy link
Copy Markdown

com6056 commented Apr 15, 2026

Tested the branch against the same setup from #507 (no local patches beyond building from this branch, stock 5s timeout, DISABLEEPSV/DISABLEMLSD removed, LOG_LEVEL=debug + LOG_CALLER=true + FTPGRAB_SERVER_FTP_LOGTRACE=true). 4h20m of runtime, 222 completed runs.

Result

The multi-minute cascade is gone.

Metric Stock v7.12.0 This PR
Stalls ≥ 2 min ~32 / day 0
Timeout events ~22 / day ~49 / day (similar rate in absolute terms)
Affected-run duration 5m – 12m12s 17s – 50s
Next cron tick degraded for the rest of the run clean, fresh AUTH TLS / 234 AUTH TLS successful

Every one of the 9 timeout events observed followed the same shape: LIST times out on both the data port and control port (2 errors occurred: ... i/o timeout), QUIT is issued immediately, the remaining directories in that cron tick fail fast, and the next cron tick opens a brand-new session that completes in the usual ~1 minute. This is exactly the behavior #507 described as "expected."

Secondary observation

Within the same run that triggered the timeout, the first ensureConnected() reconnect attempt reliably fails with tls: first record does not look like a TLS handshake on my setup, which then surfaces as the error for every subsequent directory in that run (fast, sub-second each — hence the 17–50s total instead of just ~1s). The very next scheduled tick ~60s later does a clean AUTH TLS and succeeds normally.

Representative snippet:

22:29:06 DBG ftp.go:16 > 150 Here comes the directory listing.
22:29:16 ERR file.go:42 > Cannot read directory <dir-A> error="2 errors occurred:
  * read tcp <client-ip>:49152-><server-ip>:26961: i/o timeout
  * read tcp <client-ip>:60412-><server-ip>:21: i/o timeout
"
22:29:16 DBG ftp.go:16 > QUIT
22:29:16 ERR file.go:42 > Cannot read directory <dir-B> error="tls: first record does not look like a TLS handshake"
22:29:16 DBG ftp.go:16 > QUIT
22:29:16 ERR file.go:42 > Cannot read directory <dir-C> error="tls: first record does not look like a TLS handshake"
...
22:29:17 INF ftpgrab.go:102 > Finished duration=17.597s

22:30:00 DBG ftp.go:16 > AUTH TLS
22:30:00 DBG ftp.go:16 > 234 AUTH TLS successful
22:31:00 INF ftpgrab.go:102 > Finished duration=1m0.579s

My best guess is that the bouncer in front of the FTP server needs a moment to release state after the poisoned connection is torn down, and it's not fully ready to negotiate a fresh AUTH TLS immediately. It could also be that the TLS handshake on the newly-dialed connection is racing with trailing bytes the server is still draining from the previous session. This isn't covered by the existing fix because ensureConnected()'s dial/login error isn't treated as a timeout by handleTransferError(), so c.ftp stays nil and subsequent ReadDir() calls just keep retrying the same failing reconnect within the run.

Not a blocker from my side — total wasted time went from hours/day to tens of seconds/day — but wanted to flag it in case it's worth letting the initial reconnect attempt happen after a small delay, or not attempting reconnect again within the same run after it fails once. Happy to test any variant you'd like.

Thanks for the quick turnaround on this one.

@crazy-max
Copy link
Copy Markdown
Owner Author

@com6056 Thanks a lot for this report!

I pushed a follow-up for the second issue you found.

After a timeout poisons the session and ftpgrab closes it, the client now allows one reconnect attempt in the same run. If that reconnect fails, for example with tls: first record does not look like a TLS handshake, it caches that reconnect error and returns it for the rest of the run instead of retrying the same failing handshake on every later directory.

I chose that over adding an arbitrary sleep before reconnect. The expensive part here was the repeated reconnect thrash, so this change removes that behavior without guessing at a delay that may or may not help on other servers.

If you get a chance to retest, the useful thing to confirm is that the poisoned session still gets discarded after the timeout, and the rest of the same cron tick no longer repeatedly attempts the broken TLS reconnect.

@crazy-max crazy-max marked this pull request as ready for review April 15, 2026 22:46
@com6056
Copy link
Copy Markdown

com6056 commented Apr 17, 2026

Tested the follow-up commit (2f7e82d) for 24 hours, same setup, default 5s timeout.

Results

863 runs, 45 timeout events, 0 desync cascades. The two runs over 2 minutes were legitimate large-file downloads (3.97 GB), not hangs.

The connectErr caching works correctly. Here's what a timeout event looks like now:

04:38:12 DBG ftp.go:16 > EPSV
04:38:23 DBG ftp.go:16 > QUIT
04:38:23 ERR file.go:42 > Cannot read directory <dir-A> error="2 errors occurred:
  * read tcp <client>:51008-><server>:28397: i/o timeout
  * read tcp <client>:40114-><server>:21: i/o timeout
"
04:38:23 DBG ftp.go:16 > QUIT                          ← one reconnect attempt
04:38:23 ERR file.go:42 > Cannot read directory <dir-B> error="tls: first record does not look like a TLS handshake"
04:38:23 ERR file.go:42 > Cannot read directory <dir-C> error="tls: first record does not look like a TLS handshake"
04:38:23 ERR file.go:42 > Cannot read directory <dir-D> error="tls: first record does not look like a TLS handshake"
...  (all remaining dirs return cached error on the same second)
04:38:23 INF ftpgrab.go:102 > Finished duration=21.45s

Only one QUIT after the initial timeout (the reconnect attempt), then all remaining directories return the cached error instantly — no repeated dials. Compared to stock v7.12.0 where this same event would cascade for 5–12 minutes.

Next scheduled tick reconnects cleanly and completes in the usual ~1 minute.

Looks good to merge from my end. Thanks for the quick fix — this has been a significant quality-of-life improvement.

@crazy-max
Copy link
Copy Markdown
Owner Author

Awesome thanks a lot for your feedback on this ❤️

@crazy-max crazy-max merged commit 37d09cc into master Apr 17, 2026
29 checks passed
@crazy-max crazy-max deleted the ftp-retr branch April 17, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FTP control connection becomes desynchronized after List() i/o timeout, causing minutes-long stalls

2 participants