reconnect FTP sessions after timeout-poisoned transfers by crazy-max · Pull Request #508 · crazy-max/ftpgrab

crazy-max · 2026-04-14T08:24:33Z

fixes #507

The FTP client now keeps the state needed to reconnect lazily after timeout-shaped transfer failures. Retrieve also stops ignoring resp.Close() failures, because that close path is where the trailing FTP control response timeout is surfaced by jlaffaye/ftp module.

#507 describes a real failure mode where one stalled transfer poisons the control connection and turns a short run into minutes of repeated failures. FTP has no reliable in-band resync after that state is corrupted, so the correct recovery path is to discard the connection and reconnect before the next operation.

com6056 · 2026-04-15T03:39:20Z

Tested the branch against the same setup from #507 (no local patches beyond building from this branch, stock 5s timeout, DISABLEEPSV/DISABLEMLSD removed, LOG_LEVEL=debug + LOG_CALLER=true + FTPGRAB_SERVER_FTP_LOGTRACE=true). 4h20m of runtime, 222 completed runs.

Result

The multi-minute cascade is gone.

Metric	Stock v7.12.0	This PR
Stalls ≥ 2 min	~32 / day	0
Timeout events	~22 / day	~49 / day (similar rate in absolute terms)
Affected-run duration	5m – 12m12s	17s – 50s
Next cron tick	degraded for the rest of the run	clean, fresh `AUTH TLS` / `234 AUTH TLS successful`

Every one of the 9 timeout events observed followed the same shape: LIST times out on both the data port and control port (2 errors occurred: ... i/o timeout), QUIT is issued immediately, the remaining directories in that cron tick fail fast, and the next cron tick opens a brand-new session that completes in the usual ~1 minute. This is exactly the behavior #507 described as "expected."

Secondary observation

Within the same run that triggered the timeout, the first ensureConnected() reconnect attempt reliably fails with tls: first record does not look like a TLS handshake on my setup, which then surfaces as the error for every subsequent directory in that run (fast, sub-second each — hence the 17–50s total instead of just ~1s). The very next scheduled tick ~60s later does a clean AUTH TLS and succeeds normally.

Representative snippet:

22:29:06 DBG ftp.go:16 > 150 Here comes the directory listing.
22:29:16 ERR file.go:42 > Cannot read directory <dir-A> error="2 errors occurred:
  * read tcp <client-ip>:49152-><server-ip>:26961: i/o timeout
  * read tcp <client-ip>:60412-><server-ip>:21: i/o timeout
"
22:29:16 DBG ftp.go:16 > QUIT
22:29:16 ERR file.go:42 > Cannot read directory <dir-B> error="tls: first record does not look like a TLS handshake"
22:29:16 DBG ftp.go:16 > QUIT
22:29:16 ERR file.go:42 > Cannot read directory <dir-C> error="tls: first record does not look like a TLS handshake"
...
22:29:17 INF ftpgrab.go:102 > Finished duration=17.597s

22:30:00 DBG ftp.go:16 > AUTH TLS
22:30:00 DBG ftp.go:16 > 234 AUTH TLS successful
22:31:00 INF ftpgrab.go:102 > Finished duration=1m0.579s

My best guess is that the bouncer in front of the FTP server needs a moment to release state after the poisoned connection is torn down, and it's not fully ready to negotiate a fresh AUTH TLS immediately. It could also be that the TLS handshake on the newly-dialed connection is racing with trailing bytes the server is still draining from the previous session. This isn't covered by the existing fix because ensureConnected()'s dial/login error isn't treated as a timeout by handleTransferError(), so c.ftp stays nil and subsequent ReadDir() calls just keep retrying the same failing reconnect within the run.

Not a blocker from my side — total wasted time went from hours/day to tens of seconds/day — but wanted to flag it in case it's worth letting the initial reconnect attempt happen after a small delay, or not attempting reconnect again within the same run after it fails once. Happy to test any variant you'd like.

Thanks for the quick turnaround on this one.

crazy-max · 2026-04-15T22:46:21Z

@com6056 Thanks a lot for this report!

I pushed a follow-up for the second issue you found.

After a timeout poisons the session and ftpgrab closes it, the client now allows one reconnect attempt in the same run. If that reconnect fails, for example with tls: first record does not look like a TLS handshake, it caches that reconnect error and returns it for the rest of the run instead of retrying the same failing handshake on every later directory.

I chose that over adding an arbitrary sleep before reconnect. The expensive part here was the repeated reconnect thrash, so this change removes that behavior without guessing at a delay that may or may not help on other servers.

If you get a chance to retest, the useful thing to confirm is that the poisoned session still gets discarded after the timeout, and the rest of the same cron tick no longer repeatedly attempts the broken TLS reconnect.

com6056 · 2026-04-17T05:05:32Z

Tested the follow-up commit (2f7e82d) for 24 hours, same setup, default 5s timeout.

Results

863 runs, 45 timeout events, 0 desync cascades. The two runs over 2 minutes were legitimate large-file downloads (3.97 GB), not hangs.

The connectErr caching works correctly. Here's what a timeout event looks like now:

04:38:12 DBG ftp.go:16 > EPSV
04:38:23 DBG ftp.go:16 > QUIT
04:38:23 ERR file.go:42 > Cannot read directory <dir-A> error="2 errors occurred:
  * read tcp <client>:51008-><server>:28397: i/o timeout
  * read tcp <client>:40114-><server>:21: i/o timeout
"
04:38:23 DBG ftp.go:16 > QUIT                          ← one reconnect attempt
04:38:23 ERR file.go:42 > Cannot read directory <dir-B> error="tls: first record does not look like a TLS handshake"
04:38:23 ERR file.go:42 > Cannot read directory <dir-C> error="tls: first record does not look like a TLS handshake"
04:38:23 ERR file.go:42 > Cannot read directory <dir-D> error="tls: first record does not look like a TLS handshake"
...  (all remaining dirs return cached error on the same second)
04:38:23 INF ftpgrab.go:102 > Finished duration=21.45s

Only one QUIT after the initial timeout (the reconnect attempt), then all remaining directories return the cached error instantly — no repeated dials. Compared to stock v7.12.0 where this same event would cascade for 5–12 minutes.

Next scheduled tick reconnects cleanly and completes in the usual ~1 minute.

Looks good to merge from my end. Thanks for the quick fix — this has been a significant quality-of-life improvement.

crazy-max · 2026-04-17T09:45:53Z

Awesome thanks a lot for your feedback on this ❤️

reconnect FTP sessions after timeout-poisoned transfers

f0330d5

crazy-max mentioned this pull request Apr 14, 2026

FTP control connection becomes desynchronized after List() i/o timeout, causing minutes-long stalls #507

Closed

3 tasks

avoid repeated FTP reconnect attempts after handshake failure

2f7e82d

crazy-max marked this pull request as ready for review April 15, 2026 22:46

crazy-max merged commit 37d09cc into master Apr 17, 2026
29 checks passed

crazy-max deleted the ftp-retr branch April 17, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reconnect FTP sessions after timeout-poisoned transfers#508

reconnect FTP sessions after timeout-poisoned transfers#508
crazy-max merged 2 commits intomasterfrom
ftp-retr

crazy-max commented Apr 14, 2026

Uh oh!

com6056 commented Apr 15, 2026

Uh oh!

crazy-max commented Apr 15, 2026

Uh oh!

com6056 commented Apr 17, 2026

Uh oh!

crazy-max commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

crazy-max commented Apr 14, 2026

Uh oh!

com6056 commented Apr 15, 2026

Result

Secondary observation

Uh oh!

crazy-max commented Apr 15, 2026

Uh oh!

com6056 commented Apr 17, 2026

Results

Uh oh!

crazy-max commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants