New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection closed by remote #636

Open
ThomasWaldmann opened this Issue Feb 6, 2016 · 18 comments

Comments

Projects
None yet
8 participants
@ThomasWaldmann
Member

ThomasWaldmann commented Feb 6, 2016

sometimes a borg backup gets interrupted with "connection closed by remote".

it is unclear why this happens or whether this is a problem in borg, in the ssh configuration or on the network layer.

Suspicions:

  • ssh / tcp keepalive configuration (this was pointed out on the atticmatic site, but it did not help in a experiment a borg user did), so it either does not help or is sometimes not sufficient (from IRC: Azelphur$ FYI, TCPKeepAlive yes, ServerAliveInterval 60, ServerAliveCountMax 5 didn't work :( )
    See also: http://www.cyberciti.biz/tips/open-ssh-server-connection-drops-out-after-few-or-n-minutes-of-inactivity.html
  • borg: EINTR and in general exception handling maybe could be improved in remote.py (though it is unclear what would cause EINTR in our case and would not terminate the process anyway). See there for details: http://250bpm.com/blog:12
  • there might be longer times with no or only little data transfer (e.g. if borg is reading over a lot of zeros in a sparse file or if all the chunks are already stored in the repo - e.g. when backing up VMs)
  • a router might be configured in its security/DoS settings to do something special in case a lot of packets arrive - could this be triggered by borg?
  • also (trivial), it might be just a unstable network connection that breaks down
@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Feb 6, 2016

Member

Idea for maybe getting more infos from the server side: instead of relying on the log messages that come over the wire (or rather don't come if the connection gets killed), configure the borg on the server to use logging to a local file.

logging.conf: https://paste.thinkmo.de/Nu7rdXvz#logging.conf

export BORG_LOGGING_CONF=/path/to/that/logging.conf
^ that needs to be put into the environment ssh users (like the borg executable) get.
(not sure whether we need a borg wrapper script here or a sshd config change)

Member

ThomasWaldmann commented Feb 6, 2016

Idea for maybe getting more infos from the server side: instead of relying on the log messages that come over the wire (or rather don't come if the connection gets killed), configure the borg on the server to use logging to a local file.

logging.conf: https://paste.thinkmo.de/Nu7rdXvz#logging.conf

export BORG_LOGGING_CONF=/path/to/that/logging.conf
^ that needs to be put into the environment ssh users (like the borg executable) get.
(not sure whether we need a borg wrapper script here or a sshd config change)

@Azelphur

This comment has been minimized.

Show comment
Hide comment
@Azelphur

Azelphur Feb 8, 2016

The borg log sadly doesn't actually contain any useful information, it just says

2016-02-08 19:14:54,477 borg.logger DEBUG using logging configuration read from "/home/azelphur/logging.conf"

Here is the log from the client side, running with BORG_RSH="ssh -v"

# BORG_RSH="ssh -v" /home/azelphur/.borg-backup
Backing up to ssh://azelphur@backup.azelphur.com:23/home/azelphur/darth-vader.borg
Remote: OpenSSH_6.9p1 Ubuntu-2ubuntu0.1, OpenSSL 1.0.2d 9 Jul 2015
Remote: debug1: Reading configuration data /etc/ssh/ssh_config
Remote: debug1: /etc/ssh/ssh_config line 19: Applying options for *
Remote: debug1: Connecting to backup.azelphur.com [176.9.24.240] port 23.
Remote: debug1: Connection established.
Remote: debug1: permanently_set_uid: 0/0
Remote: debug1: identity file /root/.ssh/id_rsa type 1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_rsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519 type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519-cert type -1
Remote: debug1: Enabling compatibility mode for protocol 2.0
Remote: debug1: Local version string SSH-2.0-OpenSSH_6.9p1 Ubuntu-2ubuntu0.1
Remote: debug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4
Remote: debug1: match: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 pat OpenSSH_6.6.1* compat 0x04000000
Remote: debug1: Authenticating to backup.azelphur.com:23 as 'azelphur'
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: Host '[backup.azelphur.com]:23' is known and matches the ECDSA host key.
Remote: debug1: Found key in /root/.ssh/known_hosts:1
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: debug1: SSH2_MSG_SERVICE_REQUEST sent
Remote: debug1: SSH2_MSG_SERVICE_ACCEPT received
Remote: debug1: Authentications that can continue: publickey,password
Remote: debug1: Next authentication method: publickey
Remote: debug1: Offering RSA public key: /root/.ssh/id_rsa
Remote: debug1: Server accepts key: pkalg ssh-rsa blen 279
Remote: debug1: Authentication succeeded (publickey).
Remote: Authenticated to backup.azelphur.com ([176.9.24.240]:23).
Remote: debug1: channel 0: new [client-session]
Remote: debug1: Requesting no-more-sessions@openssh.com
Remote: debug1: Entering interactive session.
Remote: debug1: Sending environment.
Remote: debug1: Sending env LANG = en_GB.UTF-8
Remote: debug1: Sending command: borg serve --umask=077 --info
reading files cache
processing files
Remote: debug1: need rekeying
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host

Azelphur commented Feb 8, 2016

The borg log sadly doesn't actually contain any useful information, it just says

2016-02-08 19:14:54,477 borg.logger DEBUG using logging configuration read from "/home/azelphur/logging.conf"

Here is the log from the client side, running with BORG_RSH="ssh -v"

# BORG_RSH="ssh -v" /home/azelphur/.borg-backup
Backing up to ssh://azelphur@backup.azelphur.com:23/home/azelphur/darth-vader.borg
Remote: OpenSSH_6.9p1 Ubuntu-2ubuntu0.1, OpenSSL 1.0.2d 9 Jul 2015
Remote: debug1: Reading configuration data /etc/ssh/ssh_config
Remote: debug1: /etc/ssh/ssh_config line 19: Applying options for *
Remote: debug1: Connecting to backup.azelphur.com [176.9.24.240] port 23.
Remote: debug1: Connection established.
Remote: debug1: permanently_set_uid: 0/0
Remote: debug1: identity file /root/.ssh/id_rsa type 1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_rsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_dsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ecdsa-cert type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519 type -1
Remote: debug1: key_load_public: No such file or directory
Remote: debug1: identity file /root/.ssh/id_ed25519-cert type -1
Remote: debug1: Enabling compatibility mode for protocol 2.0
Remote: debug1: Local version string SSH-2.0-OpenSSH_6.9p1 Ubuntu-2ubuntu0.1
Remote: debug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4
Remote: debug1: match: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 pat OpenSSH_6.6.1* compat 0x04000000
Remote: debug1: Authenticating to backup.azelphur.com:23 as 'azelphur'
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: Host '[backup.azelphur.com]:23' is known and matches the ECDSA host key.
Remote: debug1: Found key in /root/.ssh/known_hosts:1
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: debug1: SSH2_MSG_SERVICE_REQUEST sent
Remote: debug1: SSH2_MSG_SERVICE_ACCEPT received
Remote: debug1: Authentications that can continue: publickey,password
Remote: debug1: Next authentication method: publickey
Remote: debug1: Offering RSA public key: /root/.ssh/id_rsa
Remote: debug1: Server accepts key: pkalg ssh-rsa blen 279
Remote: debug1: Authentication succeeded (publickey).
Remote: Authenticated to backup.azelphur.com ([176.9.24.240]:23).
Remote: debug1: channel 0: new [client-session]
Remote: debug1: Requesting no-more-sessions@openssh.com
Remote: debug1: Entering interactive session.
Remote: debug1: Sending environment.
Remote: debug1: Sending env LANG = en_GB.UTF-8
Remote: debug1: Sending command: borg serve --umask=077 --info
reading files cache
processing files
Remote: debug1: need rekeying
Remote: debug1: SSH2_MSG_KEXINIT sent
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: SSH2_MSG_KEXINIT received
Remote: debug1: kex: server->client chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: kex: client->server chacha20-poly1305@openssh.com <implicit> none
Remote: debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
Remote: debug1: rekeying in progress
Remote: debug1: rekeying in progress
Remote: debug1: Server host key: ecdsa-sha2-nistp256 SHA256:TZNuG4BKyQfl8EO5Jv+Ld7D0CWeS5Jzr5P/16f4svxI
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS sent
Remote: debug1: expecting SSH2_MSG_NEWKEYS
Remote: debug1: set_newkeys: rekeying
Remote: debug1: SSH2_MSG_NEWKEYS received
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host
@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Feb 8, 2016

Member

ok, if that 1 line is all you have in the server log, we at least know it is not the borg process on the server side that encounters some exception or other error - that would be in that log.

so, i guess you have to go through the list of suspicions from my first post, especially the ones relating to ssh itself and the network layer.

another idea is to enhance logging on both sides with a logging.conf and include the timestamps, so we get timing information.

also interesting would be to add another piece of information to the protocol of the experiments:

  • if you do a first backup, there is a lot of traffic going over the network.
  • if you do a 2nd+ backup of same or similar data, there is way less data going over the network
    That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.
Member

ThomasWaldmann commented Feb 8, 2016

ok, if that 1 line is all you have in the server log, we at least know it is not the borg process on the server side that encounters some exception or other error - that would be in that log.

so, i guess you have to go through the list of suspicions from my first post, especially the ones relating to ssh itself and the network layer.

another idea is to enhance logging on both sides with a logging.conf and include the timestamps, so we get timing information.

also interesting would be to add another piece of information to the protocol of the experiments:

  • if you do a first backup, there is a lot of traffic going over the network.
  • if you do a 2nd+ backup of same or similar data, there is way less data going over the network
    That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.
@Azelphur

This comment has been minimized.

Show comment
Hide comment
@Azelphur

Azelphur Feb 12, 2016

ssh / tcp keepalive configuration (this was pointed out on the atticmatic site, but it did not help in a experiment a borg user did), so it either does not help or is sometimes not sufficient (from IRC: Azelphur$ FYI, TCPKeepAlive yes, ServerAliveInterval 60, ServerAliveCountMax 5 didn't work :( )
See also: http://www.cyberciti.biz/tips/open-ssh-server-connection-drops-out-after-few-or-n-minutes-of-inactivity.html

To confirm, I do indeed have

Host *
    ServerAliveInterval 60
    ServerAliveCountMax 5
    TCPKeepAlive yes

In my /etc/ssh/ssh_config file, however the issue still occurs.

there might be longer times with no or only little data transfer (e.g. if borg is reading over a lot of zeros in a sparse file or if all the chunks are already stored in the repo - e.g. when backing up VMs)

also (trivial), it might be just a unstable network connection that breaks down

if you do a 2nd+ backup of same or similar data, there is way less data going over the network That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.

To address these questions, I created 10 random files and then tried to backup these files (twice) while running mtr, here's the results

azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample2.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample3.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample4.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample5.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample6.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample7.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample8.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample9.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample0.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ du -sh *
1.1G    sample0.txt
1.1G    sample2.txt
1.1G    sample3.txt
1.1G    sample4.txt
1.1G    sample5.txt
1.1G    sample6.txt
1.1G    sample7.txt
1.1G    sample8.txt
1.1G    sample9.txt
1.1G    sample.txt
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host

and the output from mtr

                             My traceroute  [v0.85]
darth-vader (0.0.0.0)                                  Fri Feb 12 16:40:12 2016
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. router.asus.com                   0.0%   411    0.3   0.3   0.2   0.9   0.0
 2. lns12.the.dsl.enta.net            0.0%   411   10.9  13.3   9.0  58.5   7.0
 3. te3-3.the.dist.dsl.enta.net       0.0%   411   11.3  13.7   9.2  91.9   7.9
 4. te2-2.telehouse-east3.dsl.enta.n  0.0%   411   10.0  13.7   9.3 116.0   8.6
 5. te5-6.telehouse-east2.core.enta.  1.2%   411   11.4  13.9   9.7  56.0   6.9
 6. te0-1-0-1.telehouse-east4.core.e  0.0%   411   11.5  13.8  10.1  38.1   6.2
 7. te5-3.amsterdam.core.enta.net     0.0%   411   16.5  19.3  15.3  54.8   6.7
 8. amsix-gw.hetzner.de               0.0%   411   24.7  28.3  24.1  72.4   7.0
 9. core1.hetzner.de                  0.2%   411   24.5  28.0  23.9  65.2   6.9
10. core22.hetzner.de                 0.0%   411   29.4  32.6  28.9  56.3   6.4
11. juniper1.rz15.hetzner.de          0.0%   411   30.2  32.9  28.7  82.8   7.4
12. hos-tr1.ex3k2.rz15.hetzner.de     0.0%   410   30.2  33.7  29.2  66.2   6.4
13. backup.azelphur.com               0.0%   410   29.1  32.2  28.7  62.9   6.1

a router might be configured in its security/DoS settings to do something special in case a lot of packets arrive - could this be triggered by borg?

My router is an Asus RT-AC87R running AsusWRT-Merlin, DoS protection is disabled and QoS is also off.

Azelphur commented Feb 12, 2016

ssh / tcp keepalive configuration (this was pointed out on the atticmatic site, but it did not help in a experiment a borg user did), so it either does not help or is sometimes not sufficient (from IRC: Azelphur$ FYI, TCPKeepAlive yes, ServerAliveInterval 60, ServerAliveCountMax 5 didn't work :( )
See also: http://www.cyberciti.biz/tips/open-ssh-server-connection-drops-out-after-few-or-n-minutes-of-inactivity.html

To confirm, I do indeed have

Host *
    ServerAliveInterval 60
    ServerAliveCountMax 5
    TCPKeepAlive yes

In my /etc/ssh/ssh_config file, however the issue still occurs.

there might be longer times with no or only little data transfer (e.g. if borg is reading over a lot of zeros in a sparse file or if all the chunks are already stored in the repo - e.g. when backing up VMs)

also (trivial), it might be just a unstable network connection that breaks down

if you do a 2nd+ backup of same or similar data, there is way less data going over the network That difference might be important for debugging this problem. It would be interesting whether the issue happens in both cases or not.

To address these questions, I created 10 random files and then tried to backup these files (twice) while running mtr, here's the results

azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample2.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample3.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample4.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample5.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample6.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample7.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample8.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample9.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ openssl rand -out sample0.txt -base64 $(( 2**30 * 3/4 ))
azelphur@darth-vader:~/tmp/pointless-file$ du -sh *
1.1G    sample0.txt
1.1G    sample2.txt
1.1G    sample3.txt
1.1G    sample4.txt
1.1G    sample5.txt
1.1G    sample6.txt
1.1G    sample7.txt
1.1G    sample8.txt
1.1G    sample9.txt
1.1G    sample.txt
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host
azelphur@darth-vader:~/tmp/pointless-file$ borg create --compression lz4 -sv --exclude-caches ssh://azelphur@backup.azelphur.com:23/home/azelphur/test.borg::`hostname`-`date "+%Y-%m-%d_%H:%M:%S"` /home/azelphur/tmp/pointless-file
reading files cache
processing files
Remote: packet_write_wait: Connection to 176.9.24.240: Broken pipe
Connection closed by remote host

and the output from mtr

                             My traceroute  [v0.85]
darth-vader (0.0.0.0)                                  Fri Feb 12 16:40:12 2016
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. router.asus.com                   0.0%   411    0.3   0.3   0.2   0.9   0.0
 2. lns12.the.dsl.enta.net            0.0%   411   10.9  13.3   9.0  58.5   7.0
 3. te3-3.the.dist.dsl.enta.net       0.0%   411   11.3  13.7   9.2  91.9   7.9
 4. te2-2.telehouse-east3.dsl.enta.n  0.0%   411   10.0  13.7   9.3 116.0   8.6
 5. te5-6.telehouse-east2.core.enta.  1.2%   411   11.4  13.9   9.7  56.0   6.9
 6. te0-1-0-1.telehouse-east4.core.e  0.0%   411   11.5  13.8  10.1  38.1   6.2
 7. te5-3.amsterdam.core.enta.net     0.0%   411   16.5  19.3  15.3  54.8   6.7
 8. amsix-gw.hetzner.de               0.0%   411   24.7  28.3  24.1  72.4   7.0
 9. core1.hetzner.de                  0.2%   411   24.5  28.0  23.9  65.2   6.9
10. core22.hetzner.de                 0.0%   411   29.4  32.6  28.9  56.3   6.4
11. juniper1.rz15.hetzner.de          0.0%   411   30.2  32.9  28.7  82.8   7.4
12. hos-tr1.ex3k2.rz15.hetzner.de     0.0%   410   30.2  33.7  29.2  66.2   6.4
13. backup.azelphur.com               0.0%   410   29.1  32.2  28.7  62.9   6.1

a router might be configured in its security/DoS settings to do something special in case a lot of packets arrive - could this be triggered by borg?

My router is an Asus RT-AC87R running AsusWRT-Merlin, DoS protection is disabled and QoS is also off.

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Feb 12, 2016

Member

@Azelphur just as a weird idea: could you remove these ssh options (both on server and on client side, restart sshd afterwards) and run your random-files test again?

Member

ThomasWaldmann commented Feb 12, 2016

@Azelphur just as a weird idea: could you remove these ssh options (both on server and on client side, restart sshd afterwards) and run your random-files test again?

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann
Member

ThomasWaldmann commented Feb 19, 2016

@jungle-boogie

This comment has been minimized.

Show comment
Hide comment
@jungle-boogie

jungle-boogie Feb 19, 2016

Contributor

yes, those settings can help but if there's no response, the session can still be terminated. I would also say remove the settings and attempt to leave ssh connected and see what the results are, outside of borg.

Contributor

jungle-boogie commented Feb 19, 2016

yes, those settings can help but if there's no response, the session can still be terminated. I would also say remove the settings and attempt to leave ssh connected and see what the results are, outside of borg.

@Azelphur

This comment has been minimized.

Show comment
Hide comment
@Azelphur

Azelphur Aug 1, 2016

Ok...here I am, 6 months later. I finally solved it.

There was an issue with my router, I was running an app on the router which does speedtests every 15 minutes. After disabling this app, the problem no longer occurs. Not sure exactly why the issue occurred beyond that, but if anyone else gets this issue and happens to be running some services on their router - disable those services and see if you still get it. :)

Azelphur commented Aug 1, 2016

Ok...here I am, 6 months later. I finally solved it.

There was an issue with my router, I was running an app on the router which does speedtests every 15 minutes. After disabling this app, the problem no longer occurs. Not sure exactly why the issue occurred beyond that, but if anyone else gets this issue and happens to be running some services on their router - disable those services and see if you still get it. :)

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Aug 1, 2016

Member

@Azelphur thanks for the feedback.

Member

ThomasWaldmann commented Aug 1, 2016

@Azelphur thanks for the feedback.

@drzraf

This comment has been minimized.

Show comment
Hide comment
@drzraf

drzraf Oct 5, 2016

Adding to this topic (unless you prefer a new issue) that killing the ssh/gpg-agent will make borg 1.0.7 create hanging indefinitely (strace shows "select() Timeout") with no warning (at least, in non-verbose mode).
Reproducer: run a borg create then kill the ssh-agent and wait a bit.

drzraf commented Oct 5, 2016

Adding to this topic (unless you prefer a new issue) that killing the ssh/gpg-agent will make borg 1.0.7 create hanging indefinitely (strace shows "select() Timeout") with no warning (at least, in non-verbose mode).
Reproducer: run a borg create then kill the ssh-agent and wait a bit.

@enkore

This comment has been minimized.

Show comment
Hide comment
@enkore

enkore Oct 5, 2016

Contributor

That's strange. We tested this with killing the ssh client, which works as expected (ie. terminates with an error message).

Contributor

enkore commented Oct 5, 2016

That's strange. We tested this with killing the ssh client, which works as expected (ie. terminates with an error message).

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 5, 2016

Member

@drzraf please give the precise command how you do the kill.

Member

ThomasWaldmann commented Oct 5, 2016

@drzraf please give the precise command how you do the kill.

@drzraf

This comment has been minimized.

Show comment
Hide comment
@drzraf

drzraf Oct 5, 2016

Sorry for the false flag. The timeouts seen were not a sign of hanging (it was indeed uploading): the process was still ongoing. In case these "select() timeout" are of interest anyway here is my command:

BORG_PASSPHRASE=xxx borg create --remote-path=/usr/local/bin/borg1/borg1 --exclude-from xxx --exclude-caches --one-file-system --compression zlib --stats --debug "remote:main-backup::2016-10-04 04:24:03.438923" mydir

(I use such a backup name in order to "resume" from the previous borg create where I initially used {now})

The fact I observed was that remote repository size was not growing after multiple resumed borg create runs.
But the real reason was that it was always interrupted (screenlocker/kill ssh-agent) before the next file (~ 4 GB size) could terminate successfully and since borg can't resume file upload (as of 1.0.7), it was always "removing" this partial file and restarting.
My hypothesis about borg <-> ssh-agent was wrong. Sorry for the noise.

drzraf commented Oct 5, 2016

Sorry for the false flag. The timeouts seen were not a sign of hanging (it was indeed uploading): the process was still ongoing. In case these "select() timeout" are of interest anyway here is my command:

BORG_PASSPHRASE=xxx borg create --remote-path=/usr/local/bin/borg1/borg1 --exclude-from xxx --exclude-caches --one-file-system --compression zlib --stats --debug "remote:main-backup::2016-10-04 04:24:03.438923" mydir

(I use such a backup name in order to "resume" from the previous borg create where I initially used {now})

The fact I observed was that remote repository size was not growing after multiple resumed borg create runs.
But the real reason was that it was always interrupted (screenlocker/kill ssh-agent) before the next file (~ 4 GB size) could terminate successfully and since borg can't resume file upload (as of 1.0.7), it was always "removing" this partial file and restarting.
My hypothesis about borg <-> ssh-agent was wrong. Sorry for the noise.

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 5, 2016

Member

@drzraf you do not need to use a special name to "resume" a past backup. just create a new backup with any name you like and it will use what is already in the repo (so this feels like a resume operation, but technically it is not).

Member

ThomasWaldmann commented Oct 5, 2016

@drzraf you do not need to use a special name to "resume" a past backup. just create a new backup with any name you like and it will use what is already in the repo (so this feels like a resume operation, but technically it is not).

@rugk

This comment has been minimized.

Show comment
Hide comment
@rugk

rugk Oct 31, 2016

Contributor

IMHO you cannot prevent this type of error. As one user pointed out it may have to do with the router or - e.g. - with VPNs/proxies whatever...

In any case I would much more like it, if Bork is able to circumvent this issue by automatically retrying the backup (maybe 3 or a configurable number of) times.
This way this error is much less serious and backups should still work.

Contributor

rugk commented Oct 31, 2016

IMHO you cannot prevent this type of error. As one user pointed out it may have to do with the router or - e.g. - with VPNs/proxies whatever...

In any case I would much more like it, if Bork is able to circumvent this issue by automatically retrying the backup (maybe 3 or a configurable number of) times.
This way this error is much less serious and backups should still work.

@acwatkins

This comment has been minimized.

Show comment
Hide comment
@acwatkins

acwatkins Mar 11, 2017

I am getting this error consistently directly connected to a gigabit switch. Doesn't seem to be happening on the same file though. (I just reran it and it redid a bunch of files and went past the one that it hung on before.

acwatkins commented Mar 11, 2017

I am getting this error consistently directly connected to a gigabit switch. Doesn't seem to be happening on the same file though. (I just reran it and it redid a bunch of files and went past the one that it hung on before.

@rugk

This comment has been minimized.

Show comment
Hide comment
@rugk

rugk Jul 22, 2017

Contributor

BTW I actually made a small shell wrapper script for borg, which circumvents this issue by just retrying the backup (as I suggested before). See: https://github.com/rugk/borg-cron-helper

Contributor

rugk commented Jul 22, 2017

BTW I actually made a small shell wrapper script for borg, which circumvents this issue by just retrying the backup (as I suggested before). See: https://github.com/rugk/borg-cron-helper

@0x72

This comment has been minimized.

Show comment
Hide comment
@0x72

0x72 Apr 22, 2018

Thank you @rugk. I am also affected by randomly dropped connections so your script is very helpful. If anyone wants to automate their backups without an additional dependency, the most barebone approach would be:
$ until $(borg create <repository> <files>); do :; done
which restarts the command until borg exits successfully with exit code 0. This makes it impossible to use Ctrl-C to cancel a running backup, so use Ctrl-Z followed by a $ kill %1 (or whatever job id the subshell has).

If you want to control how often to restart the backup or include some exponential backoff time between attempts, look at https://github.com/kadwanev/retry, which is a very small bash retry script that is generally useful.

0x72 commented Apr 22, 2018

Thank you @rugk. I am also affected by randomly dropped connections so your script is very helpful. If anyone wants to automate their backups without an additional dependency, the most barebone approach would be:
$ until $(borg create <repository> <files>); do :; done
which restarts the command until borg exits successfully with exit code 0. This makes it impossible to use Ctrl-C to cancel a running backup, so use Ctrl-Z followed by a $ kill %1 (or whatever job id the subshell has).

If you want to control how often to restart the backup or include some exponential backoff time between attempts, look at https://github.com/kadwanev/retry, which is a very small bash retry script that is generally useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment