Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntegrityError: Invalid segment entry size #2099

Closed
bket opened this issue Jan 23, 2017 · 10 comments
Closed

IntegrityError: Invalid segment entry size #2099

bket opened this issue Jan 23, 2017 · 10 comments

Comments

@bket
Copy link
Contributor

bket commented Jan 23, 2017

In an attempt to expunge multiple checkpoints I used a script similar to:

#!/bin/sh
set -eu
    borg list :: | grep checkpoint | cut -d ' ' -f1 | while read l; do
    borg delete ::${l}
done

The client is running borg 1.0.9 on OSX and the server (OpenBSD) is also running 1.0.9. Connection is via SSH.
After some time the client errored with:

Remote: Borg 1.0.9: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/remote.py", line 113, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 200, in commit
Remote:     self.compact_segments(save_space=save_space)
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 300, in compact_segments
Remote:     for tag, key, offset, data in self.io.iter_objects(segment, include_data=True):
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 705, in iter_objects
Remote:     (TAG_PUT, TAG_DELETE, TAG_COMMIT))
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 770, in _read
Remote:     size, segment, offset))
Remote: borg.helpers.IntegrityError: Invalid segment entry size 1195725663 - too big [segment 131098, offset 5300305]Remote:
Remote: Platform: OpenBSD aurora.lan 6.0 GENERIC#136 i386 i386
Remote: Borg: 1.0.9  Python: CPython 3.6.0
Remote: PID: 44224  CWD: /home/snapshot
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--restrict-to-path', '/storage/d66e1f9b5845327b.a/borg/macbook-n_nancy.mac']
Remote: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Remote:
Data integrity error: b'Remote Exception (see remote log for the traceback)'
Traceback (most recent call last):
  File "borg/archiver.py", line 2052, in main
  File "borg/archiver.py", line 1997, in run
  File "borg/archiver.py", line 90, in wrapper
  File "borg/archiver.py", line 503, in do_delete
  File "borg/remote.py", line 403, in commit
  File "borg/remote.py", line 288, in call
  File "borg/remote.py", line 327, in call_many
  File "borg/remote.py", line 309, in handle_error
borg.helpers.IntegrityError: b'Remote Exception (see remote log for the traceback)'

Platform: Darwin macbook-n 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64 i386
Borg: 1.0.9  Python: CPython 3.5.2
PID: 29277  CWD: /Users/n.mac/.snapshot/keys
sys.argv: ['borg', 'delete', '-v', '::macbook-n_nancy.mac_1485099919.checkpoint']
SSH_ORIGINAL_COMMAND: None

Subsequential delete (or create) attempts resulted in the same error. Error was resolved by running borg check --repair.

Output of the specific segment file 131098 at offset 5300305 is included (dd if=131098 of=output
bs=1 skip=5300305 count=1024). What is interesting is that offset 0 as well as 5300305 start with "BORG_SEG".

output.zip

Issue has been discussed on IRC:

2017-01-23 19:16:21     ThomasWaldmann  if that superfluous BORG_SEG magic header was the only problem, it should be able to recover all data
2017-01-23 19:16:51     bket    other strange thing is the file size: ~2 * 5MB
2017-01-23 19:17:15     bket    I expected a segment to be ~5M
2017-01-23 19:17:29     ThomasWaldmann  as these were checkpoints, there likely was some issue interrupting the backup. otherwise they would not exist.
2017-01-23 19:18:02     ThomasWaldmann  so dime segments were written to, but no final commit entry.
2017-01-23 19:18:08     ThomasWaldmann  some*
2017-01-23 19:19:15     ThomasWaldmann  the bug might be an "ab" file mode. if a seg file already exists, that appends...
2017-01-23 19:20:12     bket    that explains the double BORG_SEG I guess?
2017-01-23 19:20:15     ThomasWaldmann              self._write_fd = open(self.segment_filename(self.segment), 'ab')
2017-01-23 19:20:17     ThomasWaldmann              self._write_fd.write(MAGIC)
2017-01-23 19:20:25     ThomasWaldmann  that looks suspicious
2017-01-23 19:21:16     ThomasWaldmann  will to some digging into code and history
2017-01-23 19:21:39     ThomasWaldmann  but as i said, recover_segment() can cope with that, so no data loss, right?
2017-01-23 19:22:32     ThomasWaldmann  it skips forward bytewise until crc matches again
2017-01-23 19:23:50     ThomasWaldmann  if you look into that seg file, it also starts with BORG_SEG at the beginning (offset 0), right?
2017-01-23 19:24:04     bket    yes
2017-01-23 19:26:14     ThomasWaldmann  19:21  ThomasWaldmann$ but as i said, recover_segment() can cope with that, so no data loss, right?
2017-01-23 19:28:12     bket    I have not observed data loss
2017-01-23 19:29:30     bket    Both offests start with BORG_SEG
2017-01-23 19:33:38     ThomasWaldmann  e22fcdc0 darc/repository.py  (Jonas Borgström  2013-06-20 12:48:54 +0200 673)             self._write_fd = open(self.segment_filename(self.se
2017-01-23 19:33:41     ThomasWaldmann  gment), 'ab')
2017-01-23 19:33:43     ThomasWaldmann  b718a443 attic/repository.py (Jonas Borgström  2013-07-08 23:38:27 +0200 674)             self._write_fd.write(MAGIC)
2017-01-23 19:33:52     ThomasWaldmann  hah, ancient code
2017-01-23 19:37:32     ThomasWaldmann  about that time, Jonas renamed from "darc" to "attic"
@ThomasWaldmann
Copy link
Member

I tagged it "corruption" as BORG_SEG is not expected to be in the middle of a segment file.

It is not data loss (AFAICS) as recover_segment() can deal with it.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 23, 2017

This is likely a duplicate of #1977 - see precisely the same bad segment entry size, which is likely taken from the MAGIC bytes.

>>> header_fmt.unpack(b'BORG_SEGx')
(1196576578, 1195725663, 120)   # crc, size, tag

Keeping this one as it has more details.

@enkore
Copy link
Contributor

enkore commented Jan 23, 2017

The O_EXCL change in master branch would "trip" on this.

So there is definitely a bug in the transaction recovery here.

@enkore enkore added this to the 1.0.10rc1 milestone Jan 23, 2017
@ThomasWaldmann
Copy link
Member

Strange: i googled for the size value and did only find #1977 - no post elsewhere on the web, not for borg, not for attic.

@ThomasWaldmann
Copy link
Member

See also jborg/attic#394.

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Jan 23, 2017
"ab" seems to make no sense here (if there is already a (crap, but non-empty) segment file,
we would write a MAGIC right into the middle of the resulting file) and cause borgbackup#2099.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Jan 23, 2017
"ab" seems to make no sense here (if there is already a (crap, but non-empty) segment file,
we would write a MAGIC right into the middle of the resulting file) and cause borgbackup#2099.

"wb" would overwrite existing content (not sure if that would be ok).
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 24, 2017

@bket did you have "repo disk full" issues?

also, the specific commands run against that repo recently would be interesting. like borg commands in your scripts or from shell history.

@bket
Copy link
Contributor Author

bket commented Jan 24, 2017

@ThomasWaldmann "repo full disk" issues are unlikely as there is plenty of free space (~230GB). Average backup size of that repo, after compression and deduplication is <100MB.

For removing checkpoints I used a loop as mentioned in this issue. The specific borg commands:

borg list ::
borg delete -v ::${checkpoint}

Use of this loop is incidentally used as cleanup measure for specific machines (laptops).

I have one script that I run a couple of times per day. This script does:

borg create -e "${borg_dir}/cache" -x -C zlib,6 ::${repo}_{now:%s} ${source}
borg prune --keep-within=24H --keep-daily=7 --keep-weekly=4 --keep-monthly=12

This script is run on wired machines as well on laptops (road warriors). The latter regularly experience poor internet connectivity (i.e. in a train) resulting in dropping of SSH-connections, thus resulting in an interrupted borg action. When this happens while creating a backup borg creates a checkpoint. I'm not sure what happens when a connection is dropped while pruning. Until now I did not experience any issues.
As a result from dropped connections I sometimes need to break locks. This is done on the specific client:

borg break-lock ::

I'm not sharing repositories. Thus, one client one repository.

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Jan 24, 2017
"ab" seems to make no sense here (if there is already a (crap, but non-empty) segment file,
we would write a MAGIC right into the middle of the resulting file) and cause borgbackup#2099.
@ThomasWaldmann
Copy link
Member

The current solution to this, see PR #2099, is to use "xb" as mode instead of "ab".

This would fail early if there already is a file, not just append to it, so the corruption causing this issue would never happen. Maybe that would also give us more / other informations about the root cause.

ThomasWaldmann added a commit that referenced this issue Jan 28, 2017
creating a new segment: use "xb" mode, fixes #2099
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 28, 2017

Fixed by #2100.

@bket: @enkore suggested that maybe using break-lock could have caused this malfunction.

If a borg create (or other op that writes to repo) is still active, one breaks the lock and starts another repo-writing op, I guess it would try to write to already existing segment files. So, did this maybe accidentally happen?

@enkore
Copy link
Contributor

enkore commented Jan 28, 2017

When I worked on the compaction logic and the transaction self-recovery I also reviewed this code quite a lot and can't remember finding anything suspicious. While I certainly don't intend to make things "cantreproduce wontfix closed" the possibility of break-lock might have to be considered?

(full quote for context)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants