IntegrityError: Invalid segment entry size #2099

bket · 2017-01-23T18:48:42Z

In an attempt to expunge multiple checkpoints I used a script similar to:

#!/bin/sh
set -eu
    borg list :: | grep checkpoint | cut -d ' ' -f1 | while read l; do
    borg delete ::${l}
done

The client is running borg 1.0.9 on OSX and the server (OpenBSD) is also running 1.0.9. Connection is via SSH.
After some time the client errored with:

Remote: Borg 1.0.9: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/remote.py", line 113, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 200, in commit
Remote:     self.compact_segments(save_space=save_space)
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 300, in compact_segments
Remote:     for tag, key, offset, data in self.io.iter_objects(segment, include_data=True):
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 705, in iter_objects
Remote:     (TAG_PUT, TAG_DELETE, TAG_COMMIT))
Remote:   File "/usr/local/lib/python3.6/site-packages/borg/repository.py", line 770, in _read
Remote:     size, segment, offset))
Remote: borg.helpers.IntegrityError: Invalid segment entry size 1195725663 - too big [segment 131098, offset 5300305]Remote:
Remote: Platform: OpenBSD aurora.lan 6.0 GENERIC#136 i386 i386
Remote: Borg: 1.0.9  Python: CPython 3.6.0
Remote: PID: 44224  CWD: /home/snapshot
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--restrict-to-path', '/storage/d66e1f9b5845327b.a/borg/macbook-n_nancy.mac']
Remote: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --info'
Remote:
Data integrity error: b'Remote Exception (see remote log for the traceback)'
Traceback (most recent call last):
  File "borg/archiver.py", line 2052, in main
  File "borg/archiver.py", line 1997, in run
  File "borg/archiver.py", line 90, in wrapper
  File "borg/archiver.py", line 503, in do_delete
  File "borg/remote.py", line 403, in commit
  File "borg/remote.py", line 288, in call
  File "borg/remote.py", line 327, in call_many
  File "borg/remote.py", line 309, in handle_error
borg.helpers.IntegrityError: b'Remote Exception (see remote log for the traceback)'

Platform: Darwin macbook-n 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64 i386
Borg: 1.0.9  Python: CPython 3.5.2
PID: 29277  CWD: /Users/n.mac/.snapshot/keys
sys.argv: ['borg', 'delete', '-v', '::macbook-n_nancy.mac_1485099919.checkpoint']
SSH_ORIGINAL_COMMAND: None

Subsequential delete (or create) attempts resulted in the same error. Error was resolved by running borg check --repair.

Output of the specific segment file 131098 at offset 5300305 is included (dd if=131098 of=output
bs=1 skip=5300305 count=1024). What is interesting is that offset 0 as well as 5300305 start with "BORG_SEG".

output.zip

Issue has been discussed on IRC:

2017-01-23 19:16:21     ThomasWaldmann  if that superfluous BORG_SEG magic header was the only problem, it should be able to recover all data
2017-01-23 19:16:51     bket    other strange thing is the file size: ~2 * 5MB
2017-01-23 19:17:15     bket    I expected a segment to be ~5M
2017-01-23 19:17:29     ThomasWaldmann  as these were checkpoints, there likely was some issue interrupting the backup. otherwise they would not exist.
2017-01-23 19:18:02     ThomasWaldmann  so dime segments were written to, but no final commit entry.
2017-01-23 19:18:08     ThomasWaldmann  some*
2017-01-23 19:19:15     ThomasWaldmann  the bug might be an "ab" file mode. if a seg file already exists, that appends...
2017-01-23 19:20:12     bket    that explains the double BORG_SEG I guess?
2017-01-23 19:20:15     ThomasWaldmann              self._write_fd = open(self.segment_filename(self.segment), 'ab')
2017-01-23 19:20:17     ThomasWaldmann              self._write_fd.write(MAGIC)
2017-01-23 19:20:25     ThomasWaldmann  that looks suspicious
2017-01-23 19:21:16     ThomasWaldmann  will to some digging into code and history
2017-01-23 19:21:39     ThomasWaldmann  but as i said, recover_segment() can cope with that, so no data loss, right?
2017-01-23 19:22:32     ThomasWaldmann  it skips forward bytewise until crc matches again
2017-01-23 19:23:50     ThomasWaldmann  if you look into that seg file, it also starts with BORG_SEG at the beginning (offset 0), right?
2017-01-23 19:24:04     bket    yes
2017-01-23 19:26:14     ThomasWaldmann  19:21  ThomasWaldmann$ but as i said, recover_segment() can cope with that, so no data loss, right?
2017-01-23 19:28:12     bket    I have not observed data loss
2017-01-23 19:29:30     bket    Both offests start with BORG_SEG
2017-01-23 19:33:38     ThomasWaldmann  e22fcdc0 darc/repository.py  (Jonas Borgström  2013-06-20 12:48:54 +0200 673)             self._write_fd = open(self.segment_filename(self.se
2017-01-23 19:33:41     ThomasWaldmann  gment), 'ab')
2017-01-23 19:33:43     ThomasWaldmann  b718a443 attic/repository.py (Jonas Borgström  2013-07-08 23:38:27 +0200 674)             self._write_fd.write(MAGIC)
2017-01-23 19:33:52     ThomasWaldmann  hah, ancient code
2017-01-23 19:37:32     ThomasWaldmann  about that time, Jonas renamed from "darc" to "attic"

The text was updated successfully, but these errors were encountered:

ThomasWaldmann · 2017-01-23T18:58:20Z

I tagged it "corruption" as BORG_SEG is not expected to be in the middle of a segment file.

It is not data loss (AFAICS) as recover_segment() can deal with it.

ThomasWaldmann · 2017-01-23T19:01:16Z

This is likely a duplicate of #1977 - see precisely the same bad segment entry size, which is likely taken from the MAGIC bytes.

>>> header_fmt.unpack(b'BORG_SEGx')
(1196576578, 1195725663, 120)   # crc, size, tag

Keeping this one as it has more details.

enkore · 2017-01-23T19:15:25Z

The O_EXCL change in master branch would "trip" on this.

So there is definitely a bug in the transaction recovery here.

ThomasWaldmann · 2017-01-23T19:18:32Z

Strange: i googled for the size value and did only find #1977 - no post elsewhere on the web, not for borg, not for attic.

ThomasWaldmann · 2017-01-23T19:31:03Z

See also jborg/attic#394.

"ab" seems to make no sense here (if there is already a (crap, but non-empty) segment file, we would write a MAGIC right into the middle of the resulting file) and cause borgbackup#2099.

"ab" seems to make no sense here (if there is already a (crap, but non-empty) segment file, we would write a MAGIC right into the middle of the resulting file) and cause borgbackup#2099. "wb" would overwrite existing content (not sure if that would be ok).

ThomasWaldmann · 2017-01-24T02:25:38Z

@bket did you have "repo disk full" issues?

also, the specific commands run against that repo recently would be interesting. like borg commands in your scripts or from shell history.

bket · 2017-01-24T05:40:45Z

@ThomasWaldmann "repo full disk" issues are unlikely as there is plenty of free space (~230GB). Average backup size of that repo, after compression and deduplication is <100MB.

For removing checkpoints I used a loop as mentioned in this issue. The specific borg commands:

borg list ::
borg delete -v ::${checkpoint}

Use of this loop is incidentally used as cleanup measure for specific machines (laptops).

I have one script that I run a couple of times per day. This script does:

borg create -e "${borg_dir}/cache" -x -C zlib,6 ::${repo}_{now:%s} ${source}
borg prune --keep-within=24H --keep-daily=7 --keep-weekly=4 --keep-monthly=12

This script is run on wired machines as well on laptops (road warriors). The latter regularly experience poor internet connectivity (i.e. in a train) resulting in dropping of SSH-connections, thus resulting in an interrupted borg action. When this happens while creating a backup borg creates a checkpoint. I'm not sure what happens when a connection is dropped while pruning. Until now I did not experience any issues.
As a result from dropped connections I sometimes need to break locks. This is done on the specific client:

borg break-lock ::

I'm not sharing repositories. Thus, one client one repository.

"ab" seems to make no sense here (if there is already a (crap, but non-empty) segment file, we would write a MAGIC right into the middle of the resulting file) and cause borgbackup#2099.

ThomasWaldmann · 2017-01-27T20:41:00Z

The current solution to this, see PR #2099, is to use "xb" as mode instead of "ab".

This would fail early if there already is a file, not just append to it, so the corruption causing this issue would never happen. Maybe that would also give us more / other informations about the root cause.

creating a new segment: use "xb" mode, fixes #2099

ThomasWaldmann · 2017-01-28T13:22:52Z

Fixed by #2100.

@bket: @enkore suggested that maybe using break-lock could have caused this malfunction.

If a borg create (or other op that writes to repo) is still active, one breaks the lock and starts another repo-writing op, I guess it would try to write to already existing segment files. So, did this maybe accidentally happen?

enkore · 2017-01-28T13:23:55Z

When I worked on the compaction logic and the transaction self-recovery I also reviewed this code quite a lot and can't remember finding anything suspicious. While I certainly don't intend to make things "cantreproduce wontfix closed" the possibility of break-lock might have to be considered?

(full quote for context)

ThomasWaldmann added bug corruption labels Jan 23, 2017

ThomasWaldmann mentioned this issue Jan 23, 2017

invalid segment entry size - too big #1977

Closed

enkore added this to the 1.0.10rc1 milestone Jan 23, 2017

ThomasWaldmann mentioned this issue Jan 23, 2017

transaction recovery issue? jborg/attic#394

Open

ThomasWaldmann mentioned this issue Jan 23, 2017

creating a new segment: use "xb" mode, fixes #2099 #2100

Merged

ThomasWaldmann added a commit that referenced this issue Jan 28, 2017

Merge pull request #2100 from ThomasWaldmann/fix-double-magic

528891a

creating a new segment: use "xb" mode, fixes #2099

ThomasWaldmann closed this as completed Jan 28, 2017

enkore added the c: repository label Jul 23, 2017

AaronWebster mentioned this issue Feb 14, 2022

transaction recovery issue? AaronWebster/attic#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IntegrityError: Invalid segment entry size #2099

IntegrityError: Invalid segment entry size #2099

bket commented Jan 23, 2017 •

edited

Loading

ThomasWaldmann commented Jan 23, 2017

ThomasWaldmann commented Jan 23, 2017 •

edited

Loading

enkore commented Jan 23, 2017 •

edited

Loading

ThomasWaldmann commented Jan 23, 2017

ThomasWaldmann commented Jan 23, 2017

ThomasWaldmann commented Jan 24, 2017 •

edited

Loading

bket commented Jan 24, 2017 •

edited

Loading

ThomasWaldmann commented Jan 27, 2017

ThomasWaldmann commented Jan 28, 2017 •

edited by enkore

Loading

enkore commented Jan 28, 2017

IntegrityError: Invalid segment entry size #2099

IntegrityError: Invalid segment entry size #2099

Comments

bket commented Jan 23, 2017 • edited Loading

ThomasWaldmann commented Jan 23, 2017

ThomasWaldmann commented Jan 23, 2017 • edited Loading

enkore commented Jan 23, 2017 • edited Loading

ThomasWaldmann commented Jan 23, 2017

ThomasWaldmann commented Jan 23, 2017

ThomasWaldmann commented Jan 24, 2017 • edited Loading

bket commented Jan 24, 2017 • edited Loading

ThomasWaldmann commented Jan 27, 2017

ThomasWaldmann commented Jan 28, 2017 • edited by enkore Loading

enkore commented Jan 28, 2017

bket commented Jan 23, 2017 •

edited

Loading

ThomasWaldmann commented Jan 23, 2017 •

edited

Loading

enkore commented Jan 23, 2017 •

edited

Loading

ThomasWaldmann commented Jan 24, 2017 •

edited

Loading

bket commented Jan 24, 2017 •

edited

Loading

ThomasWaldmann commented Jan 28, 2017 •

edited by enkore

Loading