New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
borg check --repair hangs #4243
Comments
Here is the traceback on sigint when running check --repair locally:
|
Hmm, maybe the power failure is not the only problem there. borg tries hard to make sure that data is synced to disk before a commit is written (and synced to disk). So, if the power fail is before the commit, there is just uncomitted data and borg should just discard it, rolling back to a previous consistent state. If the power fail is after the commit, the state should also be valid as all data is on-disk as it should be. |
@colmode what's your desired goal? Reporting a potential bug in borg, so we can fix it? Getting the repo into a valid state so you can access past backups? |
Guess the cause of this is you hitting Ctrl-C while it is in the crc32 computation, so it is kind of a cosmetic issue. The memoryview + its release looks like it should be done with a contextmanager. |
As a performance note: the code in |
My goal is to get the repo into a valid state, primarily so that I can continue making daily backups without having to re-sync all 300GB of the data, which I would have to if I blow away the repo and start over. At this point I don't care that much about past backups. The current state of the data is ok. It would be nice to preserve past backups though. |
The first hang persisted for 10 days (I was traveling). strace and ltrace showed not much activity, so apparently borg was looping over some internal state. |
It's also curious that borg advances to the next segment on each restart. I haven't studied your internals. |
Doing manual operations on a repo can be dangerous and a risk of total loss of the repo, so usually the advice is to make a copy first. That said, you maybe could just throw away all corrupt segments (I assume these are only at the end, highest segment files numbers [maybe 4356+?] while all before some specific number might be not corrupted). or rather than throwing away: move to a directory outside the repo Can you give a Also: please post your |
|
The first attempted backup that failed was on Dec 23. There are some files in data/4 that are dated newer than that, including today. There were some attempted backups in the intervening time, but the lockfile was held by the ongoing check --repair each time. |
can you post a hexdump of these 2? |
I have no space to make a repo backup btw. |
hexdump data/4/4371 hexdump data/4/4371.beforerecover |
The |
So, can you go back, starting from last file, until you find a valid-looking 17byte file? If there is NNNN and also NNNN.beforerecover, borg detected corruption, so somewhere before that (lower segment file number). 4355 maybe? |
hexdump -C 4355 |
The segment files that were under repair when I interrupted have funny sizes: -rw------- 1 bryan bryan 10598338 Dec 30 08:22 4403 |
Seems valid. OK, maybe try this (no guarantees):
borg check will then read all the 4000 segment files, crc-check them and rebuild a new repo index from what it still finds valid. If possible, start borg from the server and inside a On the client side, it might be required / advisable to delete the local cache: After the borg repo check is completed successfully, you could also do |
Not sure what you mean with "funny sizes"? |
Maybe also talk to the hoster and ask why you see lots of zeros where synced-to-disk data should be. (before maybe look into some of the bigger |
Ok; thanks. Will report results. |
passes:
However, borg -p -v check --archives-only fails similarly when run either locally or client-side, probably due to another corrupted file:
|
I'll try delete -n on that first archive and see what it says. I'm ok delete --force-in stuff if that might get me back to a clean state. |
borg -p -v delete -n ::wall-2018-01-31T01:45:52 I did borg delete --cache-only everywhere, and ~/.cache/borg doesn't contain a directory that matches the id hex string (UUID?) for the current repo. There are some other/older caches though. |
It might have been finding my .cache because I was running undo sudo. After --deleting that, I make some progress: borg -p -v delete -n ::wall-2018-01-31T01:45:52 Synchronizing chunks cache... Platform: Linux wall 4.15.0-42-generic #45-Ubuntu SMP Thu Nov 15 19:32:57 UTC 2018 x86_64 |
And sitll: borg -p -v check --archives-only Analyzing archive wall-2018-01-31T01:45:52 (1/31) File "/usr/lib/python3/dist-packages/borg/remote.py", line 248, in serve File "/usr/lib/python3/dist-packages/borg/repository.py", line 312, in get_free_nonce File "/usr/lib/python3.7/codecs.py", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 1: invalid start byte |
Check if the contents of the |
borg remembers some security related stuff locally in Also, there seems to be some object in the repo really missing ( |
nonce looks plausible: 73 bytes non-0. Ok, you hadn't mentioned --archives-only with --repair previously. Will try that. All the files in .config/borg/security seems to have been generated today. |
Does it make sense to delete --force the broken archives if check --repair --archives-only doesn't work? |
borg -p -v check --repair --archives-only has the same crash: Analyzing archive wall-2018-01-31T01:45:52 (1/31) |
Shouldn't the nonce be ok since I can borg mount and access the earlier archives? |
This is where it reads / decodes the The contents of that should be pure ascii (a long hex number). |
Btw, if you post tracebacks or stuff you entered as a command, use github markdown to mark it as "code" (triple backticks on a line above and below it). |
Sorry, it's 16 bytes; was looking at the wrong file. It's not ascii hex, just 16 non-0 bytes. |
write That will forward it to a likely safe value, you'll lose about half the nonce-space, but usually that is no issue. |
Is 0x80 the first byte (byte 0) of the file there, or is 0x01 the first byte? |
Oic, maybe. You are calling unhexlify(), so the file is supposed to be ascii hex. |
Your string |
Oops. Yeah, should be 16. Also, I got it slightly wrong, should be |
Do the zero chunks get replaced properly in the new archive if the file mtime is old? I'm not concerned with older archives that now have bad chunks, but I'd like to ensure that the latest backup is good. |
When I borg mount and try to cat files that had zero-replacement chunks, I get an i/o error. |
ls shows the correct file size, but borg list shows size 0:
|
borg mount --help |
Oic, that is actually helpful behavior. I delete two of the compromised archives to make sure pruning would work. Thanks for helping me recover! |
keep "data" as is, use "d" for slices so that the data.release() call is on the original memoryview and also we can delete the last reference to a slice of it first.
correctly release memoryview, see #4243
Have you checked borgbackup docs, FAQ, and open Github issues?
Yes
Is this a BUG / ISSUE report or a QUESTION?
bug
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
1.1.8
Operating system (distribution) and version.
Linux vps 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-10-07) x86_64 GNU/Linux
Hardware / network configuration, and filesystems used.
ext4/lvm2
How much data is handled by borg?
/dev/mapper/vps-backup 468G 395G 69G 86% /backup
Full borg commandline that lead to the problem (leave away excludes and passwords)
borg -p -v check --repair
Describe the problem you're observing.
check --repair fails to make progress after a certain corrupted segment. On kill/restart it hangs similarly on the next bad segment.
Details:
borg repo was corrupted by power failure on storage cluster where this VM has its vdisks. The hardware/VM are ok now and all filesystems pass forced fsck. All borg operations are usually done in server mode. I see the same hang when running check --repair locally though.
The initial error was:
borg check showed dozens of 'Remote: Data integrity error: Invalid segment magic ...' or 'Remote: Data integrity error: Segment entry checksum mismatch' and ran to completion.
borg check --repair worked its way through most of these, then hung (for 10 days) on 'Remote: Data integrity error: Segment entry checksum mismatch [segment 4403, offset 8319395]'
The borg process was pinning the CPU. I attached to it with strace, it wasn't making any syscalls. ltrace showed that it was memsetting 2 different addresses to 0, with no other intervening libcalls.
I killed/restarted borg check --repair (with -p -v this time). Nothing about segment 4403 this time; instead, hang for hours at:
Another kill/restart gets us to the same point at segment 4405.
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
See above.
The text was updated successfully, but these errors were encountered: