Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with verification or repair when there are a lot of duplicate or null blocks and file/s are split into a lot of slices #36

Closed
NilEinne opened this issue Jun 14, 2021 · 7 comments

Comments

@NilEinne
Copy link

I have encountered a weird bug in MultiPAR with verification or repair that happens when your file/s have a lot of duplicate or null blocks and you create PAR2 files with a lot of blocks.

For a simple non real world test, create a file filled with null bytes, e.g:

fsutil file createnew 1000000000fsutilfile 1000000000

(will likely require administrator privileges)

Then create a PAR2 file splicing the "data" into the maximum number of blocks (32766). Next create a PAR2 file splicing the data file into 5000 recovery blocks. You can verify both, and they'll both verify fine.

Then corrupt the data file slightly i.e. change a few byte somewhere to something besides 00. It can be right at the beginning/first block although that may make it less clear was is going on. Possibly not the last block. I just use a hex editor. Try to verify and you will find with the 32766 block PAR, the verification will break and won't continue to verify after it finds the corrupt block. But with the 5000 block PAR, it will verify and say it can be repaired/rejoined, as expected since all blocks are the same.

Since MultiPAR does recognise null blocks, you may also want to create a file filled with some other byte, I did 10h. You will find the same behaviour. (I used the 00 example because it's fairly simple to create.)

If you try to repair, you will have the same problem with 32766 blocks. An exception is if you use the file filled with null blocks, when you repair a second time it will succeed because the first repair will create a temporary "blank" file so now it just has to rename that. This can't happen if the file is filled with some other byte so repair is impossible.

If you use a larger file e.g. 11010101010 bytes (over x10 previous) you will find the same problem. (I'm sure smaller as well.) I also tried adding non duplicate blocks into the file. You get the same problem even if the corruption is in the non duplicate data (and you therefore need recovery blocks).

Additional details:

  • Most of my testing was with 1.3.1.8 but I also tried 1.3.0.7. Version 1.3.0.7 (really 1.3.0.6 since it's par2j/64 that's the problem) is worse than 1.3.1.8. par2j is better than par2j64 but both can have the problem. See my reply for details.

  • Most of my testing was with an AMD A10-5800K with 32GiB of RAM, but I also tried on an Intel i5-3470 with 24GiB RAM. Both computers were running Windows 10 x86-64.

  • I tried fiddling around with memory settings e.g. 1/8 or 7/8 and also limiting thread count down to a single thread and disabling or enabling GPU and finally changing verification levels. These made little or no difference to the problem. (I didn't try the SSD setting in part because of it occurs on 1.3.0.6 in part because all my testing was on HD.)

  • The reason I found the bug weird, there is an middle point where the verification will sometimes work and sometimes fail. To be clear, I mean with the exact same files. Just repeat verification 10 times and you should find that sometimes it will stop after finding the broken block, sometimes it will keep going and confirm recovery is possible. (Or recover if you do it in recovery mode.) To be clear this happens even when I use -lc513 to limit par2j64 to one thread, unless this doesn't completely eliminate multi-threading? See this file for a sample output: Sample of output for GitHub.txt

@NilEinne
Copy link
Author

NilEinne commented Jun 14, 2021

I think the above may be enough to diagnose the issue, but if not, some more findings/notes:

  • Although the above example isn't something you're ever likely to do in the real world, I encountered the problem myself in a probably rare but not unheard of use case. I had an uncompressed whole disk image. The disk wasn't very full so there was a lot of blank space. (Actually most of it was FF.) While it would normally be better to compress the image then create PAR2 files for compressed file, I wanted to create PAR2 files for the uncompressed disk image for reasons I won't go into and encountered the perplexing behaviour where my PAR2 files didn't seem to work.

  • I originally found the problem while using the GUI, but later moved on to testing in the command line only. Most of my tests were using this command line:
    "C:\Program Files (x86)\MultiPar\par2j64.exe" v -uo -m7 -vl2 -lc544

  • I needed to exclude AVX2 on my computers because neither CPU supports it.

  • The number of available recovery blocks or recovery files available makes no difference AFAICT.

About the middle point:

  • It seems to be the same no matter the file size provided nearly all blocks are duplicate. Null bytes or 10h didn't see to make any difference. I say roughly because it's not a single point, however even 100 blocks difference is enough that verification will always or nearly always succeed or fail. (I only generally tested 10-20 times and not automated so I don't know if it failed 1/100 times or something.) AFAICT, the failure only happens once, when MultiPAR finds there is corruption and switches to seeing if it can recover. Even if you have multiple corrupt blocks in different areas, it will work fine for all if it works. If it fails, it fails when it first detects corruption.

  • For a file filled with duplicate blocks (including null blocks) :

  1. The middle point is roughly 15957 blocks if you use par2j64.exe 1.3.1.8.
  2. For par2j 1.3.1.8 the middle point is roughly 23311 blocks. This is why I saw par2j is better as you're less likely to have the problem. (Anything from 8 or whatever to approaching 23311 will be okay.)
  3. For par2j64 1.3.0.6 the middle point is somewhere between 12000 and 13000 blocks. Hence 1.3.0.6 is worse.
  4. For par2j 1.3.0.6 the middle point is somewhere between 21000 and 22000 blocks
  • If you have multiple files with the exact same blocks, it seems to be the same just as with having different file sizes. (I combined the 1000000000 and 11010101010 files.)

  • If you have fewer duplicate blocks, the middle point changes but may still be there. For example, I added 483,878,688 bytes to roughly the middle of the 1000000000 byte file, so nearly half the file had non duplicate data. The mid point with 64 1.3.1.8 is roughly 30910. The mid point for 32 (non 64) 1.3.1.8 doesn't exist i.e. you can do up to 32766 blocks without problem.
    If you have multiple blocks of different duplicative data it's the same thing. I didn't try a single file, but I tried two 1000000000 files. One was filled with 00h, another with 10h. When I combined them into a single parity file, the mid point for 1.3.1.8 seems to be between 31002 and 31998 blocks.

@JohnLGalt
Copy link

* I tried fiddling around with memory settings e.g. 1/8 or 7/8 and also limiting thread count down to a single thread and disabling or enabling GPU and finally changing verification levels. These made little or no difference to the problem. (I didn't try the SSD setting in part because of it occurs on 1.3.0.6 in part because all my testing was on HD.)

* The reason I found the bug weird, there is an middle point where the verification will sometimes work and sometimes fail. To be clear, I mean with the exact same files. Just repeat verification 10 times and you should find that sometimes it will stop after finding the broken block, sometimes it will keep going and confirm recovery is possible. (Or recover if you do it in recovery mode.) To be clear this happens even when I use -lc513 to limit par2j64 to one thread, unless this doesn't completely eliminate multi-threading? See this file for a sample output: [Sample of output for GitHub.txt](https://github.com/Yutaka-Sawada/MultiPar/files/6644920/Sample.of.output.for.GitHub.txt)

Got a question - when using the GUI, did you try all of these together in a single session?

I think the above may be enough to diagnose the issue, but if not, some more findings/notes:

* Although the above example isn't something you're ever likely to do in the real world, I encountered the problem myself in a probably rare but not unheard of use case. I had an uncompressed whole disk image. The disk wasn't very full so there was a lot of blank space. (Actually most of it was FF.)  While it would normally be better to compress the image then create PAR2 files for compressed file, I wanted to create PAR2 files for the uncompressed disk image for reasons I won't go into and encountered the perplexing behaviour where my PAR2 files didn't seem to work.

* I originally found the problem while using the GUI, but later moved on to testing in the command line only. Most of my tests were using this command line:
  `"C:\Program Files (x86)\MultiPar\par2j64.exe" v -uo -m7 -vl2 -lc544 `

* I needed to exclude AVX2 on my computers because neither CPU supports it.

Does the CL version fail if you don't explicitly disable AVX2?

* The number of available recovery blocks or recovery files available makes no difference AFAICT.

About the middle point:

* It seems to be the same no matter the file size provided nearly all blocks are duplicate. Null bytes or 10h didn't see to make any difference. I say roughly because it's not a single point, however even 100 blocks difference is enough that verification will always or nearly always succeed or fail. (I only generally tested 10-20 times and not automated so I don't know if it failed 1/100 times or something.) AFAICT, the failure only happens once, when MultiPAR finds there is corruption and switches to seeing if it can recover. Even if you have multiple corrupt blocks in different areas, it will work fine for all if it works. If it fails, it fails when it first detects corruption.

* For a file filled with duplicate blocks (including null blocks) :


1. The middle point is roughly 15957 blocks if you use par2j64.exe 1.3.1.8.

2. For par2j 1.3.1.8 the middle point is roughly 23311 blocks. This is why I saw par2j is better as you're less likely to have the problem. (Anything from 8 or whatever to approaching 23311 will be okay.)

3. For par2j64 1.3.0.6 the middle point is somewhere between 12000 and 13000 blocks. Hence 1.3.0.6 is worse.

4. For par2j 1.3.0.6 the middle point is somewhere between 21000 and 22000 blocks


* If you have multiple files with the exact same blocks, it seems to be the same just as with having different file sizes. (I combined the 1000000000 and 11010101010 files.)

* If you have fewer duplicate blocks, the middle point changes but may still be there. For example, I added 483,878,688 bytes to roughly the middle of the 1000000000 byte file, so nearly half the file had non duplicate data. The mid point with 64 1.3.1.8 is roughly 30910. The mid point for 32 (non 64) 1.3.1.8 doesn't exist i.e. you can do up to 32766 blocks without problem.
  If you have multiple blocks of different duplicative data it's the same thing. I didn't try a single file, but I tried two 1000000000 files. One was filled with 00h, another with 10h. When I combined them into a single parity file, the mid point for 1.3.1.8 seems to be between 31002 and 31998 blocks.

Yikes. Since you're using roughly 1 GB in size, I can try this on all types of drives: Rusty, SATA SSD and NVMe SSD.

Does the approx 1 GB size matter? What if I make it exactly 1 GB (1073741824 byes)? Of course, my CPU does support AVX2, but I could still use your exact CL parameters when testing the CL version of 1.3.1.8/7/6/5 (those are the 4 I have downloaded already so I can easily test).

@Yutaka-Sawada
Copy link
Owner

I have encountered a weird bug in MultiPAR with verification or repair that happens when your file/s have a lot of duplicate or null blocks and you create PAR2 files with a lot of blocks.

Thanks NilEinne for bug report. I made the dummy file with null bytes and tested on my PC. I found a bug and fixed. I made a sample version, which can verify such case. When there are too many same CRCs, my implemented quick sort function failed. Then, I replaced the function to qsort (C-runtime library), and it seems to work now.

I put the sample (par2j_sample_2021-06-15.zip) in "MultiPar_sample" folder on OneDrive. Please test it with your wrong resulted files.

@NilEinne
Copy link
Author

Got a question - when using the GUI, did you try all of these together in a single session?

No, I generally just reopened the PAR2. Especially when changing settings as I didn't know if this would take effect without a restart. I didn't really test much with the GUI once I realised what was going on with the CLI, since it didn't seem that useful. Especially the way the CLI would sometimes fail and sometimes work with everything exactly the same, just by rerunning the command.

Does the CL version fail if you don't explicitly disable AVX2?

I didn't test until now. but it doesn't fail beyond the same behaviour as reported above. The CLI is smart enough to recognise the lack of AVX2 as it should and doesn't try to use it. This is on my A10-5800K, didn't try the I5-3470. I actually expected this, just never changed because I originally started by copying the command line from the GUI log and then when I looked into what the settings did I just kept the 512 one.

Yikes. Since you're using roughly 1 GB in size, I can try this on all types of drives: Rusty, SATA SSD and NVMe SSD.

Does the approx 1 GB size matter? What if I make it exactly 1 GB (1073741824 byes)? Of course, my CPU does support AVX2, but I could still use your exact CL parameters when testing the CL version of 1.3.1.8/7/6/5 (those are the 4 I have downloaded already so I can easily test).

1073741824 bytes should be fine.. Sorry I didn't explain this so well but I'm fairly sure the size doesn't matter. I did test 11010101010 bytes. I also have now tested 300000000. I'm sure you can go smaller if necessary.
When I first encountered this it was with a disk image that was 64GB (not all duplicate) so it's definitely with large files as well.

The only thing that matters is nearly all blocks need to be null or duplicate of each other. You can do this with a big chunk of the file not duplicate, but it changes the midway point. (Except eventually the midway point goes beyond 32768 blocks and you won't have the problem.) The midway point for all blocks (except maybe the last) being duplicate seems to be slicing a file of any size into about 15957 blocks with par2j64 1.3.1.8.

The midway point is most perplexing example at least to me given the way you can just re-run the command and sometimes it will fail to verify sometimes it will succeed. Yet it doesn't seem to depend on how many threads or the amount of memory PAR2 can use which to my non programmer eyes seems to be the obvious areas which would result in such inconsistent behaviour between runs.

BTW when I tested 300000000 I confirmed the same thing for a null byte file and a file with AD repeating. Finally I did a new test where I generated a file with a 16 byte repeating pattern and confirmed the same behaviour when the size of the blocks was 16 byte aligned. I used this PowerShell script to create it if anyone wants to do the same but it's probably not necessary as for this part, it's no different from the others.
PowerShellFileCreator.ps1.txt

@NilEinne
Copy link
Author

Thanks NilEinne for bug report. I made the dummy file with null bytes and tested on my PC. I found a bug and fixed. I made a sample version, which can verify such case. When there are too many same CRCs, my implemented quick sort function failed. Then, I replaced the function to qsort (C-runtime library), and it seems to work now.

I put the sample (par2j_sample_2021-06-15.zip) in "MultiPar_sample" folder on OneDrive. Please test it with your wrong resulted files.

Thanks! I tried the sample version with a few different files I used for testing the problem, including the original disk image where I first encountered the problem. It seems to be fixed in both the 64 and 32 bit versions wherever it used to occur.

@JohnLGalt
Copy link

JohnLGalt commented Jun 15, 2021

Ahh, sorting. The bane of any data. Lol.

Glad you got it fixed, @Yutaka-Sawada. Should we just download the Sample and replace the existing par2j executable files in existing 1.3.1.8 install?

As for size of file, @NilEinne - I have 3x 1 TB NVMe SSD, 2x 1 TB Rusty drives (currently in RAID 1) and a 960 GB SATA III SSD.

Space is not an issue, I was just curious if having a file to exact 1 GB size made a difference versus 'around' 1 GB. Obviously, it did not.

Glad to know this is all fixed.

@Yutaka-Sawada
Copy link
Owner

Should we just download the Sample and replace the existing par2j executable files in existing 1.3.1.8 install?

Yes, you do, if you have a recovery set with 20000 over same blocks. Because I didn't test so many blocks on uniform file data, I could not find the bug. As I changed a sorting function only, the sample is available for daily usage. Anyway, next version includes the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants