New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TarReader throws on archive that other tools accept #93763
Comments
The magic number in this data is "ustar " (with 2 spaces) which is apparently OLDGNU_MAGIC. 7z reports The size (12 bytes at offset 124) is 8000000000000011E7310C9D and it is failing on the first byte. The code expects this to be octal coded as chars (ie., 0=0x30 through 7=0x37) which it isn't. Instead the high bit of the first byte is set, then the remaining is base 256. 0x11 0xE7 0x31 0x0C 0x9D is 76 893 195 421 which is what 7z reports. As to why the high bit is set, I cannot find mention in the GNU tar spec, but Wikipedia says "2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.[citation needed] GNU-tar and BSD-tar followed this idea" I think the tar reader just can't handle the format used for archives larger than 8GB? This isn't my area, deferring to @carlossanlop |
Tagging subscribers to this area: @dotnet/area-system-formats-tar Issue DetailsDescriptionI have a *.tar.gz file that I am trying to process in C# by passing the GZipStream to TarReader. This fails at the line TarHelpers.ParseOctal(buffer.Slice(124, 12)) The if (num >= 8) check inside ParseOctal is getting to the ThrowInvalidNumber part as the num being tested is 80. I have seen a similar issue where non conformance to the spec is blamed for this type of thing. But Azure Data Factory can process this *.tar.gz fine and additionally Windows Explorer seems able to understand the contents of the *.tar.gz file - showing it as a compressed archive - clicking through shows the details of the single file contained within. The complete 512 bytes of the header is
So the problematic part is
The underlying file size is 76893195421 bytes. Reproduction StepsNot practical for me to share the actual file but it consists of the 512 byte header already given. Followed by 76893195421 bytes of arbitrary data that is the actual file contents Followed by a new line and 8547 null bytes. You can use
And then open the archive in 7zip and see that this utility can cope with it as one example Expected behaviorThese types of file have been generated by a system in my company for years. Every utility used to read them has apparently had no issues with them except for the new .NET framework classes. I have no idea whether or not this file format does in fact violate any spec but I would expect to be able to open it as other utilities can cope (maybe by setting a mode parameter to ignore such issues as long as they aren't fatal to the extraction) Actual behaviorSystem.IO.InvalidDataException: 'Unable to parse number.' exception thrown on initial tarReader.GetNextEntry() call Regression?No response Known WorkaroundsNo response ConfigurationNo response Other informationNo response
|
Ah here's another doc mentioning the high bit. https://manpages.debian.org/testing/libarchive-dev/tar.5.en.html
|
Ah it seems we support the PAX extended attributes including "size". If present (it's not here) then we overwrite the size from the header. So I guess we support > 8GB so long as it's PAX with the "size=.." extended attribute AND the size in the header is (any) valid octal chars. |
A minimal fix here may be to just recognize this form of size encoding, but it seems that any(?) numeric field may use this encoding so perhaps that should be recognized using the high bit. |
@carlossanlop do you have a preferred fix? perhaps we can make this help wanted. just recognize this encoding for the header's size field only? |
Hi - In the main TAR spec that we used to implement the .NET version, there's this section:
I originally interpreted the use of the word "attempts" as "unsuccessful additions to the spec". But I see that other tools implemented it anyway.
Yes, @danmoseley, I think we could take the change for the size field for now, as it is the only field that we've got a report that it is causing issues. |
@carlossanlop Do you have a preference? |
Oops disregard, I missed your reply somehow |
@asos-martinsmith any interest in offering a PR as described? |
@danmoseley
long size = (long)TarHelpers.ParseOctal<ulong>(buffer.Slice(FieldLocations.Size, FieldLengths.Size));
Debug.Assert(size <= TarHelpers.MaxSizeLength, "size exceeded the max value possible with 11 octal digits. Actual size " + size); I would introduce one more method like Does this sound like a correct solution?
Debug.Assert(size <= TarHelpers.MaxSizeLength, "size exceeded the max value possible with 11 octal digits. Actual size " + size); The
So, if we accept larger sizes, this assert would not be valid. Should |
I'm a data engineer rather than a .NET developer so will have to pass on that. In the meantime I have switched to using SharpCompress to get over this. Thanks |
@ilabutin apologies I missed your message. I don't own this code. @carlossanlop can give pointers so you can get started. |
@danmoseley No worries, I'll be waiting for @carlossanlop to react. |
my tar also has 0x80 in its size header so it seems that checking it and just treat the header as big endian long is fine. i have written a simple tar listing function myself here #96209. if my understanding is correct then the MaxSizeLength should be long.MaxValue (or ulong). |
I ran into this issue while trying to extract a tar file that has a gid of 1001080000 which is stored as The fix should be applied to all numeric fields and timestamps. https://www.gnu.org/software/tar/manual/html_section/Extensions.html has this description:
@ivanjx are you still looking to make a PR with a fix?
Something like that. You could rename
Throwing seems appropriate. The exception type can be decided on the PR.
Yes, it should be increased. Probably, we should also do some work on the writing side. |
@ivanjx do you still plan to work on a fix for this? If not, I can pick it up. |
hi @tmds, sorry for the late response. please feel free to make a pr as i am not really used to dotnet code. i just provided some sample code and possible workaround above. |
Description
I have a *.tar.gz file that I am trying to process in C# by passing the GZipStream to TarReader.
This fails at the line
TarHelpers.ParseOctal(buffer.Slice(124, 12))
The
if (num >= 8)
check inside ParseOctal is getting to the ThrowInvalidNumber part as the num being tested is 80.
I have seen a similar issue where non conformance to the spec is blamed for this type of thing.
But Azure Data Factory can process this *.tar.gz fine and additionally Windows Explorer seems able to understand the contents of the *.tar.gz file - showing it as a compressed archive - clicking through shows the details of the single file contained within.
The complete 512 bytes of the header is
So the problematic part is
8000000000000011E7310C9D
The underlying file size is 76893195421 bytes.
Reproduction Steps
Not practical for me to share the actual file but it consists of the 512 byte header already given.
Followed by 76893195421 bytes of arbitrary data that is the actual file contents
Followed by a new line and 8547 null bytes.
You can use
And then open the archive in 7zip and see that this utility can cope with it as one example
Expected behavior
These types of file have been generated by a system in my company for years.
Every utility used to read them has apparently had no issues with them except for the new .NET framework classes.
I have no idea whether or not this file format does in fact violate any spec but I would expect to be able to open it as other utilities can cope (maybe by setting a mode parameter to ignore such issues as long as they aren't fatal to the extraction)
Actual behavior
System.IO.InvalidDataException: 'Unable to parse number.' exception thrown on initial tarReader.GetNextEntry() call
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: