TarReader throws on archive that other tools accept #93763

asos-martinsmith · 2023-10-20T02:10:55Z

Description

I have a *.tar.gz file that I am trying to process in C# by passing the GZipStream to TarReader.

This fails at the line

TarHelpers.ParseOctal(buffer.Slice(124, 12))

The

if (num >= 8)

check inside ParseOctal is getting to the ThrowInvalidNumber part as the num being tested is 80.

I have seen a similar issue where non conformance to the spec is blamed for this type of thing.

But Azure Data Factory can process this *.tar.gz fine and additionally Windows Explorer seems able to understand the contents of the *.tar.gz file - showing it as a compressed archive - clicking through shows the details of the single file contained within.

The complete 512 bytes of the header is

6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

So the problematic part is

8000000000000011E7310C9D

The underlying file size is 76893195421 bytes.

Reproduction Steps

Not practical for me to share the actual file but it consists of the 512 byte header already given.

Followed by 76893195421 bytes of arbitrary data that is the actual file contents

Followed by a new line and 8547 null bytes.

You can use

var header = Convert.FromHexString(
    "6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000");


System.IO.File.WriteAllBytes(@"C:\somefile.tar", header
    );

And then open the archive in 7zip and see that this utility can cope with it as one example

Expected behavior

These types of file have been generated by a system in my company for years.

Every utility used to read them has apparently had no issues with them except for the new .NET framework classes.

I have no idea whether or not this file format does in fact violate any spec but I would expect to be able to open it as other utilities can cope (maybe by setting a mode parameter to ignore such issues as long as they aren't fatal to the extraction)

Actual behavior

System.IO.InvalidDataException: 'Unable to parse number.' exception thrown on initial tarReader.GetNextEntry() call

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

The text was updated successfully, but these errors were encountered:

danmoseley · 2023-10-20T04:26:04Z

 	System.Formats.Tar.dll!System.Formats.Tar.TarHelpers.ThrowInvalidNumber() Line 199	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHelpers.ParseOctal<ulong>(System.ReadOnlySpan<byte> buffer) Line 191	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHeader.TryReadCommonAttributes(System.Span<byte> buffer, System.Formats.Tar.TarEntryFormat initialFormat) Line 390	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHeader.TryReadAttributes(System.Formats.Tar.TarEntryFormat initialFormat, System.Span<byte> buffer) Line 170	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHeader.TryGetNextHeader(System.IO.Stream archiveStream, bool copyData, System.Formats.Tar.TarEntryFormat initialFormat, bool processDataBlock) Line 145	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarReader.TryGetNextEntryHeader(bool copyData) Line 212	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarReader.GetNextEntry(bool copyData) Line 92	C#

The magic number in this data is "ustar " (with 2 spaces) which is apparently OLDGNU_MAGIC. 7z reports Characteristics: 0 GNU ASCII bin_psize bin_size

The size (12 bytes at offset 124) is 8000000000000011E7310C9D and it is failing on the first byte.

The code expects this to be octal coded as chars (ie., 0=0x30 through 7=0x37) which it isn't.

Instead the high bit of the first byte is set, then the remaining is base 256. 0x11 0xE7 0x31 0x0C 0x9D is 76 893 195 421 which is what 7z reports.

As to why the high bit is set, I cannot find mention in the GNU tar spec, but Wikipedia says "2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.[citation needed] GNU-tar and BSD-tar followed this idea"

I think the tar reader just can't handle the format used for archives larger than 8GB?

This isn't my area, deferring to @carlossanlop

ghost · 2023-10-20T04:27:50Z

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I have a *.tar.gz file that I am trying to process in C# by passing the GZipStream to TarReader.

This fails at the line

TarHelpers.ParseOctal(buffer.Slice(124, 12))

The

if (num >= 8)

check inside ParseOctal is getting to the ThrowInvalidNumber part as the num being tested is 80.

I have seen a similar issue where non conformance to the spec is blamed for this type of thing.

But Azure Data Factory can process this *.tar.gz fine and additionally Windows Explorer seems able to understand the contents of the *.tar.gz file - showing it as a compressed archive - clicking through shows the details of the single file contained within.

The complete 512 bytes of the header is

6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

So the problematic part is

8000000000000011E7310C9D

The underlying file size is 76893195421 bytes.

Reproduction Steps

Not practical for me to share the actual file but it consists of the 512 byte header already given.

Followed by 76893195421 bytes of arbitrary data that is the actual file contents

Followed by a new line and 8547 null bytes.

You can use

var header = Convert.FromHexString(
    "6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000");


System.IO.File.WriteAllBytes(@"C:\somefile.tar", header
    );

And then open the archive in 7zip and see that this utility can cope with it as one example

Expected behavior

These types of file have been generated by a system in my company for years.

Every utility used to read them has apparently had no issues with them except for the new .NET framework classes.

I have no idea whether or not this file format does in fact violate any spec but I would expect to be able to open it as other utilities can cope (maybe by setting a mode parameter to ignore such issues as long as they aren't fatal to the extraction)

Actual behavior

System.IO.InvalidDataException: 'Unable to parse number.' exception thrown on initial tarReader.GetNextEntry() call

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

Author:	asos-martinsmith
Assignees:	-
Labels:	`untriaged`, `area-System.Formats.Tar`
Milestone:	-

danmoseley · 2023-10-20T04:32:34Z

Ah here's another doc mentioning the high bit.

https://manpages.debian.org/testing/libarchive-dev/tar.5.en.html

Another extension, utilized by GNU tar, star, and other newer tar implementations, permits binary numbers in the standard numeric fields. This is flagged by setting the high bit of the first byte. The remainder of the field is treated as a signed twos-complement value.

danmoseley · 2023-10-20T04:39:17Z

Ah it seems we support the PAX extended attributes including "size". If present (it's not here) then we overwrite the size from the header. So I guess we support > 8GB so long as it's PAX with the "size=.." extended attribute AND the size in the header is (any) valid octal chars.

danmoseley · 2023-10-20T17:08:56Z

A minimal fix here may be to just recognize this form of size encoding, but it seems that any(?) numeric field may use this encoding so perhaps that should be recognized using the high bit.

danmoseley · 2023-10-27T05:46:54Z

@carlossanlop do you have a preferred fix? perhaps we can make this help wanted.

just recognize this encoding for the header's size field only?

carlossanlop · 2023-10-27T15:40:55Z

Hi -

In the main TAR spec that we used to implement the .NET version, there's this section:

Numeric Extensions
   There  have been	several	attempts to extend the range of	sizes or times
   supported by modifying how numbers are stored in	the header.

   One obvious extension to	increase the size of files is to eliminate the
   terminating characters from the various numeric fields.	 For  example,
   the standard only allows	the size field to contain 11 octal digits, re-
   serving the twelfth byte	for a trailing NUL character.  Allowing	12 oc-
   tal digits allows file sizes up to 64 GB.

   Another	extension,  utilized by	GNU tar, star, and other newer tar im-
   plementations, permits binary numbers in	the standard  numeric  fields.
   This is flagged by setting the high bit of the first byte.  The remain-
   der  of	the  field is treated as a signed twos-complement value.  This
   permits 95-bit values for the length and	time fields and	63-bit	values
   for  the	 uid, gid, and device numbers.	In particular, this provides a
   consistent way to handle	negative time values.  GNU tar	supports  this
   extension  for  the  length,  mtime,  ctime,  and  atime	fields.	 Joerg
   Schilling's star	program	and the	libarchive library support this	exten-
   sion for	all numeric fields.  Note that this extension is largely obso-
   leted by	the extended attribute record provided by the pax  interchange
   format.

   Another	early  GNU extension allowed base-64 values rather than	octal.
   This extension was short-lived and is no	longer supported by any	imple-
   mentation.

I originally interpreted the use of the word "attempts" as "unsuccessful additions to the spec". But I see that other tools implemented it anyway.

do you have a preferred fix? perhaps we can make this help wanted.
just recognize this encoding for the header's size field only?

Yes, @danmoseley, I think we could take the change for the size field for now, as it is the only field that we've got a report that it is causing issues.

danmoseley · 2023-10-31T02:19:48Z

@carlossanlop Do you have a preference?

danmoseley · 2023-10-31T02:20:14Z

Oops disregard, I missed your reply somehow

danmoseley · 2023-10-31T02:21:22Z

@asos-martinsmith any interest in offering a PR as described?

ilabutin · 2023-11-24T20:13:08Z

@danmoseley
I would be happy to try to contribute the fix for this issue.
After looking through the code, I have few questions.

I assume the fix should be around those lines in TryReadCommonAttributes:

long size = (long)TarHelpers.ParseOctal<ulong>(buffer.Slice(FieldLocations.Size, FieldLengths.Size));
Debug.Assert(size <= TarHelpers.MaxSizeLength, "size exceeded the max value possible with 11 octal digits. Actual size " + size);

I would introduce one more method like TarHelpers.ParseNumeric<T>(...) that would check the highest bit of the first byte and do binary parsing or fallback to ParseOctal<T>(...).

Does this sound like a correct solution?

The maximum possible size encoded in a binary form can be up to 2^95, which is way more than long size can fit. Would it be correct to only accept size up to 2^63 (to fit long) and throw if larger size was read?
What would be the best way to deal with the following line?

Debug.Assert(size <= TarHelpers.MaxSizeLength, "size exceeded the max value possible with 11 octal digits. Actual size " + size);

The TarHelpers.MaxSizeLength is defined as:

internal const long MaxSizeLength = (1L << 33) - 1; // Max value of 11 octal digits = 2^33 - 1 or 8 Gb.

So, if we accept larger sizes, this assert would not be valid. Should MaxSizeLength field be increased or is there a better solution?

asos-martinsmith · 2023-12-15T14:29:19Z

@asos-martinsmith any interest in offering a PR as described?

I'm a data engineer rather than a .NET developer so will have to pass on that.

In the meantime I have switched to using SharpCompress to get over this. Thanks

danmoseley · 2023-12-15T14:31:27Z

@ilabutin apologies I missed your message. I don't own this code. @carlossanlop can give pointers so you can get started.

ilabutin · 2023-12-15T15:27:08Z

@danmoseley No worries, I'll be waiting for @carlossanlop to react.
@carlossanlop please take a look at my thoughts above (#93763 (comment)) to verify I'm thinking about steps into the right direction.

ivanjx · 2023-12-21T04:16:35Z

my tar also has 0x80 in its size header so it seems that checking it and just treat the header as big endian long is fine. i have written a simple tar listing function myself here #96209. if my understanding is correct then the MaxSizeLength should be long.MaxValue (or ulong).

tmds · 2024-03-25T19:31:10Z

I ran into this issue while trying to extract a tar file that has a gid of 1001080000 which is stored as 80 00 00 00 3b ab 44 c0 in the tar file.

The fix should be applied to all numeric fields and timestamps.

https://www.gnu.org/software/tar/manual/html_section/Extensions.html has this description:

For fields containing numbers or timestamps that are out of range for the basic format, the GNU format uses a base-256 representation instead of an ASCII octal number. If the leading byte is 0xff (255), all the bytes of the field (including the leading byte) are concatenated in big-endian order, with the result being a negative number expressed in two’s complement form. If the leading byte is 0x80 (128), the non-leading bytes of the field are concatenated in big-endian order, with the result being a positive number expressed in binary form. Leading bytes other than 0xff, 0x80 and ASCII octal digits are reserved for future use, as are base-256 representations of values that would be in range for the basic format.

@ivanjx are you still looking to make a PR with a fix?

I would introduce one more method like TarHelpers.ParseNumeric(...) that would check the highest bit of the first byte and do binary parsing or fallback to ParseOctal(...).

Something like that. You could rename ParseOctal to ParseNumeric and add an argument like: octalOnly. Only the call for reading the checksum should set limit to octalOnly. The others should accept the base-256 range.

The maximum possible size encoded in a binary form can be up to 2^95, which is way more than long size can fit. Would it be correct to only accept size up to 2^63 (to fit long) and throw if larger size was read?

Throwing seems appropriate. The exception type can be decided on the PR.

Should MaxSizeLength field be increased

Yes, it should be increased.

Probably, we should also do some work on the writing side.

tmds · 2024-04-16T07:56:09Z

@ivanjx do you still plan to work on a fix for this? If not, I can pick it up.

ivanjx · 2024-04-16T08:07:34Z

hi @tmds, sorry for the late response. please feel free to make a pr as i am not really used to dotnet code. i just provided some sample code and possible workaround above.

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 20, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Oct 20, 2023

danmoseley added area-System.Formats.Tar and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 20, 2023

carlossanlop removed the untriaged New issue has not been triaged by the area owner label Oct 25, 2023

ViktorHofer added this to the 9.0.0 milestone Oct 25, 2023

carlossanlop added the help wanted [up-for-grabs] Good issue for external contributors label Oct 27, 2023

ivanjx mentioned this issue Dec 21, 2023

Support Tar with large entry size #96209

Closed

am11 added the bug label Apr 13, 2024

tmds linked a pull request Apr 17, 2024 that will close this issue

Tar: support GNU numeric format. #101172

Open

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TarReader throws on archive that other tools accept #93763

TarReader throws on archive that other tools accept #93763

asos-martinsmith commented Oct 20, 2023 •

edited

danmoseley commented Oct 20, 2023

ghost commented Oct 20, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

danmoseley commented Oct 20, 2023

danmoseley commented Oct 20, 2023

danmoseley commented Oct 20, 2023

danmoseley commented Oct 27, 2023 •

edited

carlossanlop commented Oct 27, 2023

danmoseley commented Oct 31, 2023

danmoseley commented Oct 31, 2023

danmoseley commented Oct 31, 2023

ilabutin commented Nov 24, 2023 •

edited

asos-martinsmith commented Dec 15, 2023

danmoseley commented Dec 15, 2023

ilabutin commented Dec 15, 2023

ivanjx commented Dec 21, 2023 •

edited

tmds commented Mar 25, 2024

tmds commented Apr 16, 2024

ivanjx commented Apr 16, 2024

TarReader throws on archive that other tools accept #93763

TarReader throws on archive that other tools accept #93763

Comments

asos-martinsmith commented Oct 20, 2023 • edited

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

danmoseley commented Oct 20, 2023

ghost commented Oct 20, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

danmoseley commented Oct 20, 2023

danmoseley commented Oct 20, 2023

danmoseley commented Oct 20, 2023

danmoseley commented Oct 27, 2023 • edited

carlossanlop commented Oct 27, 2023

danmoseley commented Oct 31, 2023

danmoseley commented Oct 31, 2023

danmoseley commented Oct 31, 2023

ilabutin commented Nov 24, 2023 • edited

asos-martinsmith commented Dec 15, 2023

danmoseley commented Dec 15, 2023

ilabutin commented Dec 15, 2023

ivanjx commented Dec 21, 2023 • edited

tmds commented Mar 25, 2024

tmds commented Apr 16, 2024

ivanjx commented Apr 16, 2024

asos-martinsmith commented Oct 20, 2023 •

edited

danmoseley commented Oct 27, 2023 •

edited

ilabutin commented Nov 24, 2023 •

edited

ivanjx commented Dec 21, 2023 •

edited