New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Tar: Fix PAX regression when handling the size of really long unseekable data streams #88280

Merged

carlossanlop merged 2 commits into dotnet:main from carlossanlop:TarUnseekableStreamPaxSizeBugFix

Jul 17, 2023

Member

carlossanlop commented Jun 30, 2023

Fixes: #88271
Fixes: #87359

Regression introduced by #84279

That fix refactored the code so that unseekable data streams could be properly handled. Unfortunately, I didn't execute Outerloop locally, and apparently they also didn't get executed in my PR. If that had happened, we would've seen a failure happening in the test named System.Formats.Tar.Tests.TarWriter_WriteEntry_LongFile_Tests.WriteEntry_LongFileSize.

That test creates a >8GB tar file on disk (no other way since we cannot create MemoryStream instances that large) with PAX entries, then tries to load it in a TarReader, expecting the size of the large entry to have the correct size.

In debug runs, the test fails in a Debug.Assert that checks both the existence of the size key in the extended attributes dictionary, and also that it gets converted to octal to the right value.

In release runs, since that Debug.Assert is skipped, the test times out because the reader is unable to advance to the next entry due to the fact that the data stream size is detected as zero.

The root cause is that in PAX, the size standard header field was written too late: We were first collecting all the extended attributes that needed to be obtained from standard fields, then we were writing the extended attributes entry, then we were reading the unseekable data stream to retrieve the size (too late), and then we were writing the actual entry and its data stream.

The code refactoring reuses code for all the entry formats, but for PAX the order of actions should be different.

The fix consists in separate the writing of entries much more than in the original refactoring: one path for seekable data streams, and another path for unseekable data streams.

For V7 and UStar, choosing one path or the other is very similar and very simple.

For GNU, choosing the path only varies in that the long link or long name entry needs to be written first, and then we choose a path like we do with V7 and UStar.

But with PAX, we need to first collect the size of the data stream, then collect the extended attributes (which will include the too-long size value), then write the extended attributes entry, and finally write the actual entry with its data stream.

I ran all tests locally, including the Outerloop ones, and they all pass.

NOTE: The WriteEntry_LongFileSizeAsync is showing up as Long Running Test locally, but it passes (2-4 min to finish). I am unsure how to speed it up, considering that we need to test an >8GB file on disk. If someone knows how to do this in-memory, please let me know. If the test fails in my PR, I'd like to log a separate issue and investigate it separately.

carlossanlop added the area-System.Formats.Tar label

carlossanlop added this to the 8.0.0 milestone

carlossanlop requested review from adamsitnik, ViktorHofer and Jozkee

June 30, 2023 23:25

carlossanlop self-assigned this

ghost commented Jun 30, 2023

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #88271
Fixes: #87359

Regression introduced by #84279

That fix refactored the code so that unseekable data streams could be properly handled. Unfortunately, I didn't execute Outerloop locally, and apparently they also didn't get executed in my PR. If that had happened, we would've seen a failure happening in the test named System.Formats.Tar.Tests.TarWriter_WriteEntry_LongFile_Tests.WriteEntry_LongFileSize.

That test creates a >8GB tar file on disk (no other way since we cannot create MemoryStream instances that large) with PAX entries, then tries to load it in a TarReader, expecting the size of the large entry to have the correct size.

In debug runs, the test fails in a Debug.Assert that checks both the existence of the size key in the extended attributes dictionary, and also that it gets converted to octal to the right value.

In release runs, since that Debug.Assert is skipped, the test times out because the reader is unable to advance to the next entry due to the fact that the data stream size is detected as zero.

The root cause is that in PAX, the size standard header field was written too late: We were first collecting all the extended attributes that needed to be obtained from standard fields, then we were writing the extended attributes entry, then we were reading the unseekable data stream to retrieve the size (too late), and then we were writing the actual entry and its data stream.

The code refactoring reuses code for all the entry formats, but for PAX the order of actions should be different.

The fix consists in separate the writing of entries much more than in the original refactoring: one path for seekable data streams, and another path for unseekable data streams.

For V7 and UStar, choosing one path or the other is very similar and very simple.

For GNU, choosing the path only varies in that the long link or long name entry needs to be written first, and then we choose a path like we do with V7 and UStar.

But with PAX, we need to first collect the size of the data stream, then collect the extended attributes (which will include the too-long size value), then write the extended attributes entry, and finally write the actual entry with its data stream.

I ran all tests locally, including the Outerloop ones, and they all pass.

NOTE: The WriteEntry_LongFileSizeAsync is showing up as Long Running Test locally, but it passes (2-4 min to finish). I am unsure how to speed it up, considering that we need to test an >8GB file on disk. If someone knows how to do this in-memory, please let me know. If the test fails in my PR, I'd like to log a separate issue and investigate it separately.

Author:	carlossanlop
Assignees:	carlossanlop
Labels:	`area-System.Formats.Tar`
Milestone:	8.0.0

Member

ViktorHofer commented Jul 5, 2023

/azp run runtime-libraries-coreclr outerloop

dotnet deleted a comment from azure-pipelines bot

azure-pipelines bot commented Jul 5, 2023

Azure Pipelines successfully started running 1 pipeline(s).

ViktorHofer approved these changes

View reviewed changes

This was referenced Jul 5, 2023

LibraryImportGenerator.Unit.Tests crashed in CI #87951

Closed

Timeout on System.Formats.Tar.Tests.TarWriter_WriteEntry_LongFile_Tests.WriteEntry_LongFileSize #88271

Closed

Member Author

carlossanlop commented Jul 5, 2023

The long running test failure is still happening, but that's expected, I can even repro locally. Do we have a unit test attribute to indicate the test is long running?

I am not sure what other alternatives we have. MemoryStream does not allow handling 8GB. I am unsure if we are allowed to sture a file that large in runtime-assets.

Member

danmoseley commented Jul 5, 2023

no other way since we cannot create MemoryStream instances that large

We can make a custom flavor MemoryStream presumably, it could even fake its content.

adamsitnik reviewed

View reviewed changes

Member

adamsitnik left a comment

If someone knows how to do this in-memory, please let me know.

MemoryStream is backed by array of bytes, which has the size limit.

The alternative is to use a stream backed by native memory: UnmanagedMemoryStream which is exposed by System.Runtime (so it should work fine here).

Or as @danmoseley has suggested, you could fake it by for example implementing a stream that repeats the same data N times when reading from it.

Member

danmoseley commented Jul 7, 2023

this already does something similar, perhaps it can be adapted/copied.
https://github.com/dotnet/runtime/blob/main/src/libraries/System.Formats.Tar/tests/SimulatedDataStream.cs

if it's a general purpose test utility you could choose to put it in src\libraries\Common\tests\System\IO

carlossanlop added 2 commits

July 17, 2023 08:41


          Fix regression introduced by dotnet#84279 preventing PAX entries with…

280b393

… really long data streams to get its size correctly stored in the extended attributes when the data stream is unseekable.


          Move tests for large files to a new manual tests project.

carlossanlop force-pushed the TarUnseekableStreamPaxSizeBugFix branch from 868a661 to 5482008 Compare

July 17, 2023 19:20

carlossanlop mentioned this pull request

Add tar test with custom stream that simulates reading/writing >8GB file entry #89037

Open

Member Author

carlossanlop commented Jul 17, 2023

@ericstj @jeffhandley @ViktorHofer as discussed in our meeting, I moved the long file tests to a manual tests project. I opened #89037 to track adding a similar test that runs on CI and simulates writing/reading 8GB file entries.

carlossanlop mentioned this pull request

LibraryImportGenerator unit test segfault illegal memory access #89054

Closed

carlossanlop merged commit 2268fb3 into dotnet:main

101 of 103 checks passed

carlossanlop deleted the TarUnseekableStreamPaxSizeBugFix branch

July 17, 2023 22:19

carlossanlop mentioned this pull request

System.Formats.Tar.Tests.TarWriter_WriteEntry_LongFile_Tests.WriteEntry_LongFileSize times out on Mono #88777

Closed

akoeplinger added a commit that referenced this pull request


          Fix outerloop tests in System.Formats.Tar

eeb540e

The change from #88280 caused an issue when discovering the test:

```
System.InvalidOperationException : An appropriate member 'ManualTestsEnabled' could not be found. The conditional method needs to be a static method, property, or field on the type System.Formats.Tar.Tests.ManualTestsAsync or any ancestor, of any visibility, accepting zero arguments, and having a return type of Boolean.
```

akoeplinger mentioned this pull request

Fix outerloop tests in System.Formats.Tar #89172

Merged

akoeplinger added a commit that referenced this pull request


          Fix outerloop tests in System.Formats.Tar (#89172)

924668c

The change from #88280 caused an issue when discovering the test:

```
System.InvalidOperationException : An appropriate member 'ManualTestsEnabled' could not be found. The conditional method needs to be a static method, property, or field on the type System.Formats.Tar.Tests.ManualTestsAsync or any ancestor, of any visibility, accepting zero arguments, and having a return type of Boolean.
```

ghost locked as resolved and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.