Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPRESS-542: Lazy allocation of SevenZArchiveEntry to prevent OOM on corrupt files #120

Merged
merged 2 commits into from
Aug 15, 2020

Conversation

theobisproject
Copy link
Contributor

This PR is nearly the same as the patch I attached to the Jira ticket. I made an additional null check before passing the initialized files to the archive to avoid NullPointerExceptions later on.
As said in the ticket it would be nice to check if the headers are ok before initializing the files but with my current knowledge of 7z this is out of scope for me. This patch works for the files I encountered even if I have to admit the patch is not very beautiful.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 87.267% when pulling 80e5a46 on theobisproject:COMPRESS-542 into b70af20 on apache:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 87.267% when pulling 80e5a46 on theobisproject:COMPRESS-542 into b70af20 on apache:master.

@PeterAlfredLee
Copy link
Member

Hi @theobisproject
I'm a little interested about the amount of SevenZArchiveEntry. How many entries do you have in your 7z archive that take so much memory?

@theobisproject
Copy link
Contributor Author

Hi @PeterAlfredLee
as explained in the Jira Ticket it was a corrupted archive where the numFiles variable read from the header in readFilesInfo was about 138 million entries. I don't know what would be the correct number for the archive but this was the number commons-compress was reading.

@PeterAlfredLee
Copy link
Member

I see.
This PR looks good now. Will look more into it these days.

@garydgregory
Copy link
Member

garydgregory commented Aug 4, 2020

My biggest issue with this PR is that it is missing a unit test. Without a failing unit test, there is no way to know that this fixes anything or that a future change will not regress to the previous behavior. Please add a test ;-)

I also see a lot of duplicate code like (5 times!)

if (files[i] == null) {
      files[i] = new SevenZArchiveEntry();
}

which should be easily refactored.

@theobisproject
Copy link
Contributor Author

As you are a bit more familiar with the 7z archive maybe you have an idea how to avoid the entry allocation completely before we are sure the header is not corrupted. It was more a hack from me to process the data somehow.

@theobisproject
Copy link
Contributor Author

My biggest issue with this PR is that it is missing a unit test. Without a failing unit test, there is no way to know that this fixes anything or that a future change will not regress to the previous behavior. Please add a test ;-)

As this was encountered with production data there is no way I can share those and creating an artifical archive which is reproducing this is way beyond my current understanding of the format. The only way I see to currently add a test is when making the readFilesInfo method protected.

@akelday
Copy link

akelday commented Aug 5, 2020

avoid the entry allocation completely before we are sure the header is not corrupted

Probably not possible with the current code... tryToLocateEndHeader is the real cause because it does no CRC check and cannot, because by definition it's already a corrupt file.

I have crafted a 233 byte malformed 7z which would attempt to allocate 268,435,455 files but I'm not certain it's wise to post it here. This is in some way related to my own problems with a very large 7z because the "kName" section allocates an enormous buffer for filenames (fixable by streaming the bytes instead).

@theobisproject
Copy link
Contributor Author

Just a short update. I will try to generate a reproducting file via fuzzing which is hopefully successful. This might take some days but I think at the end of the weekend I can share somthing.

@theobisproject
Copy link
Contributor Author

The test files @akelday attached to the jira issue are replicating the issue. Since this issue is dependend on the available Java heap it could be difficult to add this as unit test since I don't know a way to force the Java heap for a single test in JUnit. As far as I know it could only be set at the maven-surefire-plugin.

if (files[nextFile] == null) {
files[nextFile] = new SevenZArchiveEntry();
}
files[nextFile].setName(fName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need a variable fName here :

if (files[nextFile] == null) {
    files[nextFile] = new SevenZArchiveEntry();
}
files[nextFile].setName(new String(names, nextName, i - nextName, StandardCharsets.UTF_16LE));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got rid of the variable in the latest commit

@@ -997,6 +1000,9 @@ private void readFilesInfo(final ByteBuffer header, final Archive archive) throw
throw new IOException("Unimplemented");
}
for (int i = 0; i < files.length; i++) {
if (files[i] == null) {
files[i] = new SevenZArchiveEntry();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @garydgregory about this duplicate code. We can have a simple refactor here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is refactored as part of the latest commit

@@ -1090,7 +1108,13 @@ private void readFilesInfo(final ByteBuffer header, final Archive archive) throw
++emptyFileCounter;
}
}
archive.files = files;
List<SevenZArchiveEntry> entries = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a final here?

final List<SevenZArchiveEntry> entries = new ArrayList<>();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
I use final as much as possible to make it obvious when reading or debugging whether I need to even think about a given local variable, for more details on my reasoning, please see https://garygregory.wordpress.com/2013/01/26/the-final-kiss-in-java/

@theobisproject
Copy link
Contributor Author

In the files I got from the fuzzing I noticed that even the array creation can lead to an OOM exception. I tried to replace it with a list and let it grow as needed but it didn't worked well. The current solution with the Map works (all tests pass on a Raspberry Pi 3) but has high memory allocation because of autoboxing the integer.

@PeterAlfredLee
Copy link
Member

Looks good. Thank you for your work @theobisproject .
BTW how did you do the fuzzing? I was thinking if we could have some fuzz tests in Compress.

@PeterAlfredLee PeterAlfredLee merged commit 464ba19 into apache:master Aug 15, 2020
@theobisproject
Copy link
Contributor Author

I am using this project https://github.com/rohanpadhye/jqf to do the fuzzing via the AFL bridge provided. This was the fuzz method implementation I used

@Fuzz
public void sevenzOom(@From(InputStreamGenerator.class) InputStream in) {
    try (SevenZFile sevenZFile = new SevenZFile(new SeekableInMemoryByteChannel(IOUtils.toByteArray(in)))) {
        Assume.assumeTrue(sevenZFile.getNextEntry() != null);
    } catch (IOException e) {
        // Ignore
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants