Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reading unused bytes from PNG streams #535

Merged
merged 4 commits into from
May 4, 2021

Conversation

Draczech
Copy link
Contributor

Trying to address #534

@Draczech
Copy link
Contributor Author

Check added in PngChunkReader as allocating space in StreamReader may be possible and generally there is nothing wrong with it. Just allocating enormous byte buffers should not be allowed in the first place.

Tested only as with the file attached to #534:

    @Test(expected = PngProcessingException.class)
    public void testCorruptedFile() throws Exception
    {
        processFile("Tests/Data/favicon_better_essay_service_com.png");
    }

As I do not intend to mock the StreamReader or so. But better test might be desired.

@@ -86,7 +86,7 @@
// Process the next chunk.
int chunkDataLength = reader.getInt32();

if (chunkDataLength < 0)
if ((chunkDataLength < 0) || (chunkDataLength > reader.available()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that this isn't a valid use of available(). Here's the JavaDoc for the method:

Returns an estimate of the number of bytes that can be read (or skipped over) from this SequentialReader without blocking by the next invocation of a method for this input stream. A single read or skip of this many bytes will not block, but may read or skip fewer bytes.

Note that while some implementations of SequentialReader like SequentialByteArrayReader will return the total remaining number of bytes in the stream, others will not. It is never correct to use the return value of this method to allocate a buffer intended to hold all data in this stream.

Returns:
an estimate of the number of bytes that can be read (or skipped over) from this SequentialReader without blocking or 0 when it reaches the end of the input stream.

This is a more fundamental "problem", the nature of a stream is that the total size might be unknown. For SequentialByteArrayReader using available() would yield the correct results, while for StreamReader (and potentially other implementations` it would not.

I don't know how a "sane maximum value" could be deducted, but I guess some more finesse is needed.

Copy link
Contributor Author

@Draczech Draczech Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you are right about this. There are implementations of InputStream which might return unexpected results for available(). I didn't realized that when I implemented my fix. I just thought available() is the best chance there is for estimating stream size. And only thought about metadata-extractor as a tool for bounded streams.

What do you suggest then? Using available() might fail for some crazy streams. And technically there is nothing wrong with allocating array up to max Integer. It can cause some Exceptions which are impossible to catch/prevent from happening outside of the library.

Copy link
Contributor

@Nadahar Nadahar Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I have a suggestion from the top of my head. I guess a check could be made with instanceof and then use your logic if the reader is a SequentialByteArrayReader , but this still leaves the question of what to do with all the other cases.

Regarding available(), it's not just "crazy streams" where this fails. Generally, for FileInputStream it returns the remaining bytes of the file, and I think that's the reason why many seem to think it reports "the remainder of the stream". For network streams for example, available() will generally just report the remainder of bytes in the buffer at that time, which will probably be a relatively small value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that this might be unnecessary optimization as allocating large buffer is generally OK and reading more bytes from a stream than it actually contains should result in java.io.EOFException: End of data reached.

Let me know though if you come up with any solution or a trick how to handle this issue. I realize it is not caused by the library, but such threat of OOM is really undesirable one.

@@ -86,7 +86,7 @@
// Process the next chunk.
int chunkDataLength = reader.getInt32();

if (chunkDataLength < 0)
if ((chunkDataLength < 0) || (chunkDataLength > reader.available()))
throw new PngProcessingException("PNG chunk length exceeds maximum");

PngChunkType chunkType = new PngChunkType(reader.getBytes(4));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I'm sure makes a lot of sense to GitHub, I can't place a comment any further down than here. I wanted to comment on line 96 below though.

I'm not saying that this will solve it, but I see one possibility to at least make this less likely to happen. As far as I can tell, chunkData is only used if willStoreChunk is true. It seems very wasteful to me to call getBytes() even if the data isn't to be kept, it will allocate memory and read information that will be just handed off to the GC. I would suggest something like this instead:

byte[] chunkData;
if (willStoreChunk) {
  chunkData = reader.getBytes(chunkDataLength);
} else {
  chunkData = null // To satisfy the compiler
  reader.skip(chunkDataLength)
}

With that in place, I guess some sane maximum sizes could be deducted based on the chunk type. I'm no PNG expert, but I would assume that only the IDAT chunk could reasonably contain really large data, and I don't think that chunk is ever relevant for Metadata-extractor to read. For the chunk type of interest for Metadata-extractor, it should be possible to come up with some reasonable maximum value where everything above that would be rejected before calling getBytes() (which is where the allocation occurs)..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this would be a nice workaround! In case of this specific file, the problematic chunk is not read, only skipped. So in the end int chunkDataLength = reader.getInt32(); is negative in next loop iteration and End of data reached. is thrown. Nice!

I can see minor complication in the fact that new PngChunkReader().extract(new StreamReader(inputStream), Collections.emptySet()) and new PngChunkReader().extract(new StreamReader(inputStream), null) can yield completely different result if your code snippet is used. Which is expected I guess, but should be mentioned in documentation/comments.

On the other hand, PngMetadataReader.readMetadata(@NotNull InputStream inputStream) has static set of PngChunkType defined, so no changes are required there :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the changes you proposed, added a short comment regarding the chunk data reading and also included the corrupted file in the tests. Let me know if this solution is OK with you. Thanks!

- Reverted stream available bytes check
- Updated tests using corrupted png file
@drewnoakes
Copy link
Owner

@Draczech thanks for following up with this. The change looks sensible.

One funny thing: In other repos I often ask for unit tests to be added. In this repo, I usually ask for them to be removed :) Well, the test is fine, but the image file -- not so much. Rather than add lots of binary files to this repo, we maintain a separate collection of test images and regularly run regression tests against them. We also use these files to drive parity between the Java and .NET implementations. Please remove the file from this PR (and the tests if they cannot be updated to work without the file) and we will add the file to this repo instead:

https://github.com/drewnoakes/metadata-extractor-images

I can add it, if you don't want to clone that repo. It's quite large.

@Draczech
Copy link
Contributor Author

Draczech commented May 4, 2021

OK, I can see you point. The image file is removed now so are the tests (they don't make much sense without the file. So the pull request has one changed file only.

Please add the file to your image repo. Thank you!

@Nadahar
Copy link
Contributor

Nadahar commented May 4, 2021

While I think this is a good change, I'd also think that it should be possible to define an absolute ridiculous maximum size for at least some chunk types, so that encountering something larger would abort parsing/declare the file invalid.

@drewnoakes Is that something worth considering (the "work" is of course to come up with these upper thresholds)?

To explain what I mean, I took a quick look at the PNG spec and can see that:

  • IHDR should AFAICT have a fixed size of 13 bytes.
  • PLTE has a maximum size of 768 bytes and must be divisible by 3.
  • IEND should have a fixed size of 0 bytes.
  • tRNS should AFAICT have a maximum size of 256 bytes.

I'm pretty sure that similar thresholds can be found for many other chunk types, I just read the first ones in the standard above.The maximum thresholds doesn't have to be as narrow as the standard defines, they could leave in a lot of "slack" and still provide a reasonable protection against memory reservation issues.

tEXt, zTXt and iTXt are in theory limited by the size of a 32 bit integer, but in reality they are used for things like Title, Author, Description, Copyright, Comment etc. I don't think that setting a maximum size of something ridiculous like say 10 MB, would be "unreasonable" for such chunks?

Maybe PngChunkType should include a new field holding this threshold, where for example negative values (like -1) would indicate that there were no threshold for that chunk type. PngChunkReader could then use this information to reject chunks that are above the threshold if one is defined, before trying to read it.

Or, all this might just be overkill 😉

@drewnoakes
Copy link
Owner

To explain what I mean, I took a quick look at the PNG spec and can see that:

  • IHDR should AFAICT have a fixed size of 13 bytes.
  • PLTE has a maximum size of 768 bytes and must be divisible by 3.
  • IEND should have a fixed size of 0 bytes.
  • tRNS should AFAICT have a maximum size of 256 bytes.

These seem like reasonable safety rails, though it would be a shame to reject a file that could be parsed just because it had some zero padding that would otherwise be ignored, for example.

tEXt, zTXt and iTXt are in theory limited by the size of a 32 bit integer, but in reality they are used for things like Title, Author, Description, Copyright, Comment etc. I don't think that setting a maximum size of something ridiculous like say 10 MB, would be "unreasonable" for such chunks?

Coming up with a number N for which we assume no chunk could ever be larger than is a tricky problem. These chunks can store arbitrary data (such as XMP).

I tend to prefer the approach of using the length of the whole data stream when it is available, and not otherwise.

Another approach would be to change how we read large byte arrays from the stream. That is, we read in chunks of some amount (say 1MB) so that it's more likely the stream hits EOF than we fail to allocate a crazy large array in the StreamReader. Of course there's a downside to this in that the happy path now has to concatenate a bunch of large arrays together to produce the same result we have today, in cases where the data is actually large.

Copy link
Owner

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this. It's a good change that will help in a range of scenarios, not just the one you originally reported. It's also generated some good discussion for future work.

I'll squash merge this so that the image doesn't end up in the history.

@drewnoakes drewnoakes merged commit d56b5a5 into drewnoakes:master May 4, 2021
@drewnoakes drewnoakes changed the title Comparing PNG chunk size to available bytes in buffer Avoid reading unused bytes from PNG streams May 4, 2021
@Nadahar
Copy link
Contributor

Nadahar commented May 5, 2021

I tend to prefer the approach of using the length of the whole data stream when it is available, and not otherwise.

That would be great, but how do you intend to decide if the data stream size is "known" or not? I've tried to solve this in another project, and the result is anything but pretty (I'm even somewhat embarrassed to post it):

	public static long getInputStreamSize(@Nullable InputStream inputStream) {
		// This method uses whatever means necessary to acquire the source size
		// of known InputStream implementations. If possible acquisition from
		// other InputStream implementations is found, the code to acquire it
		// should be added to this method.
		if (inputStream == null) {
			return -1L;
		}
		if (inputStream instanceof FakeInputStream) {
			return ((FakeInputStream) inputStream).getSize();
		}
		if (inputStream instanceof FilterInputStream) {
			try {
				Field in = FilterInputStream.class.getDeclaredField("in");
				in.setAccessible(true);
				return getInputStreamSize((InputStream) in.get(inputStream));
			} catch (ReflectiveOperationException | IllegalArgumentException e) {
				return -1L;
			}
		}
		if (inputStream instanceof PartialInputStream) {
			PartialInputStream partial = (PartialInputStream) inputStream;
			long result = getInputStreamSize(partial.source);
			if (result < 1L) {
				return result;
			}
			if (partial.getEnd() >= 0L) {
				result = Math.min(result, partial.getEnd());
			}
			return partial.getStart() > 0L ? Math.max(result - partial.getStart(), 0L) : result;
		}
		if (inputStream instanceof SizeLimitInputStream) {
			SizeLimitInputStream limited = (SizeLimitInputStream) inputStream;
			return limited.in == null ? -1L : Math.min(getInputStreamSize(limited.in), limited.maxBytesToRead);
		}
		if (inputStream instanceof DLNAImageInputStream) {
			return ((DLNAImageInputStream) inputStream).getSize();
		}
		if (inputStream instanceof DLNAThumbnailInputStream) {
			return ((DLNAThumbnailInputStream) inputStream).getSize();
		}
		if (inputStream instanceof ByteArrayInputStream) {
			try {
				Field count = ByteArrayInputStream.class.getDeclaredField("count");
				count.setAccessible(true);
				return count.getInt(inputStream);
			} catch (ReflectiveOperationException | IllegalArgumentException e) {
				return -1L;
			}
		}
		if (inputStream instanceof FileInputStream) {
			try {
				return ((FileInputStream) inputStream).getChannel().size();
			} catch (IOException e) {
				return -1L;
			}
		}
		if ("com.sun.imageio.plugins.common.InputStreamAdapter".equals(inputStream.getClass().getName())) {
			try {
				Field stream = inputStream.getClass().getDeclaredField("stream");
				stream.setAccessible(true);
				return ((ImageInputStream) stream.get(inputStream)).length();
			} catch (ReflectiveOperationException | IllegalArgumentException | IOException e) {
				return -1L;
			}
		}
		return -1L;
	}

The above has several problems. There's the use of reflection that isn't allowed in more recent versions of Java, and the fundamental "logic" is that you have to treat each InputStream implementation as a special case - which means that you will end up deeming the size "unknown" for custom implementations where it might in fact be known.

Chunked reading would solve it at the cost of some loss of performance, which of course raises the question if it's "worth it" to lose performance for every valid stream to protect against issues with a very few invalid streams. On the other hand, such invalid streams could be especially crafted to trigger such a problem I guess.

@kwhopper
Copy link
Collaborator

kwhopper commented May 5, 2021

Another approach would be to change how we read large byte arrays from the stream.

Shameless plug for drewnoakes/metadata-extractor-dotnet#250, or #361 as a starting point for Java -- only because it's (potentially) a stepping stone towards what you're after. It's creates a single place to hold most buffers instead of splattering them throughout the library, which should make it easier to branch-off under certain conditions.

For example, a future optimization could be writing out certain types of stream content directly to disk transparently.

@Draczech Draczech deleted the draczech/PngChunkSanityCheck branch May 5, 2021 11:58
drewnoakes added a commit to drewnoakes/metadata-extractor-dotnet that referenced this pull request May 7, 2021
Previously the code would read a byte[] for every PNG chunk, irrespective of whether it was going to be stored. With this change, these arrays are only allocated when they will actually be needed.

This is the .NET port of drewnoakes/metadata-extractor#535
drewnoakes added a commit to drewnoakes/metadata-extractor-dotnet that referenced this pull request May 7, 2021
Previously the code would read a byte[] for every PNG chunk, irrespective of whether it was going to be stored. With this change, these arrays are only allocated when they will actually be needed.

This is the .NET port of drewnoakes/metadata-extractor#535
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants