[COMPRESS-726] Add optional decompression bomb protection to CompressorStreamFactory by tomasilluminati · Pull Request #777 · apache/commons-compress

tomasilluminati · 2026-06-28T18:43:49Z

Adds output-counting guard BombGuardCompressorInputStream that caps decompressed size and an optional mean compression-ratio check with a configurable grace. Off by default, opt-in via CompressorStreamFactory setters. CompressorInputStream now implements InputStreamStatistics. Tests and a changes.xml entry are included.

…orStreamFactory Adds output-counting guard BombGuardCompressorInputStream that caps decompressed size and an optional mean compression-ratio check with a configurable grace. Off by default, opt-in via CompressorStreamFactory setters. CompressorInputStream now implements InputStreamStatistics. Tests and a changes.xml entry are included.

tomasilluminati · 2026-06-28T18:44:13Z

@ppkarwasz COMPRESS-726 PR Opened

garydgregory

Hi All,
I have some initial thoughts on this line of work:

Let's loose the whole "bomb" FUD terminology. I might just want to enforce limits. I should be able to do this in one place that also happens to catch possible attacks if an app decides to accept untrusted input. The goal is to enforce a limit, regardless of the input's intent. I have a door on my house to avoid people, animals, weather, and so on from coming in, I don't have a "thief-guard".
Use a builder for construction with a private ctor(Builder), otherwise, this will be another class that will get one constructor after another over time.
Why no pull up the feature in the super class or move it to IO's BoundedInputStream? The code already uses a BoundedInputStream after all.

tomasilluminati · 2026-06-28T22:36:39Z

@garydgregory
Thanks, this is really helpful, and honestly it points at a simpler shape than what I pushed.

If all we want is a decompressed-size limit, IO's BoundedInputStream already does it (setMaxCount plus an onMaxCount that throws). So we don't really need a compress-specific class or a new exception. I'm happy to drop BombGuardCompressorInputStream and DecompressionBombException and just have CompressorStreamFactory wrap the stream in a BoundedInputStream when a max size is set. That's the single place you're describing, it reuses IO, and the bomb wording goes away. The builder point sorts itself out too, since there's no class left to build.

The only bit that doesn't fit there is the compression-ratio check, since BoundedInputStream has no idea about ratios and it needs getCompressedCount() from the compressor side. So really the only open question is scope. Just the size limit, or the ratio guard too?

Size only, I drop the class and the exception and wire BoundedInputStream into the factory. Smallest change.
Ratio too, I keep it but move it onto CompressorInputStream where both counts already live, still exposed through the factory.

My hunch from your note is size-only, but just say the word and I'll redo it that way. Either way the InputStreamStatistics change stays, the compressed count is handy on its own.

Thanks again for the steer.

garydgregory · 2026-06-28T23:27:05Z

@tomasilluminati
My initial reaction to the ratio check feature is that it's nice but I doubt folks will know how to use it properly without shooting themselves in the foot. "Our app blew up when we allowed users to import files of type foo but everything was fine with bar files" kind of thing. That said, there is a way to implement a ratio check I think using org.apache.commons.io.input.ProxyInputStream.AbstractBuilder.setAfterRead(IOIntConsumer) where you'd do the check in the consumer (BoundedInputStream extends ProxyInputStream).
This one can simmer... ;) maybe @ppkarwasz has some thoughts.

tomasilluminati · 2026-06-29T00:08:12Z

@garydgregory
Yeah, I'm with you on the footgun. The ratio is genuinely hard to tune and easy to get wrong, exactly the "works for bar, blows up on foo" trap, so I'd rather not hand people a sharp tool they can't aim.

Nice pointer on setAfterRead, hadn't thought of it. That actually means neither piece needs a new class. The size cap is setMaxCount plus a throwing onMaxCount, and if the ratio ever comes back it can ride along as a setAfterRead consumer on the same BoundedInputStream.

So my lean is to keep it minimal, wire an absolute decompressed-size limit into CompressorStreamFactory via BoundedInputStream, drop the bomb class and the exception, and leave the ratio out for now. We can always document it as a recipe later.

Happy to let it simmer and hear what @ppkarwasz thinks before I change anything.

tomasilluminati · 2026-07-02T04:42:30Z

@garydgregory , I've got the minimal version ready, absolute decompressed-size limit composed over IO's BoundedInputStream, dropping the extra class and the exception. One note, since the cap is on the real decompressed output, it already covers the faked-payload-size case Piotr raised (it ignores any declared size). @ppkarwasz, given that, do you still want the mean-ratio guard and an on-by-default limit? The ratio fires earlier and needs no tuned number, so I'm happy to fold it back as a second, opt-in check if you want it. Otherwise I'll push the size-only version.

ppkarwasz · 2026-07-02T07:18:59Z

My initial reaction to the ratio check feature is that it's nice but I doubt folks will know how to use it properly without shooting themselves in the foot. "Our app blew up when we allowed users to import files of type foo but everything was fine with bar files" kind of thing.

As with most features, we probably need to provide sensible defaults, so users don't need to modify them. AI gives the following usual compression rates:

Binaries: 2-3×
Text/source code: 3-5×
XML/JSON: 5-15×
Log files: 10-30×

Higher compression rates are impossible, unless you compress zeros, so 100× looks like a safe default. When I chose the default ratio for apache/mina#52, I tried to copy what HTTP Server does: mod_deflate, has these knobs:

DeflateInflateRatioLimit defaults to 200,
Rather than a grace size, mod_deflate only rejects a stream once the cumulative ratio stays above the limit for DeflateInflateRatioBurst + 1 (default 4) consecutive output buffers. With the default 8 KiB buffer, that corresponds to roughly 32 KiB of decompressed output.

ppkarwasz · 2026-07-02T07:28:52Z

@ppkarwasz, given that, do you still want the mean-ratio guard and an on-by-default limit?

I guess, we can deliver a mean-ratio guard in a separate wrapper: only a limited number of users need it. Basically only if the compressed stream comes primarily from untrusted sources (like in a server), users might want to add a mean-ratio check as a fail-fast mechanism to prevent a DoS.

However, I think it would be nice if every CompressorInputStream inherited from BoundedInputStream: for archive formats that contain compressed data (ZIP, 7z), we usually have a (declared) decompressed size, so we could set that size as limit and prevent corrupted archives to fill the disk. Malicious archives, of course, would need the user to set a hard maximum decompressed size.

garydgregory · 2026-07-02T11:46:06Z

To make it clear, I'll move this PR draft since it is not ready as there are a few aspects still being discussed.

tomasilluminati · 2026-07-02T12:54:00Z

@ppkarwasz thanks, that lines up well.

A separate opt-in wrapper for the mean-ratio guard sounds right, it keeps the cost at zero for the common case and puts the fail-fast check where untrusted-source users actually want it. Happy to add it that way, following the mod_deflate shape you described (a burst tolerance in output buffers rather than a byte grace), with 100x as the default once it is opt-in.

On making every CompressorInputStream inherit from BoundedInputStream to use the declared ZIP/7z size: appealing for catching corrupt archives, but it is a much bigger change than this PR, it touches the base type every format extends and the japicmp surface, and the declared size lives at the archive layer rather than the codec layer, so wiring it cleanly is its own design. I'd keep it out of 726 and open a separate ticket if you want to pursue it.

So where I think this lands: absolute decompressed-size cap as the core here (composed over IO's BoundedInputStream), the mean-ratio guard as a separate opt-in wrapper with a 100x default, and the inherit-from-BoundedInputStream idea as its own follow-up. Does that match what you and @garydgregory have in mind before I rework the PR?

garydgregory · 2026-07-02T14:47:34Z

Giving defaults like:

Binaries: 2-3×

Text/source code: 3-5×

XML/JSON: 5-15×

Log files: 10-30×

is why this should NOT be added as a feature to this class. It's based on guesses at best and will certainly shoot someone in the foot or a more useful body part.

garydgregory · 2026-07-02T15:30:17Z

@ppkarwasz, given that, do you still want the mean-ratio guard and an on-by-default limit?

I guess, we can deliver a mean-ratio guard in a separate wrapper: only a limited number of users need it. Basically only if the compressed stream comes primarily from untrusted sources (like in a server), users might want to add a mean-ratio check as a fail-fast mechanism to prevent a DoS.

However, I think it would be nice if every CompressorInputStream inherited from BoundedInputStream: for archive formats that contain compressed data (ZIP, 7z), we usually have a (declared) decompressed size, so we could set that size as limit and prevent corrupted archives to fill the disk. Malicious archives, of course, would need the user to set a hard maximum decompressed size.

I think it would be nice if every CompressorInputStream inherited from BoundedInputStream:

I think I like that idea.

What we are saying today is: If you want to bound decompression, you must compose your CompressorInputStream with a BoundedInputStream yourself and configure it.

Instead, what we would be saying is: Bounding decompression is so important to all possible call sites, that we are no longer offering composition as an option, we are now hard-coding the feature by inheriting it and making it available to all call sites, all the time. You still need to set a boundary in you call site.

Doing so would require breaking binary compatibility because the deprecated method CompressorInputStream.getCount() returns an int and BoundedInputStream.getCount() returns a long. Food for thought...

tomasilluminati · 2026-07-02T21:19:32Z

@ppkarwasz, given that, do you still want the mean-ratio guard and an on-by-default limit?

I guess, we can deliver a mean-ratio guard in a separate wrapper: only a limited number of users need it. Basically only if the compressed stream comes primarily from untrusted sources (like in a server), users might want to add a mean-ratio check as a fail-fast mechanism to prevent a DoS.
However, I think it would be nice if every CompressorInputStream inherited from BoundedInputStream: for archive formats that contain compressed data (ZIP, 7z), we usually have a (declared) decompressed size, so we could set that size as limit and prevent corrupted archives to fill the disk. Malicious archives, of course, would need the user to set a hard maximum decompressed size.

I think it would be nice if every CompressorInputStream inherited from BoundedInputStream:

I think I like that idea.

What we are saying today is: If you want to bound decompression, you must compose your CompressorInputStream with a BoundedInputStream yourself and configure it.

Instead, what we would be saying is: Bounding decompression is so important to all possible call sites, that we are no longer offering composition as an option, we are now hard-coding the feature by inheriting it and making it available to all call sites, all the time. You still need to set a boundary in you call site.

Doing so would require breaking binary compatibility because the deprecated method CompressorInputStream.getCount() returns an int and BoundedInputStream.getCount() returns a long. Food for thought…

I like the intent, one honest catch beyond the getCount int/long break you flagged. BoundedInputStream bounds the bytes it reads from its wrapped stream, which for a decompressor is the compressed input side, whereas bomb protection needs to bound the decompressed output. So the inheritance only does the right thing if every format routes its output through the proxy, which is the big part of the change. Before anyone builds it, worth pinning down what the inherited bound counts, decompressed output or the compressed input it proxies. Happy to prototype whichever way you and @ppkarwasz settle on.

garydgregory requested changes Jun 28, 2026

View reviewed changes

tomasilluminati requested a review from garydgregory June 28, 2026 22:33

garydgregory mentioned this pull request Jun 29, 2026

docs: document security model #763

Open

7 tasks

garydgregory marked this pull request as draft July 2, 2026 11:46

Uh oh!

Conversation

tomasilluminati commented Jun 28, 2026

Uh oh!

tomasilluminati commented Jun 28, 2026

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

tomasilluminati commented Jun 28, 2026

Uh oh!

garydgregory commented Jun 28, 2026

Uh oh!

tomasilluminati commented Jun 29, 2026

Uh oh!

tomasilluminati commented Jul 2, 2026

Uh oh!

ppkarwasz commented Jul 2, 2026

Uh oh!

ppkarwasz commented Jul 2, 2026

Uh oh!

garydgregory commented Jul 2, 2026

Uh oh!

tomasilluminati commented Jul 2, 2026

Uh oh!

garydgregory commented Jul 2, 2026

Uh oh!

garydgregory commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomasilluminati commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

garydgregory commented Jul 2, 2026 •

edited

Loading