Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Issue 20290 - std.zip: file(name) and directory issues #7292

Closed
wants to merge 1 commit into from
Closed

Fix Issue 20290 - std.zip: file(name) and directory issues #7292

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Nov 21, 2019

I needed the decompression method in ArchiveMember, and as it should be there anyway, I moved it from ZipArchive to avoid code duplication.

This PR does not address two of the issues mentioned in issue 20290, because I think they are invalid. (Origin, where the PR took the issues):

Rejects compressed symlinks.

I see no reason, why they should be rejected completely. Instead, here they are decompressed and checked for path traversal, like the noncompressed ones. In the original place, the reason for rejecting is given as just being "simplicity" in the sense of laziness.

Rejects symlinks that exceed 1024 bytes.

As the original place admits, this is an "arbitrary upper bound to prevent runaway string decoding". Being arbitrary, it might reject valid files. Furthermore, it's not explained, what is meant with "runaway string decoding", and I couldn't find anything about this searching the internet.

@ghost ghost requested a review from CyberShadow as a code owner November 21, 2019 13:15
@dlang-bot
Copy link
Contributor

Thanks for your pull request and interest in making D better, @berni44! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Auto-close Bugzilla Severity Description
20290 normal std.zip: file(name) and directory issues

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub fetch digger
dub run digger -- build "master + phobos#7292"

Copy link
Member

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such checks would be useful when extracting files (i.e. extract one or all archive members to disk with their names as they appear in the archive), but we should not break manipulation of archives even with nonsensical / "invalid" filenames.

enforce!ZipException(splitter(link, "/").all!(a => a != "..")
&& (link.length == 0 || link[0] != '/')
&& (link.length < 2 || !isAlpha(link[0]) || link[1] != ':'),
"symlink with directory traversal found");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem with arbitrary symlinks. tar allows them, and so should ZIP files, unless there is a good reason otherwise, like being forbidden by the standard or overwhelmingly common conventions. The important bit is to not allow "overwriting" symlinks, as then a ZIP file could contain a symlink with a path followed by a file with the same (or canonically same) path which would then get written to the symlink target location. And, since there is no reason to check them, there is no need to decompress them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important bit is to not allow "overwriting" symlinks, as then a ZIP file could contain a symlink with a path followed by a file with the same (or canonically same) path which would then get written to the symlink target location.

But that's again the big question mentioned. I created such a zip file. unzip delays the creation of the symlink, which is some sort of protection. But if you run unzip twice, the file's written, where it should not be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that's the best way to handle that. I think an unlink, ignoring errors, before creating the file would be more appropriate. This is what e.g. gcc does, at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unzip delays the creation of the symlink, which is some sort of protection.

Does it attempt to canonicalize the path (which is generally trouble)? Or it merely handles symlinks after files, and does not create them if something already exists at that path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto?

// Most filesystems do not allow for more than 255 characters in a file name.
// Longer names are probably malware and we reject them therefore.
enforce!ZipException(name.length - (name.lastIndexOf('/') + 1) <= 255,
"contains filename with length > 255");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No arbitrary limits please. Let the OS do its job.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that arbitrary. I searched for the maximum allowed filename lengths and all popular POSIX-filesystems are limited to 255 chars and also windows does. There are a few, that allow less, but I couldn't find anyone that allows longer names.

The source of these checks is IMHO overdoing it a little bit and therefore I rejected the really arbitrary limit of 1024 chars for symlinks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that arbitrary. I searched for the maximum allowed filename lengths and all popular POSIX-filesystems are limited to 255 chars and also windows does. There are a few, that allow less, but I couldn't find anyone that allows longer names.

We don't know that about all filesystems that have or will ever be written.

Is there some performance or security advantage of doing this check here, when the OS will already be doing its own check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is NAME_MAX from libc which defines the OS limit, so could use that... though, if someone wants to use ZIP files as key-value stores with compressed values and arbitrary (incl. length) keys, who are we to stop them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the change to a validate function I think it's ok to keep the limit of 255?

// ASCII characters < 0x20 and 0x7F can be used to hide path traversal attacks.
// Therefore we reject them. Backslash is not allowed according to the specs.
enforce!ZipException(name.all!(c => c >= 0x20 && c != 0x7F && c != '\\'),
"filename contains invalid symbol");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNIX filenames can contain ANY characters except NUL and /, so not sure about this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on this either. The source (see above) writes this:

// CVE-2003-0282 (aka "JELMER")
// Some zip implementations filter control characters at the least.
// This behavior can be exploited to mask ".." in a directory traversal.
// https://www.securityfocus.com/archive/1/321090

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that looks like a bug in the unzip utility (it filters out the control characters) or the operating system (canonicalizes them out and interprets .. after this canonicalization). If it's the former, then we don't need to do anything. If it's the latter, then this solution is inadequate - we need to run the path through the same algorithm the OS uses for canonicalization and check the result.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto? By the way, ISO rejects some of these characters. I'll address this in a separate PR, cause here it's only about validating not rejecting.

@ghost
Copy link
Author

ghost commented Nov 21, 2019

Such checks would be useful when extracting files (i.e. extract one or all archive members to disk with their names as they appear in the archive), but we should not break manipulation of archives even with nonsensical / "invalid" filenames.

With the same argument, we should allow zip-bombs. The specs don't forbid overlapping content...

Having said this, the big question is, how much a library should do and how much should be left to the programmer using the library. See issue 18950 for some arguments. Maybe we might want to go a middleway and add a safetybelt-flag to the c'tor, which defaults to safety, but can be overwritten?

@CyberShadow
Copy link
Member

Can you define "safety" precisely? If you can't, then this is a fool's errand.

@CyberShadow
Copy link
Member

With the same argument, we should allow zip-bombs.

My understanding is that things are as follows:

  1. Data is not decompressed until requested
  2. It's possible to manipulate archive members without decompressing them, incl. copying / moving them to other ZIP archives
  3. ZIP bombs are checked for if / when the data is decompressed, but not otherwise

In which case I don't see a problem. If things are not as above, then perhaps they should be?

@ghost
Copy link
Author

ghost commented Nov 26, 2019

Can you define "safety" precisely? If you can't, then this is a fool's errand.

I don't know, if you know postfix - it's a mail transfer agent. This program checks for some common settings that produce a so called open relay and prevents that, because that would open the door for spammers. If you really want to have an open relay you need to add some more setting, effectivly saying "Yes, I want this". These checks protect users, which might not be aware of what they are doing, from accidentally producing an open relay. Of course, one can say, it's a fault of the users, because they did not get the settings right. But the software helps avoiding such traps.

And that's exactly, what I want to achieve here: For example, some users might just want to replace a system call to unzip by some D routine, which reads an archive and writes all contained files to disk. Of course it's finally on the user to check for malware, but the library can help avoiding the trap. This is the sort of safety I mean.

3. ZIP bombs are checked for if / when the data is decompressed, but not otherwise

No, zip bombs are detected before decompression, directly when parsing the archive. One reason is, that they can allready at that stage cause harm, because the read in data is much larger than the filesize (due to overlaps) and may make the application crash because it gets out of memory. And an other reason is, that I see no way to detect them later; each individual compressed file of a zip bomb is a valid file.

@CyberShadow
Copy link
Member

This program

Yep, but, this would not be appropriate for an IMAP server library.

For example, some users might just want to replace a system call to unzip by some D routine,

There would have to be a separate API for that purpose. Maybe an unzip function?

No, zip bombs are detected before decompression, directly when parsing the archive.

Come to think of it.

Wouldn't the same technique be usable to achieve higher compression ratio of ZIP archives in good faith?

And, what is the distinction between a ZIP bomb as you would define it, and merely a very large and uniform file (say, 100 GB of zeroes) compressed tightly into a normal ZIP file? Just the use of overlapping areas and the higher compression ratio that achieves?

@ghost
Copy link
Author

ghost commented Nov 27, 2019

During all the other review processes we went though in the last weeks, I had the feeling we are working together to reach something, that is better, than what I would have managed alone (and maybe also, what you would have managed alone, but as I don't know, what that is I cannot really say anything about this). For me it felt like having some sort of a mentor.

This time is different, calling me a fool and arguing favouring the protection of hacks (saving AAs as zip files, using overlapping data to achieve higher compression) over protecting inexperienced users. This feels somehow unhealthy for me.

Did I do something wrong? Did I accidentially hit your sore point? Or is it just missperception on my side?

@CyberShadow
Copy link
Member

Sorry, I don't know why you're coming to such conclusions.

During all the other review processes we went though in the last weeks, I had the feeling we are working together to reach something, that is better, than what I would have managed alone (and maybe also, what you would have managed alone, but as I don't know, what that is I cannot really say anything about this).

I feel the same way, so your message is surprising to me.

calling me a fool

I wrote that phrase under the impression that "fool's errand" was a common expression meaning futile or impossible task. No offense was intended. However, English is not my native tongue (it is my third language), and sometimes I misuse expressions. Apologies for that.

saving AAs as zip files

Using the filesystem as a database is quite common. Since ZIP files are like compressed filesystems, I don't see a problem with this idea.

using overlapping data to achieve higher compression

You mentioned yourself that such ZIP files are completely valid. This was actually in contrary to my previous impression of them; I thought they were invalid but worked only due to a lack of necessary checks in common decompression utilities.

If they are valid according to the standards we choose to support, then they need to be supported, otherwise we are not compliant with the standard. This is a black-and-white matter.

over protecting inexperienced users

Footgun safeties generally do not belong in libraries, or only as additional layers. This is why I suggested an unzip function which could then include these checks (maybe optionally).

@ghost
Copy link
Author

ghost commented Nov 28, 2019

I wrote that phrase under the impression that "fool's errand" was a common expression meaning futile or impossible task. No offense was intended. However, English is not my native tongue (it is my third language), and sometimes I misuse expressions. Apologies for that.

Oh, I'm sorry, I did not know that and got it wrong. Now I understand...

using overlapping data to achieve higher compression

You mentioned yourself that such ZIP files are completely valid. This was actually in contrary to my previous impression of them; I thought they were invalid but worked only due to a lack of necessary checks in common decompression utilities.

Meanwhile I looked through the docs once more. It's not clearly stated. There are two specs, the one from ISO which refers to the other from pkware. ISO adds several restrictions and as far as I got it, we use ISO here. ISO does not mention overlapping data. pkware does not directly mention overlapping data but gives an overview of a zip file in 4.3.6 from which one could deduce, that it was never intended, that data overlaps:

4.3.6 Overall .ZIP file format:

  [local file header 1]
  [encryption header 1]
  [file data 1]
  [data descriptor 1]
  . 
  .
  .
  [local file header n]
  [encryption header n]
  [file data n]
  [data descriptor n]
  [archive decryption header] 
  [archive extra data record] 
  [central directory header 1]
  .
  .
  .
  [central directory header n]
  [zip64 end of central directory record]
  [zip64 end of central directory locator] 
  [end of central directory record]

And there are places where they refer to other data records with "before" or "after" which IMHO suggests the same.

At least, they clearly write, that there is only one end of central dir record allowed, which rules chameleon files out.

If they are valid according to the standards we choose to support, then they need to be supported, otherwise we are not compliant with the standard. This is a black-and-white matter.

Well, as the standard isn't really clear here, it seems to be some shade of gray...

Footgun safeties generally do not belong in libraries, or only as additional layers.

For me, libraries are a tool that helps not inventing the wheel once more. So libraries should contain stuff, that's needed over and over again (and is not trivial). IMHO checking for malware is such a thing. I'm not sure if making such checks the default is the right thing to do. But they definitively will be useful and therefore should be part of the library.

This is why I suggested an unzip function which could then include these checks (maybe optionally).

I don't think, that this is a good solution. Users might want to filter some files or to change some stuff (file attributes, time stamps etc.) We cannot cover all of the usecases. IMHO, they should write such functions themselves. But we can provide the checks, so they can use them.

I thought of replacing the checks in the c'tor by a public function that can be called. But that would raise the question how to return the result. Just a boolean might be not satisfiying, because the user might like to know why a file was rejected. Again throwing? Or a large enum? Or returning a string? I'm not convinced of all of them. If you've got an idea on how to do that, that might be a solution.

@CyberShadow
Copy link
Member

Reviewing the https://en.wikipedia.org/wiki/Zip_bomb article, I understand that ZIP bombs (of the kind which make use of overlapping sections) are only effectual is an archive is attempted to be unpacked recursively, i.e. any ZIP files within the ZIP file are also unpacked and so on. This also needs to be done in a naive way, i.e. keep all previously unpacked files in memory. It seems to me that this is an extraordinary use case against which it doesn't make sense to protect by default, which combined with the fact that overlapping sections are not technically forbidden, lead to the conclusion that protection against such possible attacks ought to be opt-in.

Well, as the standard isn't really clear here, it seems to be some shade of gray...

It doesn't seem that way to me. If a standard does not forbid something, it is allowed. The illustration does show a typical ZIP file, and is a good guideline of how to create one, but not how software parsing ZIP files should expect them to look like.

BTW, overlapping sections appear in other formats as well (such as PE files), where they are also perfectly legal. This is made use of PE compressors such as Crinkler to achieve a higher compression ratio.

I don't think, that this is a good solution. Users might want to filter some files or to change some stuff (file attributes, time stamps etc.) We cannot cover all of the usecases. IMHO, they should write such functions themselves. But we can provide the checks, so they can use them.

Providing an easy-to-use function for the most common use cases and an advanced API for the less common ones seems like a perfectly reasonable compromise to me, but I agree it's not the only possible approach.

I thought of replacing the checks in the c'tor by a public function that can be called. But that would raise the question how to return the result. Just a boolean might be not satisfiying, because the user might like to know why a file was rejected. Again throwing? Or a large enum? Or returning a string? I'm not convinced of all of them. If you've got an idea on how to do that, that might be a solution.

Exceptions seem reasonable. They have the advantage that the exception class can contain additional information if required, and this can be added later without breaking the API. With this approach, the function could have a void validate() signature (or maybe taking an enum of which checks to perform).

@ghost
Copy link
Author

ghost commented Nov 28, 2019

Reviewing the https://en.wikipedia.org/wiki/Zip_bomb article, I understand that ZIP bombs (of the kind which make use of overlapping sections) are only effectual is an archive is attempted to be unpacked recursively,

No, that information is outdated. The zib bombs given on https://www.bamsoftware.com/hacks/zipbomb/ show, that this is possible without recursion too. (These zip bombs where the reason, why Walter initially added the bug report that started all these changes here.)

It doesn't seem that way to me. If a standard does not forbid something, it is allowed. The illustration does show a typical ZIP file, and is a good guideline of how to create one, but not how software parsing ZIP files should expect them to look like.

I think, we can agree, that we disagree here...

BTW, overlapping sections appear in other formats as well (such as PE files), where they are also perfectly legal. This is made use of PE compressors such as Crinkler to achieve a higher compression ratio.

If these overlaps where of any use other than malware I would agree, but I do not see a reasonable usecase here.

Providing an easy-to-use function for the most common use cases and an advanced API for the less common ones seems like a perfectly reasonable compromise to me, but I agree it's not the only possible approach.

I'm not sure, if unzip-to-disk would be the most common use case. This is probably the reason, why I'm so reluctant with this approach.

Exceptions seem reasonable. They have the advantage that the exception class can contain additional information if required, and this can be added later without breaking the API. With this approach, the function could have a void validate() signature (or maybe taking an enum of which checks to perform).

I like the idea with that enum, although I think we should postpone that until all checks are ready. Then the param can be added with a reasonable default value and the enums are less chaotic.

@CyberShadow
Copy link
Member

The zib bombs given on https://www.bamsoftware.com/hacks/zipbomb/ show, that this is possible without recursion too.

OK, carefully reading through that page, it doesn't seem like there is a reasonable way to take advantage of this peculiarity of the format to achieve higher compression ratios for valid use cases. At least with the approach described there, either the local/central file names won't match, or an additional header needs to be faked and encapsulated in a Deflate quoted block, which "corrupts" one of the files. So, no issue with scanning and rejecting such files, as there exist no valid use cases for them. (Sorry it took me so long to reach this conclusion :) )

@ghost
Copy link
Author

ghost commented Dec 8, 2019

I added two validate functions, one in ArchiveMember and one in ZipArchive. The last one is for convenience and just calls the first one for all members.

I did not change anything, concerning the checks itself. I hope they are OK, when the user can choose to call the checks.

std/zip.d Outdated Show resolved Hide resolved
std/zip.d Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Jan 17, 2020

Is there anything I could do about this?

Copy link
Member

@CyberShadow CyberShadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK but needs @atilaneves ' approval as it adds a new public symbol.

@RazvanN7 RazvanN7 added the Review:Orphaned The author of the PR is no longer available and this PR can be adopted by anyone. label Jan 28, 2021
@RazvanN7
Copy link
Collaborator

This is orphaned, but it has a link in the bugzilla tracker. Closing for now, anyone is free to adopt.

@RazvanN7 RazvanN7 closed this Jan 28, 2021
@PetarKirov
Copy link
Member

CC @berni44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review:Orphaned The author of the PR is no longer available and this PR can be adopted by anyone. Severity:Bug Fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants