-
-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Issue 20290 - std.zip: file(name) and directory issues #7292
Conversation
|
Thanks for your pull request and interest in making D better, @berni44! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please see CONTRIBUTING.md for more information. If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment. Bugzilla references
Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub fetch digger
dub run digger -- build "master + phobos#7292" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such checks would be useful when extracting files (i.e. extract one or all archive members to disk with their names as they appear in the archive), but we should not break manipulation of archives even with nonsensical / "invalid" filenames.
| enforce!ZipException(splitter(link, "/").all!(a => a != "..") | ||
| && (link.length == 0 || link[0] != '/') | ||
| && (link.length < 2 || !isAlpha(link[0]) || link[1] != ':'), | ||
| "symlink with directory traversal found"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a problem with arbitrary symlinks. tar allows them, and so should ZIP files, unless there is a good reason otherwise, like being forbidden by the standard or overwhelmingly common conventions. The important bit is to not allow "overwriting" symlinks, as then a ZIP file could contain a symlink with a path followed by a file with the same (or canonically same) path which would then get written to the symlink target location. And, since there is no reason to check them, there is no need to decompress them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important bit is to not allow "overwriting" symlinks, as then a ZIP file could contain a symlink with a path followed by a file with the same (or canonically same) path which would then get written to the symlink target location.
But that's again the big question mentioned. I created such a zip file. unzip delays the creation of the symlink, which is some sort of protection. But if you run unzip twice, the file's written, where it should not be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that's the best way to handle that. I think an unlink, ignoring errors, before creating the file would be more appropriate. This is what e.g. gcc does, at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unzip delays the creation of the symlink, which is some sort of protection.
Does it attempt to canonicalize the path (which is generally trouble)? Or it merely handles symlinks after files, and does not create them if something already exists at that path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto?
| // Most filesystems do not allow for more than 255 characters in a file name. | ||
| // Longer names are probably malware and we reject them therefore. | ||
| enforce!ZipException(name.length - (name.lastIndexOf('/') + 1) <= 255, | ||
| "contains filename with length > 255"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No arbitrary limits please. Let the OS do its job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that arbitrary. I searched for the maximum allowed filename lengths and all popular POSIX-filesystems are limited to 255 chars and also windows does. There are a few, that allow less, but I couldn't find anyone that allows longer names.
The source of these checks is IMHO overdoing it a little bit and therefore I rejected the really arbitrary limit of 1024 chars for symlinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that arbitrary. I searched for the maximum allowed filename lengths and all popular POSIX-filesystems are limited to 255 chars and also windows does. There are a few, that allow less, but I couldn't find anyone that allows longer names.
We don't know that about all filesystems that have or will ever be written.
Is there some performance or security advantage of doing this check here, when the OS will already be doing its own check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is NAME_MAX from libc which defines the OS limit, so could use that... though, if someone wants to use ZIP files as key-value stores with compressed values and arbitrary (incl. length) keys, who are we to stop them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change to a validate function I think it's ok to keep the limit of 255?
| // ASCII characters < 0x20 and 0x7F can be used to hide path traversal attacks. | ||
| // Therefore we reject them. Backslash is not allowed according to the specs. | ||
| enforce!ZipException(name.all!(c => c >= 0x20 && c != 0x7F && c != '\\'), | ||
| "filename contains invalid symbol"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNIX filenames can contain ANY characters except NUL and /, so not sure about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure on this either. The source (see above) writes this:
// CVE-2003-0282 (aka "JELMER")
// Some zip implementations filter control characters at the least.
// This behavior can be exploited to mask ".." in a directory traversal.
// https://www.securityfocus.com/archive/1/321090
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that looks like a bug in the unzip utility (it filters out the control characters) or the operating system (canonicalizes them out and interprets .. after this canonicalization). If it's the former, then we don't need to do anything. If it's the latter, then this solution is inadequate - we need to run the path through the same algorithm the OS uses for canonicalization and check the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto? By the way, ISO rejects some of these characters. I'll address this in a separate PR, cause here it's only about validating not rejecting.
With the same argument, we should allow zip-bombs. The specs don't forbid overlapping content... Having said this, the big question is, how much a library should do and how much should be left to the programmer using the library. See issue 18950 for some arguments. Maybe we might want to go a middleway and add a safetybelt-flag to the c'tor, which defaults to safety, but can be overwritten? |
|
Can you define "safety" precisely? If you can't, then this is a fool's errand. |
My understanding is that things are as follows:
In which case I don't see a problem. If things are not as above, then perhaps they should be? |
I don't know, if you know postfix - it's a mail transfer agent. This program checks for some common settings that produce a so called open relay and prevents that, because that would open the door for spammers. If you really want to have an open relay you need to add some more setting, effectivly saying "Yes, I want this". These checks protect users, which might not be aware of what they are doing, from accidentally producing an open relay. Of course, one can say, it's a fault of the users, because they did not get the settings right. But the software helps avoiding such traps. And that's exactly, what I want to achieve here: For example, some users might just want to replace a system call to unzip by some D routine, which reads an archive and writes all contained files to disk. Of course it's finally on the user to check for malware, but the library can help avoiding the trap. This is the sort of safety I mean.
No, zip bombs are detected before decompression, directly when parsing the archive. One reason is, that they can allready at that stage cause harm, because the read in data is much larger than the filesize (due to overlaps) and may make the application crash because it gets out of memory. And an other reason is, that I see no way to detect them later; each individual compressed file of a zip bomb is a valid file. |
Yep, but, this would not be appropriate for an IMAP server library.
There would have to be a separate API for that purpose. Maybe an
Come to think of it. Wouldn't the same technique be usable to achieve higher compression ratio of ZIP archives in good faith? And, what is the distinction between a ZIP bomb as you would define it, and merely a very large and uniform file (say, 100 GB of zeroes) compressed tightly into a normal ZIP file? Just the use of overlapping areas and the higher compression ratio that achieves? |
|
During all the other review processes we went though in the last weeks, I had the feeling we are working together to reach something, that is better, than what I would have managed alone (and maybe also, what you would have managed alone, but as I don't know, what that is I cannot really say anything about this). For me it felt like having some sort of a mentor. This time is different, calling me a fool and arguing favouring the protection of hacks (saving AAs as zip files, using overlapping data to achieve higher compression) over protecting inexperienced users. This feels somehow unhealthy for me. Did I do something wrong? Did I accidentially hit your sore point? Or is it just missperception on my side? |
|
Sorry, I don't know why you're coming to such conclusions.
I feel the same way, so your message is surprising to me.
I wrote that phrase under the impression that "fool's errand" was a common expression meaning futile or impossible task. No offense was intended. However, English is not my native tongue (it is my third language), and sometimes I misuse expressions. Apologies for that.
Using the filesystem as a database is quite common. Since ZIP files are like compressed filesystems, I don't see a problem with this idea.
You mentioned yourself that such ZIP files are completely valid. This was actually in contrary to my previous impression of them; I thought they were invalid but worked only due to a lack of necessary checks in common decompression utilities. If they are valid according to the standards we choose to support, then they need to be supported, otherwise we are not compliant with the standard. This is a black-and-white matter.
Footgun safeties generally do not belong in libraries, or only as additional layers. This is why I suggested an |
Oh, I'm sorry, I did not know that and got it wrong. Now I understand...
Meanwhile I looked through the docs once more. It's not clearly stated. There are two specs, the one from ISO which refers to the other from pkware. ISO adds several restrictions and as far as I got it, we use ISO here. ISO does not mention overlapping data. pkware does not directly mention overlapping data but gives an overview of a zip file in 4.3.6 from which one could deduce, that it was never intended, that data overlaps: And there are places where they refer to other data records with "before" or "after" which IMHO suggests the same. At least, they clearly write, that there is only one end of central dir record allowed, which rules chameleon files out.
Well, as the standard isn't really clear here, it seems to be some shade of gray...
For me, libraries are a tool that helps not inventing the wheel once more. So libraries should contain stuff, that's needed over and over again (and is not trivial). IMHO checking for malware is such a thing. I'm not sure if making such checks the default is the right thing to do. But they definitively will be useful and therefore should be part of the library.
I don't think, that this is a good solution. Users might want to filter some files or to change some stuff (file attributes, time stamps etc.) We cannot cover all of the usecases. IMHO, they should write such functions themselves. But we can provide the checks, so they can use them. I thought of replacing the checks in the c'tor by a public function that can be called. But that would raise the question how to return the result. Just a boolean might be not satisfiying, because the user might like to know why a file was rejected. Again throwing? Or a large enum? Or returning a string? I'm not convinced of all of them. If you've got an idea on how to do that, that might be a solution. |
|
Reviewing the https://en.wikipedia.org/wiki/Zip_bomb article, I understand that ZIP bombs (of the kind which make use of overlapping sections) are only effectual is an archive is attempted to be unpacked recursively, i.e. any ZIP files within the ZIP file are also unpacked and so on. This also needs to be done in a naive way, i.e. keep all previously unpacked files in memory. It seems to me that this is an extraordinary use case against which it doesn't make sense to protect by default, which combined with the fact that overlapping sections are not technically forbidden, lead to the conclusion that protection against such possible attacks ought to be opt-in.
It doesn't seem that way to me. If a standard does not forbid something, it is allowed. The illustration does show a typical ZIP file, and is a good guideline of how to create one, but not how software parsing ZIP files should expect them to look like. BTW, overlapping sections appear in other formats as well (such as PE files), where they are also perfectly legal. This is made use of PE compressors such as Crinkler to achieve a higher compression ratio.
Providing an easy-to-use function for the most common use cases and an advanced API for the less common ones seems like a perfectly reasonable compromise to me, but I agree it's not the only possible approach.
Exceptions seem reasonable. They have the advantage that the exception class can contain additional information if required, and this can be added later without breaking the API. With this approach, the function could have a |
No, that information is outdated. The zib bombs given on https://www.bamsoftware.com/hacks/zipbomb/ show, that this is possible without recursion too. (These zip bombs where the reason, why Walter initially added the bug report that started all these changes here.)
I think, we can agree, that we disagree here...
If these overlaps where of any use other than malware I would agree, but I do not see a reasonable usecase here.
I'm not sure, if unzip-to-disk would be the most common use case. This is probably the reason, why I'm so reluctant with this approach.
I like the idea with that enum, although I think we should postpone that until all checks are ready. Then the param can be added with a reasonable default value and the enums are less chaotic. |
OK, carefully reading through that page, it doesn't seem like there is a reasonable way to take advantage of this peculiarity of the format to achieve higher compression ratios for valid use cases. At least with the approach described there, either the local/central file names won't match, or an additional header needs to be faked and encapsulated in a Deflate quoted block, which "corrupts" one of the files. So, no issue with scanning and rejecting such files, as there exist no valid use cases for them. (Sorry it took me so long to reach this conclusion :) ) |
|
I added two validate functions, one in ArchiveMember and one in ZipArchive. The last one is for convenience and just calls the first one for all members. I did not change anything, concerning the checks itself. I hope they are OK, when the user can choose to call the checks. |
|
Is there anything I could do about this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK but needs @atilaneves ' approval as it adds a new public symbol.
|
This is orphaned, but it has a link in the bugzilla tracker. Closing for now, anyone is free to adopt. |
|
CC @berni44 |
I needed the decompression method in ArchiveMember, and as it should be there anyway, I moved it from ZipArchive to avoid code duplication.
This PR does not address two of the issues mentioned in issue 20290, because I think they are invalid. (Origin, where the PR took the issues):
I see no reason, why they should be rejected completely. Instead, here they are decompressed and checked for path traversal, like the noncompressed ones. In the original place, the reason for rejecting is given as just being "simplicity" in the sense of laziness.
As the original place admits, this is an "arbitrary upper bound to prevent runaway string decoding". Being arbitrary, it might reject valid files. Furthermore, it's not explained, what is meant with "runaway string decoding", and I couldn't find anything about this searching the internet.