-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.IO.Compression handles extended characters incorrectly #43231
Comments
I have looked at this issue and I am seeing the Zip archive entry names encoding is not handled correctly. When creating a new archive file there will be a generic flag telling if the archive is encoded using UTF-8 or not. if this flag is off, means the archive is not encoded using UTF-8, we don't handle the right encoding at that time. the following comment has more details: runtime/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs Line 356 in b75f8d9
Usually when UTF-8 encoding is not used, we should consider using the 437 encoding instead. Or try to investigate more the details as described in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT in This issue is kind of obvious to the users because when using Windows shell to create a new Archive, it doesn't create the archive using UTF-8 and I am seeing it is using encoding 437. |
Thank you. Now I know what is happening and this will be fixed. But in the mean time I could do this, and force codepage 437 to get correct value from _storedEntryNameBytes. I have to test if this works correctly in all cases
|
ZipFile.Open(String, ZipArchiveMode, Encoding) Now I just need to figure out what encoding has been used |
With this I came up. Minimal detection for encoding and after quick testing it works.
|
And no it didn't work as intended. if zip file has these in it, they will appear in this order, and my code will get encoding from first entry.
And then the second filename is actually 437 encoded. |
For #43231 (comment), this may work in some cases but not in general cases as you indicated later. if you look at pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT it says
|
After giving this some thought, exposing the header for entries would be suitable solution for this. Then it would be up to programmer to check those values if they are interested in them. |
I also had ZIP entry encoding problems recently and this is what I figured out:
I think the runtime should handle all the encoding detection work. |
This is exactly why the language encoding bit should be exposed. if it is other that UTF-8 then runtime has no way of knowing what the codepage should be. Me as developer can "guess" what codepage to use by knowing where the file is coming from. |
How are standalone archiving tools like 7-Zip are handling that case then? 7-Zip can extract all archives that I tested with correct encodings. If 7-Zip can do it then the .NET runtime can do it for sure too? |
If you create an archive that has any other encoding that UTF-8 or IBM437 it doesn't. 7-zip will extract this ")=.txt" as "]~Kúºú" So I need to know where the file came from and have knowledge what codepage they used.
|
Yeah ok, but I meant archives that were created with "standard" tools. .NET should be able to handle those automatically. |
I just found this comment in the .NET source code: https://source.dot.net/#System.IO.Compression/System/IO/Compression/ZipArchive.cs,356 What's weird about that is that it seems to be wrong. Specifically the part where it says
In my experience this is not the case, see #43231 (comment). The Windows Shell Zip tool seems to use "IBM850" (not "IBM437"!) encoding (on Windows 10). And because .NET does not default to that (but instead defaults to "the current system default code page"), those Windows zip files can not be correctly extracted by .NET. Maybe .NET just understands "the current system default code page" different than Windows. As written, Windows seems to use "IBM850" encoding, which is available in .NET as Anyway, even if it would work by default for zip files created on Windows, we would still have problems with zip files created on macOS. |
If you are running with full framework, default code page would be picked up from Windows API |
I tested on .NET Framework 4.8. But as far as I understand it .NET Core/5/6 will have the same problem, right? Because neither uses the same ZIP entry name encoding by default that Windows uses. |
Yes. I was just explaining how Encoding picks the default. |
yep! I reached this page having the same issue and it fails for .NET Core 7 |
Commeting here since I found the thread googling about the same issue. For .net framework 4.8 it works for me using Encoding.GetEncoding(System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage). For .net core I used a somewhat ugly hack where I replace the "illegal characters". In this case swedish åäö.
I can't say this is the way to go and that I recommend it, but it solved the issue for me. |
This issue has been moved from a ticket on Developer Community.
[severity:It's more difficult to complete my work]
create a text file name "tämä.txt"
send to compressed folder
Extracting this file it is read by library as "t„m„.txt"
Windows explorer expands the zip fine, as other compression software's
Original Comments
Feedback Bot on 9/20/2020, 11:00 PM:
We have directed your feedback to the appropriate engineering team for further evaluation. The team will review the feedback and notify you about the next steps.
Original Solutions
Tarek Mahmoud Sayed [MSFT] solved on 9/30/2020, 01:44 PM, 0 votes:
Thanks for sending the issue. I have tried it the same thing and I was not able to reproduce the issue. My guess here is the problem is not really the ZipArchiveEntry.Name content but it could be the Visual Studio locals Window displaying the string differently and this can be depending on the configuration of your machine. Usually I am seeing this maybe depending on the default codepage on your system.
Here is the code I tried:
And this printed the output:
which is correct.
I suggest you can try the same either manually printing the characters ordinal values and check your system configuration like default locale and codepage in the system.
on 10/8/2020, 05:05 AM:
(private comment, text removed)
Tarek Mahmoud Sayed [MSFT] on 10/8/2020, 10:10 AM:
Thanks for your reply. Unfortunately I cannot reach the files that you have attached. Could you please try to upload them again and let me know.
Also, please attach the code that you used to create the zip file again to see how did you include the file tämä.txt too.
on 10/9/2020, 02:29 AM:
(private comment, text removed)
Tarek Mahmoud Sayed [MSFT] on 10/9/2020, 11:01 AM:
Thanks for the details. I'll take a look.
The text was updated successfully, but these errors were encountered: