Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Compression handles extended characters incorrectly #43231

Open
vsfeedback opened this issue Oct 9, 2020 · 18 comments
Open

System.IO.Compression handles extended characters incorrectly #43231

vsfeedback opened this issue Oct 9, 2020 · 18 comments
Labels
area-System.IO.Compression bug help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@vsfeedback
Copy link

This issue has been moved from a ticket on Developer Community.


[severity:It's more difficult to complete my work]
create a text file name "tämä.txt"
send to compressed folder
Extracting this file it is read by library as "t„m„.txt"

Windows explorer expands the zip fine, as other compression software's

using System.IO;
using System.IO.Compression;

namespace ZIPtest
{
    class Program
    {
        static void Main(string[] args)
        {
            string zipFile = @"C:\temp\tämä.zip";

using (ZipArchive paketti = ZipFile.OpenRead(zipFile))
            {
                foreach (ZipArchiveEntry entry in paketti. Entries)
                {
                    string nimi = Path.Combine(@"c:\temp", entry. Name);
                    entry. ExtractToFile(nimi);
                }

}
        }
    }
}

Original Comments

Feedback Bot on 9/20/2020, 11:00 PM:

We have directed your feedback to the appropriate engineering team for further evaluation. The team will review the feedback and notify you about the next steps.


Original Solutions

Tarek Mahmoud Sayed [MSFT] solved on 9/30/2020, 01:44 PM, 0 votes:

Thanks for sending the issue. I have tried it the same thing and I was not able to reproduce the issue. My guess here is the problem is not really the ZipArchiveEntry.Name content but it could be the Visual Studio locals Window displaying the string differently and this can be depending on the configuration of your machine. Usually I am seeing this maybe depending on the default codepage on your system.

Here is the code I tried:

        using (ZipArchive paketti = ZipFile.OpenRead(@"F:\temp\release.zip"))
        {
            foreach (ZipArchiveEntry entry in paketti.Entries)
            {
                Console.WriteLine($"{entry.Name} ... {DumpString(entry.Name)}");
            }
        }

And this printed the output:

  tämä.txt ... \u0074\u00e4\u006d\u00e4\u002e\u0074\u0078\u0074

which is correct.

I suggest you can try the same either manually printing the characters ordinal values and check your system configuration like default locale and codepage in the system.

on 10/8/2020, 05:05 AM:

(private comment, text removed)

Tarek Mahmoud Sayed [MSFT] on 10/8/2020, 10:10 AM:

Thanks for your reply. Unfortunately I cannot reach the files that you have attached. Could you please try to upload them again and let me know.

Also, please attach the code that you used to create the zip file again to see how did you include the file tämä.txt too.

on 10/9/2020, 02:29 AM:

(private comment, text removed)

Tarek Mahmoud Sayed [MSFT] on 10/9/2020, 11:01 AM:

Thanks for the details. I'll take a look.

@tarekgh
Copy link
Member

tarekgh commented Oct 9, 2020

I have looked at this issue and I am seeing the Zip archive entry names encoding is not handled correctly. When creating a new archive file there will be a generic flag telling if the archive is encoded using UTF-8 or not. if this flag is off, means the archive is not encoded using UTF-8, we don't handle the right encoding at that time. the following comment has more details:

// The Zip file spec [http://www.pkware.com/documents/casestudies/APPNOTE.TXT] specifies a bit in the entry header

Usually when UTF-8 encoding is not used, we should consider using the 437 encoding instead. Or try to investigate more the details as described in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT in Appendix D.

This issue is kind of obvious to the users because when using Windows shell to create a new Archive, it doesn't create the archive using UTF-8 and I am seeing it is using encoding 437.

@jyrijh
Copy link

jyrijh commented Oct 13, 2020

Thank you. Now I know what is happening and this will be fixed.

But in the mean time I could do this, and force codepage 437 to get correct value from _storedEntryNameBytes.
entry.Name stays incorrect.

I have to test if this works correctly in all cases

string zipFile = @"C:\temp\tämä.zip";            
var ibm437 = Encoding.GetEncoding("IBM437");

using (ZipArchive paketti = ZipFile.OpenRead(zipFile))
{
    var entryEncoding = paketti.GetType().
        GetProperties(BindingFlags.NonPublic | BindingFlags.Instance).
        Single(pi => pi.Name == "EntryNameEncoding");

    entryEncoding.SetValue(paketti, ibm437);

    foreach (ZipArchiveEntry entry in paketti.Entries)
    {
        var name = entry.GetType().GetFields(BindingFlags.NonPublic | BindingFlags.Instance).Single(pi => pi.Name == "_storedEntryNameBytes");
        var nameValue = name.GetValue(entry);
        var result = ibm437.GetString((byte[])nameValue);

        Console.WriteLine($"{entry.Name} ... {DumpString(entry.Name)}");
        Console.WriteLine($"{result} ... {DumpString(result)}");
    }
}

@jyrijh
Copy link

jyrijh commented Oct 13, 2020

ZipFile.Open(String, ZipArchiveMode, Encoding)

Now I just need to figure out what encoding has been used

@jyrijh
Copy link

jyrijh commented Oct 13, 2020

With this I came up. Minimal detection for encoding and after quick testing it works.

using System;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Runtime.InteropServices;
using System.Text;

namespace ZIPtest
{
    class Program
    {
        [Flags]
        public enum GeneralPurpose : ushort
        {
            None = 0,
            bit00 = 1 << 0,
            bit01 = 1 << 1,
            bit02 = 1 << 2,
            bit03 = 1 << 3,
            bit04 = 1 << 4,
            bit05 = 1 << 5,
            bit06 = 1 << 6,
            bit07 = 1 << 7,
            bit08 = 1 << 8,
            bit09 = 1 << 9,
            bit10 = 1 << 10,
            languageEncoding = 1 << 11,
            bit12 = 1 << 12,
            bit13 = 1 << 13,
            bit14 = 1 << 14,
            bit15 = 1 << 15
        }

        [StructLayout(LayoutKind.Sequential, Pack = 1)]
        struct Header
        {
            [MarshalAs(UnmanagedType.ByValArray, SizeConst = 4)]
            public byte[] signature;
            public ushort version;
            public GeneralPurpose generalPurposeBit;
        }

        static void Main()
        {
            string filename = @"C:\temp\tämä.zip";
            var encoding = DetectEncoding(filename);
            CheckNames(filename, encoding);

            filename = @"C:\temp\漢字.zip";
            encoding = DetectEncoding(filename);
            CheckNames(filename, encoding);

            void CheckNames(string zipName, Encoding zipEncoding)
            {
                using (ZipArchive paketti = ZipFile.Open(zipName, ZipArchiveMode.Read, zipEncoding))
                {
                    foreach (ZipArchiveEntry entry in paketti.Entries)
                    {
                        Console.WriteLine($"{entry.Name} ... {DumpString(entry.Name)}");
                    }
                }
            }

            string DumpString(string name) => string.Join(", ", name.ToArray().Select(x => $"{(int)x:X4}"));
        }


        static Encoding DetectEncoding(string file)
        {
            var data = new byte[8];
            using (FileStream sr = new FileStream(file, FileMode.Open))
                sr.Read(data, 0, 8);

            GCHandle gch = GCHandle.Alloc(data, GCHandleType.Pinned);
            Header header = (Header)Marshal.PtrToStructure(gch.AddrOfPinnedObject(), typeof(Header));
            gch.Free();

            if (header.generalPurposeBit.HasFlag(GeneralPurpose.languageEncoding))
                return Encoding.UTF8;
            else
                return Encoding.GetEncoding("IBM437");
        }
    }
}

@jyrijh
Copy link

jyrijh commented Oct 13, 2020

And no it didn't work as intended.

if zip file has these in it, they will appear in this order, and my code will get encoding from first entry.

aa 漢字.txt
äÄöÖåÅüÜéÉ.txt
漢字.txt

And then the second filename is actually 437 encoded.
IO.Compression will handle everything with same encoding when it should handle it per entry.

@tarekgh
Copy link
Member

tarekgh commented Oct 13, 2020

ZipFile.Open(String, ZipArchiveMode, Encoding) can be used to choose the specific encoding and should help if you know the encoding.

For #43231 (comment), this may work in some cases but not in general cases as you indicated later. if you look at pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT it says

D.3 Applications MAY choose to supplement this file name storage through the use 
of the 0x0008 Extra Field.  Storage for this optional field is currently 
undefined, however it will be used to allow storing extended information 
on source or target encoding that MAY further assist applications with file 
name, or file content encoding tasks.  Please contact PKWARE with any
requirements on how this field SHOULD be used.

@adamsitnik adamsitnik modified the milestones: 6.0.0, 7.0.0 Jul 26, 2021
@jyrijh
Copy link

jyrijh commented Nov 29, 2021

After giving this some thought, exposing the header for entries would be suitable solution for this. Then it would be up to programmer to check those values if they are interested in them.

@cremor
Copy link

cremor commented Jan 10, 2022

I also had ZIP entry encoding problems recently and this is what I figured out:

  • Archives created by .NET itself can be extracted without specifying an encoding just fine, so the automatically detected encoding is correct for those.
  • Archives created by the Windows shell extension or by 7-Zip (on Windows) can not correctly be extracted with .NET without specifying an encoding manually.
    More information here: https://stackoverflow.com/a/32443735/631802
  • Archives created on macOS also can not be correctly extracted with .NET. This is also true if the Windows workaround is applied.
    This means that it is not possible to extract both Windows and macOS archives with the same encoding workaround.

I think the runtime should handle all the encoding detection work.

@jyrijh
Copy link

jyrijh commented Jan 10, 2022

This is exactly why the language encoding bit should be exposed. if it is other that UTF-8 then runtime has no way of knowing what the codepage should be. Me as developer can "guess" what codepage to use by knowing where the file is coming from.

@cremor
Copy link

cremor commented Jan 10, 2022

if it is other that UTF-8 then runtime has no way of knowing what the codepage should be

How are standalone archiving tools like 7-Zip are handling that case then? 7-Zip can extract all archives that I tested with correct encodings. If 7-Zip can do it then the .NET runtime can do it for sure too?

@jyrijh
Copy link

jyrijh commented Jan 12, 2022

If you create an archive that has any other encoding that UTF-8 or IBM437 it doesn't.

7-zip will extract this ")=.txt" as "]~Kúºú"

So I need to know where the file came from and have knowledge what codepage they used.
Usually it should be UTF-8 or IBM437, but even then I need to know then per entry what the language encoding bit is

var zipFile = @"c:\temp\test.zip";
var name = ")=.txt";
var encoding = Encoding.GetEncoding("IBM297");

using (FileStream stream = new FileStream(zipFile, FileMode.OpenOrCreate))
{
    using (ZipArchive archive = new ZipArchive(stream, ZipArchiveMode.Update, false, encoding))
    {
        var entry = archive.CreateEntry(name);
        using (StreamWriter writer = new StreamWriter(entry.Open()))
        {
            writer.WriteLine("text");
        }
    }
}

@cremor
Copy link

cremor commented Jan 12, 2022

Yeah ok, but I meant archives that were created with "standard" tools. .NET should be able to handle those automatically.

@cremor
Copy link

cremor commented Jan 27, 2022

I just found this comment in the .NET source code: https://source.dot.net/#System.IO.Compression/System/IO/Compression/ZipArchive.cs,356

What's weird about that is that it seems to be wrong. Specifically the part where it says

For instance, the Windows Shell Zip tool takes "something else" to mean "the local system codepage". We default to the same behaviour, [...]

In my experience this is not the case, see #43231 (comment). The Windows Shell Zip tool seems to use "IBM850" (not "IBM437"!) encoding (on Windows 10). And because .NET does not default to that (but instead defaults to "the current system default code page"), those Windows zip files can not be correctly extracted by .NET.

Maybe .NET just understands "the current system default code page" different than Windows. As written, Windows seems to use "IBM850" encoding, which is available in .NET as CultureInfo.CurrentCulture.TextInfo.OEMCodePage. But .NET seems to think that the default is "Windows-1252" (available in Encoding.Default).

Anyway, even if it would work by default for zip files created on Windows, we would still have problems with zip files created on macOS.

@tarekgh
Copy link
Member

tarekgh commented Jan 27, 2022

Maybe .NET just understands "the current system default code page" different than Windows. As written, Windows seems to use "IBM850" encoding, which is available in .NET as CultureInfo.CurrentCulture.TextInfo.OEMCodePage. But .NET seems to think that the default is "Windows-1252" (available in Encoding.Default).

If you are running with full framework, default code page would be picked up from Windows API GetACP. If running with .NET or .NET Core, the default codepage would be UTF-8. to get the default system codepage in .NET, need to register the codepages encoding as described in https://docs.microsoft.com/en-us/dotnet/api/system.text.codepagesencodingprovider?view=net-6.0#remarks and then call Encoding.GetEncoding(0).

@cremor
Copy link

cremor commented Jan 27, 2022

I tested on .NET Framework 4.8.

But as far as I understand it .NET Core/5/6 will have the same problem, right? Because neither uses the same ZIP entry name encoding by default that Windows uses.

@tarekgh
Copy link
Member

tarekgh commented Jan 27, 2022

But as far as I understand it .NET Core/5/6 will have the same problem, right?

Yes. I was just explaining how Encoding picks the default.

@jeffhandley jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022
@carlossanlop carlossanlop added the help wanted [up-for-grabs] Good issue for external contributors label May 26, 2023
@ViktorHofer ViktorHofer modified the milestones: 8.0.0, Future Jul 12, 2023
@caverna
Copy link

caverna commented Jul 14, 2023

I tested on .NET Framework 4.8.

But as far as I understand it .NET Core/5/6 will have the same problem, right? Because neither uses the same ZIP entry name encoding by default that Windows uses.

yep! I reached this page having the same issue and it fails for .NET Core 7

@Stallind
Copy link

Stallind commented Jan 24, 2024

Commeting here since I found the thread googling about the same issue.

For .net framework 4.8 it works for me using Encoding.GetEncoding(System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage).

For .net core I used a somewhat ugly hack where I replace the "illegal characters". In this case swedish åäö.

var replacements = new Dictionary<string,string> { 
    { "\u0086", "å" },
    { "\u008f", "Å"},
    { "\u0084", "ä"},
    { "\u008e", "Ä"},
    { "\u0094", "ö"},
    { "\u0099", "Ö"} 
};
using (ZipArchive source = System.IO.Compression.ZipFile.Open(@"C:\New Folder\testing.zip", ZipArchiveMode.Read, Encoding.GetEncoding("ISO-8859-1")))
{
    foreach (ZipArchiveEntry entry in source.Entries)
    {
        var name = entry.Name;
        foreach (var replacement in replacements)
        {
            name = name.Replace(replacement.Key, replacement.Value);
        }
        Console.WriteLine(name);
    }
}

I can't say this is the way to go and that I recommend it, but it solved the issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.IO.Compression bug help wanted [up-for-grabs] Good issue for external contributors
Development

No branches or pull requests

10 participants