Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Packaging.Package should have a way to handle common errors with malformed packages #26084

Open
twsouthwick opened this issue May 4, 2018 · 17 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Milestone

Comments

@twsouthwick
Copy link
Member

twsouthwick commented May 4, 2018

System.IO.Packaging.Package provides a simple abstraction over a container to hold multiple objects in a single entity. This is commonly used in nupkg packages, vsix extensions, appx, and also Office documents. The package format makes heavy use of URIs to maintain relationships between various components. However, Office will allow a user to generate an invalid URI for one of these relationships, and then opening via System.IO.Packaging fails with no easy way of recovery besides manually updating the file, which necessitates a deeper understanding of the file format that should be abstracted by the library.

This is to work around a breaking change that occurred between 4.0->4.5. This proposal is to provide a way to work around that change.

Example

There are multiple issues filed on the OpenXml SDK project (which abstracts the Office format to strongly typed classes) and related projects where users have been hit by this. For example:

A simple Office document that will fail can be created by the following:

  1. Create a new Word document
  2. Add a hyperlink to an invalid URI (ie mailto:one@)
  3. Try to load with Package:
using (var package = Package.Open(pathToDoc))
{
    foreach (var part in package.GetParts())
    {
        part.GetRelationships();
    }
}

This will fail with the following exception:

System.UriFormatException
  HResult=0x80131537
  Message=Invalid URI: The hostname could not be parsed.
  Source=System.Private.Uri
  StackTrace:
   at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
   at System.Uri..ctor(String uriString, UriKind uriKind)
   at System.IO.Packaging.InternalRelationshipCollection.ProcessRelationshipAttributes(XmlCompatibilityReader reader)
   at System.IO.Packaging.InternalRelationshipCollection.ParseRelationshipPart(PackagePart part)
   at System.IO.Packaging.InternalRelationshipCollection..ctor(Package package, PackagePart part)
   at System.IO.Packaging.PackagePart.EnsureRelationships()
   at System.IO.Packaging.PackagePart.GetRelationshipsHelper(String filterString)
   at PackagingExample.Program.Main(String[] args) in c:\users\tasou\source\repos\PackagingExample\PackagingExample\Program.cs:line 15

Proposed API

namespace System.IO.Packaging
{
    public abstract class Package // Existing class
    {
        public static Package Open(string path, FileShare packageShare, PackageSettings settings);
        
        public static Package Open(string path, PackageSettings settings);

        public static Package Open(Stream stream, PackageSettings settings);
    }

    public class PackageSettings
    {
        public FileMode PackageMode { get; set; }

        public FileAccess PackageAccess { get; set; }

        public IUriProvider UriProvider { get; set; }
    }

    public interface IUriProvider
    {
        Uri CreateUri(string uriString, UriKind uriKind);
    }
}

Details

  • The PackageSettings class would allow for easily adding additional settings at a later point without adding more factory methods (i.e. Package.Open(...)). Since the FileMode and FileAccess properties are used on all factory methods, these are added to the settings object.
  • The Package.Open(...) methods listed extend the current factory methods that take a Stream or a path with an optional FileShare.
  • The default implementations of IUriProvider would essentially be this line: https://source.dot.net/#System.IO.Packaging/System/IO/Packaging/InternalRelationshipCollection.cs,363
@twsouthwick
Copy link
Member Author

@karelz How can I get this proposal approved? I'm willing to implement it - many users of the Open XML SDK would appreciate this

@karelz
Copy link
Member

karelz commented Aug 20, 2018

@JeremyKuhne @pjanotti is it something you can help shepherd further? If it is out of scope of your knowledge, let's just ask @twsouthwick to bring it to API review and discuss it there ...

@JeremyKuhne
Copy link
Member

What I would like to see is how this proposal would be used to solve the given problem. How does an IUriProvider get context about the string it is given? Is context important/useful? What does the provider give back for a Uri when the input is garbage?

@twsouthwick
Copy link
Member Author

twsouthwick commented Jul 22, 2019

@JeremyKuhne I don't envision the context being that important. In .NET 4.5 Uri started throwing exceptions on malformed Uris instead of just passing them through. As people start moving to .NET Core, they'll hit this more and more on office documents in DocumentFormat.OpenXml that used to work. This allows them to either sanitize the entry, or potentially subclass Uri to hold the original and also sanitize it so it'll open the package correctly.

So, this could be used by the following:

Package.Open("path", new PackageSettings { UriProvider = new MalformedProvider() });

class MalformedProvider : IUriProvider
{
  Uri CreateUri(string uriString, UriKind uriKind)
  {
    if(Uri.TryCreate(uriString, uriKind, out var result))
    {
        return result;
    }
    else
    {
        return new MalformedUri(uriString, uriKind);
    }
  }
}

internal class MalformedUri : Uri
{
    public MalformedUri(string uriString, UriKind uriKind)
        : base("/", uriKind)
    {
        Original = uriString;
    }

    public string Original { get; }
}

Another route that could be taken to fix this would be to allow Uri to have non-conformant strings so it does not throw on creation. This could be triggered from some sort of options to Uri. That would make it act like pre-.NET 4.5, but not sure what kind of impact that would have on Uri.

@ThomasBarnekow
Copy link

@JeremyKuhne, I support @twsouthwick's statement that context would not be important. In practice, this often happens because Excel, for example, interprets some text containing an "@" as a mailto URI and happily creates a hyperlink with a malformed URI. Therefore, it would be totally fine to replace those malformed URIs with pretty much any well-formed URI.

For example, since we have been waiting for a fix for quite some time now, I implemented my own workaround in which I am using http://was-invalid-hyperlink/do-not-click to replace whatever invalid hyperlink was found. Once those hyperlinks are fixed, the document can be opened with the Open XML SDK.

@ThomasBarnekow
Copy link

ThomasBarnekow commented Dec 23, 2019

@JeremyKuhne and @twsouthwick, I thought about this a little more and came up with the idea to encode the malformed URI in the MalformedUri instead of using "/". I've done this in my own workaround as follows:

public static Uri FromMalformedUriString([NotNull] string malformedUri)
{
    if (malformedUri == null) throw new ArgumentNullException(nameof(malformedUri));

    string unescapedMalformedUri = Uri.UnescapeDataString(malformedUri);
    string escapedMalformedUri = Uri.EscapeDataString(unescapedMalformedUri);

    return new Uri("http://malformed-uri/?value=" + escapedMalformedUri);
}

The original malformed URI can be retrieved as follows:

public static string GetMalformedUriString([NotNull] Uri uri)
{
    if (uri == null) throw new ArgumentNullException(nameof(uri));

    if (uri.Scheme != "http") throw new ArgumentException();
    if (uri.Host != "malformed-uri") throw new ArgumentException();
    if (uri.AbsolutePath != "/") throw new ArgumentException();

    string query = uri.Query;
    if (string.IsNullOrEmpty(query)) throw new ArgumentException();
    if (!query.StartsWith("?value=")) throw new ArgumentException();

    string escapedDataValue = query.Substring(7);
    string unescapeDataValue = Uri.UnescapeDataString(escapedDataValue);
    return Uri.EscapeUriString(unescapeDataValue);
}

And here's a unit test that tests this with some sample malformed URIs from actual Excel workbooks (noting that I created the third one to test unencoded strings).

[Theory]
[InlineData("mailto:CM@SAP:%20Create/Change%20Work%20Center")]
[InlineData("mailto:CM@SAP:%20User%20Assignement")]
[InlineData("mailto:CM@SAP: Create/Change Work Center")]
public void CanCreateWellFormedUris(string malformedUri)
{
    Assert.Throws<UriFormatException>(() => new Uri(malformedUri));

    // Create a valid Uri that encodes the malformed URI.
    Uri validUri = MalformedUriFixer.FromMalformedUriString(malformedUri);
    string malformedUriString = MalformedUriFixer.GetMalformedUriString(validUri);

    _output.WriteLine("Original malformed URI: " + malformedUri);
    _output.WriteLine("Encoded malformed URI:  " + validUri.AbsoluteUri);
    _output.WriteLine("Restored malformed URI: " + malformedUriString);

    Assert.Equal(
        Uri.EscapeUriString(Uri.UnescapeDataString(malformedUri)),
        malformedUriString);
}

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@lindexi
Copy link
Contributor

lindexi commented May 10, 2020

I think we can add the IPackage interface like this code 00a2d82

And we can custom handle exception with the custom class that inherit the IPackage interface

See dotnet/Open-XML-SDK#715 (comment)

@twsouthwick
Copy link
Member Author

@JeremyKuhne I've taken another look at this since OpenXml users have been hitting it. After playing around with usage, I've updated my proposal to the following new APIs for System.IO.Packaging. I'm happy to take this throw the API proposal process, but haven't done that before, so any guidance would be helpful.

namespace System.IO.Packaging
{
    public abstract partial class Package
    {
        protected Package(PackageSettings settings) { }
        public static System.IO.Packaging.Package Open(System.IO.Stream stream, PackageSettings settings) { throw null; }
        public static System.IO.Packaging.Package Open(string path, PackageSettings settings) { throw null; }
    }

    public partial class PackageSettings
    {
        public FileMode PackageMode { get { throw null; } set { throw null; } }
        public FileAccess PackageAccess { get { throw null; } set { throw null; } }
        public FileShare PackageShare { get { throw null; } set { throw null; } }
        public IUriFactory UriFactory { get { throw null; } set { throw null; } }
    }

    public interface IUriFactory
    {
        System.Uri CreateUri(string url, System.UriKind kind);
        string GetOriginalString(System.Uri uri);
    }

    public class DefaultUriFactory : IUriFactory
    {
        public static IUriFactory Instance => throw null;

        protected DefaultUriFactory() { }

        public virtual Uri CreateUri(string url, UriKind kind) => throw null;

        public virtual string GetOriginalString(Uri uri) => throw null;
    }
}

The PackageSettings class is to minimize the number of overloads of Open(...) that need to be added and allow for simpler additions in the future.

This would allow someone to intercept the creation and writing of the Uri instances so that they can do whatever they want with it. For example, here are some test cases using it:

[Fact]
public void MalformedHyperlinkThrows()
{
    using (var package = Package.Open("malformed_hyperlink.xlsx"))
    {
        var part = package.GetPart(new Uri("/xl/worksheets/sheet1.xml", UriKind.Relative));
        Assert.Throws<UriFormatException>(() => part.GetRelationships());
    }
}

[Fact]
public void MalformedHyperlinkHandle()
{
    const string ExpectedUri = "mailto:mailto@one@";
    var settings = new PackageSettings
    {
        UriFactory = new MalformedUriFactory()
    };

    using (var package = Package.Open("malformed_hyperlink.xlsx", settings))
    {
        var part = package.GetPart(new Uri("/xl/worksheets/sheet1.xml", UriKind.Relative));
        var relationship = Assert.Single(part.GetRelationships());

        var malformed = Assert.IsType<MalformedUri>(relationship.TargetUri);
        Assert.Equal(ExpectedUri, malformed.Uri);
    }
}

[Fact]
public void MalformedHyperlinkRoundtrip()
{
    const string InvalidUri = "mailto:mailto@one@";

    var settings = new PackageSettings
    {
        UriFactory = new MalformedUriFactory(),
        PackageMode = FileMode.OpenOrCreate,
        PackageAccess = FileAccess.ReadWrite
    };

    using (var stream = new MemoryStream())
    {
        using (var package = Package.Open(stream, settings))
        {
            Assert.Empty(package.GetRelationships());
            package.CreateRelationship(new MalformedUri(InvalidUri), TargetMode.External, "relationship");
        }

        using (var package = Package.Open(stream, settings))
        {
            var relationship = Assert.Single(package.GetRelationships());
            var malformed = Assert.IsType<MalformedUri>(relationship.TargetUri);
            Assert.Equal(InvalidUri, malformed.Uri);
        }
    }
}

private class MalformedUriFactory : IUriFactory
{
    public Uri CreateUri(string url, UriKind kind)
    {
        if (Uri.TryCreate(url, kind, out var result))
        {
            return result;
        }

        return new MalformedUri(url);
    }

    public string GetOriginalString(Uri uri)
    {
        if (uri is MalformedUri malformed)
        {
            return malformed.Uri;
        }

        return uri.OriginalString;
    }


internal class MalformedUri : Uri
{
    public MalformedUri(string uriString)
        : base("http://unknown")
    {
        Uri = uriString;
    }

    public string Uri { get; }

    public override string ToString() => Uri;
}

@JeremyKuhne
Copy link
Member

@twsouthwick Thanks for putting time into this. Note that I'm no longer on the libraries team, @carlossanlop can help move this along.

@lindexi
Copy link
Contributor

lindexi commented May 22, 2020

Is there any updated news?

@carlossanlop carlossanlop modified the milestones: 5.0.0, Future Jun 18, 2020
@twsouthwick
Copy link
Member Author

@carlossanlop Any thoughts on this API? I haven't done an API review, but happy to help move something like this along. As more users are moving to .NET Core/.NET 5, this will become a problem for many LOB apps that rely on DocumentFormat.OpenXml.

@twsouthwick
Copy link
Member Author

@danmosemsft Similar #1544, this one is blocking people from moving to .NET Core/.NET 5 who use DocumentFormat.OpenXml as documents that would open just fine using System.IO.Packaging when targeting < .NET 4.5 but now fail to open with no way to handle it.

@danmoseley
Copy link
Member

cc @ericstj whose team owns these API's and can evaluate. Unless it's bug fix level, it would be 6.0 work at this point.

@ericstj
Copy link
Member

ericstj commented Jul 27, 2020

Correct, the soonest this could go in is 6.0.

Some thoughts on the proposal. It sounds like the primary purpose of this issue is to add an IUriFactory that can be specified when opening a package. This seems like a very advanced scenario, and I don't see the benefit of adding so much API to support it. I'd recommend starting with a more minimal proposal that minimally enables the scenario. Maybe highlight the minimal solution then contrast a bigger solution.

For example: what if you just added a 1 overload to Open with default parameters? EG: Open(Stream stream, FileMode packageMode = FileMode.OpenOrCreate, FileAccess packageAccess = FileAccess.ReadWrite, IUriFactory = null). Also consider what is minimally needed in the interface. The method which goes from URI back to string seems unusual to me since URI already has an "OriginalString". The implications of mapping back from the Uri imply you're returning types derived from Uri, for round tripping?

Another concern I have: why are we adding API to deal with malformed packages? Can you instead just convert those packages so that they are no longer malformed? For example, go through and replace any malformed URL with a URL-encoded string. That seems like the typical thing we'd recommend folks to do when we discovered a bad format and want them to get off of it. /cc @dotnet/ncl on the Uri details.

@twsouthwick
Copy link
Member Author

Thanks @ericstj. I've added a tool into the OpenXml library to handle sanitizing it (dotnet/Open-XML-SDK#793). Downside is that this requires rewriting the document, which may have only been read only before. However, it is interesting that the format itself appears to handle malformed Uris just fine and don't have the restrictions that System.Uri has. ie Word/Excel/etc happily generate files with these values.

The roundtripping here is so that the values can be rewritten so the document can be read without modifying anything. OriginalString can't be overwritten to output the original, malformed Uri, so a derived type could be used to ensure the roundtripping. This would allow the OpenXml library to transparently handle the change so people don't have to modify (potentially old) LOB apps to handle this case.

Having a way to parse a Uri with a relaxed mode similar to pre-.NET 4.5 would be ideal, so hopefully @dotnet/ncl can give some feedback for that.

@karelz
Copy link
Member

karelz commented Aug 20, 2020

Regarding Uri class:

  • It is fairly complicated code. It is sensitive code and it was source of quite a few security issues since Uri class was born. As a result, our goal is to drive it towards maintainability (using more of safe code and simpler code).
  • There were dozens of bug fixes (usually required by many customers) over the years in .NET Framework and in .NET Core to have stricter Uri parsing, compliant with RFC instead of lax parsing (as suggested here).
    • Note: Some of them broke existing customers, for which we usually had opt-out switches. We had 2 customers approaching us in last couple of years, asking us to bring some of the opt-out switches back into .NET Core (they both had large databases with Uri which they apparently couldn't easily migrate to be compliant - similar to this case). After analysis, we concluded it would make the code frozen and we couldn't innovate or improve perf in the code base, therefore we rejected the idea.

In summary, I would strongly urge against making Uri more lax in parsing (pre-.NET 4.5 times), or complicate the code significantly.
On the other hand, there is slowly growing demand for non-compliant Uri representation and I wonder if we could achieve that via other ways -- either the suggested IUriFactory, or support custom parsing, and/or custom error handling / fallback parsing in the Uri class itself as an advanced extension mechanism. Higher-level libraries (e.g. System.IO.Packaging) would have to then pass through the extension mechanism to allow such customization by the callers (OpenXML in this case).
@MihaZupan do you have any thoughts on this?

@petarpetrovt
Copy link
Contributor

Is there any updated news?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Development

No branches or pull requests

10 participants