Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DllImportGenerator] Update GeneratedDllImportAttribute handling of character set / encoding #61326

Closed
Tracked by #60595
elinor-fung opened this issue Nov 8, 2021 · 17 comments

Comments

@elinor-fung
Copy link
Member

elinor-fung commented Nov 8, 2021

For p/invokes, the character set is encoded into the metadata for the method. As a result, adding anything, like UTF-8, is complex and far-reaching. The current experience (ANSI means UTF-8 on Unix) is odd and confusing. The p/invoke source generator should be used to improve this experience.

We’d like to:

  • Avoid proliferating the pattern of ‘use CharSet.Ansi on Unix to get UTF-8
  • Allow specifying the character set to use for all parameters a method
    • Instead of needing MarshalAs on each parameter
  • Avoid adding to the CharSet enumeration
    • Don’t want inconsistent support and don’t want to implement new support in all the places that currently use it

Our current thinking is to:

Example:

// UTF-8 - equivalent to explicitly specifying [MarshalAs(UnmanagedType.LPUTF8Str)] on string parameters
[GeneratedDllImport("lib", MarshalStringsUsing = typeof(System.Runtime.InteropServices.Encoding.Utf8StringMarshalling))]
public static partial int Method(string s);

// UTF-16 - equivalent to CharSet.Unicode behaviour in built-in
[GeneratedDllImport("lib", MarshalStringsUsing = typeof(System.Runtime.InteropServices.Encoding.Utf16StringMarshalling))]
public static partial int Method(string s);

// Error - invalid encoding
[GeneratedDllImport("lib", MarshalStringsUsing = typeof(int))]
public static partial int Method(string s);

// User-defined marshalling
[GeneratedDllImport("lib", MarshalStringsUsing = typeof(MyCustomMarshal.Wtf8String))]
public static partial int Method(string s);

Where:

// .NET can provide:
namespace System.Runtime.InteropServices.Encoding
{
    // UTF-16 with endianness based on the current platform
    struct Utf16StringMarshalling { ... }

    // UTF-8
    struct Utf8StringMarshalling { ... }

    // ANSI
    [SupportedOSPlatform("windows")]
    struct AnsiStringMarshalling { ... }

    ...
}

// User can define:
namespace MyCustomMarshal
{
    struct Wtf8String { ... }
}

Other considerations:

  • Naming: Unicode vs Utf16
    • .NET has usually used the (Windows-centric) term Unicode to refer to UTF-16. Naming the struct ...Utf16StringMarshalling would be correct and in line with our cross-platform focus, but UnicodeStringMarshalling would be more consistent with existing APIs.
  • Auto (UTF-8 on Unix, UTF-16 on Windows);
    • We expect usage to be low. If necessary, users can define different p/invokes and call the desired one conditionally (for example, using the OperatingSystem APIs)
  • Defaults:
    • The source generator requires specifying marshalling information for string/char.
      • Requires the intention to be made clear and removes hidden assumptions, but can make declarations more verbose
    • The source generator does not check / reconcile higher level settings like DefaultCharSetAttribute.
  • ExactSpelling: uses CharSet to probe for entry point on Windows, doesn’t mean anything on Unix
    • The source generator could require exact spelling for entry point names
      • Would be in the spirit of avoiding propagating some of the Windows-centric aspects of DllImport

@AaronRobinsonMSFT @jkoritzinsky @jkotas @stephentoub

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Nov 8, 2021
@elinor-fung elinor-fung added area-System.Runtime.InteropServices and removed untriaged New issue has not been triaged by the area owner labels Nov 8, 2021
@elinor-fung elinor-fung changed the title Update GeneratedDllImportAttribute handling of character set / encoding [DllImportGenerator] Update GeneratedDllImportAttribute handling of character set / encoding Nov 8, 2021
@elinor-fung elinor-fung added this to the 7.0.0 milestone Nov 8, 2021
@jkotas
Copy link
Member

jkotas commented Nov 8, 2021

the character set is encoded into the metadata for the method. As a result, adding anything, like UTF-8, is complex and far-reaching

That is correct for the built-in system. However, we can add new values into CharSet that are only recognized by the source-generators, and ignored by the built-in system.

[GeneratedDllImport("lib", Encoding = nameof(System.Text.Encoding.UTF8))]

Why not make UTF8 the default? UTF8 encoding won the world. It should be the default.

Windows has been moving away from ANSI for awhile

Yes, for new APIs. Existing APIs are stuck on the ANSI/Unicode plan. Do we have any data on how many Windows APIs are ANSI-only and thus the ANSI support is required to invoke them?

What is going to be the encoding property that maps to CharSet.Unicode? Encoding.Unicode does not quite make sense since it means little-endian UTF-16, but CharSet.Unicode means machine-endian UTF-16.

It would be useful to enumerate encodings that we expect people are typically going to use and what the suggested replacement is going to be. Here is my list:

  • UTF-8
  • Machine endian UTF-16 (CharSet.Unicode)
  • ANSI (Windows)
  • C++ wchar_t: Machine endian UTF-32 on Unix / machine endian UTF-16 on Windows: Marshaling wchar_t on Linux uses wrong size #30906
  • Machine endian UTF-32
  • Varying strictness of all of the above.

@AaronRobinsonMSFT
Copy link
Member

Yes, for new APIs. Existing APIs are stuck on the ANSI/Unicode plan. Do we have any data on how many Windows APIs are ANSI-only and thus the ANSI support is required to invoke them?

This would be good to confirm, but my gut tells me it is either stable at a small number or declining. Based on this assumption, I'd argue for using the MarshalUsing logic that @jkoritzinsky has been championing and provide a more verbose solution. We could provide an ANSIStringMarshal public API that would allow people to move onto the source generator without compromising the Encoding class approach.

The list that @jkotas provided could also be used to either dictate a new public APIs using MarshalUsing or add to Encoding's static properties.

@elinor-fung
Copy link
Member Author

add new values into CharSet that are only recognized by the source-generators, and ignored by the built-in system

The thought was that it would be confusing, since I think the natural reaction to a new value on CharSet would be that it should work with the built-in system. We could do something like add an analyzer to warn/error, but it seems like it would still be confusing.

a new public APIs using MarshalUsing or add to Encoding's static properties

Machine endian ones seem like they might be odd on Encoding's static properties, since we already point at BitConverter.IsLittleEndian normally to get/create the desired Encoding instance. To still have a shorthand (instead of needing MarshalUsing on everything) for common ones like machine endian UTF-16, we could also emit some constant (similar to what we're saying for Auto) for them.

@jkotas
Copy link
Member

jkotas commented Nov 9, 2021

Machine endian ones seem like they might be odd on Encoding's static properties,

Agree. The properties on Encoding type are focused on wire formats. Interop marshaling is about in-memory formats and so the common set is going to be different. UTF8 seems to be the only one that is in both sets.

we could also emit some constant (similar to what we're saying for Auto) for them.

How do we decide when these things should be magic constants vs. proper types?

It makes me wonder how custom encodings would work. For example, I would like to write something like [GeneratedDllImport("lib", Encoding = nameof(MyEncodings.Wtf8))]. The standard nameof returns just the last part of the name that is just Wtf8 in this example. Is the source generator going to crack the nameof expression to figure out the type name?

@svick
Copy link
Contributor

svick commented Nov 9, 2021

The "string with fairly complicated rules for which values are allowed" approach for Encoding sounds very confusing to me. I'd much rather see an enum used instead. If the existing CharSet is not a good match, then I think it should be a new enum, specific to GeneratedDllImport. Is there a reason why this approach should not be used?

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Nov 9, 2021

approach for Encoding sounds very confusing
Is there a reason why this approach should not be used?

@svick It is definitely complicated, which is one of the reasons the CharSet is nice. However, the enum approach doesn't work very well in practice because it is so complicated and hamstrings us into not easily supporting new options. Basically we need to keep adding to the enum and responding to those values. Instead we are looking to provide a more extensible approach involving the declaration of a type with a specific shape that will do what is desired and anyone can extend.

One of the mitigations to the complexity will be a series of predefined types that will handle what is desired. Implementing your own would require care but we will have public APIs that handle the most common cases. An example could be something like the following:

namespace System.Runtime.InteropServices.Encoding
{
    // Selects the default encoding for the current platform
    struct AutoPlatformEncoding { ... }

    // Selects the UTF-16 endianness based on the current platform
    struct AutoUtf16Encoding { ... }
    ...
}

A series of these types would be provided by .NET. Then users could define their own or use one of the built-in ones.

[GeneratedDllImport(..., Encoding = typeof(AutoPlatformEncoding))]
public static partial int Len(string s);

Another approach would be to have built-in subclasses of the System.Text.Encoding which are similar in spirit to the above. There are several options but your concern about complexity is noted and that is one of the reasons we are discussing the ideas like this. We want an extensible solution that is also easy for the most common cases.

@elinor-fung
Copy link
Member Author

UTF8 seems to be the only one that is in both sets.
How do we decide when these things should be magic constants vs. proper types?
string with fairly complicated rules for which values are allowed

One of the initial reasons we gravitated towards the string over a type was the thought that we could just use the static properties on Encoding. Given that isn't the case (and as previously noted, those properties have a different focus from interop marshalling, so it will likely never be the case), I think an actual type does make more sense.

In the example @AaronRobinsonMSFT has above, I would expect those pre-defined types would be the same as what one could use with the MarshalUsing atttribute. I do like the parallel that it would be using the same type whether you want to use the shorthand field on GeneratedDllImport or explicit attributes on each parameter.

@elinor-fung
Copy link
Member Author

Update the description to reflect the string -> Type thinking.

@jkotas
Copy link
Member

jkotas commented Nov 10, 2021

Random thoughts:

  • Encoding property is the default string marshaler type to use. Would it make sense to give it a name that describes what it actually does? Maybe MarshalStringsUsing?
  • The marshaler names need to have common unique prefix or suffix to distinguish them from ordinary types. Utf8String in particular has very high chance of causing conflicts.
    • Naming nit: AutoUtf16String... can be UnicodeString... so that it is more consistent with the existing names.
  • AutoPlatformString: Do we believe that it is common enough to be included in the core library? I know we have a few of these on the host, but I would expect it to be a rare outlier that can be solved by declaring the PInvoke twice and if-switch based on the platform.
  • Thoughts about making Utf8 to be the default if nothing is specified?

@elinor-fung
Copy link
Member Author

name that describes what it actually does? Maybe MarshalStringsUsing?

I do like that it matches the MarshalUsing attribute. I did like having Encoding somewhere - maybe it is enough to have the types we provide be in an Encoding namespace or have it in their name (something like Utf8EncodingStringMarshal)

common unique prefix or suffix

Maybe Marshal or Marshalling a suffix? Marshalling would kind of flow better when reading it with the MarshalUsing attribute.

AutoPlatformString: Do we believe that it is common enough to be included in the core library?

Definitely fair. If I recall from when we were looking at all our own p/invokes, the host ones were the only real uses and everything else that specified it was only used on one platform and didn't need auto.

Thoughts about making Utf8 to be the default if nothing is specified?

My aversion to a default was around not wanting to match the (confusing) built-in one and wanting a clear error if the char set would end up different when switching an existing p/invoke to generated. This was influenced by our (original) mindset of trying to make GeneratedDllImport be a straight match for DllImport though. Given that we're not doing that and planning to introduce a different (and more verbose) field, I'm not against a UTF-8 default and noting that as a compat difference. Thoughts, @AaronRobinsonMSFT / @jkoritzinsky?

@AaronRobinsonMSFT
Copy link
Member

Given that we're not doing that and planning to introduce a different (and more verbose) field, I'm not against a UTF-8 default and noting that as a compat difference.

I think we can all agree that UTF-8 won—thank heavens. Making UTF-8 the default seems okay but does make getting it wrong annoying if consuming a native binary without being able to easily debug it. It also would mean we are taking sides with respect to platforms, which I'd prefer not to do in practice. We could also loosen the contract in V2. This would make the first wave transition largely avoid an annoying mistake.

Naming nit: AutoUtf16String... can be UnicodeString... so that it is more consistent with the existing names.

I'd prefer adopting a platform agnostic focus for .NET interop. For example, the UnicodeString... seems wrong because it is dragging the past forward instead of being explicit and stating ...Utf16String...—which is what it is, not Unicode. I can see it would be nice to have but the fact that we would be perpetuating a Windows centric view of "Unicode" seems to be against some of our goals.

Self-serving plug: It is also against the strong guidance I recently gave at a CppCon talk around interop with C++.

@hez2010
Copy link
Contributor

hez2010 commented Nov 18, 2021

Why not use Generic Attributes to apply type constraints to the encoding property? So that the invalid encoding error can be observed at compile time instead of run time.

class GeneratedDllImportAttribute<TEncoding> : Attribute where TEncoding : Encoding
{

}
// Error observed at compile time - invalid encoding
[GeneratedDllImport<int>("lib")]

@AaronRobinsonMSFT
Copy link
Member

Why not use Generic Attributes to apply type constraints to the encoding property?

@hez2010 That is a good question. The general problem with Generic Attributes is they are much more difficult to evolve without breaking changes. The above works for Encoding now but when feature X comes in we would have to break the type by adding another Generic type to follow that pattern. Instead the idea here is to force the consuming analyzer to do the analysis. Since the analyzer is part of the compilation step, users can be get a very similar "invalid encoding" experience.

@elinor-fung
Copy link
Member Author

Updated the attribute proposal - #46822 - to have the MarshalStringsUsing property from this discussion.

@jkoritzinsky
Copy link
Member

Since it looks like we're going the route of removing ExactSpelling on the source-generated interop attribute, it might be worth adding some additional options to our "convert to source generated DllImport" code-fix that hard-codes in the suffix into the EntryPoint field on the attribute.

@elinor-fung
Copy link
Member Author

Attribute properties for string marshalling are covered by #46822
Marshaller implementations are proposed in #66623

@ghost ghost locked as resolved and limited conversation to collaborators Apr 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants