Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UTF8 (and maybe UTF16) to System.Runtime.InteropServices.CharSet + Marshal class #4257

Closed
migueldeicaza opened this issue May 17, 2015 · 25 comments

Comments

@migueldeicaza
Copy link
Contributor

Currently marshaling in .NET has three modes: Ansi, Unicode (platform dependent) and "Auto" which picks a good default between those two. The meaning of Unicode is closely associated with Window's UTF16.

There is today no convenient and reliable way to do UTF8 encoding, and at best we have an ambiguous definition of what Unicode is.

There are enough bits on the metadata tables to add these two values.

People can resort to custom marshalers (slow, cumbersome, everyone has to do it), or manual marshaling, or hope that the platform does the right thing.

Anecdotally: this also happens to be oldest Mono bug that is still open.

The world has spoken, and UTF8 is the standard, we should have first-class support for it both for P/Invoke signatures as well as the various helper methods in Marshal.

@masonwheeler
Copy link
Contributor

The world has spoken, and UTF8 is the standard

Hmm. Now, don't get me wrong, I like UTF8 as much as anyone. Its technical superiority over UTF-16 is obvious, and I'd definitely like to see it replace UTF-16. But there's more than a little exaggeration in that claim as stated!

The platform native string type in both Windows and OSX is UTF-16. The platform native string type in both the CLR and the JVM is UTF-16. Take those out... and there's precious little left of "the world"!

@josteink
Copy link
Member

@masonwheeler: That may be, but I would argue your perspective is too narrow. You're looking at things from a framework-internal perspective.

When using P/Invoke you are typically (or at least often) interacting with libraries hosted outside the .NET framework, often written in plain C.

Most of those libraries will expect traditional 8-bit text ANSI/ASCII text and in those cases UTF8 is the only reliable way of getting Unicode-content across. And if we're going to start P/Invoking on other platforms than Windows, UTF8 is defacto the only way to represent unicode at system-level.

@masonwheeler
Copy link
Contributor

@josteink Ah, you're right. That's a very good point.

@MikePopoloski
Copy link
Contributor

+9000

I've worked with internal applications that pinvoked into custom C DLLs, and we had to maintain tons of custom marshaling code because using UTF-16 was out of the question due to the memory requirements. Automatic UTF-8 marshaling would be so nice to have.

@stephentoub
Copy link
Member

While the name is misleading, on UNIX CoreCLR's ANSI marshaling is UTF8.

@migueldeicaza
Copy link
Contributor Author

@stephentoub The name is not only misleading, it is incompatible with the behavior on Windows.

While this is what Mono has done, it causes problems for people that want to target both Unix and Windows, as evidenced by people that have to resort over and over to use the two approaches described above.

I do not need the extra work in Mono (just like I assume you guys do not want the extra work), but it makes the platform very unpleasant to work with for everyone that has to consider more than a platform. This is a request on behalf of the users of what is a half-baked and broken setup.

@jkotas
Copy link
Member

jkotas commented May 18, 2015

cc @yizhang82 - maybe a good candidate for MCG (Marshaling Code Generator)

@ghost
Copy link

ghost commented May 30, 2015

If this ever happen, would it mean we will not require to use MarshalString() between the native C++ and CLR strings, as the direct assignment would be possible? Which would also mean no more efforts to invent better (managed) variants of System::Interop like specified in @Cygon's blog: http://blog.nuclex-games.com/mono-dotnet/cxx-cli-string-marshaling/.
In that case 💯 👍

@yizhang82
Copy link
Contributor

Supporting UTF-8 marshalling does sound like something we might want to support in the future, most likely in MCG @jkotas has mentioned earlier. MCG is the new interop technology we have in .NET Native, and is a great place for experimenting with newer features like this. Implementing marshalling support in the CLR itself is rather painful - you'll need to write code generation code that emits IL. While in MCG, you write code that spits out C#. Eventually we'd like to see all the CLR runtime (.NET native being one of them) use the same underlying interop technology so that we don't have to implement them twice and maintain two separate code bases.

@jasonwilliams200OK It's unlikely we'll be able to support marshalling directly to C++ std::strings. We don't necessarily want to tie to a particular layout of a C++ string implementation. I think C++/CLI libraries should provide a good way to marshal between managed string and C++ string.

@whoisj
Copy link

whoisj commented Jun 2, 2015

@yizhang82 C++ std:string is happy with c-string, as is C#. Why not just assume that native and C# will be using char[] or wchar_t[] for string import/export.

That said, the UTF-8 marshaller is critical. The Libgit2Sharp project has a fairly good implementation without too many contributors (one really), might be worth speaking with them about making their implementation part of the coreclr, or at least a good starting point.

/CC @nulltoken

@nulltoken
Copy link

👍 for me

@paulcbetts @phkelley @tclem @dahlbyk AFAIR you all contributed to it. Thoughts?

@dahlbyk
Copy link
Contributor

dahlbyk commented Jun 10, 2015

First-class UTF8 seems reasonable to me. First-class Encoding support seems like it could also be useful, but I'm not exactly sure what that would look like.

Relevant LG2S classes:

@anaisbetts
Copy link

^^ this would be great, there's really no reason to assume text is ASCII when marshaling strings, the only two options in modern software today are UTF-16 (i.e. Windows / OS X native style) and UTF-8 (basically every OSS lib). +1 for this feature.

@yizhang82
Copy link
Contributor

yizhang82 commented Apr 19, 2016

We have the API proposal out for review: https://github.com/dotnet/corefx/issues/7804

@myungjoo
Copy link
Contributor

CC: @leemgs This seems to be related with the unexplained issue that depends on the LOCALE environment variables (not working if the locale is ko_KR.UTF8).

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@Serentty
Copy link
Contributor

@migueldeicaza Has anything happened related to this in the past few years? Especially now that .NET is focusing on better Unicode support with stuff like System.Text.Rune, it seems like this would be a logical step.

@danmoseley
Copy link
Member

cc @eerhardt @GrabYourPitchforks

@migueldeicaza
Copy link
Contributor Author

Nothing has happened. It is funny because this issue was the oldest bug we kept open for Mono - filed by Red Hat sometime around 2002-2003.

We tried ECMA at the time, we tried every contact we had, and even now it seems to be a part of the code that nobody wants to touch and is scared of changing. After Xamarin was acquired I had various discussions in person about it.

At this point developers have resorted to a spectrum of workarounds, hacks and band aids, depending on just how badly you need this and how much performance you require.

In the end, given that there are suboptimal workaround prevents this from being considered.

The world continues, but it remains unnecessarily complex for newcomers.

@jkotas
Copy link
Member

jkotas commented May 31, 2020

We tried to add CharSet.UTF8, but it turned out to be pretty difficult and far-reaching. The issue #17000 on this is still open. You can see some of the difficulties in closed PR dotnet/coreclr#18186. At this point, the most likely way we are going to fully address this is via interop source generators that we are starting to prototype. cc @elinor-fung @AaronRobinsonMSFT

In the mean time, the following related thigs happened that made the UTF8 interop story better:

  • Performance of UTF8 encoding / decoding was improved about an order of magnitude since .NET Core project started.
  • Span<T> interop is easier to write, more performant and more secure.
  • Marshal.LPUTF8Str was added
  • Marshal.PtrToStringUTF8, Marshal.StringToHGlobalUTF8 and Marshal.StringToCoTaskMemUTF8 APIs were added.
  • We have started using these new functions in the core framework, e.g. here:
    return Marshal.PtrToStringUTF8(interiorPointer)!;
    .

@migueldeicaza
Copy link
Contributor Author

migueldeicaza commented May 31, 2020

Those APIs and similar ones are the sort of bandaids I was referring to. They exist in assorted ways across cross platform projects because of this missing marshaling capability.

I looked at the issue and could not figure out what part of it was the blocking one?

@jkotas
Copy link
Member

jkotas commented May 31, 2020

There was nothing blocking.

It is just a lot of work to get the design proposed here pushed through the system. In addition to implementing it in both CoreCLR and Mono runtimes for all situations where interop does string marshaling, there is also changing Roslyn, changing ilasm/ildasm and number of other tools and libraries that operate on IL, etc.

The interop source generators will make it much easier. Proposed design
https://github.com/dotnet/runtime/blob/master/docs/design/features/source-generator-pinvokes.md

@Serentty
Copy link
Contributor

So you're saying that if you use DllImportAttribute with source generators, it will be possible to specify UTF-8 in a portable way, and that people should use that going forward instead of the old DllImport?

@AaronRobinsonMSFT
Copy link
Member

So you're saying that if you use DllImportAttribute with source generators, it will be possible to specify UTF-8 in a portable way, and that people should use that going forward instead of the old DllImport?

@Serentty It is a little more nuanced than that. With the source generators for P/Invokes proposal, we will be able to support UTF-8 in a portable and real way that doesn't cost any where near as much as the referenced PR (dotnet/coreclr#18186). The goal of source generators for P/Invokes is to enable support of many language and cross-platform features that can be much cheaper than the high cost of updating the built-in environment (e.g. Span<T>). I say more nuanced because the exact shape and primary focus of v1 of the feature isn't defined yet. This conversation though does seem to warrant making it higher priority, but it is still a bit away from being released.

For now, as @migueldeicaza stated, it is "unnecessarily complex" but reducing that complexity is going to be possible with the source generator approach.

@Serentty
Copy link
Contributor

I see, thanks.

@AaronRobinsonMSFT
Copy link
Member

Now that the LibraryImport generator (i.e., source generated DllImports) has been exposed as a publicly consumable tool, this issue should be explored relative to that effort. This specific issue is being close due to first class support in the LibraryImport generator for UTF-8. See #67635.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests