Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share UTF8 converters between coreclr and mono #85558

Merged
merged 9 commits into from
Jun 22, 2023

Conversation

am11
Copy link
Member

@am11 am11 commented Apr 29, 2023

Fix #78490

src/coreclr/vm/object.cpp Outdated Show resolved Hide resolved
src/coreclr/pal/inc/pal.h Outdated Show resolved Hide resolved
src/native/minipal/utf8converter.c Outdated Show resolved Hide resolved
src/native/minipal/utf8converter.h Outdated Show resolved Hide resolved
@am11
Copy link
Member Author

am11 commented May 2, 2023

@janvorli this is a draft PR, not ready for review yet..

@janvorli
Copy link
Member

janvorli commented May 2, 2023

@am11 ah, I am sorry for that, I was mislead by the fact that @marek-safar has added me as a reviewer.

@am11
Copy link
Member Author

am11 commented May 11, 2023

The fallback logic in coreclr and mono is a bit different. e.g. mono's eglib does not handle these error cases: https://github.com/dotnet/runtime/blob/25a4e045aa975656a96e3edd8380093177fe841d/src/coreclr/pal/tests/palsuite/locale_info/MultiByteToWideChar/test4/test4.cpp#L62-L114 Similarly, these Reflection.Emit tests are disabled on mono and now failing on coreclr: https://github.com/dotnet/runtime/blob/25a4e045aa975656a96e3edd8380093177fe841d/src/libraries/System.Reflection.Emit/tests/TypeBuilder/TypeBuilderDefineEvent.cs#L23-L25

I tried to implement the fallback logic in eglib but it turned out to be non-trivial given how we traverse the character stream. It requires look-ahead and look-behind mechanics which is currently missing in eglib. Accommodating the incomplete surrogate cases, where more than one character is missing e.g. \xF0\x90\x80 -> \xFFFD , gets rough.

With some refactoring in eglib, I think we can feature-complete it (such that it handles the list of known error cases). It can be done in a separate PR and we can continue unification work in this one.

If we entertain the alternative; port coreclr's utf8.cpp to mono, we will lose utf32/ucs4 transformation functionality needed by mono public API for embedded apps

MONO_API_FUNCTION(MONO_RT_EXTERNAL_ONLY MonoString*, mono_string_new_utf32, (MonoDomain *domain, const mono_unichar4 *text, int32_t len))
I'm not sure how important those are, or if we can break that public contract? Are external consumers relying on these utf32 APIs?

cc @jkotas, @lambdageek

@jkotas
Copy link
Member

jkotas commented May 11, 2023

the fallback logic

I assume that this fallback logic matches the fallback logic of managed UTF8 encoders/decoders. Is that correct? We have intentionally tried to match the behavior of managed and unmanaged UTF8 converters, so that we do not need to deal with behavior changes every time we move UTF8 conversion from unmanaged code to managed code.

utf32/ucs4 transformation functionality

I do not think we need utf32/ucs4 transformation in the minipal. It should stay Mono-only if these embedding APIs are still relevant.

@am11
Copy link
Member Author

am11 commented May 11, 2023

I assume that this fallback logic matches the fallback logic of managed UTF8 encoders/decoders. Is that correct?

Yes, coreclr's utf8.cpp was initially based on corefx implementation: dotnet/coreclr#3809, but corefx implementation has evolved/diverged since (it is using vectorized APIs these days). Going by test results, we have kept the fallback logic in sync with managed implementation. There may be some corner cases which are not handled by utf8.cpp that managed implementation supports.

mono's eglib implementation is based on public documentation of GNOME's glib, per the README: https://github.com/mono/eglib; it is not a copy of the original source https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gutf8.c. The original source seem to support some of these fallback cases. I haven't tested it if that is the case.

I do not think we need utf32/ucs4 transformation in the minipal. It should stay Mono-only if these embedding APIs are still relevant.

Sounds good. I will try to port utf8.cpp to mono (without C++ runtime dependency) and replace eglibc implementation; modulo the utf32 bits.

src/native/minipal/utf8.h Outdated Show resolved Hide resolved
src/mono/mono/eglib/giconv.c Outdated Show resolved Hide resolved
src/native/minipal/utf8.h Outdated Show resolved Hide resolved
src/native/minipal/utf8.c Outdated Show resolved Hide resolved
@jkotas
Copy link
Member

jkotas commented Jun 20, 2023

cc @GrabYourPitchforks This PR is refactoring the unmanaged utf8 encoding/decoding so that it can be shared between all runtimes. The unmanaged utf8 encoding/decoding is used by the loader that assumes trusted input, it should have low security risk profile. The code matches the old managed implementation that you have redone some time ago.

Please let us know if you have concerns about this change, or if you would like to see more to be done as a follow up.

@am11
Copy link
Member Author

am11 commented Jun 22, 2023

Depending on the input, PAL's MultiByteToWideChar / WideCharToMultiByte wrappers are now 20-22% faster. Perhaps we can vectorize this code by adding SIMD wrappers in minipal (e.g. once https://github.com/dotnet/runtime/pull/87865/files#diff-e0b84b85c730fa4b4a840dba229a4d630e88fcd33f17208eb35a905796e89f3c is in), similar to how managed code was updated.

These APIs can are called on hot paths as well.

@jkotas
Copy link
Member

jkotas commented Jun 22, 2023

These APIs can are called on hot paths as well.

I would rather spend time on pushing these hot paths to C# where they can take advantage of the vectorized managed implementation.

@am11
Copy link
Member Author

am11 commented Jun 22, 2023

@jkotas, e.g. this PrestubMethodFrame calls PEImage::TryOpenFile:

OS Thread Id: 0x7f4db74 (1)
        Child SP               IP Call Site
000000016FDFBB40 0000000101E40F3C libcoreclr.dylib!minipal_convert_utf16_to_utf8
000000016FDFBB40 0000000101E40138 libcoreclr.dylib!WideCharToMultiByte + 232
000000016FDFBB80 0000000101E37F20 libcoreclr.dylib!CreateFileW + 236
000000016FDFBD20 0000000101F165E4 libcoreclr.dylib!PEImage::TryOpenFile(bool) + 156
000000016FDFBD60 0000000101F93288 libcoreclr.dylib!BinderAcquirePEImage + 172
000000016FDFBDF0 000000010215FF80 libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::GetAssembly(SString&, int, BINDER_SPACE::Assembly**, BundleFileLocation) + 124
000000016FDFBE50 00000001021613AC libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindByTpaList(BINDER_SPACE::ApplicationContext*, BINDER_SPACE::AssemblyName*, bool, BINDER_SPACE::BindResult*) + 2400
000000016FDFC1F0 00000001021605A0 libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindLocked(BINDER_SPACE::ApplicationContext*, BINDER_SPACE::AssemblyName*, bool, bool, BINDER_SPACE::BindResult*) + 404
000000016FDFC260 000000010215F5CC libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindByName(BINDER_SPACE::ApplicationContext*, BINDER_SPACE::AssemblyName*, bool, bool, bool, BINDER_SPACE::BindResult*) + 208
000000016FDFC3E0 000000010215F084 libcoreclr.dylib!BINDER_SPACE::AssemblyBinderCommon::BindAssembly(AssemblyBinder*, BINDER_SPACE::AssemblyName*, bool, BINDER_SPACE::Assembly**) + 208
000000016FDFC680 00000001021662D8 libcoreclr.dylib!DefaultAssemblyBinder::BindUsingAssemblyName(BINDER_SPACE::AssemblyName*, BINDER_SPACE::Assembly**) + 96
000000016FDFC730 0000000101E7CBF0 libcoreclr.dylib!AssemblyBinder::BindAssemblyByName(AssemblyNameData*, BINDER_SPACE::Assembly**) + 100
000000016FDFC760 0000000101F92F30 libcoreclr.dylib!AssemblySpec::Bind(AppDomain*, BINDER_SPACE::Assembly**) + 796
000000016FDFCC40 0000000101E7118C libcoreclr.dylib!AppDomain::BindAssemblySpec(AssemblySpec*, int) + 644
000000016FDFD240 0000000101F14030 libcoreclr.dylib!PEAssembly::LoadAssembly(unsigned int) + 192
000000016FDFD2D0 0000000101E8A214 libcoreclr.dylib!Module::LoadAssemblyImpl(unsigned int) + 168
000000016FDFD370 0000000101E798A0 libcoreclr.dylib!Assembly::FindModuleByTypeRef(ModuleBase*, unsigned int, Loader::LoadFlag, int*) + 528
000000016FDFD3C0 0000000101E99888 libcoreclr.dylib!ClassLoader::LoadTypeDefOrRefThrowing(ModuleBase*, unsigned int, ClassLoader::NotFoundAction, ClassLoader::PermitUninstantiatedFlag, unsigned int, ClassLoadLevel) + 484
000000016FDFD490 0000000101F2CBD4 libcoreclr.dylib!SigPointer::GetTypeHandleThrowing(ModuleBase*, SigTypeContext const*, ClassLoader::LoadTypesFlag, ClassLoadLevel, int, Substitution const*, ZapSig::Context const*, MethodTable*, SigPointer::HandleRecursiveGenericsForFieldLayoutLoad*) const + 1448
000000016FDFD5A0 0000000101F62878 libcoreclr.dylib!ZapSig::DecodeType(Module*, ModuleBase*, unsigned char const*, ClassLoadLevel, unsigned char const**) + 100
000000016FDFD620 0000000101EEDB88 libcoreclr.dylib!LoadDynamicInfoEntry(Module*, unsigned int, unsigned long*, int) + 320
000000016FDFDB20 0000000101E8BF54 libcoreclr.dylib!Module::FixupNativeEntry(READYTORUN_IMPORT_SECTION*, unsigned long, unsigned long*, int) + 92
000000016FDFDB50 0000000101F67450 libcoreclr.dylib!Module::FixupDelayList(unsigned long long, int) + 204
000000016FDFDBD0 0000000101F67150 libcoreclr.dylib!ReadyToRunInfo::GetEntryPoint(MethodDesc*, PrepareCodeConfig*, int) + 872
000000016FDFDC80 0000000101F1EDDC libcoreclr.dylib!MethodDesc::GetPrecompiledR2RCode(PrepareCodeConfig*) + 60
000000016FDFDCC0 0000000101F1EA94 libcoreclr.dylib!MethodDesc::PrepareILBasedCode(PrepareCodeConfig*) + 172
000000016FDFDD20 0000000101EA1074 libcoreclr.dylib!CodeVersionManager::PublishVersionableCodeIfNecessary(MethodDesc*, CallerGCMode, bool*, bool*) + 1336
000000016FDFDE60 0000000101F22994 libcoreclr.dylib!MethodDesc::DoPrestub(MethodTable*, CallerGCMode) + 212
000000016FDFDF10 0000000101F224EC libcoreclr.dylib!PreStubWorker + 524
000000016FDFDF50                  [PrestubMethodFrame: 000000016fdfdf50] Microsoft.AspNetCore.dll!Microsoft.AspNetCore.Builder.WebApplicationBuilder..ctor(Microsoft.AspNetCore.Builder.WebApplicationOptions, Boolean, System.Action`1<Microsoft.Extensions.Hosting.IHostBuilder>)
000000016FDFDFD0 00000001020F7874 libcoreclr.dylib!ThePreStub + 80
000000016FDFE100 000000010467582C Microsoft.AspNetCore.dll!Microsoft.AspNetCore.Builder.WebApplication.CreateSlimBuilder(System.String[]) + 108 [/_/src/DefaultBuilder/src/WebApplication.cs @ 122]
000000016FDFE120 00000001045619D8 api1.dll!Program.<Main>$(System.String[]) + 112 [/Users/am11/projects/api1/Program.cs @ 4]
FFFFFFFFFFFFFFFF 00000001045619D8 
FFFFFFFFFFFFFFFF 00000001020F82B4 libcoreclr.dylib!CallDescrWorkerInternal + 132
000000016FDFE1D0 0000000101F75E94 libcoreclr.dylib!MethodDescCallSite::CallTargetWorker(unsigned long long const*, unsigned long long*, int) + 856
000000016FDFE440 0000000101E7AE8C libcoreclr.dylib!RunMain(MethodDesc*, short, int*, PtrArray**) + 632
000000016FDFE650 0000000101E7B1D8 libcoreclr.dylib!Assembly::ExecuteMainMethod(PtrArray**, int) + 260
000000016FDFE8E0 0000000101EA3218 libcoreclr.dylib!CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*) + 616
000000016FDFE9E0 0000000101E68658 libcoreclr.dylib!coreclr_execute_assembly + 204
000000016FDFEA40 0000000100005BC0 corerun!run(configuration const&) + 6696 at /Users/am11/projects/runtime/src/coreclr/hosts/corerun/corerun.cpp:0
000000016FDFF0C0 0000000100002EC8 corerun!main + 1372 at /Users/am11/projects/runtime/src/coreclr/hosts/corerun/corerun.cpp:0
000000016FDFF2B0 00000001968B3E50 dyld!start + 2544

To accomplish "managed UTF-8 conversion" here, should we make it so PEImage::TryOpenFile has UTF-8 string at that point and refactor all the up to the managed caller? In some cases, those strings are cached on runtime side in UTF-16 format, e.g. when (as a result of an FCall or QCall) name of a loaded assembly or type is passed down to a PAL API for some syscall.

@jkotas
Copy link
Member

jkotas commented Jun 22, 2023

It should be possible to move the whole assembly loader to C#. We would need a special unmanaged path to load CoreLib to bootstrap, but the rest can be managed. It is too big to do it all at once. I think a good place to start with this refactoring would be BINDER_SPACE::*.

(The UTF8 conversion under PEImage::TryOpenFile does not look perf sensitive. It is done once per loaded assembly. Assembly loading does a lot of expensive operations, non-vectorized UTF8 conversion as part of it is not a big deal.)

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this one. Thank you!

@jkotas jkotas merged commit 2f2aaae into dotnet:main Jun 22, 2023
@am11 am11 deleted the feature/minipal/utf8 branch June 22, 2023 13:48
@ghost ghost locked as resolved and limited conversation to collaborators Jul 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate UTF8 character encoding implementations
6 participants