Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does it work with wchar_t? #39

Closed
roozbehid-ic opened this issue Oct 1, 2021 · 20 comments
Closed

does it work with wchar_t? #39

roozbehid-ic opened this issue Oct 1, 2021 · 20 comments

Comments

@roozbehid-ic
Copy link

I am getting an error with wchar_t.
I also included #include <stddef.h> to bypass clang error but I do get following

PANIC: [Panic] String cannot be of zero length. (Parameter 'oldValue')
   at System.String.Replace(String oldValue, String newValue)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.Location(CXCursor cursor, Nullable`1 type)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.VisitTranslationUnit(CXTranslationUnit translationUnit)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.AbstractSyntaxTree(CXTranslationUnit translationUnit, Int32 bitness)
   at C2CS.UseCases.AbstractSyntaxTreeC.UseCase.Explore(CXTranslationUnit translationUnit, ImmutableArray`1 includeDirectories, ImmutableArray`1 ignoredFiles, ImmutableArray`1 opaqueTypes, ImmutableArray`1 whitelistFunctionNames, Int32 bitness)
   at C2CS.UseCases.AbstractSyntaxTreeC.UseCase.Execute(Request request, Response response)
   at C2CS.UseCase`2.Execute(TRequest request)

@lithiumtoast
Copy link
Member

Do you have a header / repo I could use to reproduce?

@lithiumtoast
Copy link
Member

@roozbehid-ic But yes, there is support for wchar_t. It's likely you have just hit a corner case that wasn't accounted for.

@roozbehid-ic
Copy link
Author

roozbehid-ic commented Oct 2, 2021

This is the header

#include <stddef.h>
#define WINAPI
#define NSCUBE_EXPORTS
#define NSCUBE_API __declspec(dllexport)


typedef unsigned long	nscCHANID;
typedef void *			nscHANDLE;
typedef void *			nscHSRV;
typedef int	nscRESULT;

NSCUBE_API nscRESULT WINAPI nscAddTextW
(
	nscHANDLE hTTS,
	const wchar_t *pwszInputTxt,
	void *pUserText
);

and how I run it and what I get

C2CS.exe ast -i nscube.h -o nscube.json
AbstractSyntaxTreeC: Started.
        Started step (1/3) 'Parse C code from disk'
libclang: Parsing 'nscube.h' with the following arguments...
        --language=c --std=c11 -Wno-pragma-once-outside-header -fno-blocks -m64 --include-directory= -isystemC:\Program Files (x86)\Windows Kits\10\Include\10.0.17763.0\ucrt -isystemC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\include
        Finished step (1/3) 'Parse C code from disk' in 278.1678 ms
        Started step (2/3) 'Extract C abstract syntax tree'
AbstractSyntaxTreeC: Finished unsuccessfully. Review the following diagnostics for reason(s) why:
PANIC: [Panic] String cannot be of zero length. (Parameter 'oldValue')
   at System.String.Replace(String oldValue, String newValue)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.Location(CXCursor cursor, Nullable`1 type)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.VisitTranslationUnit(CXTranslationUnit translationUnit)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.AbstractSyntaxTree(CXTranslationUnit translationUnit, Int32 bitness)
   at C2CS.UseCases.AbstractSyntaxTreeC.UseCase.Explore(CXTranslationUnit translationUnit, ImmutableArray`1 includeDirectories, ImmutableArray`1 ignoredFiles, ImmutableArray`1 opaqueTypes, ImmutableArray`1 whitelistFunctionNames, Int32 bitness)
   at C2CS.UseCases.AbstractSyntaxTreeC.UseCase.Execute(Request request, Response response)
   at C2CS.UseCase`2.Execute(TRequest request)

@lithiumtoast
Copy link
Member

Issue was the directory was not specified for input and output. I always used it like this before:

C2CS.exe ast -i .\nscube.h -o .\nscube.json

I fixed it where if the directory is not specified then the current directory would be used.

Let me know if you have any other issues or things are still persisting.

@roozbehid-ic
Copy link
Author

Thanks I got past that one but getting a new error.

c:\Utils\C2CS.exe  ast -i .\nscube.h -o .\nscube.json
AbstractSyntaxTreeC: Started.
        Started step (1/3) 'Parse C code from disk'
libclang: Parsing '.\nscube.h' with the following arguments...
        --language=c --std=c11 -Wno-pragma-once-outside-header -fno-blocks -m64 --include-directory=. -isystemC:\Program Files (x86)\Windows Kits\10\Include\10.0.17763.0\ucrt -isystemC:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\include
        Finished step (1/3) 'Parse C code from disk' in 236.3814 ms
        Started step (2/3) 'Extract C abstract syntax tree'
AbstractSyntaxTreeC: Finished unsuccessfully. Review the following diagnostics for reason(s) why:
PANIC: [Panic] Type name was not found for 'FunctionPointer'.
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.TypeName(String parentTypeName, CKind kind, CXType type, CXCursor cursor)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.CreateFunctionPointer(String typeName, CXCursor cursor, Node parentNode, CXType originalType, CXType type, ClangLocation location)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.ExploreFunctionPointer(String typeName, CXCursor cursor, CXType type, CXType originalType, ClangLocation location, Node parentNode)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.ExploreTypedef(Node node, Node parentNode)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.ExploreNode(Node node)
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.Explore()
   at C2CS.UseCases.AbstractSyntaxTreeC.ClangTranslationUnitExplorer.AbstractSyntaxTree(CXTranslationUnit translationUnit, Int32 bitness)
   at C2CS.UseCases.AbstractSyntaxTreeC.UseCase.Explore(CXTranslationUnit translationUnit, ImmutableArray`1 includeDirectories, ImmutableArray`1 ignoredFiles, ImmutableArray`1 opaqueTypes, ImmutableArray`1 whitelistFunctionNames, Int32 bitness)
   at C2CS.UseCases.AbstractSyntaxTreeC.UseCase.Execute(Request request, Response response)
   at C2CS.UseCase`2.Execute(TRequest request)

You can see the header file here.
https://gist.github.com/roozbehid-ic/03197d9db0735454ef893132c5b30dd6

@lithiumtoast
Copy link
Member

lithiumtoast commented Oct 5, 2021

Problem was:

typedef int PNSC_FNSPEECH_DATA (const unsigned char *pData, unsigned int cbDataSize, PNSC_SOUND_DATA pSoundData, void *pAppInstanceData);

This is the first time I have encountered a function proto inlined as a typedef in this fashion. Usually I encounter function protos in this fashion:

typedef SDL_HitTestResult (SDLCALL *SDL_HitTest)(SDL_Window *win, const SDL_Point *area, void *data);

Anyways just a small couple line fix in logic: d1c74f0

Note that the API in question looks to be quite intensive on strings. Please advise that casting CString8U to string calls Runtime.CString, to which that function hashes the string if the pointer representing the string (often the case of pointer to first element) is not in the cache. While djb2 is extremely fast, this is technically a O(n) where n is the length of the string; I have not found a more elegant way to deal with strings. Ideas and feedback is much appreciated.

@roozbehid-ic
Copy link
Author

Thank you so much for all the help.
Two things:

Would you mind telling how you discovered what went wrong. At least it helps future people to know how to diagnose the problems.
Also it seems you are using special classes for unicode, etc. Why not use marshaling that comes with C#? It seems you are targeting .net 5 then it should also have utf8, utf16 and all those marshaling and looks more native to c#.

@lithiumtoast
Copy link
Member

Why not use marshaling that comes with C#?

Last time I looked into it it it was implementation details: each C string (char*) when converted to a C# string (string) would be a new allocation on the GC heap which is not ideal for my use cases. I do it manually so I can have the opportunity to cache the strings using a hashing algorithm or do something else entirely in the future. It also ensures that the bindings are exactly 1-1 in terms of bit layout.

There is a whole GitHub issue on .NET about this problem: dotnet/corefxlab#2350

@lithiumtoast
Copy link
Member

Would you mind telling how you discovered what went wrong.

I just run the program with the debugger enabled and observe the stack trace. I then unravel the stack and observe the state of variables and other memory which cause the program to enter the particular state. The only magical pieces are libclang and Rosyln; the rest is 100% just ordinary C# code. However libclang can be a little hard to debug sometimes and I find myself entering throw away code to get the name of a type / cursor in some cases to help with debugging.

The tests are mostly done via smoke tests which I use the bindings generated by C2CS in my own libraries. I know immediately if something doesn't work right because I use the bindings all the time.

@roozbehid-ic
Copy link
Author

Makes sense.
I use following for char* which is potentially utf8 to string (although C# comes with ascii char* to string marshaller)

    public class UTF8Marshaler : ICustomMarshaler
    {
        static UTF8Marshaler static_instance;

        public IntPtr MarshalManagedToNative(object managedObj)
        {
            if (managedObj == null)
                return IntPtr.Zero;
            if (!(managedObj is string))
                throw new MarshalDirectiveException(
                       "UTF8Marshaler must be used on a string.");

            // not null terminated
            byte[] strbuf = Encoding.UTF8.GetBytes((string)managedObj);
            IntPtr buffer = Marshal.AllocHGlobal(strbuf.Length + 1);
            Marshal.Copy(strbuf, 0, buffer, strbuf.Length);

            // write the terminating null
            Marshal.WriteByte(buffer + strbuf.Length, 0);
            return buffer;
        }

        public unsafe object MarshalNativeToManaged(IntPtr pNativeData)
        {
            byte* walk = (byte*)pNativeData;

            // find the end of the string
            while (*walk != 0)
            {
                walk++;
            }
            int length = (int)(walk - (byte*)pNativeData);

            // should not be null terminated
            byte[] strbuf = new byte[length];
            // skip the trailing null
            Marshal.Copy((IntPtr)pNativeData, strbuf, 0, length);
            string data = Encoding.UTF8.GetString(strbuf);
            return data;
        }

        public void CleanUpNativeData(IntPtr pNativeData)
        {
            Marshal.FreeHGlobal(pNativeData);
        }

        public void CleanUpManagedData(object managedObj)
        {
        }

        public int GetNativeDataSize()
        {
            return -1;
        }

        public static ICustomMarshaler GetInstance(string cookie)
        {
            if (static_instance == null)
            {
                return static_instance = new UTF8Marshaler();
            }
            return static_instance;
        }
    }

Which probably ends up being similar to what you do.
But on delegates and dllimport it looks more natural and you end up using string class.

        public delegate int ActionReply(Int64 Id, [MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef = typeof(UTF8Marshaler))] string nameValueParameters);

or

        [DllImport("mycode.dll",  EntryPoint = "channelIssueActionRequest")]
        public static extern int ChannelIssueActionRequest(Int64 Id,int requestId, [MarshalAs(UnmanagedType.LPStr)] string actionTypeName, [MarshalAs(UnmanagedType.CustomMarshaler, MarshalTypeRef=typeof(UTF8Marshaler))] string nameValueParameters);

in This one actionTypeName is from ascii char* to string, and nameValueParameters is from char* utf8 to string
So technically in this approach you always have string in C# which might not be desirable for many cases as you lose original data. But also totally makes sense for many other cases.

@roozbehid-ic
Copy link
Author

I am wondering why are you caching strings?
Could this be used instead https://docs.microsoft.com/en-us/dotnet/api/system.string.intern?view=net-5.0 ?

@lithiumtoast
Copy link
Member

Consider the case where a C library uses C strings (char*) for identifiers that are most likely less than 25 characters in length. E.g. "PlayerEntity", "WeaponComponent", "RenderingSystem", etc. These strings are used frequently and usually out of context to the global C functions as part of the API to do something significant such as getting an entity or updating a component, etc, and then returning a callback. I do not want the garbage collector in C# to allocate memory in heap 0 for these strings from the callback. Instead I rather run a quick hash over the C string and use a hashtable to know if that C string's contents has already has an analogous memory allocated in the GC and use that reference instead.

System.String.Intern(string str) would work if it accepted a raw pointer to a C style string. Otherwise, it accepts a string as input but that string instance still has to be allocated first which is not great. Also, according to the docs the memory would be retained until the application exits or restarts which is not great either.

@lithiumtoast
Copy link
Member

lithiumtoast commented Oct 5, 2021

I use following for char* which is potentially utf8 to string (although C# comes with ascii char* to string marshaller)

This is interesting, I'll have to try it out sometime and see if it makes more sense. Right now implicit casting makes CString8U and CString16U "behave" and be "accepted" into functions and methods which use string. I do the same technique for CBool and bool.

@roozbehid-ic
Copy link
Author

Also FYI, I pasted the code to show custom marshallers could potentially be used (although I don't think they can be used in unmanaged function pointers)
New C# already has [MarshalAs(UnmanagedType.LPUTF8Str)] which can convert char* as utf8 to string

What do you think about a special mode of c2cs program that would generate managed C# code and also uses these marshaling?

@lithiumtoast
Copy link
Member

@roozbehid-ic
Copy link
Author

I can give it a try and see if I can send a PR for you. Thanks

@lithiumtoast
Copy link
Member

Just a note of warning, I have StyleCop ruleset turned on for this project. You can disable it by adding the following to the .csproj in question:

<EnableAnalyzers>false</EnableAnalyzers>

@lithiumtoast
Copy link
Member

Closing as the original problem was resolved

@lithiumtoast
Copy link
Member

lithiumtoast commented Dec 11, 2021

@roozbehid-ic I have just encountered a grave mistake on my part. wchar_t is 2 bytes on Windows but actually 4 bytes on Linux. I'll need to re-think this.

@lithiumtoast lithiumtoast reopened this Dec 11, 2021
@lithiumtoast
Copy link
Member

I'm going to close this actually a create a new issue and link the related issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants