Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

i386: Fixed definition with declaration in eetoprofinterfaceimpl.cpp #18792

Merged
merged 4 commits into from Sep 1, 2018

Conversation

sergign60
Copy link

@sergign60 sergign60 changed the title i386: Fixed declaration in eetoprofinterfaceimpl.cpp with definition i386: Fixed definition with declaration in eetoprofinterfaceimpl.cpp Jul 5, 2018
@@ -6,17 +6,17 @@

extern "C"
{
void ProfileEnterNaked(FunctionIDOrClientID functionIDOrClientID)
void __stdcall ProfileEnterNaked(FunctionIDOrClientID functionIDOrClientID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the JIT on Linux x86 going to call these with __stdcall calling convention? I thought we have converted everything to cdecl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one appears documented as though it was __stdcall: https://github.com/dotnet/coreclr/blob/master/src/jit/codegencommon.cpp#L6723
and from my admitedly limited understanding of the JIT code that is how its currently implemented.

@dotnet/jit-contrib - does that sound right?

Assuming it is, LGTM.

Copy link
Author

@sergign60 sergign60 Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have in the file src/vm/eetoprofinterfaceimpl.cc

// Declarations for asm wrappers of profiler callbacks
EXTERN_C void __stdcall ProfileEnterNaked(FunctionIDOrClientID functionIDOrClientID);
EXTERN_C void __stdcall ProfileLeaveNaked(FunctionIDOrClientID functionIDOrClientID);
EXTERN_C void __stdcall ProfileTailcallNaked(FunctionIDOrClientID functionIDOrClientID);

so, when we try to do some profiling work (for example memory usage etc.) in the optimized version of coreclr we get inconsistent stack after calling one of these stubs on x86. We don't need now assembler variants of these stubs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to fix this in the JIT to use cdecl, so that we are using cdecl across the board on Linux x86? Or is there a good reason why this should be an outlier on Linux x86?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Do you mean that I need to add #ifdef x86 && linux with cdecl declarations in the file src/vm/eetoprofinterfaceimpl.cc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete the stdcall there and let it use the default (ie stdcall on Windows and cdecl on Linux). It is what we was done in number of places during Linux x86 bring up.

The key place to get in sync is the JIT though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the rationale was to use cdecl on Linux so I don't know what advantage we'd get by switching it (other than the knowledge we are consistent). From the standpoint of codesize, changing this to cdecl would require adding a pop instruction to the prologue of every managed method so probably a net performance loss.

Copy link
Member

@jkotas jkotas Jul 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a long discussion and stream of fixes about this during the Linux x86 bring up. Here are a few related PRs:

#9928
#9977
#10410

The summary is:

  • ESP is 16-byte aligned on Linux x86. It makes use of stdcall awkward and inefficient. For example, when we were still using stdcall, the JIT ended up generating code like this:
sub esp, 12 // necessary to maintain stack alignment
push <argument>
call method
add esp, 12
  • cdecl is the native calling convention on Linux x86. We have found that stdcall is handled poorly throughout the system. For example, unwinders do not understand it, the C/C++ compiler does not generate good code for it - it calls the method as cdecl and then compensates for callee poped arguments like this:
call method_with_stdcall_convention
sub esp, <size of arguments>

It is true that stdcall code w/o the stack alignment requirement can be smaller. However, stdcall is one of the reasons why x86 is different from most other platforms out there. I guess that the Linux folks decided long time ago that it is not worth the on-going pain, it is better to take a small hit and just use cdecl for simplicity.

We have not optimized the JIT for cdecl. The focus of Linux x86 port has been functionality, not performance so far. The JIT does not generate as good code as it can on Linux x86. https://github.com/dotnet/coreclr/issues/10012 is the key part of that.

@AaronRobinsonMSFT
Copy link
Member

cc @noahfalk

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the JIT need adjustment as well?

@sergign60
Copy link
Author

@jkotas We've checked this fix with our profiler on x86 device emulator for tizen. It works well now.

@jkotas
Copy link
Member

jkotas commented Jul 9, 2018

It may work by accident. We should check that he JIT generates correct code with the right calling convention and stack alignment. Could you please share a Linux x86 disassembly for a simple managed method with these call backs?

@noahfalk
Copy link
Member

noahfalk commented Jul 9, 2018

Thanks for all the info Jan!

For example, unwinders do not understand it

Do you mean libunwind simply fails to unwind from a __stdcall callee to caller in all circumstances (and thus lldb/gdb fail to create stacktraces) or something less dire? If you aren't sure we can do some testing so no worries.

The existing API on windows x86 expects the callee to preserve all registers, including the registers __stdcall/__cdecl would normally be allowed to trash. We expect that a profiler writer will need to hand author assembly code to achieve this. There are some older and simpler ELT variants for profiler writers that care more about ease-of-use than performance, so if the profiler author is using ELT3 callbacks they probably value not having an inefficient ABI. If we aren't ruining the ability to debug/profile inside these callbacks I remain interested to declare the ABI is callee popped register, non-16 byte aligned. On the other hand if we are ruining the ability to debug it then I'd agree with you: caller popped, 16 byte aligned stack.

The focus of Linux x86 port has been functionality, not performance so far.

Makes sense. This ABI is part of a public contract with profilers so once set I don't expect we'd ever change it. If we conclude we don't have time to implement the right long term solution I think the short term move would be to leave the feature disabled. The __stdcall-like API appears to be what we've got now in the JIT so implementation-wise its easy whereas switching to the __cdecl-like convention would take some work.

EXTERN_C void __stdcall ProfileTailcallNaked(FunctionIDOrClientID functionIDOrClientID);
EXTERN_C void ProfileEnterNaked(FunctionIDOrClientID functionIDOrClientID);
EXTERN_C void ProfileLeaveNaked(FunctionIDOrClientID functionIDOrClientID);
EXTERN_C void ProfileTailcallNaked(FunctionIDOrClientID functionIDOrClientID);
Copy link
Member

@noahfalk noahfalk Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these just become __cdecl on windows x86 now? Sorry I don't recall what the windows compiler emits when you aren't specific. (on windows they must be __stdcall for correctness with existing profilers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has no impact on Windows. We compile with stdcall as the default convention on Windows. Many places depend on this default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on adding a comment above these declarations that the default calling convention is used to dictate these? Something along the lines of:

// The calling convention should not be set explicitly for these callbacks. The calling convention is defined implicitly by the default set during compilation (i.e. Windows => stdcall).

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure we've reached an agreement on what the ABI should be before this goes ahead. If there is any short term need that I am holding up do let me know - I don't want to be needlessly slowing things down)

@jkotas
Copy link
Member

jkotas commented Jul 9, 2018

Do you mean libunwind simply fails to unwind from a __stdcall callee to caller in all circumstances

It depends on the shape of the callsite:

  • It works when the stdcall callsite compensates for the callee poped arguments by subtracting them from esp again, like what you see in code generated by the C/C++ compiler on Linux x86.
  • It does not work when the stdcall callsite is the regular push+push+...+call callsite like what you see on Windows, or what the JIT seems to be generating for the callback right now.

declare the ABI is callee popped register, non-16 byte aligned

Note that this is more efficient only if the callbacks are 100% assembly. If these callbacks ever call into C/C++ code, they have to re-align the stack at 16-bytes. Re-aligning mistaligned stack is more severe efficiency hit than having the stack aligned by the ABI.

@sergign60
Copy link
Author

sergign60 commented Jul 9, 2018

@jkotas Sorry, I was wrong. It does not work, fails with SigSegv after executing the first call of ProfileLeaveNaked and returning from the method that contains this call. I guess that retl is not enough

there are some fragments of disassembled code

libcoreclr.so`::ProfileEnterNaked(FunctionIDOrClientID):
->  0xb5065820 <+0>: retl

   0xb32e33cb: subl   $0x8, %esp
   0xb32e33ce: xorl   %eax, %eax
   0xb32e33d0: movl   %eax, -0x4(%ebp)
   0xb32e33d3: pushl  $0xb32d9198               ; imm = 0xB32D9198
   0xb32e33d8: calll  0xb5065820                ; ::ProfileEnterNaked(FunctionIDOrClientID) at unixstubs.cpp:17
>  0xb32e33dd: movl   %ecx, -0x8(%ebp)
   0xb32e33e0: xorl   %edx, %edx
   0xb32e33e2: movl   %edx, -0x4(%ebp)
   0xb32e33e5: cmpb   $0x0, -0x4(%ebp)
   0xb32e33e9: jne    0xb32e33f8


libcoreclr.so`::ProfileLeaveNaked(FunctionIDOrClientID):
->  0xb5065830 <+0>: retl

    0xb32e3672: leal   -0x1(%eax), %esi
    0xb32e3675: subl   $0x8, %esp
    0xb32e3678: pushl  %esi
    0xb32e3679: pushl  %eax
    0xb32e367a: calll  0xb52b2b40                ; COMString::LastIndexOfCharArray at stringnative.cpp:379
    0xb32e367f: addl   $0x10, %esp
    0xb32e3682: pushl  $0xb32d769c               ; imm = 0xB32D769C
    0xb32e3687: calll  0xb5065830                ; ::ProfileLeaveNaked(FunctionIDOrClientID) at unixstubs.cpp:22
->  0xb32e368c: popl   %ecx
    0xb32e368d: popl   %esi
    0xb32e368e: popl   %ebp
    0xb32e368f: retl

may be this is wrong

===>    0xb32e367f: addl   $0x10, %esp
    0xb32e3682: pushl  $0xb32d769c               ; imm = 0xB32D769C
    0xb32e3687: calll  0xb5065830                ; ::ProfileLeaveNaked(FunctionIDOrClientID) at unixstubs.cpp:22
   0xb32e368c: popl   %ecx

@noahfalk
Copy link
Member

noahfalk commented Jul 9, 2018

It does not work when the stdcall callsite is the regular push+push+...+call callsite like what you see on Windows, or what the JIT seems to be generating for the callback right now.

OK then I'm on board for __cdecl-style calling convention. Sounds like we'll need to work in the JIT, in the asm stubs, and in the documentation to get this working E2E. Thanks!

@jkotas
Copy link
Member

jkotas commented Jul 10, 2018

The callbacks callsites in the JIT should look like this:

subl   $0xC, %esp // Allocate padding before pushing the argument (required to keep stack 16-byte aligned)
pushl  $0xb32d9198 // Push argument
calll  0xb52b2b40
addl   $0x10, %esp // Free stack space occupied by the argument

@sergign60 Could you please fix the JIT accordingly?

@sergign60
Copy link
Author

@jkotas To be honest, I don't quite understand your code fragment. Could you please add some explanation?

Should I delete this instruction before your code?

===>    0xb32e367f: addl   $0x10, %esp

Thanks in advance

@jkotas
Copy link
Member

jkotas commented Jul 10, 2018

I have added a few comments.

Should I delete this instruction before your code?
===> 0xb32e367f: addl $0x10, %esp

This instruction belongs to the previous call it should stay there.

The code emitted for Linux x86 currently is not very efficient. It follows the right calling convention, but it has a lot of unnecessary sub esp, XXX and add esp, YYY. #10012 is about fixing that. You should not need to worry about the unnecessary instructions for this change.

@sergign60
Copy link
Author

sergign60 commented Jul 10, 2018

@jkotas Thanks! I've found in codegenxarch.cpp genAlignStackBeforeCall and genRemoveAlignmentAfterCall Is your proposal about it? I don't mean to use them 'as is'.

@jkotas
Copy link
Member

jkotas commented Jul 10, 2018

Yes, it is the idea.

The prolog/epilog that the profiler callbacks are part of are emitted directly as instructions. You may just emit sub esp,0xC and add esp, 0x10 instructions directly around the call for Linux x86.

@sergign60
Copy link
Author

@jkotas please review. Unfortunately it's still not working. I'm trying to understand, why

@jkotas
Copy link
Member

jkotas commented Jul 11, 2018

The change looks good to me. I do not see anything obviously wrong in it.

@sergign60
Copy link
Author

sergign60 commented Jul 11, 2018

@jkotas it fails now in emitter::emitStackPushLargeStk

                if (level.IsOverflow() || !FitsIn<unsigned short>(level.Value()))
                {
                    IMPL_LIMITATION("Too many/too big arguments to encode GC information");
                }

because of level.IsOverflow() I don't think that this fix is the reason, but we don't have this fail with stdcall The fail arises with our profiling, we'll investigate it so this PR can be merged now

@jkotas
Copy link
Member

jkotas commented Jul 11, 2018

add/sub esp, ... instructions interact with stack level tracking. Look for emitCurStackLvl or emitAdjustStackDepthPushPop. The stack level tracking is not accurate in the prolog/epilog anyway. The comment argSize. Again, we have to lie about it touches on it.

Try adjusting emitCurStackLvl before or after emitting the instructions to stop hitting this assert.

@dotnet-bot

This comment has been minimized.

@dotnet-bot

This comment has been minimized.

@dotnet-bot

This comment has been minimized.

@sergign60
Copy link
Author

sergign60 commented Jul 17, 2018

@jkotas Just now I've found two places, where emitCurStackLvl is needed to ajdust:

diff --git i/src/jit/emitxarch.cpp w/src/jit/emitxarch.cpp
index 8614069..85c12b9 100644
--- i/src/jit/emitxarch.cpp
+++ w/src/jit/emitxarch.cpp
@@ -5603,6 +5603,19 @@ void emitter::emitIns_Call(EmitCallType          callType,
     }
 
 #endif // !FEATURE_FIXED_OUT_ARGS
+
+#if defined(UNIX_X86_ABI)
+    if (isNoGC)
+    {
+        unsigned helper = Compiler::eeGetHelperNum(methHnd);
+        if (helper == CORINFO_HELP_PROF_FCN_LEAVE
+            || helper == CORINFO_HELP_PROF_FCN_TAILCALL)
+        {
+            emitCurStackLvl += sizeof(int);
+        }
+    }
+#endif
+
 }
 
 #ifdef DEBUG
@@ -11180,6 +11193,16 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp)
                 break;
 
             default:
+#defined(UNIX_X86_ABI)
+                if (ins == INS_call)
+                {
+                    if (id->idIsNoGC())
+                    {
+                        // How can I determine here that I meet helper
+                        //        helper == CORINFO_HELP_PROF_FCN_LEAVE
+                        //     || helper == CORINFO_HELP_PROF_FCN_TAILCALL ???
+                        emitCurStackLvl += sizeof(int);
+                    }
+                }
+#endif
                 break;
         }
     }

Could you help me with this question:

+                         // How can I determine here that I meet helper
+                        //        helper == CORINFO_HELP_PROF_FCN_LEAVE
+                        //     || helper == CORINFO_HELP_PROF_FCN_TAILCALL ???

Thanks in advance!

@jkotas
Copy link
Member

jkotas commented Jul 17, 2018

Would it work to use emitCntStackDepth for this? If I am reading the code correctly, it should be 0 during prolog/epilog. It is used to suppress the stack level tracking during prolog/epilog.

@sergign60
Copy link
Author

sergign60 commented Jul 18, 2018

@jkotas As I see emitCntStackDepth is not set to 0 in an epilog, This is the problem. I'm trying to find how to fix it. But it can generate some other unpredictable problems because the current code is working without appropriate setting emitCntStackDepth Please take a look at the end of emitter::emitOutputInstr in emitxarch.cpp.

#if !FEATURE_FIXED_OUT_ARGS
    bool updateStackLevel = !emitIGisInProlog(ig) && !emitIGisInEpilog(ig);

#if FEATURE_EH_FUNCLETS
    updateStackLevel = updateStackLevel && !emitIGisInFuncletProlog(ig) && !emitIGisInFuncletEpilog(ig);
#endif // FEATURE_EH_FUNCLETS

    // Make sure we keep the current stack level up to date
    if (updateStackLevel)
    {

emitCntStackDepth is equal to 4 and updateStackLevel is 1 here in epilog

@sergign60
Copy link
Author

@jkotas @Dmitri-Botcharnikov @alpencolt

clang5.0 emits the following error on x86

[  253s] /home/abuild/rpmbuild/BUILD/coreclr-2.1.1/src/vm/eetoprofinterfaceimpl.cpp:2120:21: error: cast between incompatible calling conventions 'cdecl' and 'stdcall'; calls through this pointer may abort at runtime [-Werror,-Wcast-calling-convention]
[  253s]                     reinterpret_cast<FunctionEnter3 *>(PROFILECALLBACK(ProfileEnter)) :
[  253s]                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[  253s] /home/abuild/rpmbuild/BUILD/coreclr-2.1.1/src/vm/eetoprofinterfaceimpl.cpp:2056:15: note: consider defining 'ProfileEnterNaked' with the 'stdcall' calling convention
[  253s] EXTERN_C void ProfileEnterNaked(FunctionIDOrClientID functionIDOrClientID);
[  253s]               ^
[  253s]               STDCALL 

I guess that it's because we have in src/pal/prebuilt/inc/coreprof.h


typedef void __stdcall __stdcall FunctionEnter3( 
    FunctionIDOrClientID functionIDOrClientID);

typedef void __stdcall __stdcall FunctionLeave3( 
    FunctionIDOrClientID functionIDOrClientID);

typedef void __stdcall __stdcall FunctionTailcall3( 
    FunctionIDOrClientID functionIDOrClientID);

@jkotas
Copy link
Member

jkotas commented Jul 19, 2018

You can change these __stdcall to STDMETHODCALLTYPE. STDMETHODCALLTYPE is macro that defines the default calling convention.

@jkotas
Copy link
Member

jkotas commented Aug 21, 2018

Will close & reopen to pick up current CI definitions

@jkotas jkotas closed this Aug 21, 2018
@jkotas jkotas reopened this Aug 21, 2018
@sergign60
Copy link
Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test please

@sergign60
Copy link
Author

sergign60 commented Aug 24, 2018

@BruceForstall You're absolutely right. Very much for your help!
cc: @jkotas @noahfalk
Please review it one more time

@BruceForstall
Copy link
Member

This looks wrong. Why did you use AddNestedAlignment() instead of the AddStackLevel() / SubtractStackLevel() that I suggested?

@sergign60
Copy link
Author

sergign60 commented Aug 26, 2018

@BruceForstall I saw it in CodeGen::genAlignStackBeforeCall`` when I tried to avoid the assert in CodeGen::genGenerateCode```

#if EMIT_TRACK_STACK_DEPTH
    /* Check our max stack level. Needed for fgAddCodeRef().
       We need to relax the assert as our estimation won't include code-gen
       stack changes (which we know don't affect fgAddCodeRef()) */
    {
        unsigned maxAllowedStackDepth = compiler->fgPtrArgCntMax +    // Max number of pointer-sized stack arguments.
                                        compiler->compHndBBtabCount + // Return address for locally-called finallys
                                        genTypeStSz(TYP_LONG) +       // longs/doubles may be transferred via stack, etc
                                        (compiler->compTailCallUsed ? 4 : 0); // CORINFO_HELP_TAILCALL args
#if defined(UNIX_X86_ABI)
        maxAllowedStackDepth += maxNestedAlignment;
#endif
        noway_assert(getEmitter()->emitMaxStackDepth <= maxAllowedStackDepth);  <====!!!!
    }
#endif // EMIT_TRACK_STACK_DEPTH

https://github.com/dotnet/coreclr/blob/master/src/jit/codegencommon.cpp#L2446

your method isn't working and gives just this assert (when I use SubtractStackLevel(0xC), SubstractStackLevel(0x10) gives another assert assert(genStackLevel >= adjustment)).

With my fix tests run successfully

@BruceForstall
Copy link
Member

@sergign60 Check out #19700. I tweaked your change to include my suggestion, and it seems to work for me now. (I can't see how to get GitHub to just show me the difference between your last change and mine.)

Note that I found that the assert you were encountering was far too generous for Linux/x86 due to misunderstanding between count of bytes and count of ints, so I corrected that as well. It's possible that will lead to new asserts that would need to be investigated.

@sergign60
Copy link
Author

sergign60 commented Aug 28, 2018

@BruceForstall I've checked your variant #19700 with COMPlus_GCStress=3 (https://github.com/dotnet/coreclr/issues/11043) on the tizen emulator for x86 platform, we have no x86 hardware here. It's ok, many thanks!

@sergign60
Copy link
Author

@BruceForstall I've added your changes

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if all the tests pass

@@ -125,6 +125,12 @@ import "wtypes.idl";
import "unknwn.idl";
#endif

#ifdef PLATFORM_UNIX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Does this section need to be under defined(_TARGET_X86_) && defined(PLATFORM_UNIX)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be just:

#define STDMETHODCALLTYPE

without any ifdefs to make MIDL compiler happy. Defines in .idl files work in weird way.

@@ -438,50 +438,50 @@ typedef struct _COR_PRF_METHOD
mdMethodDef methodId;
} COR_PRF_METHOD;

typedef void __stdcall __stdcall FunctionEnter(
typedef void STDMETHODCALLTYPE STDMETHODCALLTYPE FunctionEnter(
Copy link
Member

@jkotas jkotas Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


MIDL compiler is duplicating this for some reason. There is nothing to fix.

@sergign60
Copy link
Author

@BruceForstall @jkotas is it ok?

@jkotas jkotas merged commit a1757ce into dotnet:master Sep 1, 2018
@jkotas
Copy link
Member

jkotas commented Sep 1, 2018

Thanks

@sergign60 sergign60 deleted the fix branch September 3, 2018 06:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants