Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTTNG-UST 2.13 changes soname breaking loading of libcoreclrtraceptprovider #57784

Open
hoyosjs opened this issue Aug 19, 2021 · 39 comments
Open

Comments

@hoyosjs
Copy link
Member

hoyosjs commented Aug 19, 2021

There's an ABI breaking change between LTTNG-ust 2.12 and 2.13, so they've reved their soname from liblttng-ust.so.0 to liblttng-ust.so.1. We link against the former. Given some distros will start supporting newer versions, customers will likely end up upgrading packages and realize they can't load libcoreclrtraceptprovider.so. We should investigate avenues to support multiple versions as needed using compilation tricks like the ones used in System.Security for openssl. Other possibility is to only support 2.13 and that brings us come code sanity advantages - this would only happen in .NET 7+ realistically.

@brianrob @dotnet/dotnet-diag

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 19, 2021
@ghost
Copy link

ghost commented Aug 19, 2021

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

There's an ABI breaking change between LTTNG-ust 2.12 and 2.13, so they've reved their soname from liblttng-ust.so.0 to liblttng-ust.so.1. We link against the former. Given some distros will start supporting newer versions, customers will likely end up upgrading packages and realize they can't load libcoreclrtraceptprovider.so. We should investigate avenues to support multiple versions as needed using compilation tricks like the ones used in System.Security for openssl. Other possibility is to only support 2.13 and that brings us come code sanity advantages - this would only happen in .NET 7+ realistically.

@brianrob @dotnet/dotnet-diag

Author: hoyosjs
Assignees: -
Labels:

area-Diagnostics-coreclr

Milestone: -

@am11
Copy link
Member

am11 commented Aug 19, 2021

using compilation tricks like the ones used in System.Security for openssl

Note, same (shim) tricks are used for libnuma.so, which is closer to where coreclrtraceptprovider lives. 🙂

@hoyosjs
Copy link
Member Author

hoyosjs commented Aug 19, 2021

Thanks. The shim approach is tempting. We sort of codegen the trace points, so this one will be interesting if we decide to support both.

@am11
Copy link
Member

am11 commented Oct 16, 2021

After some digging, it turned out lttng/tracepoint.h header (the one we use) does things quite differently than libssl, libnuma ones. It differs in the following manner:

  • We use TACEPOINT_* macros which (during the macro expansion) binds to __tracepoints__init method and the related code chunks.
    • We stay consistent and do not directly use any non-macro APIs.
  • The __tracepoints__init itself uses dlopen(3) and uses the ABI-versioned flavor liblttng-ust.so.0, liblttng-ust.so.1 (For 2.13).

Here is the expansion of macros for our generated tpdotnetruntimemonoprofiler.h:

# on Ubuntu x64
$ ./build.sh clr
$ clang-9 -E \
    artifacts/obj/coreclr/Linux.x64.Debug/pal/src/eventprovider/lttngprovider/lttng/tpdotnetruntimemonoprofiler.h \
    -Isrc/coreclr/pal/inc/rt -Isrc/coreclr/pal/inc > dump

the dump looks like this: http://sprunge.us/J5rsv3 (and file tpdotnetruntimemonoprofiler.h: http://sprunge.us/LPmIuJ). Search for __tracepoints__init or dlopen("liblttng. It is quite noisy, so I made a copy, removed #include "palrt.h" etc. and kept only one probe, here is the shorter version of this expansion: http://sprunge.us/BK0q1V, at the very end of this file is the code of interest.

I will defer to @janvorli as to the best approach for lttngshim. One option is that we install both versions on the official build machine, and compile both v0 and v1 implementations in our code with our pseudo init; which will probe for .so.1, and if not found, .so.0, then select the implementation based on the result. This approach will prevent us from stamping out symbols of liblttng-ust{-tracepoint}.so and we will still be using their init mechanism (which seem to take care of "shared access" scenarios).

@janvorli
Copy link
Member

@am11 thank you for the analysis! I'll take a look into the possibilities of shimming lttng and consider your suggestion.

@am11
Copy link
Member

am11 commented Oct 18, 2021

Thanks. I actually found about dlopen/dlsym pattern in their code when I commented out these lines (to see what breaks)

find_library(LTTNG NAMES lttng-ust)
target_link_libraries(coreclrtraceptprovider
${LTTNG}
)

just to find out it had no effect on the build result (still succeeds) other than we (unnecessarily) link to liblttng-ust{-tracepoint}.so.0. 🙂 All public APIs in their header are already shim-like macro (without multi-version support):

@janvorli
Copy link
Member

It seems that we will need to create and distribute two flavors of the libcoreclrtraceptprovider.so, one compiled with one version of lttng and the other with the other one. Then we would try to load one and if we fail, try to load the other. It seems that's essentially what @am11 has suggested, right @am11?

@hoyosjs
Copy link
Member Author

hoyosjs commented Oct 18, 2021

Yes, and that also seems to be what needs to happen from my understanding.

@am11
Copy link
Member

am11 commented Oct 18, 2021

@janvorli, yes, that's what I was thinking. It will be cleaner, as it will save us from using macro post-expansion calls. The quickest way to get there is to duplicate python code (as macro names are changed in 2.13).

Part II - python free implementation

However, in a separate project, we can replace the python scripts with lttngshim which will dynamically define those macros (at build time) and we can expand it without needing to build code. It is more work as python scripts read the XML (ETL manifest) file and generates multiple header and code files. If we think replacing python is a good idea, then I was imagining for:

  • input: we define a static header with our events and groups and keep it in sync with src/coreclr/vm/ClrEtwAll.man (manually at first, as it's only a few times a year we add/modify events; if we have more time, we can use XmlPeek msbuild task to update the header files when msbuild is present .. because cmake doesn't support xml parsing)
  • output, instead of creating too many header and code files (this is half of what we produce: https://gist.github.com/am11/404d99384eaae7259bcabdd06ff5e307), we can make one and ifdef out the flavors: coreclr, mono, rundom, private and stress.

@brianrob
Copy link
Member

One thing to add here: Using src\coreclr\vm\ClrEtwAll.man as the one source of truth means that all other tools that consume events only have to look at one source of truth. For example, TraceEvent-based tools use the manifest to generate C# code that gets used to parse the events.

@am11
Copy link
Member

am11 commented Oct 18, 2021

One thing to add here: Using src\coreclr\vm\ClrEtwAll.man as the one source of truth means that all other tools that consume events only have to look at one source of truth

I understand the rationale, and using XmlPeek -based parsing to dynamically generate the header is an option. However, in varied parts of runtime code base, we do manually keep lists in-sync which infrequently change, a few examples:

// **** NOTE: Keep these in sync with pal/inc/pal.h ****

/// These value names MUST be kept in sync with those in GetDriveType (moved to Interop.MountPoints.FormatInfo.cs),

* These values should be kept in sync with System.Globalization.CalendarId

@brianrob
Copy link
Member

Agree that we have precedent for duplicate lists that have to be kept in sync, I'd just rather not add another one. :)

@hoyosjs
Copy link
Member Author

hoyosjs commented Oct 18, 2021

Yes, but eventing is one of those things that often doesn't get updated much, but it gets updated by more than their known maintainers. Maintaining it outside the manifest is brittle. That's the main reason the python scripts have changed. It could easily be a little C# app that we stick in clr.prereqs and reads the manifest; we could stub them for bring up or generate them in a separate machine if needed if all we are trying to do is get rid of the Python dependency. There's prior art for this with the DAC table generation.

@am11
Copy link
Member

am11 commented Oct 19, 2021

but it gets updated by more than their known maintainers

Looking at git history of ClrEtwAll.man, it doesn't seem that way.

It could easily be a little C# app

Why would the C# app better than a simple msbuild XmlPeek like

<XmlPeek XmlContent="$(CrossTargetXml)" Query="toolchain-info/target-triple/text()">
?

@janvorli
Copy link
Member

Both C# app and msbuild would be problematic for bring-up of new platforms where we need to be able to build runtime native parts without pre-existing .NET runtime. So it seems it would be preferable to use a shell script instead if we wanted to drop python.

@hoyosjs
Copy link
Member Author

hoyosjs commented Oct 19, 2021

@janvorli for bring up, do you consider LTTNG crucial? Eventing in general is important for breadcrumbs, but I don't see why on bring up the source can't be checked in for the time being or stubbed out until it's deemed usable. We didn't want bash - we'd need bash + cmd. That's why it's python right now. I don't see all that much issue in python at the moment, but I know @am11 and @jkotas have a different opinion around having it as a dependency.

@am11
Copy link
Member

am11 commented Oct 19, 2021

I was thinking about a header containing a list of macros with event definitions (similar toecalllist.h), checked into the source control.

That header file containing definitions could be generated/update with msbuild task when msbuild is available. On new platforms, where msbuild is not available and we pass -skipmanaged etc. to build-runtime.sh (i.e. not a common scenario or something used by any CI leg), chances of someone adding a new event that updates ETL manifest in the same commit where they are porting runtime on a new platform does not sound likely.

Regarding the shell script option, it would require an additional dependency on xml parsing tool which I was avoiding.

@janvorli
Copy link
Member

janvorli commented Oct 19, 2021

for bring up, do you consider LTTNG crucial?

No, however we would need to enable building runtime without eventing. However, it would be nicer to be able to build all native parts even on a new platform. In the past, we've done bringup using an existing .NET SDK as a basis where we've replaced all native binaries be ones built on the new platform. So having all of them makes things complete. Also, I am not sure how would the source build that is expected to be used by distros handle this case.

@jkotas
Copy link
Member

jkotas commented Oct 19, 2021

I was thinking about a header containing a list of macros with event definition

This looks similar to what I have suggested in #39052 (comment)

Also, I am not sure how would the source build that is expected to be used by distros handle this case.

I believe that source build assumes pre-built SDK. It does not bootstrap from nothing.

@janvorli
Copy link
Member

This looks similar to what I have suggested in #39052 (comment)

Ah, I've forgotten about this comment, that sounds like the best solution to take care of my concern.

@omajid
Copy link
Member

omajid commented Dec 1, 2021

FWIW, Fedora 36 has picked up this version of lttng-ust and I am seeing crashes (in crossgen2!) there while trying to build .NET 6. Here's the backtrace from lldb:

* thread #1, name = 'crossgen2', stop reason = signal SIGSEGV: invalid address (fault address: 0x10)
    frame #0: 0x00007ffff71dcb82 liblttng-ust.so.1`check_event_provider + 162
liblttng-ust.so.1`check_event_provider:
->  0x7ffff71dcb82 <+162>: movq   0x10(%rax), %rdi
    0x7ffff71dcb86 <+166>: callq  0x7ffff71dca40            ; check_type_provider
    0x7ffff71dcb8b <+171>: movl   %eax, %r12d
    0x7ffff71dcb8e <+174>: testb  %al, %al
(lldb) bt
* thread #1, name = 'crossgen2', stop reason = signal SIGSEGV: invalid address (fault address: 0x10)
  * frame #0: 0x00007ffff71dcb82 liblttng-ust.so.1`check_event_provider + 162
    frame #1: 0x00007ffff71e24d1 liblttng-ust.so.1`lttng_ust_probe_register + 33
    frame #2: 0x00007ffff72957b5 libcoreclrtraceptprovider.so`lttng_ust__events_init__DotNETRuntime() at ust-tracepoint-event.h:1198:14

Is there a fix for this planned?

@am11
Copy link
Member

am11 commented Dec 1, 2021

Is there a fix for this planned?

Current workaround is to disable event trace by deleting the following block before build:

if(NOT DEFINED FEATURE_EVENT_TRACE)
set(FEATURE_EVENT_TRACE 1)
endif(NOT DEFINED FEATURE_EVENT_TRACE)

Supporting both ABI versions of liblttng-ust in parallel would require more work than a few liner fix.

@omajid
Copy link
Member

omajid commented Dec 2, 2021

Disabling feature like this:

diff --git a/src/coreclr/clrfeatures.cmake b/src/coreclr/clrfeatures.cmake
index f82ff1aa4e7..dfab148dbc3 100644
--- a/src/coreclr/clrfeatures.cmake
+++ b/src/coreclr/clrfeatures.cmake
@@ -3,7 +3,7 @@ if(CLR_CMAKE_TARGET_TIZEN_LINUX)
 endif()
 
 if(NOT DEFINED FEATURE_EVENT_TRACE)
-  set(FEATURE_EVENT_TRACE 1)
+  set(FEATURE_EVENT_TRACE 0)
 endif(NOT DEFINED FEATURE_EVENT_TRACE)
 
 if(NOT DEFINED FEATURE_PERFTRACING AND FEATURE_EVENT_TRACE)

Or even

diff --git a/src/coreclr/clrfeatures.cmake b/src/coreclr/clrfeatures.cmake
index f82ff1aa4e7..54a61c808e6 100644
--- a/src/coreclr/clrfeatures.cmake
+++ b/src/coreclr/clrfeatures.cmake
@@ -3,7 +3,6 @@ if(CLR_CMAKE_TARGET_TIZEN_LINUX)
 endif()
 
 if(NOT DEFINED FEATURE_EVENT_TRACE)
-  set(FEATURE_EVENT_TRACE 1)
 endif(NOT DEFINED FEATURE_EVENT_TRACE)
 
 if(NOT DEFINED FEATURE_PERFTRACING AND FEATURE_EVENT_TRACE)

Doesn't seem to work too well out of the box. A ./build.sh produces:

  /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:388:34: error: unknown type name 'EventPipeProvider'
      void EventPipeEventDelivered(EventPipeProvider *provider, DWORD eventId, DWORD eventVersion, ULONG cbMetadataBlob, LPCBYTE metadataBlob, ULONG cbEventData,                                                                           
                                   ^                                                                                    /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:390:35: error: unknown type name 'EventPipeProvider'
      void EventPipeProviderCreated(EventPipeProvider *provider);                                                     
                                    ^
  /home/omajid/devel/dotnet/runtime/src/coreclr/vm/eetoprofinterfaceimpl.h:462:9: error: unknown type name 'EventPipeProvider'                                                                                                              
          EventPipeProvider *provider,                                                                                
          ^                                                                                                           
  /home/omajid/devel/dotnet/runtime/src/coreclr/vm/eetoprofinterfaceimpl.h:475:38: error: unknown type name 'EventPipeProvider'                                                                                                             
      HRESULT EventPipeProviderCreated(EventPipeProvider *provider);                                                  
                                       ^
  /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:388:34: error: unknown type name 'EventPipeProvider'
      void EventPipeEventDelivered(EventPipeProvider *provider, DWORD eventId, DWORD eventVersion, ULONG cbMetadataBlob, LPCBYTE metadataBlob, ULONG cbEventData,                                                                           
                                   ^                                                                                  
  /home/omajid/devel/dotnet/runtime/src/coreclr/inc/profilepriv.h:390:35: error: unknown type name 'EventPipeProvider'
      void EventPipeProviderCreated(EventPipeProvider *provider);                                                     
                                    ^                                           

And many similar errors.

@omajid
Copy link
Member

omajid commented Dec 2, 2021

Is disabling loading LTTNG support enough for now? COMPlus_LTTng=0

It's good enough for me, personally. And it's enough for the source-build builds I care about. But end-users who upgrade to distributions that have lttng-ust (Fedora >= 36, Arch Linux, probably more soon) will run into crashes at runtime and will not even know about this workaround.

@am11
Copy link
Member

am11 commented Dec 2, 2021

Unless someone is already working on it, I think we should stop pursuing "convert event generator to shim" part for now and focus on multi-version support first. It includes the following:

  • keeping lttng-ust headers in source tree (src/native/external/lttng perhaps? as @jkotas proposed for zlib etc.)
  • building two versions of our existing provider: libcoreclrtraceptprovider.so.0 and libcoreclrtraceptprovider.so.1 for each liblttng-ust ABI/SO version.
  • adding a mechanism to load suitable lttng, we have prior art for this in src/coreclr/pal, the initialization logic for libcoreclrtraceptprovider (but there might be some fallouts due to the dynamic linkage which appreantly is not needed LTTNG-UST 2.13 changes soname breaking loading of libcoreclrtraceptprovider #57784 (comment)?)

Conversion to shim would be nice, but it has many moving parts; even if we do the conversion first, it will probably not exempt the above steps.

@tmds
Copy link
Member

tmds commented Dec 3, 2021

Fedora 36 has picked up this version of lttng-ust and I am seeing crashes (in crossgen2!) there while trying to build .NET 6. Here's the backtrace from lldb:

I'm reproducing this issue.

  * frame #0: 0x00007f8885a47b82 liblttng-ust.so.1`check_event_provider + 162
    frame #1: 0x00007f8885a4d4d1 liblttng-ust.so.1`lttng_ust_probe_register + 33
    frame #2: 0x00007f8885b007b5 libcoreclrtraceptprovider.so`lttng_ust__events_init__DotNETRuntime() at ust-tracepoint-event.h:1198:14
    frame #3: 0x00007f888683fa2e ld-linux-x86-64.so.2`call_init(l=<unavailable>, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:70:3
    frame #4: 0x00007f888683fb1c ld-linux-x86-64.so.2`_dl_init(main_map=0x0000556bd608a290, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:117:5
    frame #5: 0x00007f88864534c5 libc.so.6`_dl_catch_exception + 229
    frame #6: 0x00007f88868437de ld-linux-x86-64.so.2`dl_open_worker at dl-open.c:821:5
    frame #7: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #8: 0x00007f8886843b5c ld-linux-x86-64.so.2`_dl_open at dl-open.c:896:17
    frame #9: 0x00007f888638294c libc.so.6`dlopen_doit + 92
    frame #10: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #11: 0x00007f8886453533 libc.so.6`_dl_catch_error + 51
    frame #12: 0x00007f888638244e libc.so.6`_dlerror_run + 142
    frame #13: 0x00007f88863829d8 libc.so.6`dlopen@GLIBC_2.2.5 + 72
    frame #14: 0x00007f8885fd6893 libcoreclr.so`PAL_InitializeTracing() at tracepointprovider.cpp:116:9
    frame #15: 0x00007f888683fa2e ld-linux-x86-64.so.2`call_init(l=<unavailable>, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:70:3
    frame #16: 0x00007f888683fb1c ld-linux-x86-64.so.2`_dl_init(main_map=0x0000556bd6060050, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:117:5
    frame #17: 0x00007f88864534c5 libc.so.6`_dl_catch_exception + 229
    frame #18: 0x00007f88868437de ld-linux-x86-64.so.2`dl_open_worker at dl-open.c:821:5
    frame #19: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #20: 0x00007f8886843b5c ld-linux-x86-64.so.2`_dl_open at dl-open.c:896:17
    frame #21: 0x00007f888638294c libc.so.6`dlopen_doit + 92
    frame #22: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #23: 0x00007f8886453533 libc.so.6`_dl_catch_error + 51
    frame #24: 0x00007f888638244e libc.so.6`_dlerror_run + 142
    frame #25: 0x00007f88863829d8 libc.so.6`dlopen@GLIBC_2.2.5 + 72
    frame #26: 0x00007f8886274ead libhostpolicy.so`pal::load_library(path="/home/tmds/rpmbuild/BUILD/dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.netcore.app.crossgen2.linux-x64/6.0.0/tools/libcoreclr.so", dll=0x00007f888629e0a0) at pal.unix.cpp:230:12
...

I think the build works fine while using the prebuilts that link against liblttng-ust.so.0: It silently fails to load libcoreclrtraceptprovider.so.

In the stacktrace we see liblttng-ust.so.1 is loaded. I think these are binaries that got built against that. The SIGSEGV occurs in check_event_provider which is meant to validate the tracepoints. This may a regression in LTTng.

@tmds
Copy link
Member

tmds commented Dec 3, 2021

I can reproduce the issue by invoking corerun on a hello world console app.

The crash happens at this line: https://github.com/lttng/lttng-ust/blob/4c155a06d838e1ab5d385abd1d73ae56e71b7d5e/src/lib/lttng-ust/lttng-probes.c#L153.

The field is null.

(gdb) p *tp_class
$3 = {struct_size = 48, fields = 0x7ffff73ab2e0 <lttng_ust__event_fields___DotNETRuntime___GCStart>, nr_fields = 2, 
  probe_callback = 0x7ffff7364820 <lttng_ust__event_probe__DotNETRuntime___GCStart(void*, unsigned int, unsigned int)>, 
  signature = 0x7ffff738e720 <__tp_event_signature___DotNETRuntime___GCStart> "const unsigned int, Count, const unsigned int, Reason", probe_desc = 0x7ffff73a1470 <lttng_ust__probe_desc___DotNETRuntime>}
(gdb) p tp_class->fields[0]
$4 = (const struct lttng_ust_event_field * const) 0x0
(gdb) p tp_class->fields[1]
$5 = (const struct lttng_ust_event_field * const) 0x0

GCStart has 2 fields as we see in definition below, but the array has null pointers for them.

#define GCStart_TRACEPOINT_ARGS \
TP_ARGS( \
        const unsigned int, Count, \
        const unsigned int, Reason \
)
TRACEPOINT_EVENT_CLASS(
    DotNETRuntime,
    GCStart,
    GCStart_TRACEPOINT_ARGS,
     TP_FIELDS(
        ctf_integer(unsigned int, Count, Count)
        ctf_integer(unsigned int, Reason, Reason)
    )
)

@tmds
Copy link
Member

tmds commented Dec 3, 2021

The field is null.

These fields get initialized dynamically.

static const struct lttng_ust_event_field * const lttng_ust__event_fields___DotNETRuntime___GCStart[] = { new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "Count", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(unsigned int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<unsigned int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "Reason", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(unsigned int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<unsigned int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "dummy", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, }; static const struct lttng_ust_tracepoint_class lttng_ust__event_class___DotNETRuntime___GCStart = { .struct_size = sizeof(struct lttng_ust_tracepoint_class), .fields = lttng_ust__event_fields___DotNETRuntime___GCStart, .nr_fields = (sizeof(lttng_ust__event_fields___DotNETRuntime___GCStart) / sizeof((lttng_ust__event_fields___DotNETRuntime___GCStart)[0])) - 1, .probe_callback = (void (*)(void)) &lttng_ust__event_probe__DotNETRuntime___GCStart, .signature = __tp_event_signature___DotNETRuntime___GCStart, .probe_desc = &lttng_ust__probe_desc___DotNETRuntime, };

It seems they have not been initialized (yet):

(gdb) p lttng_ust__event_fields___DotNETRuntime___GCStart
$1 = {0x0, 0x0, 0x0}

@am11
Copy link
Member

am11 commented Dec 3, 2021

I think the build works fine

binary built against liblttng-ust.so.0 can not work against .so.1 without running into validation errors; per the release notes of ABI breaking change.

@tmds
Copy link
Member

tmds commented Dec 3, 2021

binary built against liblttng-ust.so.0 can not work against .so.1 without running into validation errors; per the release notes of ABI breaking change.

Yes, but it doesn't get used.

Because libcoreclrtraceptprovider.so is linked against liblttng-ust.so.0, and that is not available, the library silently fails to load here:

// Load the tracepoint provider.
// It's OK if this fails - that just means that tracing dependencies aren't available.
dlopen(tpProvPath, RTLD_NOW | RTLD_GLOBAL);

@am11
Copy link
Member

am11 commented Dec 3, 2021

Yes, in addition to updating PAL (adding version probing at run-time), I think we would also need two versions of libcoreclrtraceptprovider .so.0 and .so.0 to cover the supported matrix. The version selection part is yet to be implemented. It is the same situation as we once had with openssl, where we used to require 1.0 until later version support came along.

lttng is a bit tricky as we use their macro based APIs and their headers expand those macros to dlopen (with hardcoded SO version) calls at build time, so we would need both set of headers on the build machine.

That's why I think best way is to have those headers in our source tree under src/native/external/lttng (where we were thinking to eventually move rest of the similar vendor code in the repo).

@tmds
Copy link
Member

tmds commented Dec 4, 2021

This issue is about adding support for both versions.
I've created a separate issue for the crash we see on Fedora 36: #62398.

@am11
Copy link
Member

am11 commented Dec 4, 2021

This issue is about adding support for both versions.

Reason for adding this support is because .NET apps started to crash on systems with newer lttng library.
If #62398 is about a temporary fix like if NOT dlopen(liblttng-ust.so.0) then do what DOTNET_LTTng=0 does to disable eventing, then it makes sense we can probably backport that to 6.0 until the multi-version support arrives.

@tmds
Copy link
Member

tmds commented Dec 6, 2021

If #62398 is about

That is about .NET crashing when built against liblttng-ust.so.1 on Fedora 36.

.NET apps started to crash

From what I see, .NET built against liblttng-ust.so.0 doesn't crash on systems that have liblttng-ust.so.1.

$ dnf list installed lttng-ust
Installed Packages
lttng-ust.x86_64                                        2.13.0-1.fc36                                         @rawhide
$ ls /usr/lib64/liblttng-ust.so.1
/usr/lib64/liblttng-ust.so.1
$ which dotnet
~/dotnet-60/dotnet
$ ldd ~/dotnet-60/shared/Microsoft.NETCore.App/6.0.0/libcoreclrtraceptprovider.so
	linux-vdso.so.1 (0x00007ffe6f571000)
	liblttng-ust.so.0 => not found
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f6e3571d000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f6e3563d000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6e35622000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f6e35419000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f6e359fb000)
$ dotnet new console -o /tmp/console
The template "Console App" was created successfully.

Processing post-creation actions...
Running 'dotnet restore' on /tmp/console/console.csproj...
  Determining projects to restore...
  Restored /tmp/console/console.csproj (in 129 ms).
Restore succeeded.
$ dotnet run --project /tmp/console
Hello, World!

@am11
Copy link
Member

am11 commented Dec 6, 2021

@tmds, point is liblttng-ust.so.1 is not supported, needs work, we would need to support both versions..

if NOT dlopen(liblttng-ust.so.0) then do what DOTNET_LTTng=0 does to disable eventing

this should fix the crash you are seeing. Besides, the crash you are seeing is same as #60407 which was closed as a duplicate.

@tmds
Copy link
Member

tmds commented Dec 6, 2021

For source-built .NET we care more about #62398 than we do about this issue.
And the fix for #62398 should be backported to 6.0, while this feature might not.

@am11
Copy link
Member

am11 commented Dec 6, 2021

Source build has patching mechanism in place, you can disable lttng feature at build time as discussed above.

ajlennon added a commit to DynamicDevices/meta-mono that referenced this issue Mar 31, 2022
See: dotnet/runtime#57784

Signed-off-by: Alex J Lennon <ajlennon@dynamicdevices.co.uk>
@tommcdon tommcdon modified the milestones: 7.0.0, Future May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants