Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET built against LTTng 2.13 crashes while initializing LTTng #62398

Closed
tmds opened this issue Dec 4, 2021 · 16 comments
Closed

.NET built against LTTng 2.13 crashes while initializing LTTng #62398

tmds opened this issue Dec 4, 2021 · 16 comments
Labels
area-Tracing-coreclr tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@tmds
Copy link
Member

tmds commented Dec 4, 2021

From #57784 (comment).

.NET built on Fedora Rawhide (36) segfaults on application start while initializing LTTng.

  * frame #0: 0x00007f8885a47b82 liblttng-ust.so.1`check_event_provider + 162
    frame #1: 0x00007f8885a4d4d1 liblttng-ust.so.1`lttng_ust_probe_register + 33
    frame #2: 0x00007f8885b007b5 libcoreclrtraceptprovider.so`lttng_ust__events_init__DotNETRuntime() at ust-tracepoint-event.h:1198:14
    frame #3: 0x00007f888683fa2e ld-linux-x86-64.so.2`call_init(l=<unavailable>, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:70:3
    frame #4: 0x00007f888683fb1c ld-linux-x86-64.so.2`_dl_init(main_map=0x0000556bd608a290, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:117:5
    frame #5: 0x00007f88864534c5 libc.so.6`_dl_catch_exception + 229
    frame #6: 0x00007f88868437de ld-linux-x86-64.so.2`dl_open_worker at dl-open.c:821:5
    frame #7: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #8: 0x00007f8886843b5c ld-linux-x86-64.so.2`_dl_open at dl-open.c:896:17
    frame #9: 0x00007f888638294c libc.so.6`dlopen_doit + 92
    frame #10: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #11: 0x00007f8886453533 libc.so.6`_dl_catch_error + 51
    frame #12: 0x00007f888638244e libc.so.6`_dlerror_run + 142
    frame #13: 0x00007f88863829d8 libc.so.6`dlopen@GLIBC_2.2.5 + 72
    frame #14: 0x00007f8885fd6893 libcoreclr.so`PAL_InitializeTracing() at tracepointprovider.cpp:116:9
    frame #15: 0x00007f888683fa2e ld-linux-x86-64.so.2`call_init(l=<unavailable>, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:70:3
    frame #16: 0x00007f888683fb1c ld-linux-x86-64.so.2`_dl_init(main_map=0x0000556bd6060050, argc=10, argv=0x00007ffcd00cfd88, env=0x00007ffcd00cfde0) at dl-init.c:117:5
    frame #17: 0x00007f88864534c5 libc.so.6`_dl_catch_exception + 229
    frame #18: 0x00007f88868437de ld-linux-x86-64.so.2`dl_open_worker at dl-open.c:821:5
    frame #19: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #20: 0x00007f8886843b5c ld-linux-x86-64.so.2`_dl_open at dl-open.c:896:17
    frame #21: 0x00007f888638294c libc.so.6`dlopen_doit + 92
    frame #22: 0x00007f8886453468 libc.so.6`_dl_catch_exception + 136
    frame #23: 0x00007f8886453533 libc.so.6`_dl_catch_error + 51
    frame #24: 0x00007f888638244e libc.so.6`_dlerror_run + 142
    frame #25: 0x00007f88863829d8 libc.so.6`dlopen@GLIBC_2.2.5 + 72
    frame #26: 0x00007f8886274ead libhostpolicy.so`pal::load_library(path="/home/tmds/rpmbuild/BUILD/dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.netcore.app.crossgen2.linux-x64/6.0.0/tools/libcoreclr.so", dll=0x00007f888629e0a0) at pal.unix.cpp:230:12
...

The crash happens at this line: https://github.com/lttng/lttng-ust/blob/4c155a06d838e1ab5d385abd1d73ae56e71b7d5e/src/lib/lttng-ust/lttng-probes.c#L153.

The field is null.

(gdb) p *tp_class
$3 = {struct_size = 48, fields = 0x7ffff73ab2e0 <lttng_ust__event_fields___DotNETRuntime___GCStart>, nr_fields = 2, 
  probe_callback = 0x7ffff7364820 <lttng_ust__event_probe__DotNETRuntime___GCStart(void*, unsigned int, unsigned int)>, 
  signature = 0x7ffff738e720 <__tp_event_signature___DotNETRuntime___GCStart> "const unsigned int, Count, const unsigned int, Reason", probe_desc = 0x7ffff73a1470 <lttng_ust__probe_desc___DotNETRuntime>}
(gdb) p tp_class->fields[0]
$4 = (const struct lttng_ust_event_field * const) 0x0
(gdb) p tp_class->fields[1]
$5 = (const struct lttng_ust_event_field * const) 0x0

These fields get initialized dynamically.

static const struct lttng_ust_event_field * const lttng_ust__event_fields___DotNETRuntime___GCStart[] = { new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "Count", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(unsigned int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<unsigned int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "Reason", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(unsigned int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<unsigned int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, new (const struct lttng_ust_event_field) { .struct_size = sizeof(struct lttng_ust_event_field), .name = "dummy", .type = ((struct lttng_ust_type_common *) new (struct lttng_ust_type_integer) { .parent = { .type = lttng_ust_type_integer, }, .struct_size = sizeof(struct lttng_ust_type_integer), .size = sizeof(int) * 8, .alignment = 1 * 8, .signedness = (std::is_signed<int>::value), .reverse_byte_order = 1234 != 1234, .base = 10, }), .nowrite = 0, .nofilter = 0, }, }; static const struct lttng_ust_tracepoint_class lttng_ust__event_class___DotNETRuntime___GCStart = { .struct_size = sizeof(struct lttng_ust_tracepoint_class), .fields = lttng_ust__event_fields___DotNETRuntime___GCStart, .nr_fields = (sizeof(lttng_ust__event_fields___DotNETRuntime___GCStart) / sizeof((lttng_ust__event_fields___DotNETRuntime___GCStart)[0])) - 1, .probe_callback = (void (*)(void)) &lttng_ust__event_probe__DotNETRuntime___GCStart, .signature = __tp_event_signature___DotNETRuntime___GCStart, .probe_desc = &lttng_ust__probe_desc___DotNETRuntime, };

It seems they have not been initialized (yet):

(gdb) p lttng_ust__event_fields___DotNETRuntime___GCStart
$1 = {0x0, 0x0, 0x0}

cc @omajid @janvorli @hoyosjs @am11 @brianrob @dotnet/dotnet-diag

@dotnet-issue-labeler dotnet-issue-labeler bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Dec 4, 2021
@tommcdon tommcdon added this to the .NET 7.0 milestone Dec 6, 2021
@tommcdon tommcdon added bug and removed untriaged New issue has not been triaged by the area owner labels Dec 6, 2021
@karelz karelz modified the milestones: .NET 7.0, 7.0.0 Dec 6, 2021
@tmds
Copy link
Member Author

tmds commented Dec 6, 2021

afaik we should be able to recompile against the new version of LTTng and that should 'just work'.

I've created a ticket in the LTTng bugtracker: https://bugs.lttng.org/issues/1339.

@janvorli
Copy link
Member

janvorli commented Dec 6, 2021

I think that initializing a static variable using dynamic new is a tricky business. The order of statics initialization in a program is not defined, so if the new operator uses a static variable that it needs to have initialized to operate too, it could be the reason for this crash.

@compudj
Copy link

compudj commented Dec 6, 2021

As indicated in the LTTng bug tracker, please try https://review.lttng.org/c/lttng-ust/+/6870 and let us know if it improves the situation after rebuilding the tracepoint probe provider.

@tmds
Copy link
Member Author

tmds commented Dec 7, 2021

I've rebuilt lttng-ust with that patch, and it works:

$ ./artifacts/bin/testhost/net7.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/7.0.0/corerun /tmp/console/bin/Debug/net6.0/console.dll
Hello, World!

Thank you, @compudj!

compudj pushed a commit to lttng/lttng-ust that referenced this issue Dec 9, 2021
Observed issue
==============

Applications which transitively dlopen() a library which, in turn,
dlopen() providers crash when they are compiled with clang or
if LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP is defined.

  Core was generated by `././myapp.exe'.
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  0x00007fa94f860bc2 in check_event_provider (probe_desc=<optimized out>) at lttng-probes.c:153
  153				if (!check_type_provider(field->type)) {
  [Current thread is 1 (Thread 0x7fa94fcbc740 (LWP 511754))]

  (gdb) bt
  #0  0x00007fa94f860bc2 in check_event_provider (probe_desc=<optimized out>) at lttng-probes.c:153
  #1  lttng_ust_probe_register (desc=0x7fa94fe9dc80 <lttng_ust__probe_desc___embedded_sys>)
      at lttng-probes.c:242
  #2  0x00007fa94fe9ba3c in lttng_ust__tracepoints__ptrs_destroy ()
      at /usr/include/lttng/tracepoint.h:590
  #3  0x00007fa94fedfe2e in call_init () from /lib64/ld-linux-x86-64.so.2
  #4  0x00007fa94fedff1c in _dl_init () from /lib64/ld-linux-x86-64.so.2
  #5  0x00007fa94fdf7d45 in _dl_catch_exception () from /usr/lib/libc.so.6
  #6  0x00007fa94fee420a in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
  #7  0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #8  0x00007fa94fee39bb in _dl_open () from /lib64/ld-linux-x86-64.so.2
  #9  0x00007fa94fe8d36c in ?? () from /usr/lib/libdl.so.2
  #10 0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #11 0x00007fa94fdf7db3 in _dl_catch_error () from /usr/lib/libc.so.6
  #12 0x00007fa94fe8db99 in ?? () from /usr/lib/libdl.so.2
  #13 0x00007fa94fe8d3f8 in dlopen () from /usr/lib/libdl.so.2
  #14 0x00007fa94fecc647 in mon_constructeur () at mylib.cpp:20
  #15 0x00007fa94fedfe2e in call_init () from /lib64/ld-linux-x86-64.so.2
  #16 0x00007fa94fedff1c in _dl_init () from /lib64/ld-linux-x86-64.so.2
  #17 0x00007fa94fdf7d45 in _dl_catch_exception () from /usr/lib/libc.so.6
  #18 0x00007fa94fee420a in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
  #19 0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #20 0x00007fa94fee39bb in _dl_open () from /lib64/ld-linux-x86-64.so.2
  #21 0x00007fa94fe8d36c in ?? () from /usr/lib/libdl.so.2
  #22 0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #23 0x00007fa94fdf7db3 in _dl_catch_error () from /usr/lib/libc.so.6
  #24 0x00007fa94fe8db99 in ?? () from /usr/lib/libdl.so.2
  #25 0x00007fa94fe8d3f8 in dlopen () from /usr/lib/libdl.so.2
  #26 0x00005594f478c18c in main ()

Cause
=====

Building tracepoint instrumentation as C++ using clang causes
LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP to be defined due to a
compiler version detection problem addressed by another patch.

However, building with LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP
defined still results in the crash.

When LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP is defined, the
lttng_ust_event_field lttng_ust__event_fields__[...] structure is
initialized by dynamically-allocating field structures for the various
fields.

As the initialization can't be performed statically, it is performed at
run-time _after_ the execution of the library constructors has
completed.

Moreover, the generated initialization
function of the provider (lttng_ust__events_init__[...]) is declared as being a library
constructor. Hence, this causes it to run before the
tracepoint fields structures has a chance to be initialized.

This all results in a NULL pointer dereference during the validation of
the fields.

Solution
========

When building providers as C++, the initialization function is defined
as the constructor of a class. This class is, in turn, instantiated in
an anonymous namespace.

For the purposes of this patch, the use of an anonymous namespace is
equivalent to declaring the instance as 'static', but it is preferred in
C++11.

Known drawbacks
===============

None.

References
==========

A reproducer is available:
https://github.com/jgalar/ust-clang-reproducer

Problem initially reported on dotnet/runtime's issue tracker:
dotnet/runtime#62398

Relevant LTTng-UST issue:
https://bugs.lttng.org/issues/1339

Fixes: #1339
Change-Id: I51cfbe74729bd45e2613a30bc8de17e08ea8233d
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
compudj pushed a commit to lttng/lttng-ust that referenced this issue Dec 9, 2021
Observed issue
==============

Applications which transitively dlopen() a library which, in turn,
dlopen() providers crash when they are compiled with clang or
if LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP is defined.

  Core was generated by `././myapp.exe'.
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  0x00007fa94f860bc2 in check_event_provider (probe_desc=<optimized out>) at lttng-probes.c:153
  153				if (!check_type_provider(field->type)) {
  [Current thread is 1 (Thread 0x7fa94fcbc740 (LWP 511754))]

  (gdb) bt
  #0  0x00007fa94f860bc2 in check_event_provider (probe_desc=<optimized out>) at lttng-probes.c:153
  #1  lttng_ust_probe_register (desc=0x7fa94fe9dc80 <lttng_ust__probe_desc___embedded_sys>)
      at lttng-probes.c:242
  #2  0x00007fa94fe9ba3c in lttng_ust__tracepoints__ptrs_destroy ()
      at /usr/include/lttng/tracepoint.h:590
  #3  0x00007fa94fedfe2e in call_init () from /lib64/ld-linux-x86-64.so.2
  #4  0x00007fa94fedff1c in _dl_init () from /lib64/ld-linux-x86-64.so.2
  #5  0x00007fa94fdf7d45 in _dl_catch_exception () from /usr/lib/libc.so.6
  #6  0x00007fa94fee420a in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
  #7  0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #8  0x00007fa94fee39bb in _dl_open () from /lib64/ld-linux-x86-64.so.2
  #9  0x00007fa94fe8d36c in ?? () from /usr/lib/libdl.so.2
  #10 0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #11 0x00007fa94fdf7db3 in _dl_catch_error () from /usr/lib/libc.so.6
  #12 0x00007fa94fe8db99 in ?? () from /usr/lib/libdl.so.2
  #13 0x00007fa94fe8d3f8 in dlopen () from /usr/lib/libdl.so.2
  #14 0x00007fa94fecc647 in mon_constructeur () at mylib.cpp:20
  #15 0x00007fa94fedfe2e in call_init () from /lib64/ld-linux-x86-64.so.2
  #16 0x00007fa94fedff1c in _dl_init () from /lib64/ld-linux-x86-64.so.2
  #17 0x00007fa94fdf7d45 in _dl_catch_exception () from /usr/lib/libc.so.6
  #18 0x00007fa94fee420a in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
  #19 0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #20 0x00007fa94fee39bb in _dl_open () from /lib64/ld-linux-x86-64.so.2
  #21 0x00007fa94fe8d36c in ?? () from /usr/lib/libdl.so.2
  #22 0x00007fa94fdf7ce8 in _dl_catch_exception () from /usr/lib/libc.so.6
  #23 0x00007fa94fdf7db3 in _dl_catch_error () from /usr/lib/libc.so.6
  #24 0x00007fa94fe8db99 in ?? () from /usr/lib/libdl.so.2
  #25 0x00007fa94fe8d3f8 in dlopen () from /usr/lib/libdl.so.2
  #26 0x00005594f478c18c in main ()

Cause
=====

Building tracepoint instrumentation as C++ using clang causes
LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP to be defined due to a
compiler version detection problem addressed by another patch.

However, building with LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP
defined still results in the crash.

When LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP is defined, the
lttng_ust_event_field lttng_ust__event_fields__[...] structure is
initialized by dynamically-allocating field structures for the various
fields.

As the initialization can't be performed statically, it is performed at
run-time _after_ the execution of the library constructors has
completed.

Moreover, the generated initialization
function of the provider (lttng_ust__events_init__[...]) is declared as being a library
constructor. Hence, this causes it to run before the
tracepoint fields structures has a chance to be initialized.

This all results in a NULL pointer dereference during the validation of
the fields.

Solution
========

When building providers as C++, the initialization function is defined
as the constructor of a class. This class is, in turn, instantiated in
an anonymous namespace.

For the purposes of this patch, the use of an anonymous namespace is
equivalent to declaring the instance as 'static', but it is preferred in
C++11.

Known drawbacks
===============

None.

References
==========

A reproducer is available:
https://github.com/jgalar/ust-clang-reproducer

Problem initially reported on dotnet/runtime's issue tracker:
dotnet/runtime#62398

Relevant LTTng-UST issue:
https://bugs.lttng.org/issues/1339

Fixes: #1339
Change-Id: I51cfbe74729bd45e2613a30bc8de17e08ea8233d
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
@tmds
Copy link
Member Author

tmds commented Dec 10, 2021

The fix was merged: https://bugs.lttng.org/projects/lttng-ust/repository/lttng-ust/revisions/05bfa3dc3a6e6b2ece3686a5f384b6645c2a5010.

As suspected @janvorli, the issue was in the initialization order.

@tmds tmds added tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly and removed bug labels Dec 10, 2021
@tmds tmds changed the title .NET built on Fedora 36 crashes while initializing LTTng .NET built against LTTng 2.13 crashes while initializing LTTng Dec 10, 2021
@josalem
Copy link
Contributor

josalem commented Jan 5, 2022

@tmds is there any work in the runtime needed to consume this or can we close this issue. It sounds like the change that fixed the issue was already merged into LTTng.

@compudj
Copy link

compudj commented Jan 5, 2022

Indeed, the fixes are present in the lttng-ust 2.13.1 release. It is fixed in two ways:

First, we made sure that clang does not allocate compound literal on the heap in C++ in a typical build. This is fixed by commit a11ff47e2a6 ("fix: allocating C++ compound literal on heap with Clang"). However, if someone builds with LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP defined, the constructor ordering is still an issue. This is why we have also fixed the underlying constructor order issue. This is fixed by commit 90fe47efbc1 ("Fix: generate probe registration constructor as a C++ constuctor").

@josalem
Copy link
Contributor

josalem commented Jan 5, 2022

Thanks for details @compudj! I'll go ahead and close this issue. If there does end up being work needed in the runtime, we can reopen it to track that.

@josalem josalem closed this as completed Jan 5, 2022
@compudj
Copy link

compudj commented Jan 5, 2022

I confirm that the issue sat squarely within lttng-ust, so I don't expect anything to be needed in the .NET runtime to fix this, except rebuilding the .NET runtime probe providers against a fixed lttng-ust.

@josalem
Copy link
Contributor

josalem commented Jan 5, 2022

In that case, I'll reopen this issue to track making sure we are building against the correct version of LTTng in our infrastructure. Thanks!

@Livius90
Copy link

Can you summarize what is the final solution? Do we need to use new lttng-ust 2.13.1 version or any newer and this issue will be solved?

@compudj
Copy link

compudj commented Jan 17, 2022

Can you summarize what is the final solution? Do we need to use new lttng-ust 2.13.1 version or any newer and this issue will be solved?

You need to upgrade to the new lttng-ust 2.13.1 (or any newer version) and rebuild the .NET runtime probe providers against that upgraded lttng-ust to correct the problem.

@Livius90
Copy link

Livius90 commented Jan 19, 2022

What .NET 6 SDK version will contains this rebuild fix? I would like to use pre-built SDK for example to download .NET SDK 6.0.101 for Linux Arm.

@tmds
Copy link
Member Author

tmds commented Jan 19, 2022

@Livius90 the issue occurs when .NET was compiled on a system with lttng-ust 2.13.0.
I don't think there are pre-built SDKs that are built on such a system.
Are you running into an issue?

@Livius90
Copy link

In Yocto project Linux image building, pre-built .NET SDK is used to install it. There is an issue here about .NET 6.0.100 vs. lttng-ust 2.13.0. So, is it possible that .NET 6.0.100 pre-built needs a new lttng-ust already?

@omajid
Copy link
Member

omajid commented Jan 20, 2022

I left a comment at intel-iot-devkit/meta-iot-cloud#106, explaining what I think the correct short term fix is.

@dotnet dotnet locked as resolved and limited conversation to collaborators Feb 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tracing-coreclr tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

8 participants