Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minidump files get placed in the wrong directory and aren't uploaded to Sentry #833

Open
1 of 3 tasks
mntt-SSE opened this issue Apr 17, 2023 · 19 comments
Open
1 of 3 tasks

Comments

@mntt-SSE
Copy link

mntt-SSE commented Apr 17, 2023

Description

I'm using the Native SDK with the bundled crashpad_handler, and I noticed that crashes for some users wouldn't get uploaded to Sentry. After some digging I realized that for those users that didn't get their crashes uploaded, the minidump (dmp) files got placed in a different directory from the one configured with sentry_options_set_database_path. The directory they get placed in is [user]/Local/AppData/CrashDumps. It only happens for some users, but for those this is consistent.

Does anyone know what could be the cause of this? Or what I could do to debug it further? I've looked in the documentation and source code, but I haven't been able to figure out how I get the crashpad_handler to log stuff.

Thanks in advance!

When does the problem happen

  • During build
  • During run-time
  • When capturing a hard crash

Environment

  • OS: Microsoft Windows 10 Business, 64 bit, Version 10.0.19045 Build 19045
  • Compiler: Visual Studio 2022, 17.3.6
  • CMake version and config: 3.25.2, -DBUILD_SHARED_LIBS=OFF -DSENTRY_BUILD_RUNTIMESTATIC=ON -DSENTRY_BACKEND=crashpad

Steps To Reproduce

It's only reproducible on some machines, but this is how I set up the crash handler in my app:

	sentry_options_t* options = sentry_options_new();
	sentry_options_set_dsn(options, "<a dsn value>");
	sentry_options_set_debug(options, true);
	sentry_options_set_environment(options, "Development");
	sentry_options_set_release(options, "0");
	sentry_options_set_handler_path(options, "<working directory>");
	sentry_options_set_database_path(options, "<working directory>/sentry-db");
	sentry_init(options);
	sentry_set_transaction("<app name>");
	sentry_set_level(SENTRY_LEVEL_WARNING);

Log output

I've compared the Sentry Native output between a computer where everything is working fine and one where it is not, and they are identical. I don't know how to get output from the crashpad_handler.

@supervacuus
Copy link
Collaborator

Hi @mntt-SSE! I could not look into this on any of my Windows machines yet, but it seems you can access both types of machines to verify behavior and configuration.

Can you look in the registry to see whether the machines differ in the settings for HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps? I could imagine some unexpected interaction with the crashpad WER module.

@mntt-SSE
Copy link
Author

mntt-SSE commented Apr 18, 2023

Thanks for your answer! On a machine where the problem exists, the LocalDumps entry looks like this:
image

And each entry has a DumpFolder key with the data %PROGRAMDATA%\NVIDIA Corporation\CrashDumps. All of them are the same, and none seem to point to the application I'm using, and the DumpFolder is also different from the one where I get the crash dumps (although that folder also exists, but is empty).

On a machine where Sentry works as expected, no LocalDumps entry exists at all under Windows Error Reporting. I verified with 2 machines that don't work and 1 that does work, and they fit the pattern.

@supervacuus
Copy link
Collaborator

Thx for verifying, @mntt-SSE; that is what I expected. LocalDumps bypasses application-local handling in favor of a centralized dump approach. It should be possible to deactivate it for your application by creating a new key (similar to the other apps below the LocalDumpskey), but I haven't tested it yet. Here are the docs for the key: https://learn.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps

@supervacuus
Copy link
Collaborator

And each entry has a DumpFolder key with the data %PROGRAMDATA%\NVIDIA Corporation\CrashDumps. All of them are the same, and none seem to point to the application I'm using, and the DumpFolder is also different from the one where I get the crash dumps (although that folder also exists, but is empty).

Maybe I should also say, that what is interesting are the values in LocalDumps, not the sub-keys which probably just provide a counter config because those affect all applications not listed below.

@supervacuus
Copy link
Collaborator

For further root-causing, it would certainly help if you could also gather the debug output of the Native SDK on the systems where the dump happens in %LOCALAPPDATA%\CrashDumps; maybe something obvious is already visible there.

@mntt-SSE
Copy link
Author

mntt-SSE commented Apr 19, 2023

The LocalDumps key itself does not contain any values other than (Default), which has no data.

I have so far tried to:

  1. Create a key under LocalDumps with the name of the application, and set its DumpType value to 0.
  2. Create a key under Windows Error Reporting called ExcludedApplications, and add a DWORD value with the name of the application as the key, and 1 as the value.
  3. Under the ExcludedApplications key, create another key with the name of the application.
  4. Call the Windows API function WerAddExcludedApplication. This did the same as 2.

I tried restarting the computer after each step, but nothing worked :/ I also checked the corresponding keys under HKEY_CURRENT_USER, but there is nothing there.

I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.

@supervacuus
Copy link
Collaborator

I tried attaching WinDbg to crashpad_handler.exe but I don't seem to get any logging from it in the Logs widget. Do I have to build it with a certain configuration or flag? I should also mention that I haven't used WinDbg before so I might be missing something.

What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.

I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init() was still successful). Can you check if sentry_init() returns successfully (i.e., doesn't return 1) on those machines? From what you described, I don't think the crashpad_handler is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).

The LocalDumps key itself does not contain any values other than (Default), which has no data.

That means that it is configured to use the default values.

@mntt-SSE
Copy link
Author

What I meant was that even though you have compared the two logs, which were identical on both types of machines, it would be great to see them.

Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):

[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK 
Connection: keep-alive 
Date: Mon, 24 Apr 2023 14:30:28 GMT 
Content-Length: 2 
Content-Type: application/json 
Server: nginx 
Vary: origin,access-control-request-method,access-control-request-headers 
access-control-allow-origin: * 
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after 
x-envoy-upstream-service-time: 0 
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload 
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 520m" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "executing task on worker threa" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "sending request using winhttp to <address>:
x-sentry-auth:<auth> sentry_key=<key>, sentry_version=7, sentry_client=sentry.native/0.5.3 
content-type:application/x-sentry-envelope 
content-length:17594 " (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "received response:
HTTP/1.1 200 OK 
Connection: keep-alive 
Date: Mon, 24 Apr 2023 14:30:28 GMT 
Content-Length: 41 
Content-Type: application/json 
Server: nginx 
Vary: origin,access-control-request-method,access-control-request-headers 
access-control-allow-origin: * 
access-control-expose-headers: x-sentry-error,x-sentry-rate-limits,retry-after 
x-envoy-upstream-service-time: 0 
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload 
" (level=-1)
[2023-04-24 16:30:29] Info: Sentry: "request handled in 243m" (level=-1)

I am asking because the only other reason I can imagine this happening is if our backend didn't start correctly (and for some reason, sentry_init() was still successful). Can you check if sentry_init() returns successfully (i.e., doesn't return 1) on those machines? From what you described, I don't think the crashpad_handler is involved in the crash process. It looks like your crash escapes our local exception filter (or rather: that ours wasn't installed and instead the OS handles the crash).

sentry_init() returns 0 on all machines I've tested it on. And I can also see that the crashpad_handler.exe process is starting when the application is started, and quits when the application crashes (or when it exits nicely).

@supervacuus
Copy link
Collaborator

Ah, sorry, here is what's being printed on startup from a computer that doesn't work (which is the same as when it works):

Okay, the logs are wrapped, interleaved, and incomplete. They also show some other transmission (maybe a manual event) but not the crash (because with crashpad, you would not see a transmission from within our worker_thread).

Maybe it is easier to debug this issue with the plain sentry_example.exe. In any case, I would expect a program configured as you described it in your initial description to print all these things - at some point in its execution - until it crashes:

[sentry] INFO using database path "your_database_path"
[sentry] DEBUG starting transport
[sentry] DEBUG starting background worker thread
[sentry] DEBUG starting backend
[sentry] DEBUG background worker thread started
[sentry] DEBUG starting crashpad backend with handler "your_handler_path"
[sentry] DEBUG using minidump url "your_dsn_url"
[sentry] DEBUG registering crashpad WER handler "your_wer_module_dll"
[sentry] INFO started crashpad client handler
[sentry] DEBUG processing and pruning old runs
[sentry] INFO flushing session and queue before crashpad handler
[sentry] INFO handing control over to crashpad

Since you sentry_init() successfully (at least according to the return value), I am especially interested if you see the last two lines. If those never appear, then the client-side crash-handler is never invoked, which means that the crashpad_handler will not be triggered (even if it is running).

You can also verify this externally by deleting your database path, executing your program, provoking a crash, and looking into (the newly created) path; there must be a last_crash marker file in there.

@Cyriuz
Copy link
Contributor

Cyriuz commented May 13, 2023

I have been plagued by the same issue, that some machines just refuse to send any reports. After reading this discussion and talking to @supervacuus on discord I spent a day trying a bunch of things:

To start by answering the last question, if the 2 last lines of the sentry output: Those lines does appear on working machines and does not on the machines that doesn't work. This indicates that the UnhandledExceptionFilter is not run, so I started trying to figure out why.

The first thing that proved that it was not a Windows issue (as a lot of blogs and discussions around the web indicates it could be) was that it worked to do sentry_reinstall_backend() just before triggering the crash.

So something in the startup process overwrites our exception handler, and I'm pretty sure it is not any of our dependencies since it should behave the same for all machines in that case, and I couldn't find any reference to SetUnhandledExceptionFilter. Not sure what else it could be, some driver or something? I tried inspecting loaded dlls but didn't see anything significant, but I assume it could be something that gets loaded and then released during startup.

I tried the sentry_example.exe which seems to work fine on all machines, so I would assume it has to do with it being a gui application.

I finally tried replacing the SetUnhandledExceptionFilter with a simple return by overwriting the function address and it works perfectly, but it feels very sad. Searching a bit on github though this seems to be the way people do it. Maybe it is something sentry-native should do to increase its reliability? I assume if you use sentry-native you want it to have full control of these things. But at the same time it is a bit harsh...

@mntt-SSE
Copy link
Author

Thank you @Cyriuz and @supervacuus for looking into this!

I have also done some more investigation. Firstly, the reason I didn't get the full Sentry logs was because I initialized Sentry before the other log handling. When I switched the order I got output identical to yours @supervacuus, except for the registering crashpad WER handler "your_wer_module_dll", where it says no WER handler is registered.

Secondly, I have found a workaround. I'm building a Qt application (Qt 5.15.8), and I noticed that if I initialize Sentry after the first window is shown (using QWidget::show()), then everything works as expected, which strengthen your theory about it haviing to do with UI @Cyriuz . So now I'm flashing a tiny window before the main one, which is far from ideal, but at least I'm getting the events. Would still be good to have another solution though.

@Cyriuz
Copy link
Contributor

Cyriuz commented May 15, 2023

If you run this after sentry_init it works for me:

void lockUnhandledExceptionFilter()
{
	HMODULE kernel = GetModuleHandle("kernel32.dll");
	if (!kernel)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed locating kernel32 module.\n");
		return;
	}

	FARPROC setUnhandledExceptionFilter = GetProcAddress(kernel, "SetUnhandledExceptionFilter");
	if (!setUnhandledExceptionFilter)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed locating SetUnhandledExceptionFilter.\n");
		return;
	}

	#ifdef Q_OS_WIN64
	static const BYTE newBody[] = {
		0x33, 0xC0, // xor eax,eax
		0xC3 // ret
	};
	#else
	static const BYTE newBody[] = {
		0x33, 0xC0, // xor eax,eax
		0xC2, 0x04, 0x00, // ret 4
	};
	#endif
	SIZE_T bytes = 0;
	if (WriteProcessMemory(GetCurrentProcess(), (LPVOID)setUnhandledExceptionFilter, (LPCVOID)newBody, sizeof(newBody), &bytes) == 0)
	{
		printf("[Warning] LockUnhandledExceptionFilter: Failed writing process memory.\n");
	}
}

I tried to simplify the other examples I found as much as possible but I'm no Windows programmer so take it with a grain of salt.

The best way to get to the bottom of what it is that clears the filter would be to make a proper hook, and print the callstack in there, but its somewhat involved and I'm not sure what I would do with the info anyway since I guess it theoretically could be multiple things.

To get the WER module you have to build sentry-native with CRASHPAD_WER_ENABLED

@Swatinem
Copy link
Member

That is some very good feedback, thanks for the investigation!

We have indeed heard reports before that Qt / display drivers are constantly overwriting the UnhandledExceptionFilter, similar to managed language runtimes.

While it would be possible to hook / lock overwriting it, we decided against that, as it is a very deep "incision" that we don’t want to do by default.

@Cyriuz
Copy link
Contributor

Cyriuz commented May 15, 2023

Fair enough. Although I wonder if it can still be considered somewhat standard as so many other projects seem to do the same? And considering that sentry-native is, in my understanding at least, supposed to take ownership of the entire crash reporting pipeline I feel like there might be an argument for including it, I guess it could even be optional? Or maybe it could just be a good topic for a page on the sentry docs as a workaround so people can find it easier? I assume it potentially affects most people that use sentry native on Windows.

@Swatinem
Copy link
Member

IMO hooking the UnhandledExceptionFilter completely is a really big hammer. We might be breaking apps in surprising ways that we can’t predict yet.

There are legitimate uses for temporarily using it and then restoring the original handler afterwards. We don’t want to break those usecases. Although as the examples show, restoring the original handler is often not done correctly by other tools either.

We might want to offer this either as a compile time flag or an explicit runtime API though, wdyt @supervacuus ? Then the responsibility of (not) breaking any unpredictable usecases is up to the app author.

@supervacuus supervacuus self-assigned this May 23, 2023
@W2Wizard
Copy link

W2Wizard commented Jun 14, 2023

Just going to throw in my 2 cents here. Had a very similar issue with regards to QT (5.15.8), the moment it was initialized the entire crash reporting seized to function. There we're no registry keys, or misplaced dumps, in fact there was absolutely nothing.

A quick test basically confirmed what I suspected, I would throw and exception right after sentry initialized, it worked. Did the same right after calling QApplication a(argc, argv); nope back to being broken.

Weirdly enough it worked on some Windows machines just fine (e.g: server, ...) but on others basically nothing worked. No dumps, no logs (only the one indicating that sentry had initialized), no indication that something went wrong whatsoever.

@rodolfoBee
Copy link
Member

This issue is also affecting a user of the Unreal SDK v0.16.0. It happens with both auto and manual initialisation:

FConfigureSettingsDelegate SettingsDelegate;
SettingsDelegate.BindDynamic(this, &USentryCrashReporterIntegration::HandleSettingsDelegate);

USentrySubsystem* SentrySubsystem = GEngine-<GetEngineSubsystem>USentrySubsystem<();
SentrySubsystem-<InitializeWithSettings(SettingsDelegate);

Is there any updates on this issue?

@supervacuus
Copy link
Collaborator

supervacuus commented Jun 4, 2024

Is there any updates on this issue?

We must understand that we cannot solve this problem within the Native SDK.

While a solution like the one proposed here: #833 (comment) can work to mitigate the issue, we have to be clear that this kind of solution is the reason why the problem exists in the first place.

We already had support topics that took weeks to resolve. In the end, we discovered that a customer code dependency overwrote SetUnhandledExceptionFilter() in the same way described above. This means that it is not only wrong inside the Native SDK but also wrong as an "officially" suggested mitigation from Sentry's perspective.

Also, sentry_reinstall_backend() is not a solution to the problem but rather a way to identify and debug the issue. If it is used as a permanent mitigation, then only as a last resort, for instance, when a closed source dependency with an unresponsive dev team misbehaves.

IMO, the correct approach for Sentry is to extend the "Advanced Usage" documentation and provide users of the Native SDK with steps to identify the issue, isolate potential culprits, and implement solid mitigations. I did this with multiple users in an ad hoc fashion over the last two years, and while all users had different setups and root causes, there were a lot of common steps in between. Another aspect is to collect known culprits and potential mitigations, if any.

Since the focus over the last six to nine months has been primarily on supporting downstream SDKs to improve quality on mobile platforms, these Desktop documentation topics have been put on the back burner. But we will tackle them.

cc: @kahest

@Cyriuz
Copy link
Contributor

Cyriuz commented Jun 5, 2024

While I respect your decision and stance on the issue, mostly for discussions sake, I would argue that because of the nature of this library, its only purpose being catching crashes, the argument that the "solution is the reason why the problem exists in the first place" is not really helpful. For sure it is not good that many dependencies and drivers use this api wrong/hooks it, but that doesn't change the fact that this is the situation we're in.

We want to take ownership of the exception handling but can't possibly do it any other way. If it was just dependencies it would be one thing, it is easy enough to debug who is resetting the hook and we can complain upstream, but for drivers that does it on customer PCs, its just not feasible.

Is it really that big of a deal to say that we don't want anyone else to be able to handle exceptions, given the scope of this lib? My proposed fix doesn't involve using the apis wrong, just making sure no one else is allowed. I agree that it feels wrong in the context of doing it in a library, and maybe it is also wrong to do it here, but I see no other option and I think it would help a lot of users of this lib to save a lot of time debugging things.

In any case docs covering all of this that is easy to find would be great so you can easily find and make your own decisions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Needs Discussion
Development

No branches or pull requests

7 participants