Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chromium Legacy occasionally induces a kernel panic: kqueue_scan_continue: - invalid wait_result #44

Open
Wowfunhappy opened this issue Jan 11, 2022 · 96 comments

Comments

@Wowfunhappy
Copy link
Sponsor

Wowfunhappy commented Jan 11, 2022

I've been avoiding reporting this, because I feel like it's going to be impossible to fix. :(

Occasionally while using Chromium Legacy, OS X kernel panics with a distinctive backtrace. The backtrace implicates Chromium Helper, and I've never seen the panic occur when I'm not using Chromium. I've observed this on my desktop and laptop running 10.9, as have others on MacRumors. I think it may happen more often on Macs with less memory?

A log is attached. Notably, I have so far never managed to capture a panic while I had keepsyms enabled, I'd like to capture one at some point.

Kernel_2022-01-11-182654_Jonathans-MacBook-Air.panic.zip


Everything beyond this point is speculation, and probably useless, due to my limited understanding of C. But, I'm going to include it anyway. :)

I'm technically running XNU version 2422.115.15, which doesn't appear to be open source, but the closest version with source available is 2422.115.4. The file being referenced in the log is here: https://opensource.apple.com/source/xnu/xnu-2422.115.4/bsd/kern/kern_event.c.auto.html

The line numbers don't match up, possibly because of the tiny version mismatch, but I assume this is the referenced function:

static void
kqueue_scan_continue(void *data, wait_result_t wait_result)
{
	thread_t self = current_thread();
	uthread_t ut = (uthread_t)get_bsdthread_info(self);
	struct _kqueue_scan * cont_args = &ut->uu_kevent.ss_kqueue_scan;
	struct kqueue *kq = (struct kqueue *)data;
	int error;
	int count;

	/* convert the (previous) wait_result to a proper error */
	switch (wait_result) {
	case THREAD_AWAKENED:
		kqlock(kq);
		error = kqueue_process(kq, cont_args->call, cont_args, &count,
		    current_proc());
		if (error == 0 && count == 0) {
			wait_queue_assert_wait((wait_queue_t)kq->kq_wqs,
			    KQ_EVENT, THREAD_ABORTSAFE, cont_args->deadline);
			kq->kq_state |= KQ_SLEEP;
			kqunlock(kq);
			thread_block_parameter(kqueue_scan_continue, kq);
			/* NOTREACHED */
		}
		kqunlock(kq);
		break;
	case THREAD_TIMED_OUT:
		error = EWOULDBLOCK;
		break;
	case THREAD_INTERRUPTED:
		error = EINTR;
		break;
	default:
		panic("%s: - invalid wait_result (%d)", __func__,
		    wait_result);
		error = 0;
	}

	/* call the continuation with the results */
	assert(cont_args->cont != NULL);
	(cont_args->cont)(kq, cont_args->data, error);
}

So, the switch statement expects the wait_result parameter to be either THREAD_AWAKENED, THREAD_TIMED_OUT, or THREAD_INTERRUPTED. If it's none of these—as appears to be the case for us—it panics with our error, "invalid wait result".

So, what is being passed in as wait_result? Well, it's printing as the number "3", but I don't know enough C to understand how these integers are mapped to their human-readable names. However, note that a fourth option has been added to this switch statement in newer versions of XNU: https://opensource.apple.com/source/xnu/xnu-7195.81.3/bsd/kern/kern_event.c.auto.html

	case THREAD_RESTART:
		error = EBADF;
		break;

I'm going to hazard a guess that when the kernel panics, kqueue_scan_continue is getting called with a value of THREAD_RESTART in the wait_resultparameter, which is understood on new versions of XNU but not the older ones we're using. Is there any way to catch and handle this?

@pjpreilly
Copy link

pjpreilly commented Jan 12, 2022

When I've triggered this...by chance.. I seem to always be switching windows either through hot corners invoking Mission Control or clicking the Chromium Dock Icon to switch windows..... FWIW.... or maybe switching tabs.... I've noticed sometimes when switching tabs the pointer does not take focus of the new tab until a subsequent click in the tab being switched to...seems to occur after the browser has been running for a good amount of time.

@krackers
Copy link

krackers commented Jan 17, 2022

@Wowfunhappy Mapping is given in https://opensource.apple.com/source/xnu/xnu-4570.61.1/osfmk/kern/kern_types.h

You're right in that 3 corresponds to THREAD_RESTART. Not too familiar with kqueue internals but just based on statically analyzing the call hierarchy, seems to be

kevent_internal -> kqueue_scan -> kqueue_scan_continue

It's really interesting that kevent syscall was only recently updated to return EBADF though. Seems like this is returned whenever the fd becomes invalid (e.g. if it's closed), which would definitely explain why it's sporadic and hard to reproduce. I wonder if this bug was just never noticed on older versions of osx because nothing else other than chrome seems to make such heavy use of IPC, and older versions of Chromium probably used some other method for synchronization/ipc [1]?

There are only a few places in chromium source that make the kevent syscall (search for
"kevent" "#include <sys/event.h>"). Might be interesting to add some debugging code around that to see if there's any pattern.

[1] https://source.chromium.org/chromium/chromium/src/+/2096391cdcc3bdf963e68977e9015aa8cdbe85c6

https://source.chromium.org/chromium/chromium/src/+/c96c632d5501d5ebba14e1130fdea8b15ac29756

@Wowfunhappy
Copy link
Sponsor Author

Thanks! I wonder if this was the underlying project which introduced the problem. The timing lines up, 9–12 months after 10.9 support was dropped from Chrome.

Aside, I assume it's not possible to just replace kqueue_scan_continue with a version that handles THREAD_RESTART, right? Because it's an internal kernel function, or something like that?

@krackers
Copy link

krackers commented Jan 19, 2022

@Wowfunhappy Excellent question. It is possible if you are willing to install a kext. Basically since all the kexts will live in the same address space as the kernel code, you can hot-patch the kernel by modifying the instructions in memory.

Now to actually do this patch you'll need to be a bit clever:

  • Load the kernel mach-o in ghidra/hopper/ida and find the address of the function that we want to modify (kqueue_scan_continue). This will be our magic number that is the offset to patch from base.
  • Then in your kext, you will have to find the base location the kernel is loaded at (due to kaslr this is not fixed), and add that offset from step (1) to find the address of the desired function.
  • Add an extra offset to get to the specific instruction we want to patch. In this case, it's not a trivial modification since we want to insert a block of code, so the best way to do this is to insert a jump to code in your kext that will handle the switch(), then jump back to the original code.
  • Do the patch as described above, and it should all just work.

Conceptually not too difficult, and not that different from code injection in userspace. Only difference is that stakes are higher. If you have a VM and are willing to play around with it, I can give you some code pointers. Of course at this point it might just be easier to recompile the kernel itself since we have the source anyway...

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Jan 19, 2022

Thanks, makes sense it would need a kext. Probably not ideal if it can be helped.

(Just for my own use, I am a bit tempted to recompile XNU. But I run into this panic so rarely anyway... Some people on MacRumors were saying it happens daily for them.)

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Jan 19, 2022

MacBiter over on MacRumors captured a backtrace with keepsyms enabled.

panic(cpu 6 caller 0xffffff80055c7ab4): "kqueue_scan_continue: - invalid wait_result (3)"@/SourceCache/xnu/xnu-2422.115.15/bsd/kern/kern_event.c:2167
Backtrace (CPU 6), Frame : Return Address
0xffffff81f05a3ef0 : 0xffffff8005223139 mach_kernel : _panic + 0xc9
0xffffff81f05a3f70 : 0xffffff80055c7ab4 mach_kernel : _kqueue_scan + 0x8f4
0xffffff81f05a3fb0 : 0xffffff80052d7c67 mach_kernel : _call_continuation + 0x17

I assume this is _call_continuation? Still an internal kernel function, I naively thought it might give us Chromium's initial syscall.

Edit: And I think this is the definition... I wasn't expecting to find ASM...


Edit 2: I don't actually understand most of what it's saying, but pages 418 – 420 of Jonathan Levin's Mac OS X and iOS Internals may be relevant here. I think it's saying that wait_result is set by the thread_block_reason() function.

Screen Shot 2022-01-19 at 7 10 46 PM

I'm just trying to figure out where that wait_result value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet. It would make sense if Chromium was the one setting wait_result, since Google is only targeting modern kernels. But this is kernel code calling kernel code. I guess the obvious answer is that it's a bug, but it seems like something that wouldn't have gone unnoticed for years.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Jan 22, 2022

...well, for what it's worth, I decided to build a custom kernel. As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete. The only tricky part was that I had to install Xcode 5.0.2; later versions did not appear to work.

The only source change is that I added the THREAD_RESTART case to kqueue_scan_continue, which sets error = EBADF as in newer versions of XNU.

I am (a bit nervously) going to try using this on my main machine. A half hour in, nothing catastrophic has happened so far.

The kernel is attached in case anyone else on Mavericks wants to try it, but I am not necessarily recommending that anyone do so. If you do use it, make sure at minimum to have a secondary partition available to restore your original kernel should the need arise. Remember that you will need to clear the kernel cache on your first boot with the new kernel.

@krackers Do you have any idea why this kernel would be only 8,257,152 bytes, when the original kernel and every custom Hackintosh kernel I've checked is 8,394,688 bytes? It's weirding me out.

XNU-2422.115.4-THREAD_RESTART add-JA-v2022..01.22.zip

@krackers
Copy link

@Wowfunhappy sorry for the late response.

I'm just trying to figure out where that wait_result value could be coming from. I find it so weird that XNU is returning a value that didn't exist yet

Note that even in the older xnu version, THREAD_RESTART was still a valid defined value for wait_result_t. In fact, you can see that it is returned in wait_queue_assert_wait64 of osfmk/kern/wait_queue.c when the queue becomes invalid. The difference is that in this older kernel version, kqueue scan did not properly handle this, and just panicked instead of returning EBADF.

I agree that this seems way too obvious an omission to be a bug. But I can't think of any other explanation at the moment (that said, my knowledge here is very slim). I think the first step might be to see if we can create a scenario that can reliably reproduce this. Based on my current understanding, it seems like the conditions for triggering would look like:

  • Have one process call kqueue to create a new fd for event queue, and then call kevent to wait upon it

  • While that happens, another process closes the fd

  • Based on what's described above, this would generate a THREAD_RESTART condition in the wait queue code which would bubble up to kqueue_scan and cause the panic.

But this seems way too obvious and easy to trigger, so I'm still skeptical that this is actually what's happening.

As it turns out, compiling XNU is shockingly easy and the build itself takes only a few minutes to complete

I see you're following the "Kernel Shaman's" build instructions? I'm not quite sure why there would be a difference in size assuming that both are release builds, but it's possible that apple internal builds have some other ifdefs set during compile, or the kernel version is different? This wouldn't explain why the hackintosh kernels match in size though...

That's also why I'm personally partial to the "hotpatch via kext" approach. That way you introduce only the minimal difference necessary.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Jan 24, 2022

@krackers I'm not ignoring your suggestions of things to try, they're just mostly outside of my skill level. 🙂 I only barely understand explanations of what a file descriptor is...

Anyway, the plot thickens. I neglected to check something earlier:

https://github.com/apple-oss-distributions/xnu/blame/cc9a63552421e7db025ad771974f3e9f479948ad/bsd/kern/kern_event.c

That THREAD_RESTART case was only added to XNU in 3789.1.32, aka macOS 10.12 Sierra.

But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?

Could this indicate the problem was introduced by Chromium Legacy? If I'm not missing anything, I think the only Chromium Legacy changes that involve kevent come from this commit. e0cff58

Google's code review comments for this change are interesting...

blueboxd pushed a commit that referenced this issue Feb 3, 2022
The frame management mechanism expects that there are no more than
25 submitted frames at the same time. However, this number can
grow bigger if Wayland compositor doesn't send feedbacks (although
the protocol is supported). One of such examples is Sway that
runs on top of gnome/mutter, which doesn't expose wp_presentation
if the underlying system doesn't support that.

See bug reported to wlroots -
https://gitlab.freedesktop.org/wlroots/wlroots/-/issues/3340

PS also the number of the threshold has been lowered to avoid
FATAL:compositor_frame_sink_support.cc(734)] Check failed: pending_received_frame_times_.size() <= 25u (26 vs. 25)
errors.

(cherry picked from commit c0d0144)

Bug: 1253566, 1187077, 1243133, 1254273
Change-Id: Ib2b222f40588212342dd88b2cb880647024788f4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3322619
Reviewed-by: Robert Kroeger <rjkroege@chromium.org>
Reviewed-by: Kramer Ge <fangzhoug@chromium.org>
Commit-Queue: Maksim Sisov <msisov@igalia.com>
Cr-Original-Commit-Position: refs/heads/main@{#950548}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3338605
Auto-Submit: Kramer Ge <fangzhoug@chromium.org>
Commit-Queue: Robert Kroeger <rjkroege@chromium.org>
Cr-Commit-Position: refs/branch-heads/4758@{#44}
Cr-Branched-From: 4a2cf4b-refs/heads/main@{#950365}
@pjpreilly
Copy link

On latest canary..Version 100.0.4889.0 (Official Build) (x86_64)
two finger scrolling in a facebook messenger chat....

Tue Feb 15 06:40:24 2022
panic(cpu 0 caller 0x55bb8e): "kevent_scan_cont() - invalid wait_result (3)"@/SourceCache/xnu/xnu-1699.32.7/bsd/kern/kern_event.c:1982
Backtrace (CPU 0), Frame : Return Address (4 potential args on stack)
0xe2bf68 : 0x2203de (0x6b08cc 0xe2bf88 0x229fb0 0x0)
0xe2bf98 : 0x55bb8e (0x7006a4 0x3 0x6 0xa4c39d4)
0xe2bfc8 : 0x2c7d0c (0x9f03320 0x3 0x10 0xa4e9190)

BSD process name corresponding to current thread: Chromium Helper

Mac OS version:
11G63

Kernel version:
Darwin Kernel Version 11.4.2: Thu Aug 23 16:26:45 PDT 2012; root:xnu-1699.32.7~1/RELEASE_I386
Kernel UUID: 859B45FB-14BB-35ED-B823-08393C63E13B
System model name: MacBook4,1 (Mac-F22788A9)

System uptime in nanoseconds: 71096824466362
last loaded kext at 27108605406: com.apple.driver.AudioAUUC 1.59 (addr 0x146b000, size 24576)
last unloaded kext at 115835844270: com.apple.driver.USBCameraFirmwareLoader 1.1.0 (addr 0xd35000, size 24576)
loaded kexts:
com.apple.driver.AudioAUUC 1.59
com.apple.filesystems.autofs 3.0
com.apple.driver.AppleHWSensor 1.9.5d0
com.apple.driver.AppleHDA 2.2.5a5
com.apple.driver.AppleUpstreamUserClient 3.5.9
com.apple.driver.AppleBacklight 170.2.2
com.apple.driver.AppleMCCSControl 1.0.33
com.apple.driver.AppleIntelGMAX3100 7.0.4
com.apple.driver.AppleIntelGMAX3100FB 7.0.4
com.apple.driver.SMCMotionSensor 3.0.2d6
com.apple.iokit.IOUserEthernet 1.0.0d1
com.apple.iokit.IOBluetoothSerialManager 4.0.8f17
com.apple.Dont_Steal_Mac_OS_X 7.0.0
com.apple.driver.AudioIPCDriver 1.2.3
com.apple.driver.ApplePolicyControl 3.1.33
com.apple.driver.ACPI_SMC_PlatformPlugin 5.0.0d8
com.apple.driver.AppleLPC 1.6.0
com.apple.driver.AppleSMCPDRC 5.0.0d8
com.apple.driver.CSRUSBBluetoothHCIController 4.0.8f17
com.apple.AppleFSCompression.AppleFSCompressionTypeDataless 1.0.0d1
com.apple.AppleFSCompression.AppleFSCompressionTypeZlib 1.0.0d1
com.apple.BootCache 33
com.apple.driver.AppleUSBTrackpad 227.6
com.apple.driver.AppleUSBTCKeyEventDriver 227.6
com.apple.driver.AppleUSBTCKeyboard 227.6
com.apple.driver.AppleIRController 312
com.apple.iokit.SCSITaskUserClient 3.2.1
com.apple.driver.XsanFilter 404
com.apple.iokit.IOAHCIBlockStorage 2.1.0
com.apple.driver.AirPortBrcm43224 501.36.15
com.apple.driver.AppleFWOHCI 4.9.0
com.apple.driver.AppleHPET 1.7
com.apple.driver.AppleUSBHub 5.1.0
com.apple.driver.AppleAHCIPort 2.3.1
com.apple.driver.AppleIntelPIIXATA 2.5.1
com.apple.driver.AppleUSBEHCI 5.1.0
com.apple.driver.AppleUSBUHCI 5.1.0
com.apple.driver.AppleEFINVRAM 1.6.1
com.apple.driver.AppleRTC 1.5
com.apple.iokit.AppleYukon2 3.2.2b1
com.apple.driver.AppleSmartBatteryManager 161.0.0
com.apple.driver.AppleACPIButtons 1.5
com.apple.driver.AppleSMBIOS 1.9
com.apple.driver.AppleACPIEC 1.5
com.apple.driver.AppleAPIC 1.6
com.apple.driver.AppleIntelCPUPowerManagementClient 195.0.0
com.apple.nke.applicationfirewall 3.2.30
com.apple.security.quarantine 1.4
com.apple.security.TMSafetyNet 8
com.apple.driver.AppleIntelCPUPowerManagement 195.0.0
com.apple.kext.triggers 1.0
com.apple.driver.DspFuncLib 2.2.5a5
com.apple.driver.AppleBacklightExpert 1.0.4
com.apple.driver.AppleSMBusController 1.0.10d0
com.apple.driver.AppleHDAController 2.2.5a5
com.apple.iokit.IOHDAFamily 2.2.5a5
com.apple.iokit.IOSurface 80.0.2
com.apple.iokit.IOFireWireIP 2.2.5
com.apple.iokit.IOAudioFamily 1.8.6fc18
com.apple.kext.OSvKernDSPLib 1.3
com.apple.driver.AppleGraphicsControl 3.1.33
com.apple.iokit.IONDRVSupport 2.3.4
com.apple.iokit.IOGraphicsFamily 2.3.4
com.apple.iokit.IOSerialFamily 10.0.5
com.apple.driver.AppleSMC 3.1.3d10
com.apple.driver.IOPlatformPluginLegacy 5.0.0d8
com.apple.driver.IOPlatformPluginFamily 5.1.1d6
com.apple.driver.AppleUSBBluetoothHCIController 4.0.8f17
com.apple.iokit.IOBluetoothFamily 4.0.8f17
com.apple.driver.AppleUSBMergeNub 5.1.0
com.apple.iokit.IOUSBHIDDriver 5.0.0
com.apple.driver.AppleUSBComposite 5.0.0
com.apple.iokit.IOSCSIMultimediaCommandsDevice 3.2.1
com.apple.iokit.IOBDStorageFamily 1.7
com.apple.iokit.IODVDStorageFamily 1.7.1
com.apple.iokit.IOCDStorageFamily 1.7.1
com.apple.iokit.IOATAPIProtocolTransport 3.0.0
com.apple.iokit.IOSCSIArchitectureModelFamily 3.2.1
com.apple.iokit.IO80211Family 420.3
com.apple.iokit.IOFireWireFamily 4.4.8
com.apple.iokit.IOUSBUserClient 5.0.0
com.apple.iokit.IOAHCIFamily 2.0.8
com.apple.iokit.IOATAFamily 2.5.1
com.apple.iokit.IOUSBFamily 5.1.0
com.apple.driver.AppleEFIRuntime 1.6.1
com.apple.iokit.IONetworkingFamily 2.1
com.apple.iokit.IOHIDFamily 1.7.1
com.apple.iokit.IOSMBusFamily 1.1
com.apple.security.sandbox 177.11
com.apple.kext.AppleMatch 1.0.0d1
com.apple.driver.DiskImages 331.7
com.apple.iokit.IOStorageFamily 1.7.2
com.apple.driver.AppleKeyStore 28.18
com.apple.driver.AppleACPIPlatform 1.5
com.apple.iokit.IOPCIFamily 2.7
com.apple.iokit.IOACPIFamily 1.4

@jgrisham
Copy link

But, Google Chrome officially supports macOS 10.11. So, why don't Chrome users on El Capitan experience kernel panics?

Do those (macOS 10.11 Chrome users) actually exist? ;)

blah...
  • Those still running 10.11 may be using Safari/Firefox/Opera, or blame their kernel panics on running an old OS version.
  • (I also suspect many are headless or server systems that haven't been updated - these types of systems probably don't see much interactive web browsing.)
  • Personally, my MacBook Pro still is running Mavericks, but I seem to remember thinking (for some long-lost rationale) that if I were ever to bump it forward, I'd be stopping at 10.10, 10.12, 10.13, or 10.15.

AFAIK, there aren't many (any?) macs where that is the latest installable version.

Systems tend to max out on 10.4, 10.6.8, 10.7.5, 10.11, 10.13.x, etc...

  • Having a max supported OS of macOS 10.11 seems mostly limited to MacMini3,1s and other similar late 2008/early 2009 Penryn/Nehalem models ... the late 2009 models tend to support up to 10.13.x ... I don't remember exactly, but the demarcation point may have been 'Metal' framework support by the integrated GPU.

more blah...

We recently identified a MS Outlook / Windows MAPI bug that now silently eats emails... but only for IMAP mailboxes, and only for maybe one out of a thousand. It has existed for at least a decade, likely two. (It was likely dormant for most of that time; surfacing only as mail processors began to support ARC in late 2019/2020.)

  • From the public reporting, it wouldn't be at all surprising if Apple had overlooked intermittent kernel bugs for just as long, especially for one that didn't generally cause panics on iOS / TVOS.
  • The larger the company (and the older/more stable the codebase), the harder it is for even well-meaning and talented engineering teams to find out edge cases exist - there are unintentional gatekeepers at all levels.

Tl;dr: it's plausible that it had been broken all along (i.e. not backported to earlier versions when it was finally discovered / fixed in XNU 3789.1.32).

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Feb 17, 2022

Do those (macOS 10.11 Chrome users) actually exist? ;)

I mean, anecdotally, I've seen several of them in the wild! I even helped a third grade schoolteacher add the Let's Encrypt certificate to Keychain Access last fall.

This may well be an kernel bug, but I don't think it's happening in official versions of Chrome. A kernel panic is more difficult to overlook than a missing email, and this one appears to happen to all Chromium Legacy users sooner or later, albeit very infrequently. I don't see how Google, with all of their users and analytics, wouldn't know about this problem if it existed.

It seems much more likely to me that e0cff58 is culpable, somehow...


† Technically, my understanding is that any instance of a userspace application triggering a kernel panic is generally considered a kernel bug, since userspace shouldn't be able to take down the kernel.

@jgrisham
Copy link

{discussion clutter}

Technically, my understanding is that any instance of a userspace application triggering a kernel panic is generally considered a kernel bug, since userspace shouldn't be able to take down the kernel.

That's what I always thought too...

  • anything else seems like users and userspace devs are just being gaslit by kernel devs (I guess it's technically only gaslighting if it's intentional)

A kernel panic is more difficult to overlook ...

Indeed.

this one appears to happen to all Chromium Legacy users sooner or later, albeit very infrequently

Do we actually know that? I mean, how could we without a major analytics operation? (I'm honestly curious - not trying to be argumentative!)

... yet, someone who recompiles XNU for fun in their spare time also said:

I've been avoiding reporting this ...

Chromium is one of the most resource-intensive users of a heterogenous variety of OS resources that many users are likely to run these days.

  • It's also liable to be used in very different ways on each launch; even users visiting the same website day after day will ultimately be running different JS from one month to the next.

with all of their users and analytics

I really want to believe, but for something that doesn't significantly impact PR or ad revenue, causing occasional and non-reproducible crashes for users of older versions of only one operating system?

  • Simple things break complex systems, and yet only seem simple in retrospect.
  • It would not be at all surprising if they had much more impactful bugs languishing in their backlog, both then and now. Knowledge without action and all...

Consider:

  • Apple fixed this at the kernel level (after a few years)
  • You fixed this at the kernel level (after a few weeks)

Also:

  • None of that actually matters 😒

Neither one of those fixed kernels will get to the majority of affected users here, so we do have to continue trying to mitigate this kernel bug by rolling back or modifying our userspace code.


Tl;dr: Thank you for trying to hunt it down!

@krackers
Copy link

Looking again at the kernel kevent code, it seems there are very few places where THREAD_RESTART is actually signaled. The most likely source for the one received in kqueue_scan seems to be

https://github.com/aosm/xnu/blob/os-x-1095/osfmk/kern/wait_queue.c#L1335

	/* If it is an invalid wait queue, you cant wait on it */
	if (!wait_queue_is_valid(wq))
		return (thread->wait_result = THREAD_RESTART);

and in the kevent code kq->kq_wqs is the wait queue as part of the kq struct. This again makes me think that somehow the kqueue is being closed by some other process and we're not handling that.

@Wowfunhappy can you add some log statements to each of the sites in kern_event.c that call into wait_queue_assert, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by !wait_queue_is_valid(wq). (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.

As for userspace workarounds, assuming that hypothesis is correct then what I can think of is to try to ensure that no thread is waiting on a kqueue that is to be closed. So we'd have to send some sort of interrupt signal to break the wait on kqueue before closing the fd? Not too sure since I'm not too familiar with these parts...

As for the reverted commit you linked, at first glance I don't see anything too suspicious there. Why did that commit need to be reverted for 10.9 compatibility though? I don't see anything that's 10.11+ specific.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Feb 20, 2022

@Wowfunhappy can you add some log statements to each of the sites in kern_event.c that call into wait_queue_assert, and then an additional log statement in the place where we would have panicked before? To avoid too much data guard each log statement by !wait_queue_is_valid(wq). (Let me know if you need clarification here). Collecting more data might be useful to try to recreate the exact conditions under which this is triggered.

I'll try next week!

As for the reverted commit you linked, at first glance I don't see anything too suspicious there.

I don't understand this code, so I'm just looking at the timing of when things were added and the differences between codebases. I'm assuming:

  1. The offending code has to make use of the kevent system call.

  2. There has to be something different in Chromium Legacy's codebase and mainline Chromium's codebase, which is causing Chromium Legacy to kernel panic on 10.11 and below but not mainline Chromium. (This is based on a separate assumption that Chromium Legacy would kernel panic if used on 10.11, since 10.11 contains the offending XNU code.)

The commit I linked contains the only difference I could find between the two codebases which uses kevent.

@krackers
Copy link

@Wowfunhappy Assumption 1 isn't necessarily true. If we use the working hypothesis that the kpanic is caused by attempting to use an invalid (freed) kqueue, then you'd also have to search for references for code which uses MessagePump, since when that instance goes out of scope its destructor will free the kqueue fd.

There's also exactly 1 other place where THREAD_RESTART is signalled, and that's in some sephamore related method. If the kpanic is triggered by that, then we're really in a pickle since that would mean the root cause might not even be strictly related to kqueue operations.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Feb 21, 2022

@krackers I ran into some stupid trouble guarding the logging statements. wait_queue_is_valid doesn't exist in kern_event.c.

I tried to copy the two defines from: https://github.com/aosm/xnu/blob/0a13c7a4dcb9ab989727c2b0e08f7861457ed069/osfmk/kern/wait_queue.h#L161, adding:

#define _WAIT_QUEUE_inited		0x2
#define wait_queue_is_valid(wq)	\
	(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)

But the compiler complained:
no member named 'wq_type' in 'struct wait_queue'

The wait_queue struct is defined in kern/wait_queue.h, and wq_type is ostensibly public, but adding #include <kern/wait_queue.h> to kern_event.c had no effect. I don't want to copy-paste the struct into kern_event.c because I don't know if defining it in a new place will do something weird.

I'm sure I'm doing something dumb. 🙂

For now, I've attached a kernel with unguarded logging statements before each call of wait_queue_assert in kern_event.c. The source is this revision of: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558d/440e40ec82da87f4240f22981a020e0f169ea915. It works in my VM, but it's noisy enough that I don't want to leave it running on my real system. It seems wait_queue_assert_wait_with_leeway is called from kqueue_scan many times per second.

Also for reference, the guarded version I can't compile: https://gist.github.com/Wowfunhappy/a214411bb52d39bcee9497363a97558
XNU-2422.115.4-THREAD_RESTART add-JA-v2022.02.21.zip
d

Edit: Issue was closed by accident, please disregard.

@krackers
Copy link

krackers commented Feb 21, 2022

@Wowfunhappy

No you're not doing something wrong, the way their includes work is tricky and took me a good 10 minutes to figure out.

Very roughly, when all the headers for are put together the net result is something like


#include <stdio.h>


typedef struct wait_queue		*wait_queue_t;
#ifndef MACH_KERNEL_PRIVATE

struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types

#else



typedef struct wait_queue { // from wait_queue.h
	unsigned long int                 
	/* boolean_t */	wq_type:2,		/* only public field */
					wq_fifo:1,		/* fifo wakeup policy? */
					wq_prepost:1,	/* waitq supports prepost? set only */
					wq_eventmask:2; 
	int	wq_interlock;	/* interlock */
	int	wq_queue;		/* queue of elements */
} WaitQueue;

#define wait_queue_is_valid(wq)	\
	(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)
	
#endif
	

In particular, note that we actually have two different declarations of the wait_queue struct that are selected based on whether MACH_KERNEL_PRIVATE. I assume one is intended be used akin to an opaque-type (think "private" in an OO language) to hide internal members, while the other is used only within the wait_queue.c. Interestingly for opaque types you normally just leave the declaration without any fields, but here they explicitly initialize with some dummy fields of the proper size, which I assume was done so sizeof() would work propery.

This is also why just #including didn't work, since the proper #define wasn't set, so it effectively never made it past the preprocessor.

Now let's work around this. You're right in that we don't want to just directly copy-paste the struct since we'll run into duplicate definition error. Instead let's do it in a slightly smarter way where we create our own struct with identical layout and then cast to that to access fields.


typedef struct wait_queue		*wait_queue_t;

struct wait_queue { unsigned int opaque[2]; int *opaquep[2]; } ; // from kern_types


#define EVENT_MASK_BITS  

typedef struct wowfunhappy_wait_queue { // happy little wait queue
	unsigned long int                 
	/* boolean_t */	wq_type:2,		/* only public field */
					wq_fifo:1,		/* fifo wakeup policy? */
					wq_prepost:1,	/* waitq supports prepost? set only */
					wq_eventmask:((sizeof(long) * 8) - 4); 
} WowFunHappyWaitQueue;

#define _WAIT_QUEUE_inited		0x2
#define wait_queue_is_valid(wq)	\
	(((wq)->wq_type & ~1) == _WAIT_QUEUE_inited)

void buzz(wait_queue_t wq) {
	printf("%d", wait_queue_is_valid((WowFunHappyWaitQueue*) wq));
}

** Note 2 important things. 1) This is technically undefined behavior I think, although if you were to use a union instead of a cast to do the type-punning then it might be standards compliant? But who cares about standards so long as it works (and casting should indeed work).

  1. Note that I only included the initial member of the struct here, and left out the other 2. This should be fine because IIRC type-punning is safe for any prefix of shared fields as well. **

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Feb 22, 2022

Unfortunately, I have been forced to revert back to a vanilla XNU kernel on my personal machine.

I woke up this morning and found that my Mac had been logged out of iMessage. When I tried to log back in, the app complained of an unknown error. After a lot of reboots, password changes, network changes, and other dead ends, I finally tried reverting back to stock XNU, and iMessage instantly began working again.

If anyone else is using one of the kernels I attached, and iMessage stops working for you, you know how to fix it. And for what it's worth, I experienced zero panics during my month with the custom kernel.

Stuff related to iMessage and custom kernels

I'm not entirely sure what happened, since I was using my custom kernel for nearly a month without problems. It may be that because I was already logged into iMessage when I switched kernels, the app kept working up until Apple sent some kind of heartbeat message. The imessage_debug utility reported failures when retrieving the values of Gq3489ugfi, Fyp98tpgj, kbjfrfpoJU, oycqAZloTNDm, and abKPld1EcMni on the custom kernel (and success with the stock kernel), but the software's authors neglected to include any documentation as to what these values mean, where they come from, or why they're important. 🤷‍♂️

Regardless, I personally value iMessage compatibility, and the system is enough of a crapshoot that I'd rather not mess with it and potentially get my account flagged. Perhaps I'll see if I can figure out a kext patch instead.

@krackers
Copy link

krackers commented Feb 22, 2022

That's interesting. Based on what I found online it seems cause is

Publicly-available XNU sources lack the necessary functions needed for generating the IOPower data (just look at IORootParent::initialize() in the kernel binary vs. what's in the source).

I assume that the hackintosh community has worked around this by just doing patching instead. There seems to be a robust patching framework they've built up (Lilu). Another alternative might have been to create a kext that does populate these values in the io registry, but I don't know why they didn't go that route. I'm not too familiar with the hackintosh scene so I don't know how they obtain these values in the first place if they don't have an existing mac to copy from (basedo n https://dortania.github.io/OpenCore-Post-Install/universal/iservices.html#fixing-en0 looks like they generate random SNs??)

I personally don't like the idea of using a monolithic framework to do the patching (and Lilu is bloated since it tries to handle both userspace and kernel patching). Looking at ([Onyx the black cat[(https://github.com/gdbinit/onyx-the-black-cat)) is a good reference since it contains some useful functions we might want (especially the write enable on kernel pages). Most of those patches are in-place binary patches though, whereas for a slightly more involved patch like this I'd usually want to insert a trampoline.

I tried to find any existing examples of kexts which have a trampoline insert that we can steal (there are many userspace ones like mach_override that we can port if really needed since the core logic is the same, but I'm lazy and don't want to think too much). I found

https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/inline-hook/inline_hook.c

https://github.com/seclab-ucr/SyzGen_setup/blob/5a86c59e3e26ba7c259fbc19de89197dc4edd674/kcov/kcov/hook.c

https://github.com/SilverMoonSecurity/PassiveFuzzFrameworkOSX/blob/be29a4df6e7b2394774a8cc3019bde03d816158c/src/kernel_fuzzer/the_flying_circus/InlineHook/inline_hook.c

(interestingly all same to be the same-ish file, but don't have any attribution...)

@krackers
Copy link

krackers commented Feb 22, 2022

Er actually doing a trampoline is going to be a bit annoying because we need to ensure the return value is actually used. We can either implement the trampoline'd function in assembly so we don't need to worry about C messing things up, try to fuss around with the proper syntax to ensure that wait_result gets mapped to the function param and the return value gets mapped to the register that contains error, or just forget about doing this properly and patch the default section to return EBADF directly and hope that this doesn't mask any other bug.

In terms of hacker spirit I still like the trampoline though, and if we set up the framework for that it will be easy to insert the logging I mentioned. After going through the candidates I like the approach taken by

https://github.com/seclab-ucr/SyzGen_setup/blob/5a86c59e3e26ba7c259fbc19de89197dc4edd674/kcov/kcov/hook.c

since it is the easiest for me to understand and flexible enough to coax into doing what we want.

@Wowfunhappy Either way (whether or not we opt to use trampoline) let's start off with creating a simple kext that just reads the contents of kqueue_scan_continue function address to make sure we are accessing the correct address. https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c has a nifty solve_kernel_symbol function but I don't think we can use it because kqueue_scan_continue is not exported function. Instead, we need to find that corresponding block in the disassembled binary to get the relative address (relative to kernel base) and then add the kaslr offset to it to get the absolute address in memory.

Once you have the relative address, can you create a basic "hello world" kext (https://craftware.xyz/tips/Building-kernel-extension.html) that uses the kaslr_slide function from https://github.com/lilang-wu/EasyBP/blob/d2d9591417df52b94d21945efb8cea393dc46a9b/makedebugpoint/solve_kernel_symbols/kernel_info.c to dump out some bytes from the absolute address to see if they match the assembly we expect? Once we have that then we can take the next steps...

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Feb 23, 2022

The Hackintosh community has some good reasons for avoiding custom kernels. Apple releases source code on a delay, and if you're running a non-ancient version of macOS, every update will potentially overwrite your custom kernel with a stock one, and leave you with a non-bootable system if you aren't careful. The more recent OSS releases also leave out really critical things like sleep and power management.

I'd thought that on an EOL OS, there was a certain purity to just fixing an obvious bug in an open-source component and recompiling—after all, what good is code that's never built? But it was never a great solution anyway, and it appears that even Apple's older code releases aren't as complete as I thought.

I am a Hackintosh user (on one of my two machines) and I'm already using Lilu, so it might make sense for me to go that route; it mostly depends on what I can actually manage to do. Downside: AIUI, Lilu has to be injected via a Hackintosh bootloader, because it needs to be loaded before the kexts it's patching. (Although, wait a minute, we're not trying to patch a kext, we're trying to patch the kernel—will this work at all? I know you said it's all the same address space, but then why does the loading order matter...?)

But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users. The custom kernel had the advantage of being easy, I was done in a couple hours.

How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in userspace?

Edit: I started writing this before your more recent replies/edits @krackers. Okay, if I attempt to keep going with this I'll follow from your starting point, thank you! I would still like to step back for a minute and consider whether this a smart avenue to keep exploring—what are we hoping to find out?

@krackers
Copy link

@Wowfunhappy

Yes, Lilu will work since it has support for patching kernel functions as well. But I don't like it here because I don't think using Lilu is going to give us any strong benefit that we could not have ourselves by copying a few header files for useful functions. And the Lilu only takes care of boilerplate, but we must still bring core logic. And that too we'd have to learn the Lilu API so the benefits of having boilerplate taking care of is negated. I think it's much simpler and more understandable to just make vanilla kext?

But, I also need to step back for a moment—I'm not sure what we're trying to accomplish here. Even without Lilu, I wouldn't feel comfortable recommending a patch like this to most Chromium Legacy users.

End result will be to create a kext that users can just place in /Library/Extensions or whatever to hotpatch kernel on the fly. Will only work for users with same kernel version that you based offsets on though, other kernel versions people will have to recompile kext with their own offset. It's also fun for learning, but yes it is a bit more work to do than just recompiling xnu. Up to you whether the pain of occasional kernel panic is worth spending a few weekends trying to create a kext though. Happy to provide guidance along the way. I edited previous repose #44 (comment) with some concrete starting steps if you've decided to take this route.

How useful do we think this logging data will be? Do we actually have a chance of fixing this properly in user-space?

To be honest, I don't know. If we can determine for sure that it is caused by a freed kqueue fd, then we need to take next step to figure out why the fd is being freed. This would involve adding some logging code in Chromium (maybe in destructor for messagepump?). If we verify that is indeed the cause, then next step would be to create some work around where we try to only free the kqueue after we ensure no thread is blocked waiting on one. We could try to make that change now itself but to me seems like the effort required to figure out how to implement this (would have to study exactly what type of event to send to interrupt existing kqueue wait) combined with effort to recompile chromium is much more than doing the kext.

Other option is if you can try to find commits involving messagepump we can try to see if one of them might be responsible, but this is more of a shot in the dark compared to the kernel patch which we know will work. If you're optimizing for time spent vs. reward then the kernel patch approach seems to be a surer bet.

@krackers
Copy link

krackers commented Feb 27, 2022

It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner. Can we run chrome under gdb or with dtrace to figure out where the kevent syscall is being made?

Tangential thing Completely unrelated, but if you like reading tales about old versions of chrome and patching things, did you know that newer version of chrome actually fixed (well more like worked-around) a coretext bug in 10.9? Under very rare and infrequent conditions (usually when rendering CJK text + emoji, but not deterministically) coretext shaper will get stuck when using AAT font, and just go in an infinite loop.

You can see in
https://github.com/harfbuzz/harfbuzz/blob/main/src/hb-coretext.cc#L187

that google actually discovered this issue and added a workaround to coretext in v58 or so. But this workaround is not perfect, and there are still conditions which cause this (seems to be a combination of emoji and kanji characters from what I've seen). I actually went a wild chase some time back trying to see if I could fix this myself by hooking and hotpatching that function to determine which parameters caused coretext to crash. I could have recompiled Chromium but figuring out how to build chromium seemed harder than just hotpatching. In fact even figuring out how to parse the crash dump format from non-debug builds (minidump format) was quite an exercise.

As you know, for security chrome divides the main browser process from the helper processes, which are spawned from a different binary. SIMBL works fine to inject into the main chrome process (surprisingly, Chrome is alright with the injection. I would have thought that they do crazy tricks to avoid process injection like in windows, but seems like not.), but since chrome helper does not load any cocoa libraries we cannot simbl inject into that. We can't add dylib to the helper binary because it is codesigned with CS_RESTRICT (and removing codesign here seemed like a bad idea) since there are some entitlements it needs I think, but helper binary also loads chrome framework dylib (manually via dlopen actually, not via usual dynamic linking), so we can actually modify that chrome framework without tripping any security issues. I tried this and it worked, but changing signature does prevent some keychain related things from working which I didn't like.

I finally realized that what I really want is to use mach_inject automatically for every helper process spawned. Newer versions of macforge do this, but for 10.9 mysimbl only does traditional osax injection.

(Btw, @Wowfunhappy I noticed many of your plugins usually done via insert_dylib approach, you should consider packaging as a simbl plugin instead since it's much easier to enable/disable and you don't break signatures).

Continue tangential thing So I thought some more and realized that to do `mach_inject` we only really a way to look for pid of spawned process. Turns out that chrome launches the helper via `posix_spawnp`, and we can hook this call via standard patching techniques. So in my mind I constructed this wacky chain where I'd simble inject into the main chrome process, then hotpatch to interpose `posix_spawnp`, then take the pid result from that call and call `mach_inject` to inject a blob into the helper, and the injected blob would finally hotpatch `harfbuzz_shape`.

_Of course after more thinking I realized this was absolutely insane and I just removed the problematic emoji that was triggering the crash from my webpage. _

EDIT: Fwiw I just realized a way to do this is via DYLD_INSERT_LIBRARIES, it automatically gets propagated to all spawned child processes, and does not break code signing. You can then use mach_override or any other plt hooking library.

Oh so back to how newer version of Chromium fixed it. Turns out that newer versions of harfbuzz don't call into CoreText at all to shape AAT fonts, so we sidestep this problem entirely. Very fortunate, and it probably also avoids a whole class of security vulnerabilities. Anyway, thank you for reading.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Feb 27, 2022

@krackers I'm going to continue the tangential discussion a bit here to try to leave this issue as focused as possible.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Feb 27, 2022

It's also possible that chromium is calling into a system framework/api which then makes use of kqueue in this manner.

The crash log is pretty explicit though:

BSD process name corresponding to current thread: Chromium Helper

If it was a system framework, wouldn't that be showing up instead?

I also took a quick look in Instruments, and can confirm that Chromium Helper is certainly making BSC_kevent64 syscalls. Whether these are the syscalls that sometimes lead to a panic, I have no idea.

Edit: All coming from one place:
Screen Shot 2022-02-27 at 10 46 30 AM

@krackers
Copy link

If it was a system framework, wouldn't that be showing up instead

I think even if call into system framework, it's still part of same chrome helper process?

But yes based on your trace seems like it is indeed chromium message pump. Can you try recompiling chromium add adding some logging in the destructor for messagepumpkqueue (assuming that is the class which holds the kqueue reference)? I assume that it's one messagepump per chrome helper process, which means it might not be too noisy to run it and see if object destroy is related to kernel panic trigger.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Jun 18, 2022

Quick note that the kext doesn't work in 32-bit Lion. I briefly looked into adding support today, but from what I can tell, my method of finding the KASLR slide doesn't work, and I don't want to rewrite the entire thing.

@krackers
Copy link

krackers commented Jun 19, 2022

@Wowfunhappy Wouldn't you have to rewrite (or at least recompile) anyway because in 32-bit mode the mach-o magic is different and header struct layout probably also differs? The approach of scanning memory to look for 0xfeedface should still work though...

Btw I was surprised that 64-bit chromium worked on 32-bit xnu, but then I remember I watched a presentation on how 32-bit xnu had support for running 64-bit userspace processes, so maybe less surprising. I don't remember the presentation though, I'll try to find it since I forget exactly how they supported this.

Edit: Found it, it was a CCC talk: https://www.youtube.com/watch?v=-7GMHB3Plc8
Also found this article: https://appleinsider.com/articles/08/10/28/road_to_mac_os_x_snow_leopard_64_bit_to_the_kernel

@Wowfunhappy
Copy link
Sponsor Author

@krackers ...y'know what, that might be all it is, I forgot I was literally searching for a header called "MH_MAGIC_64" 🤦‍♂️. I'll take another look.

Recompiling isn't a problem, I can just lipo together two different binaries that technically have different code. (It does make the built cycle a bit more annoying unless I also take the time to automate that.)

blueboxd pushed a commit that referenced this issue Jun 22, 2022
Use a new tertiary background color, which is not the same as the old
one.

(cherry picked from commit f0fa977)

Fixed: 1324975
Change-Id: Ibb5d93d67c7546d7f46a8f04cc07feb052dd9956
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3648185
Auto-Submit: Stepan Khapugin <stkhapugin@chromium.org>
Commit-Queue: Stepan Khapugin <stkhapugin@chromium.org>
Reviewed-by: Robbie Gibson <rkgibson@google.com>
Cr-Original-Commit-Position: refs/heads/main@{#1003276}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3650743
Reviewed-by: Stepan Khapugin <stkhapugin@chromium.org>
Cr-Commit-Position: refs/branch-heads/5060@{#44}
Cr-Branched-From: b83393d-refs/heads/main@{#1002911}
@Wowfunhappy
Copy link
Sponsor Author

The kext has been updated with compatibility for 32-bit Lion.

https://gist.github.com/Wowfunhappy/8212f5bea4c601ac9a6297789f232321

(The strategy for finding the kernel base address works fine in 32 bit, but is pointless because Lion lacks KASLR.)

blueboxd pushed a commit that referenced this issue Aug 31, 2022
When ContentSuggestionsUIModuleRefresh is enabled, the gradient
background view of the NTP should still be the visible color when
the feed is disabled.

(cherry picked from commit f6e4ba5)

Fixed: 1347108
Change-Id: I5588b9cff65b5f796dbebe5ce5a026f9ff10a89b
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3784533
Auto-Submit: Chris Lu <thegreenfrog@chromium.org>
Commit-Queue: Sergio Collazos <sczs@chromium.org>
Reviewed-by: Sergio Collazos <sczs@chromium.org>
Commit-Queue: Chris Lu <thegreenfrog@chromium.org>
Cr-Original-Commit-Position: refs/heads/main@{#1027824}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3787570
Cr-Commit-Position: refs/branch-heads/5195@{#44}
Cr-Branched-From: 7aa3f07-refs/heads/main@{#1027018}
@fhgwright
Copy link

@Wowfunhappy

If I don't discover any new issues in the next week or so, I'm going to add this to my Preference Pane, so it gets installed automatically for anyone who uses that. I don't know if it makes sense to incorporate this into the Chromium Legacy app directly, but you are of course welcome to it.

Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.

While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly, given that:

  1. AFAIK this has never been triggered by any application other than Chromium Legacy.
  2. It was never triggered by any supported release of Chrome.
  3. AIUI, the kernel fix still leads to an application crash, so an application-level fix is also needed, anyway.

Given item 3, the kernel patch on test systems could be useful in identifying the combination of syscalls that provokes the bug. And assuming that the offending code doesn't provide any new capability, there must be something in Chromium which changed from an "old way" to a "new way" of doing something. Once that difference can be figured out, the next questions are:

  1. How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?
  2. How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?

It would certainly be useful to also have a kernel fix for users who want a more robust kernel and are willing to install a kernel update, but it would be best to avoid making that a requirement for avoiding this bug. And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch. Some work would be needed to identify the correct sources to use as a starting point, though fortunately that needs to be done only once for any given abandoned OS version.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Sep 18, 2022

@fhgwright

Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.

The preference pane does ask for permission!

Screen Shot 2022-09-17 at 8 34 57 PM

The installation is "automatic" in the sense that, if the user says yes, it's done in one click.

Let me know if you think the above message isn't clear enough. I am trying to be respectful of the user. However, I don't think leaving users with kernel panics is particularly respectful either!

While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly

I 100% agree! At the moment, however, I don't think anyone working on Chromium Legacy has the time or expertise to come up with a userspace fix. This is the best that krackers and I can do. If you have the expertise to fix the problem in userspace, please contribute, that would be wonderful!

there must be something in Chromium which changed from an "old way" to a "new way" of doing something.

My current best guess (which could definitely still be wrong) is that it's https://bugs.chromium.org/p/chromium/issues/detail?id=932175

How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?

It sounds like it's not such a huge advantage. Most users are not exhausting file descriptors. However...

How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?

Wide enough that Bluebox wasn't able to do it. #44 (comment)

And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch.

Already done!

#44 (comment)

This kernel is for 10.9.5, but you could easily apply the fix to whichever kernel you'd like. It's literally a three line change.

The problem is that the open source release of XNU is incomplete. Most notably, custom kernels are unable to log into iMessage. This is why we created a kext to patch the memory at runtime instead.

@fhgwright
Copy link

@fhgwright

Please don't. Kernel patches should never be installed without the user's express permission. Patching the kernel, either in an application, or in something which represents itself as an application updater, is a really extreme violation of the principle of least surprise.

The preference pane does ask for permission!

Screen Shot 2022-09-17 at 8 34 57 PM

The installation is "automatic" in the sense that, if the user says yes, it's done in one click.

Let me know if you think the above message isn't clear enough. I am trying to be respectful of the user. However, I don't think leaving users with kernel panics is particularly respectful either!

I've never seen that dialog box (except here), nor is the kext present here, so I assumed that you hadn't yet implemented the preference-pane hack. But maybe the preference pane doesn't update itself.

It should let you know whether the patch is already present, and offer an uninstall option as well. But both goals could be mostly served by giving the full path of the extension in the text.

While it's true that there's a kernel bug at work here, as a practical matter, coming up with a userspace workaround would be much more user-friendly

I 100% agree! At the moment, however, I don't think anyone working on Chromium Legacy has the time or expertise to come up with a userspace fix. This is the best that krackers and I can do. If you have the expertise to fix the problem in userspace, please contribute, that would be wonderful!

there must be something in Chromium which changed from an "old way" to a "new way" of doing something.

My current best guess (which could definitely still be wrong) is that it's https://bugs.chromium.org/p/chromium/issues/detail?id=932175

How much of an advantage is there in the "new way"? I.e., how important is it to use the "new way" when it doesn't cause trouble, rather than reverting unconditionally?

It sounds like it's not such a huge advantage. Most users are not exhausting file descriptors. However...

How wide is the scope of the difference? I.e., if there were a patch to revert to the "old way", how maintainable would such a patch be?

Wide enough that Bluebox wasn't able to do it. #44 (comment)

But if it's true that the change involved moving away from using FDs for signaling due to FD exhaustion issues, then given that Chrome also runs on Linux, the "old way" must still be present (and fully supported) in the Linux build.

I doubt that I'd have trouble with FD exhaustion here anyway, since I crank up the limit for other reasons, and I'm not the sort to keep hundreds of browser tabs open simultaneously.

And a "proper" kernel update should involve building from source, rather than applying some kludgy on-the-fly binary patch.

Already done!

#44 (comment)

But that didn't match the release kernel, and the discrepancy was never investigated.

This kernel is for 10.9.5, but you could easily apply the fix to whichever kernel you'd like. It's literally a three line change.

The problem is that the open source release of XNU is incomplete. Most notably, custom kernels are unable to log into iMessage. This is why we created a kext to patch the memory at runtime instead.

Until a matching kernel can be built from source, I wouldn't blame the iMessage issue on custom kernels per se.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Sep 21, 2022

I've never seen that dialog box (except here), nor is the kext present here, so I assumed that you hadn't yet implemented the preference-pane hack. But maybe the preference pane doesn't update itself.

Correct, it does not update itself.

It should let you know whether the patch is already present, and offer an uninstall option as well.

While this does not currently happen, uninstalling the PrefPane (via the provided uninstall script) will also uninstall the kext. In addition, the kext is installed in /Library/Extensions/, the standard location for third-party kernel extensions.

But that didn't match the release kernel, and the discrepancy was never investigated.

We did investigate. It does not match the release kernel because the release kernel is not open source. Which is why we went with a kernel extension instead.

Until a matching kernel can be built from source, I wouldn't blame the iMessage issue on custom kernels per se.

The open source version of XNU is explicitly missing a function required by iMessage. This was actually documented by the Hackintosh community, I just wasn't aware previously.

given that Chrome also runs on Linux, the "old way" must still be present (and fully supported) in the Linux build.

Yes, but the Linux build uses Linux-specific code that won't work on XNU, so it's harder than it seems. But again, by all means, please give it a shot. I don't know if it will work, but it might, and I'd love to see this fixed within Chromium Legacy itself. It's just beyond my own ability.

blueboxd pushed a commit that referenced this issue Sep 29, 2022
If Chrome is started and the FRE is not completed yet - tag crashes
happening in this session with "first_run": "yes". This is consistent
with iOS, where crashes from the first run are tagged with
"first-run": "yes". This information is helpful to analyze stability
of FRE-related launches and should give us more info about the flow
that led to a crash.

(cherry picked from commit da740b4)

Bug: 1354558
Change-Id: Ia37f322f4b01d9b4e4bef34765c221bba96cf5bc
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3841762
Commit-Queue: Boris Sazonov <bsazonov@chromium.org>
Reviewed-by: Sam Maier <smaier@chromium.org>
Cr-Original-Commit-Position: refs/heads/main@{#1037752}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3851321
Auto-Submit: Boris Sazonov <bsazonov@chromium.org>
Commit-Queue: Sam Maier <smaier@chromium.org>
Cr-Commit-Position: refs/branch-heads/5249@{#44}
Cr-Branched-From: 4f7bea5-refs/heads/main@{#1036826}
blueboxd pushed a commit that referenced this issue Nov 30, 2022
Ensure GuestOsSessionTracker is started to capture all VM events.

(cherry picked from commit 0e1aa7b)

Bug: b/253838039
Change-Id: I31bbb94331347e795f22bfb2863139145ef729a2
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3958457
Commit-Queue: Xiyuan Xia <xiyuan@chromium.org>
Auto-Submit: Joel Hockey <joelhockey@chromium.org>
Reviewed-by: Xiyuan Xia <xiyuan@chromium.org>
Reviewed-by: David Munro <davidmunro@google.com>
Cr-Original-Commit-Position: refs/heads/main@{#1060011}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3962228
Commit-Queue: Joel Hockey <joelhockey@chromium.org>
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
Cr-Commit-Position: refs/branch-heads/5359@{#44}
Cr-Branched-From: 27d3765-refs/heads/main@{#1058933}
This was referenced Dec 29, 2022
@hoangbv15
Copy link

Not sure if this will be valuable for you but I reproduced this on 10.9.4 and here are the logs.
I was leaving https://piped.mha.fi on, intending to type something but haven't decided, so no interaction with the machine yet. And my Macbook Pro 2012 kernel paniced.

If it's not helpful please let me know and I will remove the comment.

Kernel_2023-01-14-184339_Hoangs-MacBook-Pro.panic.log

Anonymous UUID:       58645209-3F22-8B42-01B7-0C62519750C7

Sat Jan 14 18:43:39 2023
panic(cpu 0 caller 0xffffff80071c6bb4): "kqueue_scan_continue: - invalid wait_result (3)"@/SourceCache/xnu/xnu-2422.110.17/bsd/kern/kern_event.c:2167
Backtrace (CPU 0), Frame : Return Address
0xffffff81f2053ef0 : 0xffffff8006e22f79 
0xffffff81f2053f70 : 0xffffff80071c6bb4 
0xffffff81f2053fb0 : 0xffffff8006ed7417 

BSD process name corresponding to current thread: Chromium Helper 
Boot args: kext-dev-mode=1

Mac OS version:
13E28

Kernel version:
Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64
Kernel UUID: BBFADD17-672B-35A2-9B7F-E4B12213E4B8
Kernel slide:     0x0000000006c00000
Kernel text base: 0xffffff8006e00000
System model name: MacBookPro9,1 (Mac-4B7AC7E43945597E)

System uptime in nanoseconds: 4066641044880
last loaded kext at 3901636049533: com.apple.driver.AppleIntelMCEReporter	104 (addr 0xffffff7f8924c000, size 49152)
last unloaded kext at 3966662167467: com.apple.driver.AppleIntelMCEReporter	104 (addr 0xffffff7f8924c000, size 32768)
loaded kexts:
...

@krackers
Copy link

krackers commented Jan 14, 2023

Just to confirm, is this with Wowfunhappy's kext installed? That should that should hot-patch the kernel to avoid these panics (instead it will "merely" crash just the chromium renderer instead of bringing down your entire system).

Also fyi 10.9.4 is not the latest version of mavericks, you should probably be on 10.9.5.

@Wowfunhappy
Copy link
Sponsor Author

Wowfunhappy commented Jan 14, 2023

Just to confirm, is this with Wowfunhappy's kext installed? That should hot-patch the kernel to avoid these panics

Not if they're on 10.9.4 it won't! I'd need to add memory addresses for that kernel.

Is there a reason this system is on 10.9.4 instead of 10.9.5? If there's something "special" about 10.9.4 in particular, I could add support for it.

@hoangbv15
Copy link

hoangbv15 commented Jan 15, 2023 via email

@Wowfunhappy
Copy link
Sponsor Author

I was on 10.9.4 because I never realised that it wasn't the latest Mavericks 😂😂😂 I kind a skimmed through the end of the issue and didn't spot the workaround, I'll update to 10.9.5 and try it, will report back after some testing.

Quick heads up that Apple also released further updates after the base 10.9.5 which don't change the version number, you want 10.9.5 build 13F1911. I'm pretty sure the updater will prompt you, but if not, install: https://support.apple.com/kb/dl1886?locale=en_US

@hoangbv15
Copy link

hoangbv15 commented Jan 16, 2023 via email

blueboxd pushed a commit that referenced this issue Feb 8, 2023
Change-Id: Ib3ae438e517a34bbc01ee7fd3bf7090a75d1c2e8
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4121693
Bot-Commit: Chrome Release Bot (LUCI) <chrome-official-brancher@chops-service-accounts.iam.gserviceaccount.com>
Cr-Commit-Position: refs/branch-heads/5481@{#44}
Cr-Branched-From: 130f3e4-refs/heads/main@{#1084008}
blueboxd pushed a commit that referenced this issue Mar 7, 2023
The existing privacySandboxEnabled Chrome Extension API is being
depreciated in favour of three new k-APIs - topicsEnabled, fledgeEnabled
and adMeasurementEnabled in M111.

This CL enables the existing privacySandboxEnabled API to be respected
and compatible with the new k-APIs during the migration period.
If an extension has disabled the privacySandboxEnabled API, then the
new k-APIs must also be disabled.
If an extension clears the privacySandboxEnabled API, then the new
k-APIs are also cleared.

See a screen recording of an extension disabling the privacySandbox API:
http://dr/file/d/1MoLhOVmmh4bqW18lkdDctQRmsTo-Hjdy/view
Test extension for Privacy Sandbox Extension APIs used:
http://dr/file/d/1Wlk3TzECpU7j_23H9UUZxUdSY2UDTMQc/view
Experiment #privacy-sandbox-settings-4 was enabled.

(cherry picked from commit 5fc51e7)

Bug: 1378703, b/254414152
Change-Id: Id5e711149e85ceb84217e2b59d6b2ebaffef8997
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4133664
Commit-Queue: Filipa Senra <fsenra@google.com>
Reviewed-by: Devlin Cronin <rdevlin.cronin@chromium.org>
Cr-Original-Commit-Position: refs/heads/main@{#1097899}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4202368
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
Cr-Commit-Position: refs/branch-heads/5563@{#44}
Cr-Branched-From: 3ac59a6-refs/heads/main@{#1097615}
blueboxd pushed a commit that referenced this issue Apr 14, 2023
Seems like the start and end offsets for ime text spans are being
incorrectly computed for Japanese special characters, causing a crash
when trying to access char16_offsets[end].

Need to investigate further, but in the meantime, add a check for valid
start and end offsets and ignore the composition text span if the
offsets are out of bounds.

(cherry picked from commit 01bc7f6)

Bug: b/270390666
Change-Id: I47d531c1f2155a8b01394d4d455c474afadc7046
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4289048
Commit-Queue: Michelle Chen <michellegc@google.com>
Reviewed-by: Darren Shen <shend@chromium.org>
Cr-Original-Commit-Position: refs/heads/main@{#1109496}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4290836
Cr-Commit-Position: refs/branch-heads/5615@{#44}
Cr-Branched-From: 9c6408e-refs/heads/main@{#1109224}
blueboxd pushed a commit that referenced this issue Apr 25, 2023
(cherry picked from commit 20d8b7a)

(cherry picked from commit 13c7be2)

Bug: 1417955
Change-Id: I85d6a3f50491f276d0fce9dad50850b8a82e3499
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4367804
Commit-Queue: Daniel Murphy <dmurph@chromium.org>
Reviewed-by: Daniel Murphy <dmurph@chromium.org>
Reviewed-by: Dibyajyoti Pal <dibyapal@chromium.org>
Cr-Original-Original-Commit-Position: refs/heads/main@{#1121969}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4374061
Commit-Queue: Dibyajyoti Pal <dibyapal@chromium.org>
Cr-Original-Commit-Position: refs/branch-heads/5672@{#44}
Cr-Original-Branched-From: 5f2a724-refs/heads/main@{#1121455}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4416085
Auto-Submit: Daniel Murphy <dmurph@chromium.org>
Cr-Commit-Position: refs/branch-heads/5615@{#1226}
Cr-Branched-From: 9c6408e-refs/heads/main@{#1109224}
blueboxd pushed a commit that referenced this issue May 9, 2023
(cherry picked from commit 20d8b7a)

Bug: 1417955
Change-Id: I85d6a3f50491f276d0fce9dad50850b8a82e3499
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4367804
Commit-Queue: Daniel Murphy <dmurph@chromium.org>
Reviewed-by: Daniel Murphy <dmurph@chromium.org>
Reviewed-by: Dibyajyoti Pal <dibyapal@chromium.org>
Cr-Original-Commit-Position: refs/heads/main@{#1121969}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4374061
Commit-Queue: Dibyajyoti Pal <dibyapal@chromium.org>
Cr-Commit-Position: refs/branch-heads/5672@{#44}
Cr-Branched-From: 5f2a724-refs/heads/main@{#1121455}
blueboxd pushed a commit that referenced this issue May 27, 2023
This CL adds metrics to track scrolling actions in grid views:
- events when the user scrolls;
- cumulative time spent scrolling.

(cherry picked from commit 63e25d1)

Fixed: 1439234
Change-Id: Id5e5c3bb0739359ddd6670f4de10789c3c25ca35
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4466313
Auto-Submit: Louis Romero <lpromero@google.com>
Reviewed-by: Aliona Dangla <alionadangla@chromium.org>
Reviewed-by: Gauthier Ambard <gambard@chromium.org>
Commit-Queue: Louis Romero <lpromero@google.com>
Reviewed-by: Olivier Robin <olivierrobin@chromium.org>
Cr-Original-Commit-Position: refs/heads/main@{#1136168}
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4484223
Commit-Queue: Gauthier Ambard <gambard@chromium.org>
Cr-Commit-Position: refs/branch-heads/5735@{#44}
Cr-Branched-From: 2f562e4-refs/heads/main@{#1135570}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants