-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel hang with CONFIG_DEBUG_NMI_SELFTEST (Alpine Linux) #797
Comments
It's any kernel built with |
Just out of curiosity can you share the latest version of alpine compiled with |
As far as I can tell the last version that currently works is 3.5.3 which is super old (slightly newer versions seem to have a different problem, then all versions start to have this issue). How much work will it be to implement this so versions of Alpine Linux from within the last 5 to 6 years can run in v86? I think the alternative is compiling a custom kernel while running Alpine in QEMU and that seems like a lengthy process that requires a very large image size. It would be really nice to have a useful OS to run in v86. @copy has suggested using Alpine, but how are we supposed to do that? |
The Arch Linux profile on copy.sh/v86 is usable to some extent
…________________________________
From: edwillard ***@***.***>
Sent: Monday, February 12, 2024 5:02 PM
To: copy/v86 ***@***.***>
Cc: Ryan ***@***.***>; Comment ***@***.***>
Subject: Re: [copy/v86] Kernel hang with CONFIG_DEBUG_NMI_SELFTEST (Alpine Linux) (Issue #797)
As far as I can tell the last version that currently works is 3.5.3 which is super old (slightly newer versions seem to have a different problem, then all versions start to have this issue). How much work will it be to implement this so versions of Alpine Linux from within the last 5 to 6 years can run in v86? I think the alternative is compiling a custom kernel while running Alpine in QEMU and that seems like a lengthy process that requires a very large image size. It would be really nice to have a useful OS to run in v86. @copy<https://github.com/copy> has suggested using Alpine<#972>, but how are we supposed to do that?
—
Reply to this email directly, view it on GitHub<#797 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQQXXR6PAFNWIGTCGFXKUH3YTI4LFAVCNFSM6AAAAAAUGOQEV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZYHE4TANRYHE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
@spetterman66, thank you for your suggestion. Arch 32-bit works well enough but it lacks current versions of most popular developer tooling packages (at least that I checked) like Node and NPM. Then it runs into various issues when trying to install popular NPM-based tooling like CRA. I tried creating a custom image of Arch with QEMU (so I could try building Node from source) but got stuck at the keyboard support part. I think v86 is fast enough that it could be used for some pretty cool things but not without a useful OS (for a developer). Sorry if this is getting off topic but I am more than willing to help out in the effort. I just need to know where to best start. |
@edwillard In what sense does this bug prevent you from using Alpine on v86? From what I can tell, it doesn't cause any problems besides a 5 second delay in the boot process. Besides that, Alpine seems to work fine. |
When I create an image using the latest Alpine Virtual 3.19.1 it outputs a message similar to "watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]" from the OP over and over with the seconds number growing each time and never gets to anything usable. Maybe I missed something? @spetterman66 also seemed to confirm that only old versions work in an earlier message... |
To be thorough I also tried Alpine Standard 3.19.1 and it has the same issue. I can successfully use Alpine versions 3.5.3 and older, quickly and without any problem. Current versions never get past the soft lockup messages even if left to run for half an hour. |
Are you using http://copy.sh? Try https://copy.sh or localhost. Timer resolution is reduced on non-https (see https://developer.mozilla.org/en-US/docs/Web/API/Performance/now#security_requirements) |
I use localhost but have tried https://copy.sh/ (never http) and it has the same issue in both cases. Although your point about timer resolution made me consider that it could be a Firefox (included by default on my fresh install of Ubuntu) issue or something related to Firefox settings so I installed Chrome and it does indeed work in Chrome. It would be nice to get it working in Firefox too if anyone has any ideas. I can provide an update here or open a new issue if I find anything relevant. Thank you very much for your support and excellent work on v86. Edit: Just to note that Alpine versions 3.5.3 and below do work with Firefox regardless. |
To be thorough I tried on localhost with a fresh install of Firefox Developer Edition and it has the same issue. It seems like it must be something from the combination of this bug and Firefox. I would like to test on Safari as well when I have a chance and will update on the results if no one else gets to that sooner. |
I was able to get on a Mac sooner than expected and it works in Safari and Chrome but not in Firefox. Not sure if I can get Windows set up and test there too but my guess would be that Chrome works but not Firefox. Happy to help with Firefox however I can just let me know. |
You can run this code to determine the resolution of performance.now(): On Firefex (developer edition), I get 1ms on http sites and 0.02ms on https (with COOP/COEP). The NMI test suite calls On Chromium, the resolution is higher (0.1ms and 0.005ms respectively). Maybe your Firefox has a different resolution or doesn't accept the COOP/COEP headers on https://copy.sh? Now regarding fixes, there are a couple of options:
Or, since this is not really v86's fault:
|
@copy, thank you so much for your detailed investigation, this is great and very helpful. I found that running the provided code to determine the timer resolution in my standard Firefox resulted in almost 17ms regardless of localhost/HTTP/HTTPS! Based on your example that might be expected to take 4 to 5 hours instead of 16 to 17 minutes! I thought the almost 17ms value suspiciously matches up nearly perfectly with the "16.67ms frame budget" so it has probably been tuned with that in mind. After finding that Firefox Developer Edition behaves as you stated (1ms and 0.02ms) I figured out that setting the config value (by going to about:config) for privacy.resistFingerprinting to true (default false) was responsible for the ~16.67ms value (note that the browser needs to be restarted after changing this for the timer resolution to be changed). With privacy.resistFingerprinting set to false I see a timer resolution of around 0.02ms on https://copy.sh/v86/ but other HTTPS sites and localhost are still 1ms which I suppose is due to the lack of COOP/COEP headers but have not looked into that deeper. I did find that setting the config value for privacy.reduceTimerPrecision to false (default true) makes everything (localhost/HTTP/HTTPS) 0.02ms and also that setting the config value for privacy.reduceTimerPrecision.unconditional to false (default true) in addition to privacy.reduceTimerPrecision makes everything below 0.001ms, often below 0.0005ms (note that the browser does not need to be restarted after changing these for the timer resolution to be changed). Non-default config settings are difficult to require of users so I think the quickest workaround to the issue for Firefox support is to either wait about 17 minutes for Alpine (or any other distro that has the same problem) to finish booting (assuming privacy.resistFingerprinting set to false), or change the config settings in Firefox, or just load it in Chrome, then save the state and simply load that which nicely bypasses the issue in Firefox (even if the user has privacy.resistFingerprinting set to true). I plan on digging into everything you wrote more (if only for my own education) and will update with any additional results. Thanks again for your help, I really appreciate it. Hopefully this helps any other Firefox users that are having the same trouble. Edit: Just to note that I think the reasoning for the timer resolution decisions in Firefox has to do with Spectre and Meltdown mitigations so it might be undesirable to set privacy.reduceTimerPrecision and privacy.reduceTimerPrecision.unconditional to false in general unless someone more knowledgeable says otherwise. |
I found that the timer resolution can have a large impact on overall performance. Taking the example of running |
It's used roughly three places:
To narrow it down, I'd suggest:
|
I recompiled Alpine Linux edge kernel without Notes:
|
@edwillard I don't have a fix for the nmi selftest yet, but I pushed a fix for slow disk IO into the wip branch (dfacd61). Please test and let me know if this fixes (some of) your performance issues. |
I pushed a fix for the nmi selftest (especially with |
I pushed a better fix in e644f89 (in the wip branch) |
Did you enable SB16 support when compiling this? My installation of Alpine doesn't seem to detect it so just wondering if installing this kernel would help with that |
Can you try |
I don’t seem to have a module called snd-sb16. I have snd-sb-common and snd-sb16-dsp, but no snd-sb16. Installing linux-firmware-sb16 doesn’t add the missing module, either. |
Sorry, I have confused with sb16 and sb16-dsp |
That’s okay. I’ve created an issue on the GitLab page for Alpine requesting for the CONFIG_SND_SB16 flag to be enabled on linux-lts for x86 Alpine Linux v3.20, which would include the module in all future linux-lts package releases. As long as the issue is deemed worthy of a fix by the developers, sound on Alpine might soon be fixed for everyone :) |
Great news! linux-virt should now have the necessary network and sound drivers: https://gitlab.alpinelinux.org/alpine/aports/-/commit/af54cf99c3ace522d51b0aac1681e6edfbb56098 |
When booting Alpine Linux in v86 (default kernel compiled with
CONFIG_DEBUG_NMI_SELFTEST=y
), the kernel softlocks duringnmi_selftest
:The test is supposed to timeout, but occasionally the watchdog shows that it's hitting udelay:
In v86,
IOAPIC_DELIVERY_NMI
is stubbed:v86/src/apic.js
Lines 436 to 440 in 17a6b3b
So it seems like a layered issue:
IOAPIC_DELIVERY_NMI
should probably be implemented (at least trivially)The text was updated successfully, but these errors were encountered: