Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add yield() to spinlock #1127

Merged
merged 1 commit into from
Jul 9, 2021
Merged

Add yield() to spinlock #1127

merged 1 commit into from
Jul 9, 2021

Conversation

kklobe
Copy link
Collaborator

@kklobe kklobe commented Jul 6, 2021

A fair amount of time is spent in the spinlock on systems that use DelayPrecise, so add the yield hint for systems that can take advantage of it. Example: https://en.cppreference.com/w/cpp/thread/yield.

The yield was causing problems on Win32, so I had removed it initially, but after adding CanDelayPrecise(), which effectively bypasses Win32, we can add it back in and test.

This probably won't make much difference for manycore systems, but since spinlocking is relatively evil to begin with, the yield makes it more well-behaved citizen, especially on lower-core systems.

I tested the number of loop iterations in the spinlock (M1 MacBook):

without yield():
spinlock loop iterations: 862
spinlock loop iterations: 6233
spinlock loop iterations: 872
spinlock loop iterations: 1386
spinlock loop iterations: 3868
spinlock loop iterations: 586
spinlock loop iterations: 467
spinlock loop iterations: 5286
spinlock loop iterations: 1158
spinlock loop iterations: 4391


with yield():
spinlock loop iterations: 866
spinlock loop iterations: 881
spinlock loop iterations: 984
spinlock loop iterations: 872
spinlock loop iterations: 876
spinlock loop iterations: 798
spinlock loop iterations: 924
spinlock loop iterations: 862
spinlock loop iterations: 904
spinlock loop iterations: 803

@dreamer
Copy link
Member

dreamer commented Jul 6, 2021

Wait, C++ STL uses spinlock?! That's… weird to say the least.

We need comparison tests on Linux and Windows as well.

As for now: fix the formatting, please.

Copy link
Member

@kcgen kcgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will test this on the Pi and x86 Linux systems set to limited speeds. Thanks, @kklobe.

@kklobe
Copy link
Collaborator Author

kklobe commented Jul 6, 2021

Wait, C++ STL uses spinlock?! That's… weird to say the least.

The DelayPrecise function I added spinlocks for the remaining fraction of a millisecond that std::this_thread::sleep_for isn't accurate enough to hit exactly.

Copy link
Member

@dreamer dreamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder about squashing commits and writing good commit message. :)

@kklobe
Copy link
Collaborator Author

kklobe commented Jul 6, 2021

Reminder about squashing commits and writing good commit message. :)

I thought that was a pretty comprehensive commit message, please advise on desired changes :)

A fair amount of time is spent in the spinlock on systems that use DelayPrecise, so add the yield() hint for systems that can take advantage of it. Example: https://en.cppreference.com/w/cpp/thread/yield
@dreamer
Copy link
Member

dreamer commented Jul 6, 2021

Ehh… I had no idea we added spinlocks to our code… @kklobe spinlocks in userspace code are evil, I don't care what somebody wrote on some random blog. We need to find a way to avoid it - preferably using platform-specific APIs. But in my opinion we should stick to what's provided to us by SDL.

@kklobe
Copy link
Collaborator Author

kklobe commented Jul 6, 2021

Ehh… I had no idea we added spinlocks to our code… @kklobe spinlocks in userspace code are evil, I don't care what somebody wrote on some random blog. We need to find a way to avoid it - preferably using platform-specific APIs. But in my opinion we should stick to what's provided to us by SDL.

The change increased framerate and also reduced CPU usage by ~50% on at least my M1 Mac (#1084).

I am happy to revert those changes if they are unwanted.

EDIT: I should clarify, I'm probably using the term spinlock incorrectly. I am not attempting to acquire a lock at all, I am simply busy-waiting for a very short (on average 50-100us) period of time.

@dreamer
Copy link
Member

dreamer commented Jul 6, 2021

@kklobe we need to make sure your increased framerate on M1 Mac did not break e.g. GUS or SB emulation on Linux or Windows (or other things)… Generally: spinlocks never should be used in userspace, because they create an indirect, hidden dependency on the behaviour of OS kernel. Maybe today you get increased framerate, but then an OS update happens and you get a regression instead.

Basically we need to use proper APIs for sleeping, like POSIX nanosleep - this way we cooperate with OS kernel rather than fight against it.

@kklobe
Copy link
Collaborator Author

kklobe commented Jul 6, 2021

@kklobe we need to make sure your increased framerate on M1 Mac did not break e.g. GUS or SB emulation on Linux or Windows (or other things)… Generally: spinlocks never should be used in userspace, because they create an indirect, hidden dependency on the behaviour of OS kernel. Maybe today you get increased framerate, but then an OS update happens and you get a regression instead.

Basically we need to use proper APIs for sleeping, like POSIX nanosleep - this way we cooperate with OS kernel rather than fight against it.

That is why I switched the base code to sleep_for, which I saw translated down the nanosleep for the two Linux systems (Fedora 33 and 34 arm64 and x64), but it still wasn't precise enough to give an accurate 1ms sleep, hence the algorithm in DelayPrecise to busy-wait for the exact remainder of the millisecond duration after doing a number of sleep_for calls to get close.

EDIT: if, on the other hand, we don't want to trust std::this_thread::sleep_for as a cross-platform Thread.Sleep(), that's a different discussion I'm also happy to have.

@kcgen
Copy link
Member

kcgen commented Jul 6, 2021

PR 1084 was a net-benefit based on actual measurements: holding performance at par or better on all supported platforms, and simultaneously improving the 1ms timing delay without the typical 10% overage due to using SDL coarse delay timer.

Chrono is is vastly superior to SDL when it comes to timing: its type and unit safe, offers finer resolutions, and is directly support by the language. Compiler implementations continue to improve it (as has happened for Chrono from c++11 to c++14).

If the code was an unmaintainable mess or had functional corner cases or performance regressions- then I fully agree we would want to hold to SDL.

If we can refine it further with more OS-friendly mechanisms like nanosleep, then I'm all for that!

@kklobe
Copy link
Collaborator Author

kklobe commented Jul 6, 2021

I'm not sure why GitHub isn't letting me edit my post -

We did extensive function and performance testing to ensure we in even better shape than before - but if we can refine it further with more OS-friendly mechanisms like nanosleep, then I'm all for that.

just a reminder, std::this_thread::sleep_for does punch through to e.g. nanosleep on the *nix systems I tested, which was the main reason I did the first timer.h refactor to use STL instead of SDL, so we'd be using a standard cross-platform sleep mechanism.

@kklobe
Copy link
Collaborator Author

kklobe commented Jul 6, 2021

Let me back up for a minute and refresh the context for these changes.

The first change I looked at was replacing the SDL-specific GetTicks and Delay functions with equivalent STL ones. I used std::this_thread::sleep_for as a direct replacement for Delay. There isn't a direct GetTicks replacement in std::chrono that I could see, so I used std::chrono::steady_clock::now().time_since_epoch()) as the closest match.

After that was done, @kcgen and I were discussing the potential benefits of increaseticks operating with a precise 1ms delay, instead of the measured actual delay of ~1.2ms. After testing on all three major platforms, sleep_for was not able to give us an exact 1ms sleep duration, as might be expected. Even on systems where sleep_for uses nanosleep, the accuracy was +/- 50%, as I would request sleep_for(100us) and the measured sleep was between 50-150us. It's even worse on Windows, since sleep_for just uses Thread.Sleep, so sleep_for(100us) took ~1200us.

So the algorithm in DelayPrecise estimates how accurate the sleep_for(100us) is and offsets the difference with the busy-wait you see, which isn't acquiring any locks, it's simply waiting until the remaining millisecond leftover duration from the sleep_for calls is exactly complete by testing std::chrono::steady_clock::now() via GetTicksUsSince().

I added the quick heuristic CanDelayPrecise() rather than hard-coding #ifndef WIN32, trying to stick to generic STL functions, and if Visual Studio does at some point update their STL to have a finer-grained sleep_for, we'll be ready.

All of this was done while carefully measuring CPU usage and performance on Windows, Linux (Fedora 33/34 for arm64 and x64), and macOS (M1).

@kcgen kcgen added the cleanup Non-functional changes that simplify, improve maintainability, or squash warnings label Jul 7, 2021
@kcgen
Copy link
Member

kcgen commented Jul 8, 2021

Quake timedemo at 120k cycles running on an 800 Mhz Linux x86-64 PC

Both sets of runs produced identical FPS (39.8, all 10 runs)

Branch w/ yield:

wall  user system
35.132 28.959 0.914
35.118 27.981 0.975
35.106 28.341 0.940
35.129 28.701 0.982
35.149 28.452 0.986
-----------------
35.126 28.486 0.959  <- column averages

Main branch w/ spinlock:

wall  user system
35.175 29.246 1.029
35.134 28.013 1.036
35.129 28.606 0.951
35.151 28.839 1.049
35.126 28.420 1.033
-----------------
35.143 28.624 1.019

Branch delivers the same runtime performance with ~0.34% lower host-side cost.

@kcgen kcgen self-requested a review July 8, 2021 01:17
Copy link
Member

@kcgen kcgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kklobe , thanks for confirming that yield() boils down to platform-specific cooperative calls (like nanosleep) that @dreamer mentioned.

Benchmarks on my side all are good: at-par or slightly lower CPU overhead to the deliver the same performance.

Formatting and commit message are good.

I think this is OK to merge. @dreamer, what do you think?

Will plan to give it another day or so and then go-ahead.

@kcgen
Copy link
Member

kcgen commented Jul 9, 2021

Thanks for the this improvement @kklobe !

@kcgen kcgen merged commit a85b39f into master Jul 9, 2021
@dreamer
Copy link
Member

dreamer commented Jul 9, 2021

@kklobe Thanks! Next time please follow this guideline when it comes to commit messages, please: https://chris.beams.io/posts/git-commit/

@kcgen gentle reminder, that reviewing commit message is part of the review

Aside of that - my previous worries about spinlock were a bit overblown - this code does not use spinlock as a thread synchronization mechanism after all - it's more akin to active waiting.

We should still find the good solution using proper high resolution timer and appropriate API (maybe clock_nanosleep with CLOCK_REALTIME flag, I don't know). Slapping yield back and forth seems to me like a cargo-cult programming where the changes introduced are not understood fully. At least I do not understand why yield in this case changed something and what the reprecursions will be. What if by introducing yield we're increasing likelihood that kernel will unbind our process from particular core? I am seriously worried that this active waiting code introduced some regressions in time-sensitive code (e.g. in PC speaker emulation or maybe VGA emulation) and the changes were simply not tested well enough for regressions.

@kklobe
Copy link
Collaborator Author

kklobe commented Jul 9, 2021

@kklobe Thanks! Next time please follow this guideline when it comes to commit messages, please: https://chris.beams.io/posts/git-commit/

I would appreciate specific feedback on the commit message, because I thought I followed the guidelines. I believe the subject line meets #1-#5, and I described what I did and why in the body. I'm interested in good messages and would like to know how I could have made that better.

@kcgen gentle reminder, that reviewing commit message is part of the review

Aside of that - my previous worries about spinlock were a bit overblown - this code does not use spinlock as a thread synchronization mechanism after all - it's more akin to active waiting.

We should still find the good solution using proper high resolution timer and appropriate API (maybe clock_nanosleep with CLOCK_REALTIME flag, I don't know). Slapping yield back and forth seems to me like a cargo-cult programming where the changes introduced are not understood fully. At least I do not understand why yield in this case changed something and what the reprecursions will be. What if by introducing yield we're increasing likelihood that kernel will unbind our process from particular core? I am seriously worried that this active waiting code introduced some regressions in time-sensitive code (e.g. in PC speaker emulation or maybe VGA emulation) and the changes were simply not tested well enough for regressions.

These are legitimate concerns, and it was me correcting an oversight from what I originally had in the first changes. In general (and especially in embedded / firmware, which is most of my dev work), you never want to "busy wait" without giving the CPU a chance to do something else. For example, on a Cortex M4, you use WFE or WFI to avoid eating 100% CPU while you wait, so this wasn't some cargo-cult change I slapped in for no reason, but was based on a reasonable amount of past experience.

From the cppreference link I included in the original commit, this type of situation is one big reason why yield() was added to the standard, so I had no reason to distrust it, but I do also like that this is a single commit, so we can bisect if we do run into issues.

Thank you - I appreciate the detailed feedback!

@dreamer
Copy link
Member

dreamer commented Jul 9, 2021

@kklobe Thanks! Next time please follow this guideline when it comes to commit messages, please: https://chris.beams.io/posts/git-commit/

I would appreciate specific feedback on the commit message, because I thought I followed the guidelines. I believe the subject line meets #1-#5, and I described what I did and why in the body. I'm interested in good messages and would like to know how I could have made that better.

It's rule 6 - wrapping your commit message body around 72 column. When you're writing your commit message in vim, the editor will wrap the lines automatically for you :) Avoid using GitHub interface for commit message editing.

@dreamer
Copy link
Member

dreamer commented Jul 9, 2021

@kcgen gentle reminder, that reviewing commit message is part of the review
Aside of that - my previous worries about spinlock were a bit overblown - this code does not use spinlock as a thread synchronization mechanism after all - it's more akin to active waiting.
We should still find the good solution using proper high resolution timer and appropriate API (maybe clock_nanosleep with CLOCK_REALTIME flag, I don't know). Slapping yield back and forth seems to me like a cargo-cult programming where the changes introduced are not understood fully. At least I do not understand why yield in this case changed something and what the reprecursions will be. What if by introducing yield we're increasing likelihood that kernel will unbind our process from particular core? I am seriously worried that this active waiting code introduced some regressions in time-sensitive code (e.g. in PC speaker emulation or maybe VGA emulation) and the changes were simply not tested well enough for regressions.

These are legitimate concerns, and it was me correcting an oversight from what I originally had in the first changes. In general (and especially in embedded / firmware, which is most of my dev work), you never want to "busy wait" without giving the CPU a chance to do something else. For example, on a Cortex M4, you use WFE or WFI to avoid eating 100% CPU while you wait, so this wasn't some cargo-cult change I slapped in for no reason, but was based on a reasonable amount of past experience.

From the cppreference link I included in the original commit, this type of situation is one big reason why yield() was added to the standard, so I had no reason to distrust it, but I do also like that this is a single commit, so we can bisect if we do run into issues.

Thank you - I appreciate the detailed feedback!

Do you think we should cherry-pick this change to release branch for 0.77.1 release?

@kcgen
Copy link
Member

kcgen commented Jul 9, 2021

Sorry @kklobe , I thought the commit only had a title.

GitHub didn't given me the usual [...] expansion dots to indicate the commit has a comment block. This could be related to some recent problems I've been having with GitHub UI lately too.

Atleast it's currently indicating properly:

2021-07-09_07-22

I was happy with the commit title - and certainly reviewed that aspect. Next time I won't trust the presence of the dots (or I'll inspect the branch locally with gitk).

@kcgen
Copy link
Member

kcgen commented Jul 9, 2021

On that topic - does anyone know of a work-around to make vscodium wrap commit messages at the 72 column boundary? (There is an open ticket, but no progress microsoft/vscode#2718).

I currently do my git operations and builds on the console using first-letter aliases (gists: git & meson), and use nano and Ctrl+J to do smart wrapping.

@kcgen
Copy link
Member

kcgen commented Jul 9, 2021

I am seriously worried that this active waiting code introduced some regressions

The DOSBox authors (and the code) expect ongoing slippage where each 1ms tick through the emulator actually is much more than 1ms.

  • The time-cost of running the core isn't accounted for or deducted from the sleep time
  • The host-lag of polling events (a serial task after running the core) isn't deducted from the sleep time
  • The SDL-1ms sleep itself is highly inaccurate, often sleeping for ~1.2ms

The existing DOSBox code expects this slippage by measuring and adjusting for it after all of the above have happened. It then adds a small number of ticks (between 2 and 5, if I recall), to force DOSBox to always stay ahead of the ragged-edge by just a couple ms.

This PR only addresses the third bullet - trying to make the 1ms sleep really 1ms instead of 1.2ms.

It's a bit like running DOSBox on an RT-patched Linux kernel and giving the process FIFO scheduling (which I've done.. timing becomes highly accurate).

Given DOSBox already handles large variations in delay inaccuracies, this move toward more accurate 1-ms delaying just moves the error bars inward a bit more.

time-sensitive code (e.g. in PC speaker emulation or maybe VGA emulation

I did extensive testing of all audio devices, all cores, all machine-types prior to 0.77, including running tests for all of the video modes (vgatest.exe.zip).

Do you think we should cherry-pick this change to release branch for 0.77.1 release?

Couldn't hurt! I'll add it.

I think there are a couple more little fixes we've done that could go into 0.77.1 now too.

@kcgen kcgen added the backport Important fixes that should be backported label Jul 9, 2021
@kcgen
Copy link
Member

kcgen commented Jul 9, 2021

Do you think we should cherry-pick this change to release branch for 0.77.1 release?

In progress: #1138

Will merge when it passes CI.

@kcgen kcgen deleted the kk/delayprecise-yield-1 branch December 4, 2021 04:43
@kcgen kcgen added this to the 0.78 release milestone Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Important fixes that should be backported cleanup Non-functional changes that simplify, improve maintainability, or squash warnings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants