Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreTiming: Throttle Before Every Event Using std::chrono #11348

Merged
merged 5 commits into from Jan 14, 2023

Conversation

Sam-Belliveau
Copy link
Contributor

@Sam-Belliveau Sam-Belliveau commented Dec 15, 2022

What it does

  • Throttles before every single event with std::this_thread::sleep_until
    • If the sleep function over sleeps, then the next few calls to sleep will be skipped until it maintains timings again. So this whole system will remain relatively stable regardless of the way that sleep is implemented.
    • This means that events schedules in the CoreTimingManager now happen the instant they are supposed to
      • New Timings +/- 10µs (tested on macOS)
      • Old Timings +/- 1000µs (on every operating system due to design)

Comparisons

This commit can turn these frame times (Wii Sports Resort):
image

Into these frame times (Wii Sports Resort - Immediate XMB ON): (just stunning)
image

Wii Sports Resort - Immediate XMB OFF: (notice that vblanks have an std of 0ms)
image


Here is with the metal frame time viewer:

Before:
image

After:
image

@JMC47
Copy link
Contributor

JMC47 commented Dec 16, 2022

Does this cause any kind of increased latency?

@Sam-Belliveau
Copy link
Contributor Author

@JMC47 I could be wrong but I do believe that it reduces latency.

Basically the old throttling method would run every 1ms, and it would either wait 1ms or not in order to maintain speed. This effectively made the timing random.

The new throttling looks at how long each event should take, and if it is faster than it should, it waits. If it is slower, it will eat into the next events time. This basically means that each event is happening at the EXACT time that it should, rather than spread out through the weird throttle function.

@Sam-Belliveau Sam-Belliveau force-pushed the improved-pacing branch 2 times, most recently from 759621f to 0b5547a Compare December 23, 2022 05:05
@Sam-Belliveau Sam-Belliveau force-pushed the improved-pacing branch 3 times, most recently from 3b0c9b9 to 9416a95 Compare December 24, 2022 02:55
@Pokechu22
Copy link
Contributor

It looks like this fixes #11326 (comment), but it would be good to extract that fix into a standalone pull request so that it can be included in a beta release.

@Sam-Belliveau
Copy link
Contributor Author

This branch also contains additions that more optimally fit the performance information on the screen. It will now stack the information vertically when not using the performance graph.

It also has a new counter which is equally as important as Speed, called Max Speed. This counts the amount of real cpu time compared to the emulated CPU time and tells you what your theoretical max speed would be if running at an uncapped framerate. This allows you to create a performance buffer by seeing how much lag your system could endure before you saw a stutter. This information was calculated before, but was less useful because of the sloppiness in the throttling.

image

image

image

image

image

// timings on its own and we are safe to engage in VSync if we want.
if (m_throttle_disable_vsync)
{
WARN_LOG_FMT(COMMON, "System Lag Resolved! VSync Capability Reenabled!");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't thisbe an INFO_LOG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but I an definitely update it.

@Sam-Belliveau Sam-Belliveau force-pushed the improved-pacing branch 4 times, most recently from bc0cbca to 667806a Compare December 28, 2022 00:53
@AdmiralCurtiss
Copy link
Contributor

Can you please split out the refactoring from the actual change here? I have a hard time telling what is relevant and what is just noise from changing size_t to u64 or whatever (why did you do that stuff, anyway?).

Source/Core/Core/Core.cpp Outdated Show resolved Hide resolved
Source/Core/Core/CoreTiming.cpp Outdated Show resolved Hide resolved
@@ -305,6 +310,8 @@ void CoreTimingManager::Advance()
Event evt = std::move(m_event_queue.front());
std::pop_heap(m_event_queue.begin(), m_event_queue.end(), std::greater<Event>());
m_event_queue.pop_back();

Throttle(evt.time, emulation_speed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place for the Throttle() call. The way CoreTiming works -- from what I remember, anyway -- is that CoreTiming::Advance() is called periodically by the JIT, and by the time it gets control multiple events could have logically already happened but haven't been executed yet (which this loop then executes in order to 'catch up' the event state). So by the time this code runs, we're already at m_globals.global_timer, which makes it odd to only throttle until evt.time, which may be earlier than current -- and also means you may throttle multiple times in quick succession when once would have sufficed.

It probably makes more sense to put a Throttle(m_globals.global_timer, emulation_speed) after this loop instead.

This is also most likely the most significant change here, so you should probably mention it in the PR title and description.

e: Actually... well, logically from the CPU we're at m_globals.global_timer, but in terms of real-world time we're likely ahead of where we want to be, so I can see the logic of throttling here if the event itself is user-observable in some way (like, say, presentation of the rendered image). So I guess this is probably fine as-is.

In the future it might also make sense to introduce a flag for an event whether it wants to execute as close to the correct real time as possible or not, since I suspect most events don't actually care.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand the CPU doesnt actually need to throttle, but throttling here just makes sure that every event happens exactly when it's supposed to. The way the old throttler worked is that it created an event every GetTicksPerSecond() / 1000 that would sleep for about 1ms depending on the timer. This one is a little bit more precise. (having it here is also used in a future PR which uses this throttler to detect lag and skip VI's if needed)

Source/Core/Core/CoreTiming.h Outdated Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceMetrics.h Outdated Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceTracker.h Outdated Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceTracker.h Outdated Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceTracker.h Outdated Show resolved Hide resolved
@Sam-Belliveau Sam-Belliveau force-pushed the improved-pacing branch 3 times, most recently from 163506e to 73c5d25 Compare December 30, 2022 19:56
@phire
Copy link
Member

phire commented Dec 31, 2022

sleep_until is guaranteed to return BEFORE the provided time, so it will not introduce any overhead and takes full advantage of the timing system the operating system provides.

Wait... as far as I can tell, sleep_until actually makes the opposite guarantee. Both libcxx and libstdc++ implement sleep_until as a call to sleep_for(time - clock::now()), which uses nanosleep internally on both.

And the nanosleep man page guarantees that it will sleep for AT LEAST that time. If anything, it provides a guarantee that it will oversleep whenever the OS's scheduler feels like it.

Throttles before every single event with

Throttling before every single event is extremely excessive, will probably result in worse latency, and will not improve actual timing accuracy. Might actually be very detrimental for performance on lower-end devices.

Your current implementation tries to make every single event happen somewhat close to the correct time. But that's kind of pointless when most events don't have any side effects which the user can even observe.

Ideally, you only want to throttle to make sure the Input/Output events. It doesn't matter if the emulator races though to complete rendering of half the frame as fast as possible, as long as it stops before reading input or presenting the result to screen (or audio, though we delay that)

@Sam-Belliveau
Copy link
Contributor Author

sleep_until is guaranteed to return BEFORE the provided time, so it will not introduce any overhead and takes full advantage of the timing system the operating system provides.

Wait... as far as I can tell, sleep_until actually makes the opposite guarantee. Both libcxx and libstdc++ implement sleep_until as a call to sleep_for(time - clock::now()), which uses nanosleep internally on both.

And the nanosleep man page guarantees that it will sleep for AT LEAST that time. If anything, it provides a guarantee that it will oversleep whenever the OS's scheduler feels like it.

Throttles before every single event with

Throttling before every single event is extremely excessive, will probably result in worse latency, and will not improve actual timing accuracy. Might actually be very detrimental for performance on lower-end devices.

Your current implementation tries to make every single event happen somewhat close to the correct time. But that's kind of pointless when most events don't have any side effects which the user can even observe.

Ideally, you only want to throttle to make sure the Input/Output events. It doesn't matter if the emulator races though to complete rendering of half the frame as fast as possible, as long as it stops before reading input or presenting the result to screen (or audio, though we delay that)

Sorry for the delay in response. I modeled this after the original throttling behavior.

The previous throttle ran at 1000hz, and would sleep in 1ms increments. It didn't matter if there was an observable event or not, it would just sleep at different intervals at 1000hz. However because it was only able to sleep in increments of 1ms, it ended up sleeping a jittery, unpredictable way.

What I meant by guaranteed to return before is that, if I request to sleep for 10µs, and windows is only able to sleep in 1ms increments, it will skip sleeping.

I understand the concerns about performance, however from testing (the SkipVI branch uses this branch) I believe that there has not been a substantial drop in performance. This is likely because if you are lagging, then this whole process of sleeping is already ignored.

It is probably smart to only throttle if the event requires it, however that is a little out of the scope of this PR.

I could however add a "MIN_SLEEP" variable that will skip the sleep call if it is less than say 1ms, which would end up calling sleep at similar increments to the original throttling system, with the added benefits that this PR also brings.

I was also considering completely disabling this throttling system when VSync is active to use that for timing instead of this, but I don't know if there is a good way to ensure that VSync is running at the right speed.

@Sam-Belliveau Sam-Belliveau changed the title Core: Improve Throttle Consistency Through Use of std::chrono CoreTiming: Throttle Before Every Event Using std::chrono Jan 5, 2023
@AdmiralCurtiss AdmiralCurtiss force-pushed the improved-pacing branch 2 times, most recently from 0a459b1 to 8364f4c Compare January 6, 2023 20:02
dvessel added a commit to dvessel/dolphin that referenced this pull request Jan 6, 2023
@Sam-Belliveau Sam-Belliveau force-pushed the improved-pacing branch 2 times, most recently from 968dd29 to 93d428b Compare January 6, 2023 21:09
@Sam-Belliveau
Copy link
Contributor Author

Sam-Belliveau commented Jan 6, 2023

What I meant by guaranteed to return before is that, if I request to sleep for 10µs, and windows is only able to sleep in 1ms increments, it will skip sleeping.

How exactly did you arrive at that conclusion? On pretty much every operating system I have ever seen, sleeping a thread guarantees that the thread will stay asleep for at least the requested time, but makes no guarantees on when the thread will awake again afterwards. So I'm with phire in that this will, on average, oversleep rather than undersleep.

Also, maybe this is just Windows being less precise with sleeping than macOS, but I've tested this a bit and I don't see any real improvements in frametimes with this change. It's still pretty jittery. Do I have to enable/disable a specific setting for this to work?

e: If I replace the sleep_until with spinning the frametimes do get better, so I'm concluding that this is indeed an operating system thing.

I thought I had read it somewhere but looking into it further, that will actually not effect this PR at all. If the throttling oversleeps by say 5ms, this system will simply not call sleep again until it has caught up to the timings. The tolerance it has for this is +/- 40ms. So it will dynamically adapt to the best resolution that the system can provide.

It is an operating system thing, and while it might not provide the same level of consistency as macos, having throttling in this part of the code makes life a lot easier, especially for the VISkip branch. It also is able to provide a lot more accurate estimated performance speed, which it was unable to do previously.

It also opens the door to using multimedia timers if timing is that important, but I find this unlikely.

On systems like MacOS and Linux that support nanosleep, calling sleep more often will probably be able to give the scheduler more appropriate times to hand control back, however the effect on performance is unknown.

If calling sleep alot does cause issues, it is very simple to add a 1ms min sleep requirement to limit the sleep calls to 1000hz at the expense of accuracy.

@AdmiralCurtiss
Copy link
Contributor

Okay, how about this:

  • Make CoreTimingManager::Throttle() public.
  • Remove the call from the event loop.
  • Add a call to it here, here, and maybe here (unsure if this matters?).

Does this still give you the 'good' frame times?

@Sam-Belliveau
Copy link
Contributor Author

Im going to test this right now, but I worry about how it might create issues that didnt exist before due to it being too different.

But i will comment the results of doing what you said.

@Sam-Belliveau
Copy link
Contributor Author

@AdmiralCurtiss

If you check the discord youll see that the timings are worse if we only throttle in those places, and it seems like the performance difference is negligible. (it seems to show higher performance when throttling before every event)

JMC47 added a commit that referenced this pull request Jan 6, 2023
…actorings

Performance tracker refactorings extracted from PR #11348.
@AdmiralCurtiss
Copy link
Contributor

shrug Weird, but if it's like that then it's like that. I'm not sure what else to do then.

if (stack_vertically)
window_y += window_height + window_padding;
else
window_x -= window_width + window_padding;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commits are mainly just moving things around since the Improved Pacing allows us to get some new data points

Source/Core/Core/CoreTiming.cpp Outdated Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceMetrics.cpp Outdated Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceMetrics.h Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceMetrics.h Outdated Show resolved Hide resolved
Source/Core/VideoCommon/PerformanceTracker.h Outdated Show resolved Hide resolved
@AdmiralCurtiss
Copy link
Contributor

Anyway, I remain unconvinced that this is a positive change in all instances, but at the same time I'm not gonna block @JMC47 or whoever from merging this if they really want it. It's not like we can't change it back if it turns out to cause problems...

@Sam-Belliveau
Copy link
Contributor Author

I agree that it shouldnt be regarded as final, but it is certainly more flexible that previously. (the changes you suggested were not possible with the old timing system)

@phire
Copy link
Member

phire commented Jan 12, 2023

Overall, I'm happy with the code. It's a good cleanup and an improvement over what was there previously.

What I meant by guaranteed to return before is that, if I request to sleep for 10µs, and windows is only able to sleep in 1ms increments, it will skip sleeping.

This is a dubious claim, and certainly not a guarantee. The only reason why windows will skip sleeping is that 10µs rounds down to 0ms, which skips sleeping.

On other operating systems with nanosleep (and windows if you ask for more than 1ms of sleep), they will always suspend your process. And once the process is suspended, there is no guarantee that it will return anywhere near to your requested time, because another process might be scheduled to run instead and might massively delay the return to Dolphin. Hell, by default, windows will actually sleep for upto 15ms when you Sleep(1)

Relying on rounding down to 0 is a weird edge case. If you actually want that behaviour, perhaps you should be implementing it in Throttle instead?

Measure how long calls to nanosleep/sleep actually take (compared to requested time) to determine what timer resolution the OS is actually giving you, and then only call nanosleep only call sleep_until if it's longer than that time.

Maybe on windows, we we should be switching to using CREATE_WAITABLE_TIMER_HIGH_RESOLUTION and CreateWaitableTimerEx when it's available, but I have no idea what resolution it actually supports, it seems pretty undocumented, this blog post seems to be the best source. I saw someone on reddit claiming it supports 0.5ms resolution, but I suspect it might actually vary from machine to machine.

The actual code is fine, but can you remove any mention of sleep_until providing a guarantee from your PR description, and perhaps add comments to the code that explictly point out the issue with OS scheduler resolution.

If you check the discord youll see that the timings are worse if we only throttle in those places, and it seems like the performance difference is negligible. (it seems to show higher performance when throttling before every event)

Sigh. Discord isn't good for findability. In future, it would be helpful to duplicate any relevant infomation into the PR, rather than making me dig though discord history for it:

image

That result doesn't feel right to me. When are you actually measuring frametime here? Is it in VI's present, or when the GPU finishes rendering the frame?

The reason why I'm a bit concerned is that there is no guarantee that Clock::now() is actually a cheap call. In modern Operating Systems, it's often implemented mostly in userspace to make it fast, but it might actually result in an expensive system call. And the one thing your changes do is makes dolphin very dependant on Clock::now() being fast.

So because of concern about potential performance regressions, it's worth trying to reduce the number of calls to Clock::now(). And there is absolutely no reason to have every single event with accurate timing. Accurate timing only matters for input and video/sound out.

Throttling before every single event seems like massive overkill, and I worry it might cause performance on certain OS configurations or cases where the hardware was already marginal.


Actual requests:

  • Please update PR summary to remove claims about sleep_until guarantee early return
  • Consider making Throttle implementing the minimum sleep behaviour itself.
  • Look into why only throttling on only those relevant events doesn't produce as smooth pacing as throttling on every call. It might be fixable, but at the very least it would be helpful to know why.

@Sam-Belliveau
Copy link
Contributor Author

@phire I updated the PR to reflect that the guarantee does not really exist. I must have misread that somewhere. However if it does happen to oversleep, it quickly compensates for that by not sleeping until the timings catch up. It can account for ~40ms of oversleep before there is a slow down. I believe that the reason that the timing are more accurate when sleeping before every event just comes down to the fact that its sleeping by small amounts more often, and is able to correct for the small errors in the sleep.

I implemented the minimum sleep feature, but did it in a way that only calls Clock::now if enough time has passed. The number I settled on was 8192, because that let me set the minimum clocks to sleep to clocks_per_sec >> 13, and because it reduced the number of sleep calls from 25k to 4k, at least according to mac os activity monitor. It also maintained the frame pacing as < +/- 0.2ms, which i consider to be perfect. It also seemed to fix an issue where VSync would cause lag that would prevent the throttler from catching up, and would run the game at 60hz.

Hopefully this resolves your concerns!

@JMC47
Copy link
Contributor

JMC47 commented Jan 14, 2023

@phire and @AdmiralCurtiss said I could merge this. We'll hopefully get some more testing on a wider variety of machines with this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
7 participants