Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[d3d11] Use host memory in deferred context for "small" updates. #1805

Closed
wants to merge 1 commit into from

Conversation

rbernon
Copy link
Contributor

@rbernon rbernon commented Nov 4, 2020

I'm not sure of the lifetime of things involved here, so I'm opening the PR to get some feedback. Dark Souls III performance is clearly hitting a bottleneck here, with multiple threads fighting for slices and thrashing the CPU cache in several places: the free slice spinlock, but atomic operations on buffer refcount as well.

I tried a few games with it and it seems to be working alright, but didn't really test extensively.

Instead of spinning like crazy in the slice allocator, thrashing the
CPU caches while fighting other deferred context threads.

Using the heap here gets us ~15 more fps in Dark Souls III.
@doitsujin
Copy link
Owner

doitsujin commented Nov 4, 2020

What kind of hardware is this a problem on? I've never seen Dark Souls 3 perform poorly in comparison to Windows, even on my 10-year old Phenom II X6 back in the day.

I really don't like the idea of introducing memory allocations and memcpys on a hot path that we're going to be hitting every single time in the vast majority of games using Deferred Contexts. In fact, doing that is what made some other games (like Diablo III iirc) run very slowly, which is why the change was made in the first place.

This looks like it does exactly the same thing as the code path in case m_csFlags.test(DxvkCsChunkFlag::SingleUse) is false anyway, and in my testing, that has pretty much always been worse in relevant games and can be triggered via a config option if necessary. Note that most games do not use deferred contexts at all.

@rbernon
Copy link
Contributor Author

rbernon commented Nov 4, 2020

I'm running on a laptop with a Intel i7-8565U CPU (and a NVIDIA 1070 e-GPU but it's not the bottleneck here), and perf reports 30% CPU usage on the related atomic ops. I haven't tried moving everything through the other code path, I thought AllocUpdateBufferSlice was also allocating a slice, but it seems after a better look that it could fit.

However, I initially didn't limit the size of the updates, and Sekiro then performed very badly for instance. I think the issue is only with "small" updates that apparently are used a lot in DS3.

@doitsujin
Copy link
Owner

doitsujin commented Nov 4, 2020

The problem in general is that these small updates tend to be made more very frequently as well if a game doesn't make use of D3D11_MAP_NO_OVERWRITE (most unfortunately don't), so the total amount of data being copied around is often going to be fairly large and will at some point end up being the bottleneck.

I'm also not sure how much of a problem lock contention really is for most games, it seems like DS3 just uses one single dynamic constant buffer for just about everything but that's not the norm either.

@rbernon
Copy link
Contributor Author

rbernon commented Nov 4, 2020

I think the issue is not so much lock contention but rather cache thrashing. For instance using the heap makes the threads fight each other as well but it ends up being nicer to the cache. Using a std::mutex instead of a spinlock for the slice allocator also helps a bit but then there's still some thrashing visible on buffer refcount.

@rbernon
Copy link
Contributor Author

rbernon commented Nov 4, 2020

I tried the option, and disabling single use mode for this game improves performance in the same way on my hardware. Would it make sense to add it to the default config?

@doitsujin
Copy link
Owner

We can do that, but I'd like to do get some numbers from different hardware configs. This game is unfortunately very annoying to benchmark because it's locked to 60 FPS.

What kind of frame rates are you getting before and after enabling it?

@rbernon
Copy link
Contributor Author

rbernon commented Nov 4, 2020

Right at the start of the game, all settings to low and 1280x720, without moving the camera it was hardly above 45fps before with ~25% GPU usage (reported on the HUD), and enabling it makes it reach 60fps with ~40% GPU usage. It's not obviously capped though, and moving the camera changes the numbers too, in particular it's better in both cases with less geometry on screen.

Although it's much better with the single use mode disabled, perf still shows 11% CPU usage in dxvk::DxvkContext::invalidateBuffer(dxvk::Rc<dxvk::DxvkBuffer> const&, dxvk::DxvkBufferSliceHandle const&), caused by some incRef cache miss.

Note that using the heap like I did here didn't change much the perf report, although the 10% CPU time is then spent waiting on the global heap cs (or thread local heap atomic ops if I use it instead).

@ViNi-Arco
Copy link

Hi Rémi, this modification you made, improved the fps stability in Dark Souls Remastered for me, now I get 60 fps constant, here before and after Dark Souls Remastered in FullHD with 720p resolution scale with only MotionBlur and FXAA enabled. IIn Firelink Shrine it was one of the areas that most varied the fps for me, thank you.

@rbernon
Copy link
Contributor Author

rbernon commented Nov 18, 2020

Nice to know. As discussed above there's already a d3d11.dcSingleUseMode = False config setting that enables a similar code path and that could be used without any change.

@doitsujin
Copy link
Owner

@ViNi-Arco what's your hardware configuration?

I wonder if (besides contesting the one constant buffer) this has anything to do with latencies involved when writing data to external GPUs directly. DXVK is known to not do particularly well in such configurations.

@ViNi-Arco
Copy link

ViNi-Arco commented Nov 21, 2020

Hello Mr. Philip, I have an Intel Q9650 with DDR3@1333 with Tight Timings with Command Rate in 1T.

With Rémi correction, CPU usage is less.
With DXVK in Stock, CPU usage is slightly higher.

and DXVK in stock is stuttered more often, as you can see in the image above, it is already stuttered.

Edit-2: This higher CPU usage probably was a MuQSS anomaly, that for some reason this Patch helped, now with the new Project C PDS, the usage is lower idepending of the Remi Patch, using d3d11.dcSingleUseMode = False gives a small gain equal to the Patch

Edit: I tested it on Far Cry 4, and there was no difference, that must be specific in some games, maybe. I only have these two games in DX11 to test at the moment.

@rbernon
Copy link
Contributor Author

rbernon commented Feb 8, 2021

I'm closing this, as it seems it's hardly useful. People using it as a patch, should probably not.

@rbernon rbernon closed this Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants