Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DxvkMemoryAllocator: Memory allocation failed #747

Closed
kakra opened this issue Nov 4, 2018 · 66 comments
Closed

DxvkMemoryAllocator: Memory allocation failed #747

kakra opened this issue Nov 4, 2018 · 66 comments

Comments

@kakra
Copy link

kakra commented Nov 4, 2018

Software information

The Witcher 3, all settings maxed out, full HD, Nvidia Hairworks all characters + AA4

System information

  • GPU: NVIDIA 1050 Ti 4GB
  • Driver: 396.54.9 vulkan beta
  • Wine version: 3.19 master with Proton patches
  • DXVK version: v0.90-46-g963bd66

Log files

  • d3d11.log: cannot reproduce with debug enabled
  • dxgi.log: cannot reproduce with debug enabled

After loading a saved game, the game freezes just milliseconds after starting to fade in the screen. Since the game still fades in, everything is dark but it looks like everything is rendered correctly - no models or textures missing, NV hairs are also working. This happens only sometimes.

Looking at the logs I see

terminate called after throwing an instance of 'Dxvk::DxvkError'

Turning on full debug logging of dxvk eliminates the issue, it's not longer reproducible.

The frozen game can be successfully and instantly killed with SIGKILL.

@doitsujin
Copy link
Owner

Please apply the following patch to DXVK to get more descriptive error output:
dxvk-error.patch.txt

I've never seen this problem or anything like it, and I test Witcher 3 a lot. With the current set of information I won't be able to do anything, though.

@doitsujin doitsujin added the bug label Nov 4, 2018
@kakra
Copy link
Author

kakra commented Nov 4, 2018

Thanks, I'll try during the next days. I never saw this behavior in the v0.80 series of DXVK.

@SveSop
Copy link
Contributor

SveSop commented Nov 4, 2018

I haven't noticed this with the "Beta 3.16" Proton version tho. Afaik that uses dxvk-0.90...

Atm 3.16-4 Beta, that I would guess is the release called proton-3.16beta-20181031

@kakra
Copy link
Author

kakra commented Nov 4, 2018

@SveSop I'm currently working with bleeding edge builds here... Proton rebased to 3.19 including some code to optimize the process scheduler priorities to reduce priority inversion effects, and bleeding edge dxvk from git built als winelib. This boosts SOTTR performance from 19 to 33 fps for me here (even 35 fps with latest wine-3.19). And it reduces stutter and fps dips in TW3 and PoE. Also, intermittent freezes in SOTTR are fixed. I'm also working on some avrt patches so that native xaudio can properly gain realtime priority (currently, only built-in xaudio does that, and only with staging patchset). I'm going to soon push these updates to my repository but I'm currently not satisfied with it, and I want to test quality a little more. Also, wine had some commits lately breaking compatibility with esync and d3d related patches from Proton which I need to iron out (I think I fixed most by now).

I don't think that the wine version has anything to do with it, or if it has, it's something that'll show up here as soon as Proton would be officially based off a newer wine version.

@kakra kakra changed the title The Witcher 3 sometimes freezes with 'Dxvk::Error' The Witcher 3 sometimes freezes with 'dxvk::DxvkError' Nov 5, 2018
@kakra
Copy link
Author

kakra commented Nov 7, 2018

@doitsujin Is it possible that the patch you've attached just displays a bunch of newlines? I currently cannot reproduce it in Witcher 3 but it now occurs in SOTTR.

@doitsujin
Copy link
Owner

Ah yeah, sorry. This one should work:
dxvk-error.patch.txt

Again, SOTTR works fine on my end.

@kakra
Copy link
Author

kakra commented Nov 7, 2018

SOTTR also radically dropped performance for me during one of my last rebases, from 30 fps to 10 fps (with vsync+triple buffer). But I don't know if this is due to code changes in wine-master or in DXVK. I'm currently trying to figure out if my wine-master rebase went wrong. There are currently many conflicting changes going on and I'm reintegrating patches from their updated sources now. I already reverted my own code changes as a first step but that didn't help. So there seems nothing wrong with those. Ah well... sigh

@doitsujin
Copy link
Owner

Can you just test things with a clean wine-tkg setup (if you're on arch) or something similar to rule out issues with your wine build?

@kakra
Copy link
Author

kakra commented Nov 7, 2018

@doitsujin Okay, something strange is going on. Out of desperation, I zapped the shader cache from $STEAMAPPS/shadercache/$GAMEID (both DXVK and Nvidia) and the crash in SOTTR is gone, plus it's back to normal performance (the perceived performance even looks smoother now). The first benchmark run was clearly full of stutters as expected. Subsequent runs are fine now. Also, graphic distortions in SOTTR are gone (like Lara missing her clothes or hair).

Does this make sense to you? I wonder if TW3 benefits from a cache clear, too. Let me try...

PS: Don't try to reproduce Lara missing clothes and expecting some fun, the developers seem to have thought of this. :-)

@doitsujin
Copy link
Owner

That's weird and should probably not happen, but yeah, might be worth tryng for TW3 as well.

@kakra
Copy link
Author

kakra commented Nov 7, 2018

Is the cache depending on the DXVK version somehow? And are there safeguards against broken shader caches?

Or: s/shader cache/state cache/

@kakra
Copy link
Author

kakra commented Nov 7, 2018

Okay, I already found that there's a safeguard using sha1 sums of each state cache entry, and a version header. So how did it break for me?

@kakra kakra changed the title The Witcher 3 sometimes freezes with 'dxvk::DxvkError' State cache corruption? Nov 8, 2018
@doitsujin
Copy link
Owner

Not sure. Did you manage to confirm whether it was DXVK's state cache or the Nvidia driver cache that was causing issues?

@kakra
Copy link
Author

kakra commented Nov 8, 2018

I nuked both and only then discovered that this wasn't the best idea to find which one actually caused the problem. :-(

@kakra
Copy link
Author

kakra commented Nov 8, 2018

Okay, I got TW3 to crash again, this time logging worked (that logging patch should be in mainline, shouldn't it?):

0029:err:clipboard:convert_selection Timed out waiting for SelectionNotify event
0029:err:clipboard:convert_selection Timed out waiting for SelectionNotify event
DxvkMemoryAllocator: Memory allocation failed
terminate called after throwing an instance of 'dxvk::DxvkError'
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding c4
004d:fixme:seh:dwarf_get_ptr unsupported encoding 7d
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding c4
004d:fixme:seh:dwarf_get_ptr unsupported encoding 7d
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:fixme:seh:dwarf_get_ptr unsupported encoding 9b
004d:fixme:seh:dwarf_get_ptr unsupported encoding 4a
004d:fixme:seh:dwarf_get_ptr unsupported encoding a9
004d:err:seh:call_stack_handlers invalid frame 3986f519 (0x39672000-0x39870000)
004d:err:seh:NtRaiseException Exception frame is not in stack limits => unable to dispatch exception.

@kakra
Copy link
Author

kakra commented Nov 8, 2018

Looking at the code, it seems like I should somehow manage to reproduce this error even wtih DXVK logging turned on...

@doitsujin
Copy link
Owner

DxvkMemoryAllocator: Memory allocation failed indicates that you're running out of memory (not necessarily VRAM).

@kakra kakra changed the title State cache corruption? The Witcher 3: DxvkMemoryAllocator: Memory allocation failed Nov 9, 2018
@kakra
Copy link
Author

kakra commented Nov 9, 2018

Okay, I renamed the issue title to reflect the original problem. I think the "cache corruption" in SOTTR is really a different issue and should be reported separately by me if it occurs again.

It's strange that this can happen even very early after starting the game, read: When I just loaded a saved game the first time after starting The Witcher 3. I'll report back with new findings.

Actually, my system was loaded with some development applications which like to take a good amount of RAM while this issue occurred the last time. But it still had plenty of RAM left, around 8 GB. After all, TW3 is usually not THAT memory hungry (being an older game).

@kakra
Copy link
Author

kakra commented Nov 9, 2018

Here's an update:

err:   DxvkMemoryAllocator: Memory allocation failed
  Size:      134217728
  Alignment: 256
  Mem flags: 0x7
  Mem types: 0x681
DxvkMemoryAllocator: Memory allocation failed
terminate called after throwing an instance of 'dxvk::DxvkError'

# free -m
              total        used        free      shared  buff/cache   available
Mem:          15931        8063        1415         184        6452        7092
Swap:         67583        1304       66279

# nvidia-smi after the crash
Fri Nov  9 21:03:35 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54.09              Driver Version: 396.54.09                 |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0  On |                  N/A |
| 54%   43C    P5    N/A /  75W |   1697MiB /  4006MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1040      G   /usr/libexec/Xorg                           1120MiB |
|    0      2537      G   /usr/bin/kwin_x11                             47MiB |
|    0      2547      G   /usr/bin/krunner                               1MiB |
|    0      2549      G   /usr/bin/plasmashell                         241MiB |
|    0      4762      G   ...quest-channel-token=6856907186180940681   256MiB |
|    0      6145      G   ...ra/.local/share/Steam/ubuntu12_32/steam    20MiB |
|    0      6153      G   ./steamwebhelper                               1MiB |
|    0      6172      G   ./steamwebhelper                               4MiB |
+-----------------------------------------------------------------------------+

@kakra
Copy link
Author

kakra commented Nov 9, 2018

I played this game for extended hours (sometimes 12 in a row, yes I'm an addict of this game) with previous versions of DXVK. So I wonder why I see this now.

@doitsujin
Copy link
Owner

Does it still work on older versions?

I don't see why allocating a 128MB buffer in system memory would suddenly fail when it previously didn't, especially since the memory allocator hasn't been touched in a long time.

@kakra
Copy link
Author

kakra commented Nov 10, 2018

After trying some games, I see that multiple games are affected... Skyrim SE freezes on loading screens or in the middle of the game, looking at the logs I also see DXVK complain that very moment about memory.

It looks like Chrome hogs a lot of GPU memory, Xorg was holding almost 3 GB of GPU memory. Restarting Chrome fixes that, and stopping Chrome gets rid of the issues in Skyrim SE. I don't think it has anything to do with the DXVK version but coincidence is that other processes occupy GPU memory. Shouldn't such memory swap out to system memory? Maybe something changed in the NVIDIA driver?

@doitsujin
Copy link
Owner

System memory that needs to be made visible to the GPU cannot be swapped out as far as I'm aware, and the Nvidia driver might have further limitations (probably for a good reason). fwiw I've seen similar issues on amdgpu under low-memory conditions, although not directly related to DXVK.

In any case, if you consider this issue resolved by closing third-party applications, please close the issue.

@doitsujin doitsujin removed the bug label Nov 10, 2018
@kakra
Copy link
Author

kakra commented Nov 11, 2018

There's definitely a bug somewhere leaking memory... If I play long enough, VRAM eventually fills up to 3.7 of 4 GB and then games either freeze, crash or behave strangely (like flickering or missing textures/models). Something changed, I'm just not sure what. I'm using the same NV driver version since some time now, so it's not too likely that the graphics driver changed something. I followed DXVK master closely. Maybe something new in DXVK triggers such a bug?

It's very likely possible that the bug was there earlier but something triggers is much earlier now. I've seen similar problems on very rare occasions before but only after very long gaming sessions.

@doitsujin
Copy link
Owner

doitsujin commented Nov 11, 2018

You can monitor DXVK's memory consumption (both VRAM and mapped system RAM) with DXVK_HUD=memory. I haven't seen any behaviour that would indicate a leak.

@kakra
Copy link
Author

kakra commented Nov 14, 2018

@doitsujin Yes this is what the code says but I'm sure there's still sysmem available, or the system could just swap stuff out to disk to make some small allocation of 128M available. Could this be a driver bug? After all you're not allocating through standard C/C++ functions but through vulkan functions.

@doitsujin
Copy link
Owner

i don't know what the issue is, but it seems that something eats unusual amounts of memory on your end. Have you tried running those games on a simple WM (like fluxbox) without any applications running in the background?

@SveSop
Copy link
Contributor

SveSop commented Nov 14, 2018

Stupid question from my end: Is the problem here that there is a memory leak eating more and more vram until the game crashes in eg. TW3?
If so, i dont really see the same problem on my end, as those 2.8GB allocated vram shown happened in the first 2-3 minutes of playing TW3 yesterday, and did not increase over the course of 2 hours of me playing, loading/saving games several times without change. In use (commited? dont remember the wording) hovered around 2.2GB - 2.5GB mostly.

I did not try to crash the game on purpose by overloading vram in some other manner tho.

Just trying to troubleshoot on a different system than yours to weed out any possible non-dxvk issues.

@SveSop
Copy link
Contributor

SveSop commented Nov 14, 2018

Did a wee bit of testing back and forth, and can't really say i am able to make something eat so much vram.. Opening 10 chrome windows did not chunk out a huge deal of vram either tbh, but for all i know you could be running 200+ windows while editing a 4K movie in the background :)

What i DID notice however (no difference between Proton 3.16 w/dxvk 0.90 vs. building my own dxvk from git) was the "Memory Allocated" from the DXVK hud only increased and never went down even if i loaded a saved game with less "Memory used".

That might be intended, and tbh SHOULD not be an issue as long as it is <4GB i guess?
Eg.

nVidia-SMI: Witcher3: 1488
nVidia-SMI: (Total): 1880
DXVK Memory Allocated: 2030
DXVK Memory Used:      1837

Loading save games from different spots + running around and so on would up the "Memory Allocated" upwards, even tho "memory used" goes up/down as needed. Not really sure what the discrepency between nVidia-SMI (who i would deem to be "accurate" in usage directly from the driver) and "Memory used"?
nVidia-SMI "Total" memory was 1880, and somewhat more in line with "DXVK Memory" i guess, but nVidia one includes Xorg, gnome-shell and stuff like that, so i would not think DXVK would be able to "read" that?

I did not test hours upon hours of gameplay, but from the 2 hours i played yesterday mentioned above, i had 2.8GB "Memory allocated", so i guess it MIGHT be something that just grows and grows until it gets a problem? Is the mem allocation something that SOMETIMES gets cleared out? (Or rather SHOULD).

@kakra
Copy link
Author

kakra commented Nov 14, 2018

DXVK uses a chunk allocator, thus it usually doesn't cleanup because some bit of information will always be left in a chunk. Chunks are allocated probably in 64 MB blocks, within each chunk you'll have a free list of blocks from which DXVK will allocate into the biggest block available (except a free block matches exactly in size), if the allocation request type matches the chunk type. It's similar to how btrfs manages its device space. If a chunk becomes completely free, it could be de-allocated, but that really doesn't make much sense because probably you would request a new chunk of memory just moments later. If no free block can be found, a new chunk will be allocated from the device. Thus, it's normal that the memory usage only increases until it peaks at some value. A chunking allocator is pretty much the best thing you can do if you need to handle different and incompatible types of allocations. You just need to properly tune the chunk size so you can fit all types of allocations without too much overhead and without too much wasted space. The "allocated" counter is probably what's been allocated as chunks, the "used" counter is what's actually used across all chunks. The difference is wasted space which wasn't used or couldn't be used due to incompatible memory type flags.

As far as I understood, chunks are allocated from the driver or the vulkan layer which in turn decides if it allocates from the device or from system memory (depending on the flags given). Within each chunk, memory is managed by DXVK itself by keeping lists of free blocks (pairs of offset/size).

What happens in my case seems to be: DXVK asks vulkan for a new chunk of device-local memory, vulkan says "no", DXVK tries again without the "device-local" flag, thus it allows to use non-local memory which is slower because it is accessed over the PCI bus. But the vulkan says "no" again. But there's plenty of system RAM available to allocate such a chunk. I can only guess why that is. Maybe vulkan cannot find system memory that would be mappable by the GPU. Not all of your physical address space may be available to the GPU because of chipset limitations, or because other devices already mapped that, i.e. another GPU, or I don't know what.

Overcommitting "solves" this because it lets vulkan pretend that unused chunk memory isn't going to be used any time soon. Thus, such memory is still available to other allocations. Your Linux kernel does a similar thing: Allocated memory only becomes mapped to real memory if something writes to the memory blocks. Otherwise it stays idle. It accounts for the allocated RAM but not the used RAM. It's the "virt" counter you'd see in top: virt is allocated space. But things start crashing if one application now actually wants to use its allocated but yet unused memory: The GPU won't find any space to put that request, it fails, crash. Linux solves this by swapping to disk. The GPU could request the driver to swap to sysmem. But as I understood, vulkan leaves that completely to the application. So DXVK would be in charge of doing so. But DXVK doesn't implement this. It's complicated. It should be avoided as long as you can.

So in turn that means: Overcommitting does not crash for me, thus a lot of VRAM is only allocated but not used. So Chrome (or Xorg) seems to allocate a lot of VRAM just because it can but it never uses it.

To the experts: Does this make sense?

I'm running with two monitors, left one is a full-HD TV (which I actually use for gaming from the couch, with a wireless controller), and the right one is a 4k PC monitor. I do no video editing but some browser tabs may host paused or finished youtube videos (which tend to be streamed in 4k quality). I also have multiple gmail tabs open. At least back in 2014 there was a bug in Chrome where it would slowly eat away your VRAM if you have gmail opened over longer periods of time. But that was fixed since then.

So overall I'm probably running a virtual framebuffer of (1920+3840)x2160 pixels at 32 bit color depth (I think it doesn't use 24 bit buffer representation, but color space is 24 bit). With triple buffering, that's about 142 MB of screen buffer. Probably there's some padding and alignment but nothing to worry about...

Or a little less technical and abstract:

Think of your desktop (the real wooden one where you put your keyboard and mouse on) as your VRAM. Everytime you want to do something with the GPU arrange a peace of coloured paper onto your desktop. Put your information in the paper sheets. Different types of information will use different coloured paper. At some point either your desktop fills up and you can only use the space left on paper, or the space left on paper is enough to work with. If your desktop space fills up, you could start putting paper sheets elsewhere... On the floor... or into some folders. But accessing these is much slower then. Overcommitting is like using scissors to cut parts of paper off and replace those parts with a different color. But if the other application now has to put information there and there's no space left to put the cut-off snippets, things will crash.

@SveSop
Copy link
Contributor

SveSop commented Nov 14, 2018

Thanks for a thorough explanation :)

I use 2x1080p monitors, but rarely have i ever seen vram used past 2GB in the cruddy old games i play... save for TW3 (probably old aswell), and have after a while of playing up toward 2.8GB allocated mem.
Now.. i dont do 12+ hour gaming sessions without logging off, nor do i have many many chrome tabs open while i game. I DO however sometimes watch a video of some quest, or read some shit WHEN i play, but nowhere near going oom of vram. This COULD ofc be worse if i play for a lot longer, as i said (and to your explanation) the game COULD be allocating chunks until vram is all spent? Dunno.

How long does it take you if you do a clean boot and just load up steam and start TW3 until you get errors? Cos troubleshooting stuff that is in the realm of "Oh.. yeah, you need to do a 12 hour playingsession before that happens" is kinda.. uhm.. Well :)

As i said, chrome seemed hard pressed to really use much vram for me, so i am looking for something different perhaps.. some example code that can be started over until vram is spent perhaps? Found some references to GLSLHacker (GeeXLab) and some 4GB vram test thingy, but was not able to find that anymore. Opening a 4K video on youtube seems to be using a whopping 70MB of vram for me, so i dunno...

@doitsujin
Copy link
Owner

@kakra

The GPU could request the driver to swap to sysmem. But as I understood, vulkan leaves that completely to the application. So DXVK would be in charge of doing so.

Actually no, it isn't. Once a memory chunk is allocated, Vulkan apps don't really have to bother with it, residency is magaged by the driver. Even for device-local memory types, there is no guarantee that memory allocated from them is actually located in VRAM, it can be paged out if necessary.

@kakra
Copy link
Author

kakra commented Nov 14, 2018

@doitsujin So we are back to "that doesn't seem to happen here". It only strengthens the theory this is a driver / graphic stack issue here... Maybe related to configuration or hardware memory layout...

@kakra
Copy link
Author

kakra commented Nov 15, 2018

Okay, I managed to let SOTTR allocate more than 4GB of memory now without a crash, also TW3 allocated around 3GB without a crash now - with Chrome and some other windows opened. I have a theory of what was going wrong in my system but I need to test this a little more. The "slow down over time" issue in SOTTR also seems to be gone but since my system does a lot of background activity currently, I'd like to defer the performance testing a little more. Currently, stuttering is a lot more apparent now and I'm not sure if it comes from background activity or switched settings. But the games seem to cope well now with Xorg/Chrome allocating a lot of memory, overall memory footprint of those seems a little bit lower now. I'll report back.

Fun fact: Sometimes it helps to write elaborated texts explaining things to get the clue where a problem is. :-)

@kakra
Copy link
Author

kakra commented Nov 16, 2018

Apparently, during testing various kernel configurations I managed to crash my filesystem the hard way. I probably lost some important changes to the wine code, one of which is hard to recreate. Replacement drives are ordered because I want to keep around the broken file system for trying recovery. This throws me back about 1-2 weeks, so I'm going to pause working on this for a few days.

But to recap what I found out so far: Vulkan (or NVIDIA) seems to interact with THP very badly (at least in combination with wine). It wasn't able to allocate more RAM because there was just no mappable memory block left to allocate. This is probably a memory fragmentation issue. Usually, the kernel would defer huge page creation then. But I also noticed that my kernel didn't properly enable IOMMU (which seems to be important for NVIDIA). I was still testing that part when the crash occured.

Since THP can be a pretty nice, performance enhancing feature, I wanted to work out a proper configuration and document that. First tests showed that it makes a difference in performance. Overall fps was mostly identical but I did notice audio-dropouts every now and then which I didn't notice before.

So I probably take the chance to rebase my work to wine 3.21 then. I was just finished with preparing and cleaning up the 3.20 release when everything went down the virtual drain. :-(

I think I'm back up running by next weekend. Thank you, Murphy, that I discovered my daily backup wasn't working that very same day.

Note to myself: Don't mix zswap with some workloads. Push often even if still WIP.

Conclusion: If someone is seeing this issue, too, it may be due to THP being enabled and not being fully and/or correctly configured. Could you check? grep ^ /sys/kernel/mm/transparent_hugepage/*

@lieff
Copy link

lieff commented Nov 16, 2018

Nvidia have separate nvidia-uvm driver part for Unified Memory Access support. Not sure if it's CUDA only, but may be all CPU visible allocations goes through it and may be it have some tunable options.
Anyway, dxvk do reuse allocated blocks + games do allocation/frees in runtime, so fragmentation issue can't be resolved only by driver/system, GPU mem defragmenter is needed. Otherwise there ~2x GPU mem is needed in worst case scenario (i.e. game can only use 2GB for 4GB card).

@kakra
Copy link
Author

kakra commented Nov 17, 2018

During my tests I enabled HMM which is probably needed for full uvm support. But the uvm module is not used (used count is 0), and also it would probably be auto loaded if something accesses the API. Games don't it seems. CUDA probably should but I'm not sure how to test that.

The chunk allocator of DXVK seems to do a pretty good job, I've not seen a waste of 2x, usually just a couple of 100 MB. If the factor raises to 2 or more, that's usually only of short duration because a game discards and reloads stuff. The strategy to fill the largest free blocks in a chunk first seems to work very well: It leaves space for bigger allocations, or if a big allocation cannot fit, it will allocate a new chunk and this allocation will already take a good part of it, so overall slack is reduced.

But in my case I think it's not a problem of the GPU memory but the Linux kernel wasn't able to find a contiguous block of system memory. Usually, a THP defrag should happen then which introduces latency. But due to latency, I turned on deferred defrag: Thus, an allocation would be broken down into 4k pieces and defragged later. Trying to guess what happens here: Such a broken allocation cannot be mapped for the GPU then and the allocation request will be rejected, resulting in the out of memory error. If you turn off defrag, you'd easily run out of memory for applications early, even when top still states enough memory available. THP uses 2M page size here. It reduces pressure on the TLB which can result in up to 10% performance difference.

Chunk size of DXVK is way above THP pages size, so it should not introduce additional fragmentation issues with the GPU mem. I think the issue is only within system memory.

Maybe the memory stats should show what the largest block from the free list currently is. I'm not sure if the Linux kernel could tell you something similar. But displaying this can be difficult as there are many different chunk types.

@SveSop
Copy link
Contributor

SveSop commented Nov 17, 2018

@kakra
Couple of questions cos im a total n00b.

  1. Why would you need IOMMU for wine? (IOMMU being pci-passthrough for use with KVM/QEMU)
  2. Is not nvidia-uvm for CUDA?
  3. Would you need to compile the nVidia kernel module with hmm=1 to get that support, given that you ofc have a kernel with:
CONFIG_ARCH_HAS_HMM=y
CONFIG_HMM=y
CONFIG_HMM_MIRROR=y

Sorry if my quesions are stupid... just trying to see the connection with DXVK and these options :)

@kakra
Copy link
Author

kakra commented Nov 17, 2018

IOMMU can be used for proper DMA remapping according to NVIDIA driver readme. And when the crash happened, I was just starting with HMM. I didn't check yet if Gentoo automatically picks up the HMM setting. The crash came first. But modinfo stated HMM is available as far as I remember. About UVM/CUDA: You are probably right.

I think IOMMU and DMA remapping could work around the fragmentation issue but I'm not sure. Without hardware IOMMU the driver uses software bounce buffers which can be slower and disables some other features.

@SveSop
Copy link
Contributor

SveSop commented Nov 17, 2018

IOMMU is a virtualization feature that afaik requires your bios and hardware to support "Intel vt-d" (Also needs to be enabled in the motherboard uefi bios).

IOMMU is most commonly used for mapping dma/memory for your gpu and other stuff to be used in a VM (pci-passthrough). Or, possibly gain the usage of 4GB gpu memory in a 32-bit os?? I dont understand what would be needing IOMMU in wine or dxvk tbh...

HMM on the other hand might be interesting... if the nVidia driver actually support it?

@SveSop
Copy link
Contributor

SveSop commented Nov 17, 2018

All pre-PCI Express GPUs and non-Native PCI Express GPUs (often known as bridged GPUs) are limited to 32 bits of physical address space, which corresponds to 4 GB of memory. On a system with greater than 4 GB of memory, allocating usable DMA buffers can be a problem. Native PCI Express GPUs are capable of addressing greater than 32 bits of physical address space and do not experience the same problems.

https://us.download.nvidia.com/XFree86/Linux-x86_64/331.79/README/dma_issues.html

Seem to be talking about PRE-PciE cards... So from that i would understand its a benefit if you have a PCI nvidia gpu (with only 32-bit address space support). Probably not useful at all for a GTX970 PCI-E card.. no?

@kakra
Copy link
Author

kakra commented Nov 17, 2018

It's not a feature used by DXVK directly. It's used by the driver. My BIOS doesn't have a IOMMU setting, thus I needed to use intel_iommu=on on the kernel cmdline and dmesg stated it is enabled now.

But that's all not the point here, I was trying to fix problems introduced by THP. All I wrote, is not about IOMMU introducing issues in DXVK. So there's hardly a point in discussing where the connection to DXVK is in the initial problem.

@SveSop
Copy link
Contributor

SveSop commented Nov 17, 2018

There usually is no "IOMMU setting" in the bios. If i need to enable IOMMU in the kernel, i have to enable the bios setting called "Intel Virtualization - vt-d". If that is "disabled" (default), i cant (or get an error) use IOMMU. There is 2 parts in this, 1 is the CPU, and other is the motherboard.

From what i gather from the nVidia documents, it is not something you should need other than in special cases. I guess if you have a mining rig with 20 8GB 1080 cards or something, this is needed to overcome some memory boundaries? Dunno.
https://download.nvidia.com/XFree86/Linux-x86_64/396.54/README/README.txt
Chatpter 35 talks about

For example, it is common for a system with 512 GB of RAM installed to have
physical addresses up to ~513 GB. In this scenario, a GPU with an addressing
capability of 512 GB would force the driver to fall back to the 4 GB DMA zone
for this GPU.

And i get:

cat /proc/driver/nvidia/gpus/0000\:01\:00.0/information 
Model: 		 GeForce GTX 970
IRQ:   		 150
GPU UUID: 	 GPU-6d548f8e-645a-6933-c892-f86a2f2fe8aa
Video BIOS: 	 84.04.36.00.f1
Bus Type: 	 PCIe
DMA Size: 	 40 bits
DMA Mask: 	 0xffffffffff
Bus Location: 	 0000:01:00.0
Device Minor: 	 0
  1. 1 Terabyte (40 bits)

I dont have >1TB memory, so i would think im in the "safe-zone".

PS. To enable IOMMU by default and not have to use intel_iommu=on, you need to compile the kernel with CONFIG_INTEL_IOMMU_DEFAULT_ON=y

IMO: Not something you "need" unless you have >1TB memory (possibly), or a large number of GPU's.. or unless you use KVM/QEMU to run a VM with pci-passthrough. (And if you DO, you usually would want to pass the nVidia GPU through to the vm, and need to disable that totally with no driver)

@kakra
Copy link
Author

kakra commented Nov 19, 2018

@SveSop I think IOMMU is not dependent on VT-d but vice-versa if you need features like PCI-passthru.

You are citing the manual. It only tells you that the driver falls back to the 4 GB DMA zone if you have more than 512 GB of memory because the GPUs cannot address more than that.

IOMMU allows to bypass the DMA zone because address translation and process isolation is done in the north bridge.

My idea was that the driver can utilize IOMMU to overcome fragmentation issues introduced by THP: A contiguous mapping of GPU memory would no longer need to align with the system memory the same way... IOMMU could just remap memory access when THP deferred page merging allocates memory in 4k pages first.

It's not that I want to overcome memory addressing issues - I don't have those: my system has 16 GB of RAM. And I don't want to remap memory because I want to do VM passthru. I just wanted to try to break the constraints between memory mapping within the GPU and memory mapping in sysmem.

BTW: My BIOS says "VT-d not available" because my CPU does not support that feature. But it should still have IOMMU support. I get no error when I put intel_iommu=on into the kernel cmdline, it says DMAR uses IOMMU now. Surprisingly it's not enough to set CONFIG_INTEL_IOMMU_DEFAULT_ON=y - it simply doesn't generate the same message in dmesg for DMAR.

If you think you don't need it: I didn't ask you to enable it on your side. I'm still researching if I can make use of it, and will document if I found something useful. Current findings are: The DXVK error occurs when I use THP=always. With THP=madvise the error no longer shows up but I get audio-dropouts - not sure why or what happens yet. With THP=never some workloads in my system are slower, probably also with THP=madvise. This is because there are many more TLB cache misses then. I was working with THP=always for a long time now successfully (since they introduced deferred defrag to overcome latency spikes). Only lately, DXVK started to show errors. But that's not a DXVK issue probably. Something else changed. But since switching THP modes definitely changes things, I'm probably on the right track.

I'm only guessing what the problem may be: DXVK asks for 64 MB of memory. GPU allocates contiguous memory, the driver asks the kernel to find 64 MB of contiguous memory in sysmem. It cannot (or doesn't want to) do that because deferred defrag is turned on, it allocates 64 MB as 4k pages instead of 2M pages scattered around physical memory. The GPU cannot map that because it's not contiguous yet, allocation fails, DXVK makes boom. But without deferred defrag I get very noticeable latency spikes throughout the whole system, without even looking at graphics performance in DXVK. In theory, with IOMMU it no longer matters what the physical layout of GPU mapped memory is in sysmem. In practice, I'm not sure if IOMMU could handle such big allocations. From reading the manuals, it's usually meant to be used with much smaller memory blocks.

Old school DMA is limited to 4 GB address space without address translation support as far as I figured out yet. So DMA buffers could only be placed in the first 4 GB of memory. IOMMU can bypass this. DMA also doesn't support process isolation wrt to memory access by device, IOMMU does. That's why it's an integral useful feature for virtualization but it is not limited to be used for that. It's just one small piece of virtualization technology. DMA allows PCI(e) devices to address system memory. IOMMU allows addresses to be transparently translated to other physical locations within the north bridge.

@kakra
Copy link
Author

kakra commented Nov 19, 2018

@SveSop, I wrote:

IOMMU can be used for proper DMA remapping according to NVIDIA driver readme. And when the crash happened, I was just starting with HMM. I didn't check yet if Gentoo automatically picks up the HMM setting. The crash came first. But modinfo stated HMM is available as far as I remember. About UVM/CUDA: You are probably right.

BTW: I enabled HMM and verified that the driver actually compiled with HMM enabled. But my system is unstable with this (freezing more or less randomly), I stopped following this idea and reverted the driver change.

@SveSop
Copy link
Contributor

SveSop commented Nov 19, 2018

It is a lot i dont understand about this, so i guess its kinda silly for me to comment on that.
I do not really think you would gain much benefits from IOMMU, but could ofc be a faulty assumption. There is also a BIOS option (on my MB) that reads: Above 4G decoding that has some notion about "remapping PCIe cards above 4GB limit". That could be relevant.
This setting might also vary among manufacturers, or possibly not be a option at all, for all i know. Mentioning this as it is for my motherboard a separate "option" that has something to do with remapping this > 4GB instead of the "normal" <4GB address. If Asus does this "right" or not, i have no idea... I do have it enabled tho.

When it comes to HMM, it does look interesting, so some experimenting on this would be warranted. What option did you set, or how did you compile the nVidia binary driver with to get HMM support?

@kakra
Copy link
Author

kakra commented Nov 19, 2018

@SveSop

In Gentoo it's easy, create /etc/portage/package.env with this line:

x11-drivers/nvidia-drivers hmm

Then create /etc/portage/env/hmm with this line:

NVIDIA_BUILD_SUPPORTS_HMM=1

Now reinstall the driver, add a modprobe.conf entry to add option nvidia-uvm hmm=1, reboot.

If your system becomes unstable, just revert the changed line in /etc/portage/package.env and rebuild.

I cannot say how to do this with other distributions. If you use the original nvidia installer, it's probably a matter of running with an exported variable:

$ sudo NVIDIA_BUILD_SUPPORTS_HMM=1 /path/to/installer/NVIDIA-Linux-...run

The header files in the installer explicitly state that this needs kernel patches upstreamed before becoming default, thus it is currently only opt-in by setting this variable.

@SveSop
Copy link
Contributor

SveSop commented Nov 19, 2018

Seems as HMM is part of nvidia-uvm. nvidia-uvm is the "cuda" module is it not?

lsmod | grep -i nvidia
nvinvidia_drm             45056  4
nvidia_modeset       1093632  10 nvidia_drm
nvidia              14090240  425 nvidia_modeset
drm_kms_helper        208896  1 nvidia_drm
drm                   532480  7 drm_kms_helper,nvidia_drm
ipmi_msghandler       110592  2 ipmi_devintf,nvidiadia

So, no nvidia-uvm module loaded. I do not use cuda.

I could try to add NVIDIA_BUILD_SUPPORTS_HMM=1 when building the kernel module through dkms, but i dont really see the point if it adds HMM support for nvidia-uvm module, as i dont use cuda.

Starting a cuda test through wine did load the module tho
[ 895.239558] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 511
nvidia_uvm 917504 0

But not sure about the relevance of that and dxvk?

@kakra
Copy link
Author

kakra commented Nov 19, 2018

No, UVM is not CUDA. CUDA is for accessing the compute units of the GPU, UVM is the unified memory management which is not exclusively reserved for use by CUDA. But I don't think that anything else currently uses that except CUDA. So there's no relevance to DXVK.

UVM enables applications (and also CUDA) to simplify memory access, application would no longer need to manually copy data back and forth. It could theoretically be used generally for any GPU acceleration, not just CUDA but also 3D. As far as I can tell, neither Vulkan nor the driver itself use that currently, only the CUDA API probably does.

But enabling HMM seems to have some effect otherwise my system wouldn't become unstable with it. But at this stage, performance tests are impossible for me because of that.

But please stop thinking about CUDA and virtualization. UVM, HMM, IOMMU, DMAR etc are just building blocks of those, it's not the implementation of CUDA or VT-d. You aren't forced to build a house out of those bricks, you could also build a tower or a bridge. I think CUDA is used by NVAPI which in turn is used by some games. But apparently the wine implementation of that is imcomplete (tho it forwards calls to CUDA as far as I understand the code).

@SveSop
Copy link
Contributor

SveSop commented Nov 19, 2018

I know "UVM is not CUDA", but so far CUDA uses UVM, and i don't really know anything else that does. FFMPEG acceleration... cuda. Bitcoin mining.. Cuda.

Mostly the point being that dxvk does so far not use cuda, or load the nvidia-uvm module. Why HMM enabled should create problems outside of the uvm module, i cant really say tbh. I might give it a whirl to see if i get instability i guess.

Now, getting this to work could be interesting if it provided benefits, but that in turn depends if dxvk needed to support something that only exist for nVidia? Since HMM is supported in kernel, could it be used "outside" of needing nvidia driver support? Seen talks about nouveau and HMM support.. but so far its a useless project when it comes to performance due to nVidia's lack of open-source support.

Loading the nvidia-uvm module by a simple modprobe wont really DO anything tho...

@kakra
Copy link
Author

kakra commented Dec 12, 2018

I'm closing this with the following results: Disabling transparent huge pages in the kernel fixes the problem at the cost of quite noticeable performance loss. But there's a compromise: Using transparent huge pages in "madvise" mode keeps around memory optimizations for running software that supports it, so TLB misses have less impact on DXVK performance. But the advantage is only a slight one to nothing - depending on your system configuration.

I'm instead experimenting with a patch to wine to support large page mode for applications. "Middle-earth: Shadow of War" is supposed to support it and I can successfully enable it in the game now - but tracing wine shows that the game still does not make the needed system calls.

If anyone wants to help with that, here's my wine-proton project:
https://github.com/kakra/wine-proton

Another issue discovered during the course of this report: Web browsers tend to allocate a lot VRAM directly and through the xserver. This has a direct impact on performance of the game. I doubt there's anything DXVK could do about it - shuffling idle VRAM of foreign processes between sysmem and VRAM is probably not concern of DXVK. Such VRAM shortage has direct consequences under transparent hugepages: The game may crash because of memory allocation failures despite having plenty of memory available as initially outlined in this report. It's probably some hidden memory fragmentation problem. There's probably nothing to do about it unless the Vulkan/graphics driver would be aware of transparent huge pages. The fix currently is to not force transparent huge pages but use "madvise" mode instead, I recommend to combine it with deferred defragmentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants