-
Notifications
You must be signed in to change notification settings - Fork 783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DXVK takes large portions of RAM and never frees them, causing system freezes #632
Comments
Does this mean that the memory doesn't get freed even when you terminate the game? Is this a regression / do older DXVK versions work? If so, which is the last one that doesn't have the issue? |
Yes, it is never freed. I close the game and the system still can't use it. I open the game and it can't use that memory again. Edit: while I used older versions of DXVK, I was never able to play really long sessions for a multitude of issues (GPU hangs, very low performance, Wine issues), so I can't say if older versions were good or not. I discovered this problem when I was doing tests about this bug. |
@FurretUber What happens when you clear PageCache, dentries and inodes? |
While this sometimes happens, usually Steam don't allow to launch second copy of a game. But you can check to be sure. |
@edmondo I use
The cache is not reducing to values lower than 2 GB so, basically, 25% of the system memory is unusable. I played DiRT 3 today and the cache only kept growing, I wrote a simple script that writes date and time and the memory usage every 2 seconds, and the caches went from 100 MB to 2 GB. When I played Euro Truck Simulator 2 (native, so no DXVK) after playing DiRT 3, cache increased to up to 4,5 GB but, as I closed the game, cache reduced again to 2 GB. Later I played DiRT 3 again and the cache was 2,6 GB. Even using the command above to drop the caches multiple times, the minimal it can reach is 2,6 GB cache. This ~600 MB grow in cache impossible to drop happened after 35 minutes playing. |
@FurretUber Thanks. I have some more questions, if you have time to check. Is your /tmp mounted in memory (tmpfs)? Can you provide the meminfo content before and after? |
/tmp is tmpfs, but the tmpfs filesystems usage is pretty small:
The amount is around 90 MB from filesystems in RAM, so it wouldn't justify this cache. Right now, cache impossible to drop is close to 6,4 GB, pretty uncomfortable to use the computer with so little memory usable. I limited the CPU clock to 50% (1,2 GHz) and let Forsaken Castle running, the game was running with framerates from 49 to 65. With 100% CPU clock (2,3 GHz) the game would run with framerates from 90 to 130. If I let the CPU clock at 100%, Forsaken Castle makes the system freeze due to memory usage, but limiting the CPU clock to 50% made the system not freeze, but the cache increased significantly, from 2,6 GB to 6,4 GB in 7 hours. The higher the framerate the faster the cache fills, looks like. About the Before: After: |
Can you provide tests without DXVK? |
This is very strange because it's normal behaviour for cache to never be freed after the application using the cached pages exits. The Linux Kernel doesn't garbage collect on cached objects as it's not worth it resources' wise than just leaving the cache be, which is essentially free memory for any application that needs it. That you reach system freeze when the cached is still being used like it is in your screenshots makes me think that the problem lies elsewhere. Even if the Wine process somehow "zombified" and kept grabbing part of the memory, those cached page would still be freed if the kernel or another process required some memory. I myself had my system freeze when using DXVK, but I suspected more a GPU hang rather than memory pressure problem. It would be interesting to look at the shader cache size. Also, can you make sure Wine is not running in DEBUG mode? Proton dumps debug output to a file in your home directory. It's quite easy for such a file to grow significantly in a short time (perhaps as fast as whatever operations the graphics functions of the game require, explaining why the more there are frames, the more system cache grows), so it may be worth checking? One other thing I noticed is that you're using an Intel iGPU with Mesa. Intel iGPUs use system memory, shared with all other applications. Could it be that somehow, the iGPU can't free VRAM from cache? I do not have enough expertise to investigate that though... |
@FurretUber Hm, following is intriguing:
Can you post the output of slabtop please? |
@notfood After 5 hours of Forsaken Castle running with Wine's d3d11, the cache amount was restored to the value it had on boot after closing it: cache rose up to 1022 MB with the game open and, after closing, it reduced to 505 MB. Dropping caches reduced further to 369 MB. Stopping LightDM and dropping caches reduced to 252 MB. With DXVK in 7 hours the cache increased 3,8 GB. @DistantThunder While cache is useful to fasten applications, it should be freed when the system requests more memory This cache created by DXVK is not being freed, causing the 1,45 GB of usage, 1,15 GB of swap and 6,4 GB of cache not freed, this is not normal behavior. I have checked and there are no processes related to steam, dxvk, exe, dirt3, castle, wine after I close the games. If
As the Intel GPU has unified memory, video memory and RAM memory share the same space. What I know is that with OpenGL applications memory is freed properly. The remaining thing to test would be Edit: @edmondo Use this command after DXVK takes the memory? I rebooted because the system was in a pretty bad state with only 1,45 of 7,76 GiB usable. If it happens again I'll use that command. |
Applications can indeed lock part or all of their allocated vm pages but that is an explicit mechanism (mlock) for which I don't find the calls in DXVK code. What you can do with our iGPU is also test wether reducing max VRAM amount in BIOS/EFI has any influence on the problem. |
@FurretUber We don't know right now what's "using" the slab cache, so maybe slabtop sorted by cache size could give us a hint. |
Possibly unrelated but I just crashed out of Arkham Knight due to an OOM condition. Found just over a thousand instances of Killing them worked but it wasn't immediately obvious because they were each only a tiny fraction of the consumed RAM. |
anyone tested with wined3d? |
Yes, I tested, the memory increase was not noticed after 5 hours with d3d11. "After 5 hours of Forsaken Castle running with Wine's d3d11, the cache amount was restored to the value it had on boot after closing it: cache rose up to 1022 MB with the game open and, after closing, it reduced to 505 MB. Dropping caches reduced further to 369 MB. Stopping LightDM and dropping caches reduced to 252 MB. With DXVK in 7 hours the cache increased 3,8 GB." Dota 2 download finished. Hopefully I'll have the answer if this issue is a Vulkan problem (which would be a Intel Mesa issue) or it is exclusive to DXVK in up to 12 hours. For now, from my tests: Native OpenGL: memory is deallocated properly; Is somebody else testing to see if this issue happens? It should be with a GPU that uses unified memory. |
what about Wine Vulkan but without DXVK? |
How can I test this? |
just run Windows Vulkan applications in Wine |
OK, so I'll run the Vulkan demos from here to test Wine Vulkan. DiRT 3 filled memory with cache (~140 MB) and the command @edmondo suggested to use is showing i915_request as the resource/program/thing that is using most of the cache using 148528 K. It was not even appearing before. I played Dota 2 around the same time as I played DiRT 3 and Dota 2 did not fill any memory that wasn't deallocated when closed. The question is: which test is the most important now? I thought on doing one of these two things for approximately six hours, with CPU limited to 50% to reduce noise: -Let Dota 2 running with native Vulkan in the tutorial or watching matches, if possible If you have an idea on another test that could have more meaningful results, please suggest (and explain if it's hard to do 😛 ). |
Dump /proc/slabinfo: http://man7.org/linux/man-pages/man5/slabinfo.5.html Running |
What happens if you run Doom 2016 in Wine? Doom 2016 uses Vulkan. |
@FurretUber It seems almost certain that we have a memory leak in the i915 kernel driver. memleak would be perfect for figuring this out, if someone implemented kernel support in the -O option. The easiest way of doing this is to run memleak for a really long time and have it filter by the age of the allocations. Alternatively, there is this: https://www.kernel.org/doc/html/v4.10/dev-tools/kmemleak.html If you dig around enough, you might find a script someone wrote for using ftrace output to identify memory leaks. Another OSS developer who I know used it a few weeks ago on a kernel memory leak, but it is likely a pain to use. My suggestion is to rely on memleak from iovisor/bcc. |
@setsunati Rather than ask what happens when he runs X, Y and Z, it might be better for him to profile the kernel's memory allocations to identify the call stacks that are allocating the memory that leaks. My guess is that the kernel is not calling |
My system seems to be suffering from this too. I can get the number of objects allocated in the i915_request slab cache to rise by running Rise of Nations: Extended Edition (Direct3D 10). It is currently at 442064 objects, which I suspect to be abnormal. Restarting Xorg doesn't have any effect on it. I won't have time to identify the call stack until Wednesday at the earliest. I am running Linux 4.18.0-rc8 (and really ought to update to a newer kernel, but that is for another day). My CPU/GPU is a Xeon E3-1276v3. |
@ryao For now (I have not started any game yet)
I suppose I have to play some game to see this in action. If this is a i915 issue, why it seems to affect only DXVK? Native Vulkan and Wine Vulkan both seem fine. |
@FurretUber the command that I gave you will only display outstanding allocations that are at least 1 hour old. It occurs to me that you might need to leave it running for a few hours before the leak shows up in the top 10 being printed. As for why only DXVK is affected, it could be triggering a buggy code path inside the kernel. If we know how the leaked memory was allocated, we should be able to work backward to find out how it could fail to be freed. Anyway, i915_request is part of the i915 kernel driver, so it is responsible for freeing this memory. |
@FurretUber Actually, the command that I gave you will only display outstanding allocations that are 10 hours old. I put 1 too many zeros in the cutoff. You want |
"Good" news, this appeared |
@FurretUber That narrows it down to the allocation made in It is possible to instrument specific lines using the The right command depends on your kernel version because line numbers vary. See the example on line numbers here: http://www.brendangregg.com/perf.html#DynamicTracingEg The example also includes local variables. We might not need that, but it would not hurt to capture the values of fences, in_fence and out_fence if they are available. This could take some trial and error to get right. It also requires having your kernel sources installed in addition to debuginfo for it to list the function definition. Unfortunately, I do not have time to probe the i915 driver today, but there should be enough information at that link (and in the man pages) for the uninitiated in kernel debugging to get started. By the way, you can check if the issue occurred by doing |
I had reported this issue upstream: https://bugs.freedesktop.org/show_bug.cgi?id=107899 |
To add a bit more information, the normal execution path does https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/i915_gem_execbuffer.c#L2219 Presumably, the allocation is eventually freed when this is called. When an error occurs that executes Someone might wonder why I am not interested in any of the other goto statements when they also jump over |
It was reported on FreeDesktop? It's a bit funny because I went GPU hang -> FreeDesktop report -> test patch -> find memory allocation error -> GitHub report -> discover the problem is in kernel. I have to build perf for my kernel version (4.17.18, as 4.18 and 4.19 have a regression hitting me, I hope developers see the kernel bugzilla before 4.19 is released). I tried using
Looks like I'll have to build the kernel. Don't wait for me, this will take a long time. |
You could ask your distribution’s developers how to install the kernel sources for debugging purposes. |
@ryao I have built the kernel and got this from using Edit: I tried one and it has only: Edit 2: I used the following commands to inspect:
Edit 3: I built a perf with both x86 and x86_64 options enabled. Unfortunately Edit 4: Good news! Investigating line 70 printed relevant log information. The file is compressed because it's too big. I'll test other lines to see if there is further information, but this shows that only Forsaken Castle has a value set for |
Line 70 is not what I meant. I meant the goto statements around line 192, but perf is not showing those as being instrumentable. Line 70 will be executed every single time to check the condition. Unfortunately. It does not yield useful information. ZFS has a SET_ERROR macro that was ported into the Linux driver. I think we might need to patch this function to use it, which involves getting it from the ZFSOnLinjx source code and adding it to the kernel headers. I have plenty of offline stuff to do, so I doubt that I could do that tonight, but that is probably the next step here unless one of the Intel guys how understands the i915 driver is willing to eyeball it or reproduces it on his end. |
Investigating line 8 I found something pretty relevant. In line 8 there was
rolc.exe is a Wine OpenGL game, Xorg is... Xorg and Forsaken_Castle is Wine with DXVK. Notice the huge difference between the values of the |
It occurs to me that it is possible to get much of the same data that SET_ERROR would give by tracing lines 191, 197, 203, 220 and 222 simultaneously. We would need to get the value of err on line 222. |
I can do this check, I think. Looks like |
Printing out in_fence on line 8 is not that useful because it is set further down based on the args->flags varible. There is a way to access the args->flags variable in a perf probe, but it is annoying to do. I think Brendan Greg’s site has an example. It requires calculating the offset of the struct member, adding it to the variable and then dereferencing that. The values for fences are interesting. This code is being passed user space pointers for Xorg and rolc.exe while it is passed a kernel pointer for Forsaken_Castle: https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt That is in the KASAN shadow memory, which means that something is wrong. I am not sure offhand why them pointer would go into the KASAN shadow memory region (I assume that you are not running KASAN). I would need to look at the code further when I have time. |
Output from that lines was empty, pretty strange. Maybe one by one? Edit: Really, line 191 output is empty, there is nothing on it Edit 2: line 222 has the following:
I'll test the others you suggested Edit 3: line 220:
|
@FurretUber You actually need to profile all of them and dump the output. Just 1 line individually won't be helpful. The idea is to see what path is executed. I can infer it by seeing which probes fire in what order. The output from some lines being empty is expected. To be more clear, you can just do Also, to make my previous comment about capturing args->flags more useful, if you install the dwarves package, you can do |
Thank you, I was not aware this was possible to do, as today is the first time I'm using
The lines of code are here: linhasdocodigo.txt The output of the script is here: |
@FurretUber you are doing well for the first time that you have ever used it. I did not do better with perf on my first time using it. I will need to look at that output later. I have some offline stuff to do. If it helps, I can say offhand that by looking at the thread IDs and looking at which probes fired in sequence in the same thread, you can’t work out which goto statement executed. If the probes before a goto statement fires and then nothing until the last statement, it means that the goto statement right after the statement whose probe fired before the probe on the last statement executed. |
@FurretUber I looked at the output. Contrary to my expectation, the goto statements are not being executed, which means this is not a bug in error handling. |
Jason Ekstrand wrote a fix for the issue. It is available from the free desktop bug tracker: https://bugs.freedesktop.org/show_bug.cgi?id=107899 I have reviewed and tested it. It appears to resolve the issue entirely in Rise of Nations: Extended Edition. The patch itself does a fairly good job of explaining what went wrong (and also why I missed the issue when I tried to debug it). @FurretUber reported a small number of long lived allocations even with the patch, but further testing by him shows that they get garbage collected by an OpenGL game. I am unable to reproduce them. It might have something to do with me using KDE 5. After this goes into mainline, someone should ping the linux-stable mailing list so that it is properly backported to Linux 4.14.y and 4.18.y. We probably should ping Canonical so that they apply it to their kernel too. |
I've submitted the kernel patch to the mailing list. Hopefully, it will land fairly soon and we'll make sure it gets back-ported as far as needed. |
The fix has landed in drm-misc-fixes; it will propagate to a kernel release near you shortly. |
@FurretUber is it fixed now for you? |
Yes, it is since 4.19-rc6. The fix was backported to 4.14.76 and 4.18.14 too. |
Since this was a kernel bug that has been fixed upstream, I'm going to close this. |
wow, this just worked |
I am still experiencing this. Latest Proton and DXVK:
I am running a game through proton: saints row 3 remastered, it occupies ~14 GB of RAM. It cannot properly exit, so I have to kill it each time. Afterwards I do not see any orphaned processes occupying my RAM -- |
And the conclusion that this isn't our bug hasn't chaged. Please make sure you're running up to date kernels and mesa versions for your Intel GPU. Also, what exactly are the first 5 lines of vulkaninfo output supposed to tell us? You literally cut out all useful info. |
I have noticed that when playing games using DXVK the amount of RAM memory available to the system slowly reduces. Initially the value is not significant, but in longer sessions (2 hours, for example), the amount of RAM taken is significant, as 1,3 GB.
This RAM memory is never freed, reducing the available memory for the entire system. Once DiRT 3 Complete Edition took 5,2 GB of RAM from the system, and I had to use the computer with only 2,5 GB usable (before I discovered what the problem was).
Even longer session take all RAM memory and the system freezes, sometimes with INFO messages in dmesg, sometimes with a General Protection Fault. The messages are only readable on the next boot.
One strange thing is that the memory is shown as "free" on
htop
, so while it shows a large amount of free memory the system swaps until it freezes. The screenshot below showshtop
when the system froze with all memory used while playing DiRT 3 Complete Edition. Notice The large amount of memory "free" as cache (the yellow):Software information
Two games were tested:
DiRT 3 Complete Edition (using Steam Play)
Forsaken Castle (Win64 build, using Wine 3.15 and DXVK git-e48c27ac30ee92df6f378a11030fd1ee980d0017)
System information
Log files
d3d11.log:
Forsaken_Castle_d3d11.log
dxgi.log:
Forsaken_Castle_dxgi.log
The text was updated successfully, but these errors were encountered: