Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DXVK takes large portions of RAM and never frees them, causing system freezes #632

Closed
FurretUber opened this issue Sep 9, 2018 · 54 comments

Comments

@FurretUber
Copy link

I have noticed that when playing games using DXVK the amount of RAM memory available to the system slowly reduces. Initially the value is not significant, but in longer sessions (2 hours, for example), the amount of RAM taken is significant, as 1,3 GB.

memory usage after playing DiRT 3 for 2 hours

This RAM memory is never freed, reducing the available memory for the entire system. Once DiRT 3 Complete Edition took 5,2 GB of RAM from the system, and I had to use the computer with only 2,5 GB usable (before I discovered what the problem was).

Even longer session take all RAM memory and the system freezes, sometimes with INFO messages in dmesg, sometimes with a General Protection Fault. The messages are only readable on the next boot.

One strange thing is that the memory is shown as "free" on htop, so while it shows a large amount of free memory the system swaps until it freezes. The screenshot below shows htop when the system froze with all memory used while playing DiRT 3 Complete Edition. Notice The large amount of memory "free" as cache (the yellow):

htop memory on freeze

Software information

Two games were tested:
DiRT 3 Complete Edition (using Steam Play)
Forsaken Castle (Win64 build, using Wine 3.15 and DXVK git-e48c27ac30ee92df6f378a11030fd1ee980d0017)

System information

  • GPU: Intel HD Graphics 520
  • Driver: Mesa 18.3.0-devel (git-07a2098a70)
  • Wine version: Proton 3.7 and Wine 3.15
  • DXVK version: v0.70-16-g07b4d3c and git-e48c27ac30ee92df6f378a11030fd1ee980d0017

Log files

@doitsujin
Copy link
Owner

doitsujin commented Sep 9, 2018

Does this mean that the memory doesn't get freed even when you terminate the game?

Is this a regression / do older DXVK versions work? If so, which is the last one that doesn't have the issue?

@FurretUber
Copy link
Author

FurretUber commented Sep 9, 2018

Yes, it is never freed. I close the game and the system still can't use it. I open the game and it can't use that memory again.

Edit: while I used older versions of DXVK, I was never able to play really long sessions for a multitude of issues (GPU hangs, very low performance, Wine issues), so I can't say if older versions were good or not.

I discovered this problem when I was doing tests about this bug.

@edmondo
Copy link
Contributor

edmondo commented Sep 9, 2018

@FurretUber What happens when you clear PageCache, dentries and inodes?
echo 3 >/proc/sys/vm/drop_caches

@pchome
Copy link
Contributor

pchome commented Sep 9, 2018

$ ps aux | grep -i defu - check if game don't quit properly (defunct)
$ ps aux | grep -i exe - manually check all running executables
$ kill -9 <process id> - manually kill stalled game, if any

While this sometimes happens, usually Steam don't allow to launch second copy of a game. But you can check to be sure.

@FurretUber
Copy link
Author

@edmondo I use sudo sh -c 'free && sync && echo 3 > /proc/sys/vm/drop_caches && free' to drop the caches
The problem I have is that the system is showing cache and it's impossible to drop it. free -wm shows now:

              total       usada       livre    compart.     buffers       cache  disponível
Mem.:          7946        3573        2300          94           9        2063        3976
Swap:          9999         505        9494

The cache is not reducing to values lower than 2 GB so, basically, 25% of the system memory is unusable. I played DiRT 3 today and the cache only kept growing, I wrote a simple script that writes date and time and the memory usage every 2 seconds, and the caches went from 100 MB to 2 GB.

When I played Euro Truck Simulator 2 (native, so no DXVK) after playing DiRT 3, cache increased to up to 4,5 GB but, as I closed the game, cache reduced again to 2 GB.

Later I played DiRT 3 again and the cache was 2,6 GB. Even using the command above to drop the caches multiple times, the minimal it can reach is 2,6 GB cache.

This ~600 MB grow in cache impossible to drop happened after 35 minutes playing.

@edmondo
Copy link
Contributor

edmondo commented Sep 10, 2018

@FurretUber Thanks. I have some more questions, if you have time to check.

Is your /tmp mounted in memory (tmpfs)?
grep /tmp /proc/mounts

Can you provide the meminfo content before and after?
cat /proc/meminfo

@FurretUber
Copy link
Author

FurretUber commented Sep 10, 2018

/tmp is tmpfs, but the tmpfs filesystems usage is pretty small:

df -h | grep tmpfs
tmpfs           795M  1,6M  794M   1% /run
tmpfs           3,9G   59M  3,9G   2% /dev/shm
tmpfs           5,0M  4,0K  5,0M   1% /run/lock
tmpfs           3,9G     0  3,9G   0% /sys/fs/cgroup
tmpfs           3,9G   28M  3,9G   1% /tmp
tmpfs            96M  1,4M   95M   2% /tmp/dumps
tmpfs           100K     0  100K   0% /var/lib/lxd/shmounts
tmpfs           100K     0  100K   0% /var/lib/lxd/devlxd
tmpfs           795M   40K  795M   1% /run/user/1000

The amount is around 90 MB from filesystems in RAM, so it wouldn't justify this cache. Right now, cache impossible to drop is close to 6,4 GB, pretty uncomfortable to use the computer with so little memory usable.

captura de tela_2018-09-10_10-05-39

I limited the CPU clock to 50% (1,2 GHz) and let Forsaken Castle running, the game was running with framerates from 49 to 65. With 100% CPU clock (2,3 GHz) the game would run with framerates from 90 to 130.

If I let the CPU clock at 100%, Forsaken Castle makes the system freeze due to memory usage, but limiting the CPU clock to 50% made the system not freeze, but the cache increased significantly, from 2,6 GB to 6,4 GB in 7 hours. The higher the framerate the faster the cache fills, looks like.

About the cat /proc/meminfo:

Before:
meminfoantes.txt

After:
meminfodepois.txt

@notfood
Copy link

notfood commented Sep 10, 2018

Can you provide tests without DXVK?

@DistantThunder
Copy link

This is very strange because it's normal behaviour for cache to never be freed after the application using the cached pages exits.

The Linux Kernel doesn't garbage collect on cached objects as it's not worth it resources' wise than just leaving the cache be, which is essentially free memory for any application that needs it.

That you reach system freeze when the cached is still being used like it is in your screenshots makes me think that the problem lies elsewhere.

Even if the Wine process somehow "zombified" and kept grabbing part of the memory, those cached page would still be freed if the kernel or another process required some memory.

I myself had my system freeze when using DXVK, but I suspected more a GPU hang rather than memory pressure problem.

It would be interesting to look at the shader cache size.

Also, can you make sure Wine is not running in DEBUG mode?

Proton dumps debug output to a file in your home directory. It's quite easy for such a file to grow significantly in a short time (perhaps as fast as whatever operations the graphics functions of the game require, explaining why the more there are frames, the more system cache grows), so it may be worth checking?

One other thing I noticed is that you're using an Intel iGPU with Mesa. Intel iGPUs use system memory, shared with all other applications.

Could it be that somehow, the iGPU can't free VRAM from cache? I do not have enough expertise to investigate that though...

@edmondo
Copy link
Contributor

edmondo commented Sep 10, 2018

@FurretUber Hm, following is intriguing:

Slab:            6466252 kB
SReclaimable:    6336504 kB

Can you post the output of slabtop please?
slabtop -s c

@FurretUber
Copy link
Author

FurretUber commented Sep 10, 2018

@notfood After 5 hours of Forsaken Castle running with Wine's d3d11, the cache amount was restored to the value it had on boot after closing it: cache rose up to 1022 MB with the game open and, after closing, it reduced to 505 MB. Dropping caches reduced further to 369 MB. Stopping LightDM and dropping caches reduced to 252 MB. With DXVK in 7 hours the cache increased 3,8 GB.

@DistantThunder While cache is useful to fasten applications, it should be freed when the system requests more memory This cache created by DXVK is not being freed, causing the 1,45 GB of usage, 1,15 GB of swap and 6,4 GB of cache not freed, this is not normal behavior.

I have checked and there are no processes related to steam, dxvk, exe, dirt3, castle, wine after I close the games. If htop is to be believed, everything is properly closed.

$HOME/.cache/mesa_shader_cache has 51,7 MB and the Steam log for DiRT 3 has 2 MB.

As the Intel GPU has unified memory, video memory and RAM memory share the same space. What I know is that with OpenGL applications memory is freed properly.

The remaining thing to test would be ruin matches play Dota 2 with Vulkan, because I don't see this problem with Vulkan demos from https://github.com/SaschaWillems/Vulkan

Edit: @edmondo Use this command after DXVK takes the memory? I rebooted because the system was in a pretty bad state with only 1,45 of 7,76 GiB usable. If it happens again I'll use that command.

@DistantThunder
Copy link

While cache is useful to fasten applications, it should be freed when the system requests more memory This cache created by DXVK is not being freed

Applications can indeed lock part or all of their allocated vm pages but that is an explicit mechanism (mlock) for which I don't find the calls in DXVK code.
Furthermore, the lock disappears with the process once it dies.

What you can do with our iGPU is also test wether reducing max VRAM amount in BIOS/EFI has any influence on the problem.

@edmondo
Copy link
Contributor

edmondo commented Sep 10, 2018

@FurretUber We don't know right now what's "using" the slab cache, so maybe slabtop sorted by cache size could give us a hint.

@oliwarner
Copy link

Possibly unrelated but I just crashed out of Arkham Knight due to an OOM condition. Found just over a thousand instances of explorer.exe nibbling away at the available RAM.

Killing them worked but it wasn't immediately obvious because they were each only a tiny fraction of the consumed RAM.

@libcg
Copy link
Contributor

libcg commented Sep 10, 2018

anyone tested with wined3d?

@FurretUber
Copy link
Author

Yes, I tested, the memory increase was not noticed after 5 hours with d3d11.

"After 5 hours of Forsaken Castle running with Wine's d3d11, the cache amount was restored to the value it had on boot after closing it: cache rose up to 1022 MB with the game open and, after closing, it reduced to 505 MB. Dropping caches reduced further to 369 MB. Stopping LightDM and dropping caches reduced to 252 MB. With DXVK in 7 hours the cache increased 3,8 GB."

Dota 2 download finished. Hopefully I'll have the answer if this issue is a Vulkan problem (which would be a Intel Mesa issue) or it is exclusive to DXVK in up to 12 hours.

For now, from my tests:

Native OpenGL: memory is deallocated properly;
Wine OpenGL (d3d11): memory is deallocated properly;
Native Vulkan: what I tested (Vulkan demos) deallocated memory properly, I'll test with Dota 2;
Wine Vulkan (DXVK): failure to deallocate until system freeze.

Is somebody else testing to see if this issue happens? It should be with a GPU that uses unified memory.

@libcg
Copy link
Contributor

libcg commented Sep 11, 2018

what about Wine Vulkan but without DXVK?

@FurretUber
Copy link
Author

How can I test this?

@libcg
Copy link
Contributor

libcg commented Sep 11, 2018

just run Windows Vulkan applications in Wine

@FurretUber
Copy link
Author

OK, so I'll run the Vulkan demos from here to test Wine Vulkan.

DiRT 3 filled memory with cache (~140 MB) and the command @edmondo suggested to use is showing i915_request as the resource/program/thing that is using most of the cache using 148528 K. It was not even appearing before.

I played Dota 2 around the same time as I played DiRT 3 and Dota 2 did not fill any memory that wasn't deallocated when closed.

The question is: which test is the most important now? I thought on doing one of these two things for approximately six hours, with CPU limited to 50% to reduce noise:

-Let Dota 2 running with native Vulkan in the tutorial or watching matches, if possible
-Let Vulkan demos running with Wine Vulkan. I would run more than one demo at the same time to try to simulate a significant load

If you have an idea on another test that could have more meaningful results, please suggest (and explain if it's hard to do 😛 ).

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

Dump /proc/slabinfo:

http://man7.org/linux/man-pages/man5/slabinfo.5.html

Running memleak -o 36000000 in the background during an hour long session would probably show us where the long lived allocations are being made. memleak is from iovisor/bcc on GitHub. This assumes that your kernel has debuginfo available for bcc. It might not work without it.

@setsunati
Copy link

What happens if you run Doom 2016 in Wine? Doom 2016 uses Vulkan.

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

@FurretUber It seems almost certain that we have a memory leak in the i915 kernel driver. memleak would be perfect for figuring this out, if someone implemented kernel support in the -O option. The easiest way of doing this is to run memleak for a really long time and have it filter by the age of the allocations.

Alternatively, there is this:

https://www.kernel.org/doc/html/v4.10/dev-tools/kmemleak.html

If you dig around enough, you might find a script someone wrote for using ftrace output to identify memory leaks. Another OSS developer who I know used it a few weeks ago on a kernel memory leak, but it is likely a pain to use. My suggestion is to rely on memleak from iovisor/bcc.

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

@setsunati Rather than ask what happens when he runs X, Y and Z, it might be better for him to profile the kernel's memory allocations to identify the call stacks that are allocating the memory that leaks. My guess is that the kernel is not calling i915_fence_release() on memory allocated in i915_request_alloc(). There are too many code paths to eyeball it, so profiling it would be helpful.

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

My system seems to be suffering from this too. I can get the number of objects allocated in the i915_request slab cache to rise by running Rise of Nations: Extended Edition (Direct3D 10). It is currently at 442064 objects, which I suspect to be abnormal. Restarting Xorg doesn't have any effect on it. I won't have time to identify the call stack until Wednesday at the earliest.

I am running Linux 4.18.0-rc8 (and really ought to update to a newer kernel, but that is for another day). My CPU/GPU is a Xeon E3-1276v3.

@FurretUber
Copy link
Author

@ryao For now (I have not started any game yet) memleak is showing only:

[23:58:46] Top 10 stacks with outstanding allocations:
[23:58:56] Top 10 stacks with outstanding allocations:
[23:59:05] Top 10 stacks with outstanding allocations:

I suppose I have to play some game to see this in action. If this is a i915 issue, why it seems to affect only DXVK? Native Vulkan and Wine Vulkan both seem fine.

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

@FurretUber the command that I gave you will only display outstanding allocations that are at least 1 hour old. It occurs to me that you might need to leave it running for a few hours before the leak shows up in the top 10 being printed.

As for why only DXVK is affected, it could be triggering a buggy code path inside the kernel. If we know how the leaked memory was allocated, we should be able to work backward to find out how it could fail to be freed. Anyway, i915_request is part of the i915 kernel driver, so it is responsible for freeing this memory.

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

@FurretUber Actually, the command that I gave you will only display outstanding allocations that are 10 hours old. I put 1 too many zeros in the cutoff. You want memleak -o 3600000. Sorry about that.

@FurretUber
Copy link
Author

"Good" news, this appeared

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

@FurretUber That narrows it down to the allocation made in i915_gem_do_execbuffer(). My first instinct is to scrutinize the error processing path by instrumenting each of the 3 goto err_request statements. Checking to see if the error paths are taken on DXVK vs a native Vulkan app that does not exhibit this bug should help narrow this down further.

It is possible to instrument specific lines using the perf probe command and then record a trace using perf record to see what execution paths are being taken. You can do multiple probe commands and then state all of the probes in a single record command.

The right command depends on your kernel version because line numbers vary. See the example on line numbers here:

http://www.brendangregg.com/perf.html#DynamicTracingEg

The example also includes local variables. We might not need that, but it would not hurt to capture the values of fences, in_fence and out_fence if they are available. This could take some trial and error to get right. It also requires having your kernel sources installed in addition to debuginfo for it to list the function definition.

Unfortunately, I do not have time to probe the i915 driver today, but there should be enough information at that link (and in the man pages) for the uninitiated in kernel debugging to get started.

By the way, you can check if the issue occurred by doing grep i915_request /proc/slabinfo and seeing if the first number (the number of active allocations) increased by say a few thousand over say 5 to 10 minutes. There is no need to do a long running reproducer anymore since we know the execution path doing the allocation. This should make probing the i915 driver easier.

@yurikoles
Copy link

I had reported this issue upstream: https://bugs.freedesktop.org/show_bug.cgi?id=107899

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

To add a bit more information, the normal execution path does eb_submit:

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/i915_gem_execbuffer.c#L2219

Presumably, the allocation is eventually freed when this is called. When an error occurs that executes goto err_request;, we skip over eb_submit. There does not appear to be anything at a glance to handle that, although we need to scrutinize things in greater detail. Knowing if the error path is taken when the issue occurs vs when it does not occur would be helpful in knowing if my conjecture is correct.

Someone might wonder why I am not interested in any of the other goto statements when they also jump over eb_submit. For any other goto statement, the allocation either never happened or failed, so jumping over eb_submit (which I believe does the free) should be okay.

@FurretUber
Copy link
Author

It was reported on FreeDesktop? It's a bit funny because I went GPU hang -> FreeDesktop report -> test patch -> find memory allocation error -> GitHub report -> discover the problem is in kernel.

I have to build perf for my kernel version (4.17.18, as 4.18 and 4.19 have a regression hitting me, I hope developers see the kernel bugzilla before 4.19 is released).

I tried using sudo ./perf probe -L i915_gem_do_execbuffer and the output was:

Failed to find the path for kernel: No such file or directory
  Error: Failed to show lines.

Looks like I'll have to build the kernel. Don't wait for me, this will take a long time.

@ryao
Copy link
Contributor

ryao commented Sep 11, 2018

You could ask your distribution’s developers how to install the kernel sources for debugging purposes.

@FurretUber
Copy link
Author

FurretUber commented Sep 11, 2018

@ryao I have built the kernel and got this from using sudo ./perf probe -L i915_gem_do_execbuffer -m i915. Which lines should I inspect?

Edit: I tried one and it has only: unwind: target platform=x86 is not supported

Edit 2: I used the following commands to inspect:

sudo ./perf probe -m i915 --add 'i915_gem_do_execbuffer:229 in_fence out_fence fences'
sudo ./perf record -e probe:i915_gem_do_execbuffer -a
sudo ./perf script

Edit 3: I built a perf with both x86 and x86_64 options enabled. Unfortunately sudo ./perf script output is empty, so I suppose it's never running the code at the line 229?

Edit 4: Good news! Investigating line 70 printed relevant log information. The file is compressed because it's too big.
saidalinha70.txt.gz

I'll test other lines to see if there is further information, but this shows that only Forsaken Castle has a value set for fences.

@ryao
Copy link
Contributor

ryao commented Sep 12, 2018

Line 70 is not what I meant. I meant the goto statements around line 192, but perf is not showing those as being instrumentable. Line 70 will be executed every single time to check the condition. Unfortunately. It does not yield useful information. ZFS has a SET_ERROR macro that was ported into the Linux driver. I think we might need to patch this function to use it, which involves getting it from the ZFSOnLinjx source code and adding it to the kernel headers. I have plenty of offline stuff to do, so I doubt that I could do that tonight, but that is probably the next step here unless one of the Intel guys how understands the i915 driver is willing to eyeball it or reproduces it on his end.

@FurretUber
Copy link
Author

Investigating line 8 I found something pretty relevant. In line 8 there was @<i915_gem_do_execbuffer+3647> and I investigated by particular part as it had both fences and in_fence. The values are pretty notable. I separated a small part of it, showing three applications:

 Forsaken_Castle 16943 [003] 17521.309896: probe:i915_gem_do_execbuffer: (ffffffffc07690af) in_fence=0x0 fences=0xffff952366b7bb00
            Xorg  2215 [000] 17521.309986: probe:i915_gem_do_execbuffer: (ffffffffc07690af) in_fence=0x0 fences=0x2d28
        rolc.exe 15796 [001] 17521.311725: probe:i915_gem_do_execbuffer: (ffffffffc07690af) in_fence=0x0 fences=0x1d8

rolc.exe is a Wine OpenGL game, Xorg is... Xorg and Forsaken_Castle is Wine with DXVK. Notice the huge difference between the values of the fences variable.

@ryao
Copy link
Contributor

ryao commented Sep 12, 2018

It occurs to me that it is possible to get much of the same data that SET_ERROR would give by tracing lines 191, 197, 203, 220 and 222 simultaneously. We would need to get the value of err on line 222.

@FurretUber
Copy link
Author

I can do this check, I think. Looks like perf was problematic before. Give me a moment.

@ryao
Copy link
Contributor

ryao commented Sep 12, 2018

Printing out in_fence on line 8 is not that useful because it is set further down based on the args->flags varible. There is a way to access the args->flags variable in a perf probe, but it is annoying to do. I think Brendan Greg’s site has an example. It requires calculating the offset of the struct member, adding it to the variable and then dereferencing that.

The values for fences are interesting. This code is being passed user space pointers for Xorg and rolc.exe while it is passed a kernel pointer for Forsaken_Castle:

https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt

That is in the KASAN shadow memory, which means that something is wrong. I am not sure offhand why them pointer would go into the KASAN shadow memory region (I assume that you are not running KASAN). I would need to look at the code further when I have time.

@FurretUber
Copy link
Author

FurretUber commented Sep 12, 2018

Output from that lines was empty, pretty strange. Maybe one by one?

Edit: Really, line 191 output is empty, there is nothing on it

Edit 2: line 222 has the following:

 Forsaken_Castle 22969 [000] 19285.169559: probe:i915_gem_do_execbuffer: (ffffffffc07689e8) in_fence=0x0
            Xorg  2215 [000] 19285.170637: probe:i915_gem_do_execbuffer: (ffffffffc07689e8) in_fence=0x0
        rolc.exe 15796 [003] 19285.171467: probe:i915_gem_do_execbuffer: (ffffffffc07689e8) in_fence=0x0

I'll test the others you suggested

Edit 3: line 220:

 Forsaken_Castle 24521 [003] 19487.476694: probe:i915_gem_do_execbuffer: (ffffffffc0768b89) in_fence=0x0 fences=0xffff952366b7bae0
        rolc.exe 15796 [000] 19487.476778: probe:i915_gem_do_execbuffer: (ffffffffc0768b89) in_fence=0x0 fences=0x33a0
            Xorg  2215 [002] 19487.476814: probe:i915_gem_do_execbuffer: (ffffffffc0768b89) in_fence=0x0 fences=0x2d70

@ryao
Copy link
Contributor

ryao commented Sep 12, 2018

@FurretUber You actually need to profile all of them and dump the output. Just 1 line individually won't be helpful. The idea is to see what path is executed. I can infer it by seeing which probes fire in what order. The output from some lines being empty is expected.

To be more clear, you can just do perf probe -a ... each time. It will give you a unique probe name. Then in perf record, you specify -e $NAME for each $NAME that you want to include in the trace in the same perf record command. Finally, you can dump it via perf script to yield output that a kernel developer can interpret. You don't need to capture much data to get a useful trace from that in this situation.

Also, to make my previous comment about capturing args->flags more useful, if you install the dwarves package, you can do pahole -C drm_i915_gem_execbuffer2 $(modinfo -n i915) to get the offset of the flags variable. On my system, the offset is 40. Then you would do something like flags=+40(args):u64 as part of the perf probe command to dump it. Basically, you would specify that as one of the variables you want to capture.

@FurretUber
Copy link
Author

Thank you, I was not aware this was possible to do, as today is the first time I'm using perf. The list of the probes done:

  probe:i915_gem_do_execbuffer (on i915_gem_do_execbuffer:190@i915/i915_gem_execbuffer.c in i915 with in_fence)
  probe:i915_gem_do_execbuffer_1 (on i915_gem_do_execbuffer:191@i915/i915_gem_execbuffer.c in i915 with in_fence)
  probe:i915_gem_do_execbuffer_2 (on i915_gem_do_execbuffer:196@i915/i915_gem_execbuffer.c in i915 with in_fence)
  probe:i915_gem_do_execbuffer_3 (on i915_gem_do_execbuffer:197@i915/i915_gem_execbuffer.c in i915 with in_fence)
  probe:i915_gem_do_execbuffer_4 (on i915_gem_do_execbuffer:202@i915/i915_gem_execbuffer.c in i915 with in_fence fences)
  probe:i915_gem_do_execbuffer_5 (on i915_gem_do_execbuffer:203@i915/i915_gem_execbuffer.c in i915 with in_fence fences)
  probe:i915_gem_do_execbuffer_6 (on i915_gem_do_execbuffer:203@i915/i915_gem_execbuffer.c in i915 with in_fence out_fence)
  probe:i915_gem_do_execbuffer_7 (on i915_gem_do_execbuffer:220@i915/i915_gem_execbuffer.c in i915 with in_fence fences err)
  probe:i915_gem_do_execbuffer_8 (on i915_gem_do_execbuffer:222@i915/i915_gem_execbuffer.c in i915 with in_fence)

The lines of code are here: linhasdocodigo.txt

The output of the script is here:
saidalinhasdiversas.txt.gz

@ryao
Copy link
Contributor

ryao commented Sep 12, 2018

@FurretUber you are doing well for the first time that you have ever used it. I did not do better with perf on my first time using it.

I will need to look at that output later. I have some offline stuff to do. If it helps, I can say offhand that by looking at the thread IDs and looking at which probes fired in sequence in the same thread, you can’t work out which goto statement executed. If the probes before a goto statement fires and then nothing until the last statement, it means that the goto statement right after the statement whose probe fired before the probe on the last statement executed.

@ryao
Copy link
Contributor

ryao commented Sep 12, 2018

@FurretUber I looked at the output. Contrary to my expectation, the goto statements are not being executed, which means this is not a bug in error handling.

@ryao
Copy link
Contributor

ryao commented Sep 19, 2018

Jason Ekstrand wrote a fix for the issue. It is available from the free desktop bug tracker:

https://bugs.freedesktop.org/show_bug.cgi?id=107899
https://bugs.freedesktop.org/attachment.cgi?id=141619

I have reviewed and tested it. It appears to resolve the issue entirely in Rise of Nations: Extended Edition. The patch itself does a fairly good job of explaining what went wrong (and also why I missed the issue when I tried to debug it).

@FurretUber reported a small number of long lived allocations even with the patch, but further testing by him shows that they get garbage collected by an OpenGL game. I am unable to reproduce them. It might have something to do with me using KDE 5.

After this goes into mainline, someone should ping the linux-stable mailing list so that it is properly backported to Linux 4.14.y and 4.18.y. We probably should ping Canonical so that they apply it to their kernel too.

@gfxstrand
Copy link

I've submitted the kernel patch to the mailing list. Hopefully, it will land fairly soon and we'll make sure it gets back-ported as far as needed.

@gfxstrand
Copy link

The fix has landed in drm-misc-fixes; it will propagate to a kernel release near you shortly.

@yurikoles
Copy link

@FurretUber is it fixed now for you?

@FurretUber
Copy link
Author

Yes, it is since 4.19-rc6. The fix was backported to 4.14.76 and 4.18.14 too.

@doitsujin
Copy link
Owner

Since this was a kernel bug that has been fixed upstream, I'm going to close this.

@Demetrio92
Copy link

@FurretUber What happens when you clear PageCache, dentries and inodes?
echo 3 >/proc/sys/vm/drop_caches

wow, this just worked

@Demetrio92
Copy link

Demetrio92 commented Aug 3, 2021

Since this was a kernel bug that has been fixed upstream, I'm going to close this.

I am still experiencing this. Latest Proton and DXVK:

$ /usr/bin/vulkaninfo | head -n 5
WARNING: [Loader Message] Code 0 : loader_icd_scan: Can not find 'ICD' object in ICD JSON file /usr/share/vulkan/icd.d/nvidia_layers.json.  Skipping ICD JSON
WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.2.182
$ uname -srm
Linux 5.8.0-63-generic x86_64

I am running a game through proton: saints row 3 remastered, it occupies ~14 GB of RAM. It cannot properly exit, so I have to kill it each time. Afterwards I do not see any orphaned processes occupying my RAM -- htop tells me my RAM is blocked, but no memory intensive process is present. echo 3 >/proc/sys/vm/drop_caches immediately releases those blocked 14 GB.

@doitsujin
Copy link
Owner

doitsujin commented Aug 3, 2021

And the conclusion that this isn't our bug hasn't chaged. Please make sure you're running up to date kernels and mesa versions for your Intel GPU.

Also, what exactly are the first 5 lines of vulkaninfo output supposed to tell us? You literally cut out all useful info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests