-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regression on linux 6.3.1: 'vmalloc error' during crawl #257
Comments
Show
Disabling the memory cgroup controller (and using generational LRU instead) somewhat mitigated that for me, setting transparent huge pages to madvise also helped, but it is still not fully fixed for me. What you want to see in User-space pages are movable (because they are addressed indirectly through page table lookups). Buffers for hardware are usually not movable. Page cache is, I think, neither movable but reclaimable. Your error indicates it's trying to get an order 9 allocation (4k * 2^9 = 2M) so Something bad is going on with memory allocations in the kernel since 6.1. |
Sorry for the delay. /proc/buddyinfo before: /proc/buddyinfo after: |
That indicates that both before and after the event, the memory is already very fragmented - and that it is fragmented before is probably why it is happening in the first place. Could you look at it after a fresh reboot, then look how it develops while using the system? Maybe you can identify an action or behavior on your system that is causing this behavior. As a first counter measure you could try disabling huge pages after a fresh reboot (while buddyinfo still shows low numbers on the left side and high numbers on the right side): echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo always | /sys/kernel/mm/transparent_hugepage/defrag If this helps but you feel like you want to use huge pages (because it lowers TLB cache misses and can increase performance for some workloads by up to 10%), try this as a next step (I am using these settings, it causes around 1 GB of unused memory on my desktop system under memory pressure when memory is partially fragmented, instead of 4-8 GB with huge pages always turned on): echo 64 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
echo 8 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
echo 32 | sudo tee /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
echo within_size | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
echo defer+madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled It tells transparent huge pages to only create huge pages for madvise memory regions (thus, when an application explicitly asks for it, bees does this for the hash table). It also tells to defer defragmenting huge pages for better latency (but this tends to delay seeing the immediate effects of bad memory layout). Depending on your workload, you may have better results with The You can Background: 2M pages create less more fragmented free memory because memory holes for buddy allocation tend to be smaller. It also creates more memory pressure often causing the kernel to flush out cache early and create "seemingly free" memory which actually cannot be used because it is too fragmented. I think exactly this is what you initially described. Using and tuning huge pages is a question of cost vs benefit: Memory loads become up to 10% at the cost of reduced usable memory. If your system allocates memory in bad patterns, the cost easily becomes very high in which case you may want to disable huge pages completely or identify the process which is causing it. Btrfs itself seems to spike buddy memory allocations quite often which increases the cost of huge pages. BTW: 2M pages cannot be swapped. They need to be broken up back into 4k pages for swapping. I'm not sure if the kernel does this by default or if there's a tunable for when this should happen. |
my setup
Tests
How can we determine if this is a regression in bees when using newer kernels or if it is the kernel's btrfs code used by crawl/dedup itself? A new traceback: |
By definition, user space software must never be able to create kernel oops or traces [1] - so this is a kernel regression. Does it work fine with an older kernel then? [1]: bees does some efforts to work around such issues, tho - but that doesn't make it bees fault |
It does work fine with kernel 6.2.13 (no kernel traces). The buddyinfo does behave identical to the buddyinfo in the non-working 6.3.1 case I do agree that userspace should never be able to create a oops or trace, so indeed it should not bees's fault. |
Actually, I feel like memory fragmentation is becoming a bigger issue when running btrfs with each kernel cycle. I'm currently running 6.1 and see very high order 0 values in buddyinfo, and get oops'es or IO thrashing - while it worked fine in the previous LTS kernel (and thus I never looked at buddyinfo). Using memory cgroups seems to worsen the problem but that may be an effect of using bees and how cache ownership works in memory cgroups. One of our servers running 6.1 had buddyinfo with order 0 in the millions - and increasing RAM for it only worsened the problem for some reason. This hasn't been an issue with the previous 5.19. With transparent hugepages completely turned off it now behaves mostly as expected but the order 0 numbers are still very high. There's another metric you could look at: |
I have noticed that even without bees running vmalloc errors occur after some time due to other services.
If I do a echo 1 > /proc/sys/vm/drop_caches the low order numbers jump up and are reduced after some time. The order 9 numbers increase, and the vmalloc errors are removed for some time, until the order 9 number are reduced to 0 again. Does anybody have a contact in the BTRFS development community where this could be triggered? These are the errors I get: Jun 13 15:00:30 ltytgat-desktop kernel: [110969.198519] kded5: vmalloc error: size 10485760, page order 9, failed to allocate pages, mode:0x400cc2(GFP_KERNEL_ACCOUNT|__GFP_HIGHMEM), nodemask=(null),cpuset=user.slice,mems_allowed=0 Jun 13 15:00:30 ltytgat-desktop kernel: [110969.198530] CPU: 7 PID: 147280 Comm: kded5 Tainted: G W OE 6.3.7-060307-generic #202306090936 Jun 13 15:24:24 ltytgat-desktop kernel: [112402.962588] bash (161844): drop_caches: 3 |
This is a good find. Maybe hop over to IRC #btrfs then, there are quite some btrfs devs active. |
So kernel 6.3.10 and 6.4 are good to go, but now the LTS kernels are broken. |
I was just going to mention that in 6.5rc1 this issue seems to have been tackled, but then I saw the comment above from @Zygo Do I understand it correctly that in essence this was not an error, only an unfortunate kernel log message? |
Yes. The underlying error condition behind the message is expected, and the btrfs code already handles the error cases. The recent kernel code changes are all related to when the message should appear in the log. |
There's still a problem with memory fragmentation, no matter the error log. |
But in #260 (comment) you explicitly mention that this is a bug that was backported but the fix wasn't backported yet. Unless I got something wrong... |
Also here in #257 (comment). That means current LTS kernels 5.15 and 6.1 now have the 'vmalloc error' kernel messages that were fixed in 6.4. The kernel changes would not affect any VM behavior, other than emitting the log message or not. |
Dear
I upgraded my system from linux 6.2.13 to 6.3.1. This resulted in the error messages as below in my logs. There are no crashes.
This has also been reported here
In the reply to above it is mentioned that a out-of-memory condition could trigger the issue. In my setup I have about 30GB of free RAM memory, so this shouldn't be the case.
Feel free to contact me for more info/tests.
The text was updated successfully, but these errors were encountered: