Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slab leak in Android Kernel #57

Closed
grassead opened this issue Apr 14, 2020 · 26 comments
Closed

Slab leak in Android Kernel #57

grassead opened this issue Apr 14, 2020 · 26 comments

Comments

@grassead
Copy link

Hi,

When trying Android 8 on a Sabrelite, I encounter an issue (located in the kernel (4.9.88 - commit 2324d06)).

The "Slab memory" is leaking when the graphics system (SurfaceFlinger) does composition.

For example, I wrote an application that pop a Toast as fast as possible.

Before running the test, cat /proc/meminfo gives:
Slab: 42092 kB
SReclaimable: 15076 kB
SUnreclaim: 27016 kB

After Running the application for 10-15 minutes:
Slab: 78960 kB
SReclaimable: 14828 kB
SUnreclaim: 64132 kB

After killing the application:
Slab: 80956 kB
SReclaimable: 14960 kB
SUnreclaim: 65996 kB

The "SUnreclaim" is not freed.

Could you please help me to fix this issue.

Thanks,

@gibsson
Copy link
Member

gibsson commented Apr 16, 2020

Hi Adrien,

Thanks for this report. We've seen issues with Android 8 indeed.

We'll look into it as soon as possible but we have many ongoing projects at this time.

Regards,
Gary

@JeansHH
Copy link

JeansHH commented Aug 12, 2020

+1 Are there any updates?

@ep-skolberg
Copy link

+1
for us, this issue blocking an upgrade to Android 8

@gibsson
Copy link
Member

gibsson commented Aug 17, 2020

Sorry we haven't had a chance to look at it yet.
We'll keep you posted as soon as we have an update.

@JeansHH
Copy link

JeansHH commented Aug 20, 2020

I looked into the code and it looks like the fences created in viv_fence_create (os/linux/kernel/gc_hal_kernel_sync.c) are never released. Maybe I am wrong or this is not relevant. But if this is the problem, is there a way to release the fence?

@gibsson
Copy link
Member

gibsson commented Aug 21, 2020

Quick update, now looking into this issue.
@JeansHH unfortunately as far as I can tell the userspace should be the one releasing it.
This would point to the Vivante binary blobs which wouldn't bee too surprising.
Good news is that I was able to reproduce the issue 100% of the time, will keep you guys posted.

@gibsson
Copy link
Member

gibsson commented Aug 25, 2020

Another update:

  • @JeansHH I confirm that there is some leak in the fence that it is unfortunately not the main issue I'm afraid.
    Enabling KMEMLEAK showed the fence leak:
unreferenced object 0xd3e19080 (size 128):
  comm "surfaceflinger", pid 301, jiffies 4294948055 (age 102.340s)
  hex dump (first 32 bytes):
    01 00 00 00 74 7b 6e c1 00 00 00 00 00 00 00 00  ....t{n.........
    90 90 e1 d3 90 90 e1 d3 c0 90 e1 d3 00 00 00 00  ................
  backtrace:
    [<c0291a58>] kmem_cache_alloc_trace+0x1a4/0x2b0
    [<c09945ac>] viv_fence_create+0x3c/0x1ec
    [<c0962560>] gckOS_CreateNativeFence+0x7c/0x11c
    [<c096cfe8>] gckKERNEL_Dispatch+0x468/0x12d4
    [<c096e124>] gckDEVICE_Dispatch+0x2d0/0x2d4
    [<c0966614>] drv_ioctl+0x140/0x320
    [<c02b3c9c>] do_vfs_ioctl+0xc0/0x9c4
    [<c02b461c>] SyS_ioctl+0x7c/0x8c
    [<c01088c0>] ret_fast_syscall+0x0/0x48
    [<ffffffff>] 0xffffffff

However after updating some part of the driver to p9, that fence leak is gone but the main leak isn't.
See my patch here if you want to give it a try:
https://gist.github.com/gibsson/45921590256a95c5b868e37387e1909d

  • I confirm that this issue is gone with the p9.0.0_2.2.0_ga release, if you're interested, our partner Kynetics does have a release for Nitrogen platforms: https://www.kynetics.com/android-bsp/boundary-devices
  • I tried backporting p9 Vivante binaries to o8 but it's more difficult than expected as:
    1- memory allocation mechanism changed
    2- new binaries depend on libraries that don't exist in i.MX Oreo (libdrm_vivante, libdrm_android)

@JeansHH
Copy link

JeansHH commented Aug 26, 2020

@gibsson Thank you such much for your effort. I really appreciate it. I will give your patch a try.

I will talk to my colleagues checking if we can migrate to android 9. Do you keep working on the issue?

Once again: Thank you such much!

@JeansHH
Copy link

JeansHH commented Aug 26, 2020

The behaviour is much better with this patch

@RomainNaour
Copy link

Hello @gibsson,

I just tested with the latest kernel for Android 8 (boundary-imx-o8.0.0_1.0.0-ga branch)
5f75170
But the but is still present.

In the end, the nxp firmware o8.0.0_1.0.0-ga is buggy... what about o8.1.0_1.3.0 firmware ?
Can we use them instead ?

Best regards,
Romain

@gibsson
Copy link
Member

gibsson commented Sep 23, 2020

Hi Romain,

Yes, there were 2 leaks before, that patch only fixes 1 of them (the one in the kernel).
The issue has been reproduced on NXP EVK platform with their prebuilt images so clearly comes from NXP release.
Not sure about the other Oreo releases, as none of them were GA for i.MX6Q.
To be honest I've tried porting the libraries from o8.1.0_1.3.0 to o8.0.0 but it's not as straightforward as it sounds since NXP changed its graphics libs to depend on libdrm which wasn't the case in o8.0.0.
All I can say is that the issue doesn't occur on Nougat nor Pie so we now strongly recommend not to use Oreo.

Regards,
Gary

@RomainNaour
Copy link

Hi Gary,

Thank you for your quick reply!
What about convincing NXP to do a fix release of these o8.0.0_1.0.0-ga firmware ?

Best regards,
Romain

@gibsson
Copy link
Member

gibsson commented Sep 23, 2020

Hi Romain,

I've tried, with no luck. They recommended changing release as well.
You can try as well, it doesn't hurt, maybe that will make them change their mind.
Sorry for the inconvenience.

Regards,
Gary

@RomainNaour
Copy link

Hi Gary,

No problem, my customer asked me to ask this question.
We'll try on our side as well.

Thanks,
Romain

@RomainNaour
Copy link

Hi Gary,

Here is the link to the NXP community forum where the request has been posted yesterday:
https://community.nxp.com/t5/i-MX-Graphics/Memory-leak-spotted-on-i-MX6-with-Android-8/m-p/1157843#M15

Best regards,
Romain

@gibsson
Copy link
Member

gibsson commented Sep 24, 2020

Hi Romain,
Thanks for creating that post. I've replied making sure to mention it was reproduced on SabreSD, otherwise the answer will be "it's because you don't use NXP platform".
Now, I think it would be best for you to share the apk there as well as repro steps and procedure to see the memory leak increase.
Thanks,
Gary

@RomainNaour
Copy link

@RomainNaour
Copy link

Hello @JeansHH @ep-skolberg,

Can you add a comment about your issue on the NXP forum?
It would help to convince NXP to take a look at our issue.

Thanks,
Romain

@JeansHH
Copy link

JeansHH commented Sep 25, 2020

done

@RomainNaour
Copy link

Hello,

NXP did a test on Pixel mobile and reproduced the issue using our app.

https://community.nxp.com/t5/i-MX-Graphics/Memory-leak-spotted-on-i-MX6-with-Android-8/m-p/1161467/highlight/true#M21

So it's not clear if it's really a imx6 issue or not.

@gibsson
Copy link
Member

gibsson commented Oct 8, 2020

Has anyone other than NXP been able to verify that claim?

@RomainNaour
Copy link

Not yet, I was looking at testing Android 8 on a potato board or a Rasperry-pi.

https://libre.computer/2018/09/27/android-release-for-tritium-and-le-potato

If the issue is really not related to NXP firmware but Android AOSP part, the issue should be reproducible with an emulator ?

Thanks again Gary!

@RomainNaour
Copy link

Hi Gary,

I tried to reproduce the leak using two different board: a RasperryPi 3 with LineageOS image (15.1 based on Android 8.0 but with a kernel 4.4), a Le Potato from the link above (Android 8.0.0, kernel 4.9.61).

I'm unable to reproduce the issue so far, even on Le Potato board using an Android image very close to the image provided by BoundaryDevice for the Sabrelite board.

I'm not sure how NXP is able to reproduce the issue on a Android Pixel phone...

Best regards,
Romain

@gibsson
Copy link
Member

gibsson commented Oct 14, 2020

Hi,
Please share those findings on the community forum. To be honest, I don't believe the testing done on Pixel was correct.
Regards

@grassead
Copy link
Author

Hi,

I reproduced this issue on my Pixel 2 (to a lesser extent) on android-8.0.0_r34 (the last aosp release for Pixel 2) running on kernel 4.4.56-g594d847d09a1.

18h12
walleye:/ # cat /proc/meminfo
Slab: 135940 kB
SReclaimable: 47344 kB
SUnreclaim: 88596 kB

20h24
walleye:/ # cat /proc/meminfo
Slab: 137952 kB
SReclaimable: 47604 kB
SUnreclaim: 90348 kB

Thanks,

@gibsson
Copy link
Member

gibsson commented Nov 13, 2020

I am closing this issue as the kernel doesn't have any leak any longer.
Also, it has been proven that an update of Vivante libraries can fix the issue.

@gibsson gibsson closed this as completed Nov 13, 2020
gibsson pushed a commit that referenced this issue Apr 8, 2021
…le_activate

[ Upstream commit 5808fec ]

In case if isi.nr_pages is 0, we are making sis->pages (which is
unsigned int) a huge value in iomap_swapfile_activate() by assigning -1.
This could cause a kernel crash in kernel v4.18 (with below signature).
Or could lead to unknown issues on latest kernel if the fake big swap gets
used.

Fix this issue by returning -EINVAL in case of nr_pages is 0, since it
is anyway a invalid swapfile. Looks like this issue will be hit when
we have pagesize < blocksize type of configuration.

I was able to hit the issue in case of a tiny swap file with below
test script.
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/scripts/swap-issue.sh

kernel crash analysis on v4.18
==============================
On v4.18 kernel, it causes a kernel panic, since sis->pages becomes
a huge value and isi.nr_extents is 0. When 0 is returned it is
considered as a swapfile over NFS and SWP_FILE is set (sis->flags |= SWP_FILE).
Then when swapoff was getting called it was calling a_ops->swap_deactivate()
if (sis->flags & SWP_FILE) is true. Since a_ops->swap_deactivate() is
NULL in case of XFS, it causes below panic.

Panic signature on v4.18 kernel:
=======================================
root@qemu:/home/qemu# [ 8291.723351] XFS (loop2): Unmounting Filesystem
[ 8292.123104] XFS (loop2): Mounting V5 Filesystem
[ 8292.132451] XFS (loop2): Ending clean mount
[ 8292.263362] Adding 4294967232k swap on /mnt1/test/swapfile.  Priority:-2 extents:1 across:274877906880k
[ 8292.277834] Unable to handle kernel paging request for instruction fetch
[ 8292.278677] Faulting instruction address: 0x00000000
cpu 0x19: Vector: 400 (Instruction Access) at [c0000009dd5b7ad0]
    pc: 0000000000000000
    lr: c0000000003eb9dc: destroy_swap_extents+0xfc/0x120
    sp: c0000009dd5b7d50
   msr: 8000000040009033
  current = 0xc0000009b6710080
  paca    = 0xc00000003ffcb280   irqmask: 0x03   irq_happened: 0x01
    pid   = 5604, comm = swapoff
Linux version 4.18.0 (riteshh@xxxxxxx) (gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)) #57 SMP Wed Mar 3 01:33:04 CST 2021
enter ? for help
[link register   ] c0000000003eb9dc destroy_swap_extents+0xfc/0x120
[c0000009dd5b7d50] c0000000025a7058 proc_poll_event+0x0/0x4 (unreliable)
[c0000009dd5b7da0] c0000000003f0498 sys_swapoff+0x3f8/0x910
[c0000009dd5b7e30] c00000000000bbe4 system_call+0x5c/0x70
Exception: c01 (System Call) at 00007ffff7d208d8

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
[djwong: rework the comment to provide more details]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
gibsson pushed a commit that referenced this issue Apr 8, 2021
…le_activate

[ Upstream commit 5808fec ]

In case if isi.nr_pages is 0, we are making sis->pages (which is
unsigned int) a huge value in iomap_swapfile_activate() by assigning -1.
This could cause a kernel crash in kernel v4.18 (with below signature).
Or could lead to unknown issues on latest kernel if the fake big swap gets
used.

Fix this issue by returning -EINVAL in case of nr_pages is 0, since it
is anyway a invalid swapfile. Looks like this issue will be hit when
we have pagesize < blocksize type of configuration.

I was able to hit the issue in case of a tiny swap file with below
test script.
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/scripts/swap-issue.sh

kernel crash analysis on v4.18
==============================
On v4.18 kernel, it causes a kernel panic, since sis->pages becomes
a huge value and isi.nr_extents is 0. When 0 is returned it is
considered as a swapfile over NFS and SWP_FILE is set (sis->flags |= SWP_FILE).
Then when swapoff was getting called it was calling a_ops->swap_deactivate()
if (sis->flags & SWP_FILE) is true. Since a_ops->swap_deactivate() is
NULL in case of XFS, it causes below panic.

Panic signature on v4.18 kernel:
=======================================
root@qemu:/home/qemu# [ 8291.723351] XFS (loop2): Unmounting Filesystem
[ 8292.123104] XFS (loop2): Mounting V5 Filesystem
[ 8292.132451] XFS (loop2): Ending clean mount
[ 8292.263362] Adding 4294967232k swap on /mnt1/test/swapfile.  Priority:-2 extents:1 across:274877906880k
[ 8292.277834] Unable to handle kernel paging request for instruction fetch
[ 8292.278677] Faulting instruction address: 0x00000000
cpu 0x19: Vector: 400 (Instruction Access) at [c0000009dd5b7ad0]
    pc: 0000000000000000
    lr: c0000000003eb9dc: destroy_swap_extents+0xfc/0x120
    sp: c0000009dd5b7d50
   msr: 8000000040009033
  current = 0xc0000009b6710080
  paca    = 0xc00000003ffcb280   irqmask: 0x03   irq_happened: 0x01
    pid   = 5604, comm = swapoff
Linux version 4.18.0 (riteshh@xxxxxxx) (gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)) #57 SMP Wed Mar 3 01:33:04 CST 2021
enter ? for help
[link register   ] c0000000003eb9dc destroy_swap_extents+0xfc/0x120
[c0000009dd5b7d50] c0000000025a7058 proc_poll_event+0x0/0x4 (unreliable)
[c0000009dd5b7da0] c0000000003f0498 sys_swapoff+0x3f8/0x910
[c0000009dd5b7e30] c00000000000bbe4 system_call+0x5c/0x70
Exception: c01 (System Call) at 00007ffff7d208d8

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
[djwong: rework the comment to provide more details]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
gibsson pushed a commit that referenced this issue Apr 8, 2021
…le_activate

[ Upstream commit 5808fec ]

In case if isi.nr_pages is 0, we are making sis->pages (which is
unsigned int) a huge value in iomap_swapfile_activate() by assigning -1.
This could cause a kernel crash in kernel v4.18 (with below signature).
Or could lead to unknown issues on latest kernel if the fake big swap gets
used.

Fix this issue by returning -EINVAL in case of nr_pages is 0, since it
is anyway a invalid swapfile. Looks like this issue will be hit when
we have pagesize < blocksize type of configuration.

I was able to hit the issue in case of a tiny swap file with below
test script.
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/scripts/swap-issue.sh

kernel crash analysis on v4.18
==============================
On v4.18 kernel, it causes a kernel panic, since sis->pages becomes
a huge value and isi.nr_extents is 0. When 0 is returned it is
considered as a swapfile over NFS and SWP_FILE is set (sis->flags |= SWP_FILE).
Then when swapoff was getting called it was calling a_ops->swap_deactivate()
if (sis->flags & SWP_FILE) is true. Since a_ops->swap_deactivate() is
NULL in case of XFS, it causes below panic.

Panic signature on v4.18 kernel:
=======================================
root@qemu:/home/qemu# [ 8291.723351] XFS (loop2): Unmounting Filesystem
[ 8292.123104] XFS (loop2): Mounting V5 Filesystem
[ 8292.132451] XFS (loop2): Ending clean mount
[ 8292.263362] Adding 4294967232k swap on /mnt1/test/swapfile.  Priority:-2 extents:1 across:274877906880k
[ 8292.277834] Unable to handle kernel paging request for instruction fetch
[ 8292.278677] Faulting instruction address: 0x00000000
cpu 0x19: Vector: 400 (Instruction Access) at [c0000009dd5b7ad0]
    pc: 0000000000000000
    lr: c0000000003eb9dc: destroy_swap_extents+0xfc/0x120
    sp: c0000009dd5b7d50
   msr: 8000000040009033
  current = 0xc0000009b6710080
  paca    = 0xc00000003ffcb280   irqmask: 0x03   irq_happened: 0x01
    pid   = 5604, comm = swapoff
Linux version 4.18.0 (riteshh@xxxxxxx) (gcc version 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)) #57 SMP Wed Mar 3 01:33:04 CST 2021
enter ? for help
[link register   ] c0000000003eb9dc destroy_swap_extents+0xfc/0x120
[c0000009dd5b7d50] c0000000025a7058 proc_poll_event+0x0/0x4 (unreliable)
[c0000009dd5b7da0] c0000000003f0498 sys_swapoff+0x3f8/0x910
[c0000009dd5b7e30] c00000000000bbe4 system_call+0x5c/0x70
Exception: c01 (System Call) at 00007ffff7d208d8

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
[djwong: rework the comment to provide more details]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants