New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
free() invalid pointer #4761
Comments
Unmaintained hardware |
Helios64 suffers (at least) from an overwritten DTS during rockchip64 patching, and is in dire need of a maintainer to clean it up. My efforts to find one have failed: people either have an old kernel running in a "production" helios64, or not willing to put in the effort. |
Have some questions. Would be great if you can help me understand the following
Cross-linking https://forum.armbian.com/topic/26295-free-invalid-pointer-when-installing-python3-setuptools/ for those looking. |
General instructions:
Check upstream code and what our patches does. Perhaps they are not needed anymore (maintainer - you - have to know that) at all or we have some features enabled that are not upstream or vice versa. Resolving that patch diffs.
Only prior to upcoming release or by request (rarely). But checking functionality by booting from SD card is already a lot better then no checking at all. No need to mess up your existing setup. |
I just checked upstream code dts - it was unchanged from jan 2022. Not sure if upstream change is the reason. But will try to look more into it tomorrow.
That sounds promising. |
For the DTS overwrite, basically we had a DTS patched into Armbian thanks to the Kobol guys before it was available in mainline. Then it became available in mainline and the Armbian patch never got revised/removed. All the helios64 specific patches need rebased on the mainline device tree. |
To build it, i am assuming i need to make changes to patches in 5.15 folder and compile armbian again. Let me know if it is the way. Will try to do it and report if it worked or not. |
Yes. The patch added a new file back then. Now, that file already exists in mainline, but is completely removed by the bash patching done in the The fact it even boots is... surprising.
|
So, I tried removing the patch
Also, a few more logs from dpkg
❯ cat /var/lib/dpkg/info/python3-pkg-resources.postinst
#!/bin/sh
set -e
# Automatically added by dh_python3
if command -v py3compile >/dev/null 2>&1; then
py3compile -p python3-pkg-resources
fi
if command -v pypy3compile >/dev/null 2>&1; then
pypy3compile -p python3-pkg-resources || true
fi
# End automatically added section Related searches
|
Nice, probably some picking from the previous patch can result in a good-enough new patch.
Did you try switching userspace? RELEASE=jammy etc? |
Likely.
I tried with jammy, impish, kinetic, debian. All suffer from this. This makes me think if this is a kernel / syscall issue 🤔 |
I also have this issue running Ansible to the helios64 python3. I was able to reproduce your py3compile crash. Both py3compile crash and Ansible rule that crash rae random. I sometimes get a kernel oops. Unlikely DTS. Likely a kernel code patch. Might even affect other rockchip64 device because most apps runs fine on the helios64. I always ends up with a kernel memory error (also random) and need to reboot. |
I tried searching for |
@blmhemu free is a userspace call. There is no such call as free in the kernel. I was able to reproduce on helios64 with linux image edge 22.11.4 kernel 6.1.7-rockchip64 but I was able to do two runs of ansible before the bug "free(): invalid pointer" triggered. I also got "munmap_chunk(): invalid pointer" from running About the lack of maintainership, I would like to help with that, but until I find out why my board is unstable I focus on debugging the stability issue. Be it the kernel crash due to memory corruption (and half of the time the board hangs without even rebooting on panic, and most of the time at reboot after panic it hangs early on) or the rk3399 hs400es breakage that I nailed down to a commit in https://forum.armbian.com/topic/18855-upgrading-to-bullseye-troubleshooting-armbian-21081/page/3/#comment-128793. |
https://docs.armbian.com/Board_Maintainers_Procedures_and_Guidelines/ |
Next suggestion: remove kernel patches, maybe all of them, and try a build with mostly mainline only stuff. Does problem persist? If not (my bet...) bisect the evil patch out of rockchip64... another idea: update u-boot and/or blobs. |
I change vin-supply in pwm-supply in helios64 board dts vdd-log section. vdd-log is known for stability issues if not powered properly and without this fix it was assigned the dummy regulator. Hard stretched but I am still not set if the helios64 kernel memory errors I have are due to a driver which corrupts the memory or a wrong setting that makes the CPU unstable. |
@blmhemu do you run OMV above helios64? OMV has a few optimizations (sysctl), tools that stress the memory (folder2ram), and probably others that might make the bug from the unknown source more visible, but it could be that the issue is not helios64 specific. Sorry unlikely an OMV-related issue as from above I understand you ran vanilla jammy, impish, kinetic, debian and reproduced. |
can anyone reproduce this issue with another RK3399 board? |
No. Plain ubuntu. |
I do not have another RK3399 board to test. If anyone could, I made a simpler test case:
I also get "free(): invalid pointer", "double free or corruption (out)" or else. I tried with the python3 debugger:
this is with the python3 debugger with the tracemalloc flag. With the python3 debugger without flags:
Note that if I do the loop in python code instead of calling the python runtime in a loop the crash does not occur.
works. I also wonder why only python3 is affected on my system. Maybe running other python3 setups in docker containers on the helios64 could confirm if this is due to the userspace setup (or it could also be that this particular python3.9 setup stress test a specific issue with the kernel or hardware and another setup will just hide the issue). |
@blmhemu could you paste the output of /proc/buddyinfo when python3 starts to output invalid free? It seems that python3 does not cope well when an allocation fails and tries to free it even if it was not allocated. That may explain our issue. What is not clear is why we get these even though all was fine before. The fact is it may be another issue. In the process of debugging this invalid free I turned off my zswap so maybe I produced a page allocation failure with my debug attempts. I tested in a docker container on the helios64 bullseye (still with latest master edge kernel) with Debian bookworm python3.11 in the container and was able to reproduce the invalid free. |
@blmhemu sorry to bother you again. Do your tests with different releases (jammy, kinetic, etc) all run with a different "current" armbian kernel, or did the build run with the latest kernel (6.1 ?)? |
I made a mistake and thus u-boot booted on my old eMMC install (which I left untouched since at least July 2022). Then I tried on the SD card install (up to date bullseye) with current armbian kernel |
@blmhemu I confirm that just installing linux-image-current-rockchip64 package and its ad-hoc linux-dtb current-rockchip64 package at version 21.08.2, which is 5.10.63-rockchip64, fixes this issue.
five times without an issue. I can even run the test case fine with latest kernels if I disable cpufreq with kernel boot parameter cpufreq.off=1 |
I have been able to reproduce the python3 invalid free with linux-image-legacy-rk3399 that is 4.4.213. I believe I did not encounter the issue before because before my ansible setup was using the python2 installed on the helios64, not the python3. @blmhemu, I need to retry but I believe with cpufreq disabled ( Also I tried latest 6.1.12 with cpufreq enabled and all armbian patches removed except the add helios64 board add-board-helios64.patch, board-helios64-remove-pcie-ep-gpios.patch, my emmc hs400 es patch to read emmc hs400 and rk3399-enable-dwc3-xhci-usb-trb-quirk.patch and I can still reproduce the issue (I also tried with |
Hello ! @prahal I can confirm that the
I built the image with the |
Does this mean, it could be a bug in upstream kernel ? |
Done - still stable Updated the uboot (ddrbin now shows 1.25) - see https://pastebin.mozilla.org/M1XXJnLn |
@blmhemu then I believe when you did the latest bullseye install somehow you modified the installed bootloader. Probably the 21st of February 2023 when you told us you got the issue fixed (sorry I forgot you already had a fixed setup, I though you were still suffering this issue):
So likely the rockchip DDR blob 1.24 is fine too. |
Unfortunately, I do not have those logs :( |
@prahal I diffed both the logs and here are my findings
The unstable build
vs The stable build
Link to diff https://www.diffchecker.com/3D0UDOHx/ |
UPDATE (Again):
I have compiled armbian with the above option and flashed it. May be we found the root cause ? (u-boot tpl) Observations
|
UPDATE 3: Ran the python loop Observations
|
@blmhemu about the DDR frequencies, I added the ddrbin freq to blob less u-boot (keeping all other ddr parameters the same which is probably not fine) and forced them. Still the same issue (though I should post the hack for this issue to be reproduced by others but for one I am away for a few weeks). Mind v2023.04 has a fix to do the training at 400MHz instead of 50MHz bit this did not help with our issue. About he SATA/nvme, maybe look on the kobold wiki, probably in the comments I am confident this was answered. (I also made an u-boot 2023.04 build that seems to have a pretty good support for SATA, but it requires to migrate to new apis (bootlow, bootdev, boothmeth). I want to spend time sharing this hack of a build but it turned out it did not help with the DDR stability issue so it became lower priority. Though I believe instruction to achieve SATA boot are already available in the kobold wiki. If not tell me I will try to share my u-boot v2023.04 for Helios64 build. Mind this build had an issue that it can boot loop in u-boot (I manage to stop the loop but did not investigate the cause yet). So pretty experimental. And somehow u-boot v2022.10 I believe was the version that was not buidlable as it partially migrate to bin man binary build while still being half makefile based. So I was not able to build both the idbloader.imh and u-boot.itb binaries. All in all I attempted those to try the new DDR related fixes in these version which ended up not being related to this invalid free bug. About the eth0 error "Net: dw_dm_mdio_init" I always though it had always been so. I will take a look if I can get this working but not asap I believe ( out of that being an easy catch). |
2020.10 - https://pastebin.com/MmtpS7F9 |
I have upgraded the system (apt update && apt upgrade) and could not boot now. |
Do you mean u-boot load the kernel then nothing or an error on the serial console? By the way this looks like another issue and to avoid this thread becoming unreadable I guess this requires a thread of it's own on the armbian forum. Feel free to tag me in your forum thread so I get a notice by email. |
If anyone encounters this - this is due to the armbain provided linux-libc-dev (use the debian one instead by giving a lower priority to the armabian repo) |
Thanks for the follow-up and workaround. Feel free to open another bug report to track this issue! I am currently trying a few ideas as I was able to reproduce the raid10 resync always crashing the kernel on helios64 I randomly have since I received the unit. I will try to sort out which of the ideas are useless against this issue (I even had hints that it could be related to the HDDs firmwares above the SATA/pci bridge (rk3399 pcie is known to have bugs, but I suspect at the very least it is not the known issue which affects pcie devices being slow to enumerate). Or it could be another memory ddr corruption that the mdadm raid10 resync stresses and is the only test case to reliably reproduce its crashes. |
Hi there, This problem is solved when I add cpufreq.off=1, but then the cpu is really slow. |
@prahal : Is there anything wrong with opening a pull request until a better solution is found? |
@prahal Thank you! your fix solved my helios64 problem. Rejoyed too soon!kernel:[47341.023705] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Message from syslogd@helios64 at Aug 22 10:59:53 ... Message from syslogd@helios64 at Aug 22 10:59:53 ... |
@snakekick indeed the kernel crashes are not fixed. Seems to me this a memory corruption. It affects random kernel code. But I left the helios64 down for a while as I have a way to reproduce the kernel corruption fast, that is boot when the raid 10 had a bad crash and is healing at boot. Might still be memory related. Though we should discuss this matter in the forum as the current issue I believe is not the same and we have a workaround for. |
Hi @prahal and @snakekick, I'm also running a Helios64 wich freezes randomly every few days now :/ |
@d3473r I've downgraded the bootloader as well, so far no more freezes. here is the way to do it : cd /tmp
wget --content-disposition https://imola.armbian.com/apt/pool/main/l/linux-u-boot-helios64-edge/linux-u-boot-edge-helios64_22.02.1_arm64.deb
dpkg -x linux-u-boot-edge-helios64_22.02.1_arm64.deb linux-u-boot-edge-helios64_22.02.1_arm64/
vi /usr/lib/u-boot/platform_install.sh While in the #DIR=/usr/lib/linux-u-boot-current-helios64
DIR=/tmp/linux-u-boot-edge-helios64_22.02.1_arm64/usr/lib/linux-u-boot-edge-helios64_22.02.1_arm64 Then launch I strongly suggest to dump your current bootloader just in case : |
Hi @bcecchinato, i installed the bootloader with you instructions, It runned for a while after a reboot but eventually freezed again after a few hours :( |
@d3473r yep it crashed on my side this morning as well :( Depending on which storage you are (emmc/sd card), the I'm trying with another bootloader here : Since I don't really know what this changes, maybe this attempt is useless at all :D and unfortunately i'm not an expert with armbian/bootloader and etc. I can make some tests if other users from this topic want however. |
Are you running Debian 11 or 12? |
I'm on Debian 12, the free issue started with bookworm, no issues with Bullseye and the latest bootloader (but I can't say if it was uptodate or not). |
@d3473r the free issue is not the same as the freeze one. What I mean is that you can fix the free issue but still have the freezes as I do. Still, I am chasing the freeze issue too. Currently have the helios64 down for weeks since it is in a state were I can reproduce the freeze. That is raid10 resyncing at boot. I would like to have a bug report to centralize the freeze issue reports. As of now, they are scattered in various threads on the Armbian forum. Maybe you could open a new one there and give the link here? Note that I have freezes since I got the helios64. At one point I blamed the rk3399 pcie ... but I am unsure now. Mind my raid10 stress the pcie in the SOC and the sata controller. So having other setup details could help this. Especially what the setups that were or are working are like. |
@prahal I don't know if the free and freezes are related, but since my downgrade to 21.08.9 of the bootloader i havn't encountered neither free error, nor freezes. I'm running on a uSD card, bootloader installed on the uSD card as well (the EMMC is completely blank, i've dd zeroes to be sure not to boot on it). My case is a bit different, I had a uSD on debian bullseye, and made a fresh install on a second uSD card with bookworm. The troubles started from this point. I haven't deleted the old card, I can make some diffs between each in case this might help. Both cards have the same boot loader version (the 23.08.1 version), but I can't say if the bootloader written on the old uSD is 23.08.1 or 21.08.9. I wish I could help more, but like you, I've no skills on bootloaders :( The only sure thing is : bookworm with 21.08.9 bootloader is working fine and has no free issues at all. |
@prahal If have a pretty good understanding when the freezes started, but no why. My helios64 is used as a Timemachine backup target, and the backups started failing since the beginning of September. I'm certain about this as the root filesystem is encrypted and any freeze or reboot would have forced me to unlock the root fs via ssh to boot the helios64 up again. So my guess is: I updated something in the beginning of September (presumably kernel updates, i have not made a dist upgrade) and since then the freezes are occuring |
@d3473r you have a history of the upgrades in /var/log/apt/history.log<.n.gz>. Note that knowing the previous working versions is even more interesting than the new broken one. Also, it could be the new version is only more efficient and thus stresses the hardware more (or even enables a new hardware component). When you say they ran over a year without a freeze, you mean there were also freezes beforehand. Were they rare before that time? I bet you never upgraded the bootloader before you did recently. Do you know from which image you installed the EMMC or SD card initially? One might be able to guess the older bootloader from that. Also, it could be the load to the hardware changed over time and even without any upgrade you will have ended up with this freeze. Also, do you have small static discharges when touching the helios64 enclosure? I am pretty sure this is unrelated nowadays but who knows (I have them when my helios64 power adapter is close to my UPS and set of other chargers (not sorted which one yet). |
On the Helios64 random memory errors happens when using the U-Boot DDR intialization code for rk3399. Switching to the rkbin rk33 933MHz v1.25 allows this testcase to run more than once without a memory error: for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done Could be LPDDR4 specific. Workaround armbian#4761 "free() invalid pointer".
On the Helios64 random memory errors happens when using the U-Boot DDR intialization code for rk3399. Switching to the rkbin rk33 933MHz v1.25 allows this testcase to run more than once without a memory error: for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done Could be LPDDR4 specific. Workaround #4761 "free() invalid pointer".
What happened?
Getting free() invalid pointer issue when installing / using python3. Could be dpkg issue as well !
Board: Helios64 (I know, I know CSC)
Chipset: RK3399
When I did
sudo apt update && sudo apt upgrade
, it happened and hence I tried to reinstall.This issue occurs when trying to manage it with ansible as well. I think something might be wrong with the latest python.
Branch
master (main development branch)
On which host OS are you observing this problem?
Jammy
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: