Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

free() invalid pointer #4761

Closed
1 task done
blmhemu opened this issue Jan 28, 2023 · 78 comments
Closed
1 task done

free() invalid pointer #4761

blmhemu opened this issue Jan 28, 2023 · 78 comments

Comments

@blmhemu
Copy link

blmhemu commented Jan 28, 2023

What happened?

Getting free() invalid pointer issue when installing / using python3. Could be dpkg issue as well !
Board: Helios64 (I know, I know CSC)
Chipset: RK3399

Screenshot 2023-01-28 at 9 44 08 AM

  ansible_facts: {}
  failed_modules:
    ansible.legacy.setup:
      ansible_facts:
        discovered_interpreter_python: /usr/bin/python3
      failed: true
      module_stderr: |-
        free(): invalid pointer
        Aborted
      module_stdout: ''
      msg: |-
        MODULE FAILURE
        See stdout/stderr for the exact error
      rc: 134
  msg: |-
    The following modules failed to execute: ansible.legacy.setup

When I did sudo apt update && sudo apt upgrade, it happened and hence I tried to reinstall.
This issue occurs when trying to manage it with ansible as well. I think something might be wrong with the latest python.

Branch

master (main development branch)

On which host OS are you observing this problem?

Jammy

Relevant log output

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 2 reinstalled, 0 to remove and 0 not upgraded.
Need to get 0 B/472 kB of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 44574 files and directories currently installed.)
Preparing to unpack .../python3-pkg-resources_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
double free or corruption (out)
Aborted
dpkg: warning: old python3-pkg-resources package pre-removal script subprocess returned error exit status 134
dpkg: trying script from the new package instead ...
dpkg: ... it looks like that went OK
Unpacking python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) over (59.6.0-1.2ubuntu0.22.04.1) ...
Preparing to unpack .../python3-setuptools_59.6.0-1.2ubuntu0.22.04.1_all.deb ...
Unpacking python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) over (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) ...
Setting up python3-setuptools (59.6.0-1.2ubuntu0.22.04.1) ...
free(): invalid pointer
Aborted
dpkg: error processing package python3-setuptools (--configure):
 installed python3-setuptools package post-installation script subprocess returned error exit status 134
Errors were encountered while processing:
 python3-setuptools
E: Sub-process /usr/bin/dpkg returned an error code (1)

Code of Conduct

  • I agree to follow this project's Code of Conduct
@EvilOlaf
Copy link
Member

Unmaintained hardware
Not a building issue. Please use forums for issues when running Armbian.

@EvilOlaf EvilOlaf closed this as not planned Won't fix, can't repro, duplicate, stale Jan 28, 2023
@rpardini
Copy link
Member

Helios64 suffers (at least) from an overwritten DTS during rockchip64 patching, and is in dire need of a maintainer to clean it up. My efforts to find one have failed: people either have an old kernel running in a "production" helios64, or not willing to put in the effort.

@blmhemu
Copy link
Author

blmhemu commented Jan 28, 2023

Have some questions. Would be great if you can help me understand the following

  • What would it take to maintain it ?
    • Does it need intimate code knowledge or is it mostly rewriting existing patches ?
    • Do you have some pointers / code references / prs on fixing this DTS overwrite issues ?
    • Should we constantly flash the device and lose "production" persistent settings / config ?

Cross-linking https://forum.armbian.com/topic/26295-free-invalid-pointer-when-installing-python3-setuptools/ for those looking.

@igorpecovnik
Copy link
Member

General instructions:
https://docs.armbian.com/Board_Maintainers_Procedures_and_Guidelines/

Do you have some pointers / code references / prs on fixing this DTS overwrite issues ?

Check upstream code and what our patches does. Perhaps they are not needed anymore (maintainer - you - have to know that) at all or we have some features enabled that are not upstream or vice versa. Resolving that patch diffs.

Should we constantly flash the device and lose "production" persistent settings / config ?

Only prior to upcoming release or by request (rarely). But checking functionality by booting from SD card is already a lot better then no checking at all. No need to mess up your existing setup.

@blmhemu
Copy link
Author

blmhemu commented Jan 28, 2023

Check upstream code and what our patches does. Perhaps they are not needed anymore (maintainer - you - have to know that) at all or we have some features enabled that are not upstream or vice versa. Resolving that patch diffs.

I just checked upstream code dts - it was unchanged from jan 2022. Not sure if upstream change is the reason. But will try to look more into it tomorrow.

Only prior to upcoming release or by request (rarely). But checking functionality by booting from SD card is already a lot better then no checking at all. No need to mess up your existing setup.

That sounds promising.

@Tonymac32
Copy link
Member

For the DTS overwrite, basically we had a DTS patched into Armbian thanks to the Kobol guys before it was available in mainline. Then it became available in mainline and the Armbian patch never got revised/removed. All the helios64 specific patches need rebased on the mainline device tree.

@blmhemu
Copy link
Author

blmhemu commented Jan 28, 2023

For the DTS overwrite, basically we had a DTS patched into Armbian thanks to the Kobol guys before it was available in mainline. Then it became available in mainline and the Armbian patch never got revised/removed. All the helios64 specific patches need rebased on the mainline device tree.

To build it, i am assuming i need to make changes to patches in 5.15 folder and compile armbian again. Let me know if it is the way.

Will try to do it and report if it worked or not.

@rpardini
Copy link
Member

basically we had a DTS patched into Armbian thanks to the Kobol guys before it was available in mainline

Yes. The patch added a new file back then. Now, that file already exists in mainline, but is completely removed by the bash patching done in the master branch, thus becoming this monstrosity: rpardini/linux@dc718a4

The fact it even boots is... surprising.

@blmhemu
Copy link
Author

blmhemu commented Jan 29, 2023

So, I tried removing the patch add-board-kobol-helios64... in 5.15 and compiled armbian.
It did compile fine and even booted. Things I observed:

  • The device booted ! Yay !
  • The fans are running at full speed unlike before.
  • The memory double free issue persists. Still getting the invalid pointer or double free issue. It honestly feels like something wrong with userspace / ubuntu / debian.

Also, a few more logs from dpkg

D000001: ensure_diversions: new, (re)loading
D000001: process queue pkg python3-pkg-resources:all queue.len 0 progress 1, try 1
D000040: checking dependencies of python3-pkg-resources:all (- <none>)
D000400:   checking group ...
D000400:     checking possibility  -> python3
D000400:       checking non-provided pkg python3:arm64
D000400:       is installed, ok and found
D000400:     found 3
D000400:   found 3 matched 0 possfixbytrig -
D000040: ok 2 msgs >><<
D000040:     checking Breaks
Setting up python3-pkg-resources (59.6.0-1.2ubuntu0.22.04.1) ...
D000002: fork/exec /var/lib/dpkg/info/python3-pkg-resources.postinst ( configure 59.6.0-1.2ubuntu0.22.04.1 )
free(): invalid pointer
Aborted
dpkg: error processing package python3-pkg-resources (--configure):
 installed python3-pkg-resources package post-installation script subprocess returned error exit status 134
D000001: ensure_diversions: same, skipping
Errors were encountered while processing:
 python3-pkg-resources
❯ cat /var/lib/dpkg/info/python3-pkg-resources.postinst
#!/bin/sh
set -e

# Automatically added by dh_python3
if command -v py3compile >/dev/null 2>&1; then
	py3compile -p python3-pkg-resources
fi
if command -v pypy3compile >/dev/null 2>&1; then
	pypy3compile -p python3-pkg-resources  || true
fi

# End automatically added section

May have found the root cause
Screenshot 2023-01-29 at 4 45 08 PM

Digging even further
Screenshot 2023-01-29 at 5 48 55 PM

Related searches

@rpardini
Copy link
Member

  • The device booted ! Yay !
  • The fans are running at full speed unlike before.

Nice, probably some picking from the previous patch can result in a good-enough new patch.

  • The memory double free issue persists. Still getting the invalid pointer or double free issue. It honestly feels like something wrong with userspace / ubuntu / debian.

Did you try switching userspace? RELEASE=jammy etc?

@blmhemu
Copy link
Author

blmhemu commented Jan 30, 2023

Nice, probably some picking from the previous patch can result in a good-enough new patch.

Likely.

Did you try switching userspace? RELEASE=jammy etc?

I tried with jammy, impish, kinetic, debian. All suffer from this. This makes me think if this is a kernel / syscall issue 🤔

@prahal
Copy link
Collaborator

prahal commented Feb 6, 2023

I also have this issue running Ansible to the helios64 python3. I was able to reproduce your py3compile crash. Both py3compile crash and Ansible rule that crash rae random. I sometimes get a kernel oops. Unlikely DTS. Likely a kernel code patch. Might even affect other rockchip64 device because most apps runs fine on the helios64. I always ends up with a kernel memory error (also random) and need to reboot.

@blmhemu
Copy link
Author

blmhemu commented Feb 6, 2023

I also have this issue running Ansible to the helios64 python3. I was able to reproduce your py3compile crash. Both py3compile crash and Ansible rule that crash rae random. I sometimes get a kernel oops. Unlikely DTS. Likely a kernel code patch. Might even affect other rockchip64 device because most apps runs fine on the helios64. I always ends up with a kernel memory error (also random) and need to reboot.

I tried searching for free() function calls, but could not find in patches 🤔

@prahal
Copy link
Collaborator

prahal commented Feb 6, 2023

@blmhemu free is a userspace call. There is no such call as free in the kernel.
All in all, I do not see how a python3 call can end up in kernel memory corruption. But I am unable to reproduce the python3 free invalid pointer issue on another armbian box also running bullseye with linux-image current 22.11.4 kernel 6.1.7-meson64. So userspace python3 is identical on both.
It could still be that the issue is only harder to reproduce on armbian meson, hard to ascertain.

I was able to reproduce on helios64 with linux image edge 22.11.4 kernel 6.1.7-rockchip64 but I was able to do two runs of ansible before the bug "free(): invalid pointer" triggered. I also got "munmap_chunk(): invalid pointer" from running sudo py3compile -p python3-pkg-resources in a loop.

About the lack of maintainership, I would like to help with that, but until I find out why my board is unstable I focus on debugging the stability issue. Be it the kernel crash due to memory corruption (and half of the time the board hangs without even rebooting on panic, and most of the time at reboot after panic it hangs early on) or the rk3399 hs400es breakage that I nailed down to a commit in https://forum.armbian.com/topic/18855-upgrading-to-bullseye-troubleshooting-armbian-21081/page/3/#comment-128793.

@igorpecovnik
Copy link
Member

About the lack of maintainership, I would like to help with that

https://docs.armbian.com/Board_Maintainers_Procedures_and_Guidelines/
tl;dr; = scan the text, send contact details (link is in the text) and beeing around to run some tests and fix (or at least log) problems prior to major releases / when kernel version is changed or similar bigger changes like upcoming build system upgrade.

@rpardini
Copy link
Member

rpardini commented Feb 7, 2023

Next suggestion: remove kernel patches, maybe all of them, and try a build with mostly mainline only stuff. Does problem persist? If not (my bet...) bisect the evil patch out of rockchip64... another idea: update u-boot and/or blobs.

@prahal
Copy link
Collaborator

prahal commented Feb 7, 2023

I change vin-supply in pwm-supply in helios64 board dts vdd-log section. vdd-log is known for stability issues if not powered properly and without this fix it was assigned the dummy regulator.

Hard stretched but I am still not set if the helios64 kernel memory errors I have are due to a driver which corrupts the memory or a wrong setting that makes the CPU unstable.

@prahal
Copy link
Collaborator

prahal commented Feb 8, 2023

@blmhemu do you run OMV above helios64?

OMV has a few optimizations (sysctl), tools that stress the memory (folder2ram), and probably others that might make the bug from the unknown source more visible, but it could be that the issue is not helios64 specific.

Sorry unlikely an OMV-related issue as from above I understand you ran vanilla jammy, impish, kinetic, debian and reproduced.

@Tonymac32
Copy link
Member

can anyone reproduce this issue with another RK3399 board?

@blmhemu
Copy link
Author

blmhemu commented Feb 13, 2023

@blmhemu do you run OMV above helios64?

No. Plain ubuntu.

@prahal
Copy link
Collaborator

prahal commented Feb 16, 2023

I do not have another RK3399 board to test. If anyone could, I made a simpler test case:

$ for i in $(seq 1 100);do python3 -c "import pkg_resources; pkg_resources.parse_version('1')" || break;done
double free or corruption (out)
Abandon (core dumped)

I also get "free(): invalid pointer", "double free or corruption (out)" or else.

I tried with the python3 debugger:

$ for i in $(seq 1 100);do python3-dbg -X tracemalloc -c "import pkg_resources; pkg_resources.parse_version('1')" || break;done
Debug memory block at address p=0x800ffffb7b29ab0: API ''
    18302063728016752640 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
        at p-7: 0x00 *** OUCH
        at p-6: 0x00 *** OUCH
        at p-5: 0x00 *** OUCH
        at p-4: 0x00 *** OUCH
        at p-3: 0x00 *** OUCH
        at p-2: 0x00 *** OUCH
        at p-1: 0x00 *** OUCH
    Because memory is corrupted at the start, the count of bytes requested
       may be bogus, and checking the trailing pad bytes may segfault.
    The 8 pad bytes at tail=0x5fefdfdb4b29ab0 are Erreur de segmentation (core dumped)

this is with the python3 debugger with the tracemalloc flag.
Note that I have a hard time getting an error with the tracemalloc flag, most of the time the hundredth tests run proceed fine with this flag. As it slows down the process quite a lot it could point to a timing-dependent issue.

With the python3 debugger without flags:

for i in $(seq 1 100);do python3-dbg -c "import pkg_resources; pkg_resources.parse_version('1')" || break;done
Debug memory block at address p=0x4000ffffbb8f58d0: API '�'
    3508782105221857280 bytes originally requested
    The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0x70b2b2bbbb8e58d0 are Erreur de segmentation (core dumped)

Note that if I do the loop in python code instead of calling the python runtime in a loop the crash does not occur.

import pkg_resources

for x in range(0, 10000):
    pkg_resources.parse_version("1")

works.

I also wonder why only python3 is affected on my system. Maybe running other python3 setups in docker containers on the helios64 could confirm if this is due to the userspace setup (or it could also be that this particular python3.9 setup stress test a specific issue with the kernel or hardware and another setup will just hide the issue).

@prahal
Copy link
Collaborator

prahal commented Feb 20, 2023

@blmhemu could you paste the output of /proc/buddyinfo when python3 starts to output invalid free? It seems that python3 does not cope well when an allocation fails and tries to free it even if it was not allocated. That may explain our issue.
I had page allocation of order 7 errors in my kernel log (but not always).

What is not clear is why we get these even though all was fine before. The fact is it may be another issue. In the process of debugging this invalid free I turned off my zswap so maybe I produced a page allocation failure with my debug attempts.
Also, it is not that I do not have memory left but that memory is fragmented and there are no higher allocation pages left.

I tested in a docker container on the helios64 bullseye (still with latest master edge kernel) with Debian bookworm python3.11 in the container and was able to reproduce the invalid free.

@prahal
Copy link
Collaborator

prahal commented Feb 20, 2023

@blmhemu sorry to bother you again. Do your tests with different releases (jammy, kinetic, etc) all run with a different "current" armbian kernel, or did the build run with the latest kernel (6.1 ?)?
The issue is python3.9 on my board was not upgraded for months and it was working (a long time ago but I cannot tell when it broke. Still, I doubt it was broken a year ago. So this leaves the libraries below python3.9 or the kernel. Though if you already tested a lot of stable kernels this makes the kernel an unlikely target and can save time in debugging.
I tried the latest armbian kernel 6.1.12 without armbian patches (but still with helios64 armbian dts, aufs and wifi patches) and I still get the python3 free invalid.

@prahal
Copy link
Collaborator

prahal commented Feb 21, 2023

I made a mistake and thus u-boot booted on my old eMMC install (which I left untouched since at least July 2022).
The issue python3 invalid free is reproducible there ... 5.15.48-rockchip64 #22.05.3. I do not know how I did notice back then this ansible python3 issue (I already had ansible setup for the helios64 but it is true that I had issues with my roles and thus tended to run the playbooks on a targeted system so maybe I did not try helios64 for a long time or at all).

Then I tried on the SD card install (up to date bullseye) with current armbian kernel Linux helios64 5.15.89-rockchip64 #22.11.4 and the issue is also reproducible.

@blmhemu
Copy link
Author

blmhemu commented Feb 21, 2023

Hey @prahal ! I am currently a bit busy with life and will likely start testing from next week. In the meanwhile, I did install the latest bullseye from this armbian mirror and it works smoothly with ansible and could not see the free issue.

@prahal
Copy link
Collaborator

prahal commented Feb 22, 2023

@blmhemu I confirm that just installing linux-image-current-rockchip64 package and its ad-hoc linux-dtb current-rockchip64 package at version 21.08.2, which is 5.10.63-rockchip64, fixes this issue.
I was able to run:

for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done

five times without an issue.

I can even run the test case fine with latest kernels if I disable cpufreq with kernel boot parameter cpufreq.off=1

@prahal
Copy link
Collaborator

prahal commented Feb 27, 2023

I have been able to reproduce the python3 invalid free with linux-image-legacy-rk3399 that is 4.4.213. I believe I did not encounter the issue before because before my ansible setup was using the python2 installed on the helios64, not the python3.

@blmhemu, I need to retry but I believe with cpufreq disabled (cpufreq.off=1 on linux kernel command line) I was not able to reproduce the python3 invalid free with latest kernel (and probably current too). If you could confirm that would help. You can add cpufreq.off=1 to /boot/armbianEnv.txt extraargs=.

Also I tried latest 6.1.12 with cpufreq enabled and all armbian patches removed except the add helios64 board add-board-helios64.patch, board-helios64-remove-pcie-ep-gpios.patch, my emmc hs400 es patch to read emmc hs400 and rk3399-enable-dwc3-xhci-usb-trb-quirk.patch and I can still reproduce the issue (I also tried with EXTRAWIFI=no AUFS=no).

@thomas-maurice
Copy link

Hello !

@prahal I can confirm that the cpufreq.off=1 worked on the current bulleseye build, I was able to successfully install OMV on my helios64 wihtout any issues

root@helios64:~# uname -a
Linux helios64 5.15.104-rockchip64 #3 SMP PREEMPT Wed Mar 22 12:31:37 UTC 2023 aarch64 GNU/Linux

I built the image with the ./compile.sh script of the toolchain without any customisation.

@blmhemu
Copy link
Author

blmhemu commented Jul 6, 2023

Also I tried latest 6.1.12 with cpufreq enabled and all armbian patches removed except the add helios64 board add-board-helios64.patch, board-helios64-remove-pcie-ep-gpios.patch, my emmc hs400 es patch to read emmc hs400 and rk3399-enable-dwc3-xhci-usb-trb-quirk.patch and I can still reproduce the issue (I also tried with EXTRAWIFI=no AUFS=no).

Does this mean, it could be a bug in upstream kernel ?

@blmhemu
Copy link
Author

blmhemu commented Jul 27, 2023

Could you try running this command at least six times before upgrading the u-boot?

Done - still stable

Updated the uboot (ddrbin now shows 1.25) - see https://pastebin.mozilla.org/M1XXJnLn
Could not repro the error ! (Ran 6 times + multiple ansible runs)

@prahal
Copy link
Collaborator

prahal commented Jul 28, 2023

@blmhemu then I believe when you did the latest bullseye install somehow you modified the installed bootloader.
I believe you do not have a log of the u-boot output from when you had the invalid free issue, else could you give it?

Probably the 21st of February 2023 when you told us you got the issue fixed (sorry I forgot you already had a fixed setup, I though you were still suffering this issue):

In the meanwhile, I did install the latest bullseye from this armbian mirror and it works smoothly with ansible and could not see the free issue.

So likely the rockchip DDR blob 1.24 is fine too.

@blmhemu
Copy link
Author

blmhemu commented Jul 28, 2023

I believe you do not have a log of the u-boot output from when you had the invalid free issue, else could you give it?

Unfortunately, I do not have those logs :(

@blmhemu
Copy link
Author

blmhemu commented Jul 29, 2023

@prahal
Update: I was able to flash the recently compiled build (unstable with free issue) - Here are the logs you asked for https://pastebin.com/zGjvrvux

I diffed both the logs and here are my findings

  • The unstable build logs do not output anysort of ddrbin logs
  • The RAM frequencies seem to differ from the logs

The unstable build

lpddr4_set_rate: change freq to 400000000 mhz 0, 1	
lpddr4_set_rate: change freq to 800000000 mhz 1, 0	
Trying to boot from BOOTROM	
Returning to boot ROM...

vs

The stable build

ddr_set_rate to 328MHZ
ddr_set_rate to 666MHZ
ddr_set_rate to 928MHZ
channel 0, cs 0, advanced training done
channel 1, cs 0, advanced training done
ddr_set_rate to 416MHZ, ctl_index 0
ddr_set_rate to 856MHZ, ctl_index 1
support 416 856 328 666 928 MHz, current 856MHz

Link to diff https://www.diffchecker.com/3D0UDOHx/
Left is unstable. Right is stable.

@blmhemu
Copy link
Author

blmhemu commented Jul 29, 2023

UPDATE (Again):

Setting BOOT_SCENARIO=tpl-blob-atf-mainline in config/board/helios64.csc and:

I have compiled armbian with the above option and flashed it.
I could NOT reproduce the free issue now. 🥳 🥳 🥳 🥳 I see also see the ddrbin logs in the serial console.

May be we found the root cause ? (u-boot tpl)
Link to serial console logs - https://pastebin.com/zvhsyF2R

Observations

@blmhemu
Copy link
Author

blmhemu commented Jul 30, 2023

UPDATE 3:
I was able to boot fedora !!! Using the above u-boot and idbloader and following the steps at https://fedoraproject.org/wiki/Architectures/ARM/Installation

Ran the python loop for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done and no free error.

Observations

  • In fedora the fan is quite loud and always runs on full capacity. This is fixable by following the steps at https://wiki.kobol.io/helios64/pwm/
  • Not unsure how to get transfer install OS onto the nvme / sata. Any suggestions would be appreciated.

@prahal
Copy link
Collaborator

prahal commented Aug 2, 2023

@blmhemu about the DDR frequencies, I added the ddrbin freq to blob less u-boot (keeping all other ddr parameters the same which is probably not fine) and forced them. Still the same issue (though I should post the hack for this issue to be reproduced by others but for one I am away for a few weeks).
Changing the frequencies is not enough. At least one have to tweak the DDR parameters in u-boot lpddr4 inc files. But those are pretty cryptic to me.
And we don't not have these parameters from rockchip it seems (I believe they are on th DDR binary blob). Or maybe it is just that the DDR blob does two training, one before setting the freq to 416MHz and one after with an added advanced training afterwards.

Mind v2023.04 has a fix to do the training at 400MHz instead of 50MHz bit this did not help with our issue.

About he SATA/nvme, maybe look on the kobold wiki, probably in the comments I am confident this was answered. (I also made an u-boot 2023.04 build that seems to have a pretty good support for SATA, but it requires to migrate to new apis (bootlow, bootdev, boothmeth). I want to spend time sharing this hack of a build but it turned out it did not help with the DDR stability issue so it became lower priority. Though I believe instruction to achieve SATA boot are already available in the kobold wiki. If not tell me I will try to share my u-boot v2023.04 for Helios64 build. Mind this build had an issue that it can boot loop in u-boot (I manage to stop the loop but did not investigate the cause yet). So pretty experimental. And somehow u-boot v2022.10 I believe was the version that was not buidlable as it partially migrate to bin man binary build while still being half makefile based. So I was not able to build both the idbloader.imh and u-boot.itb binaries. All in all I attempted those to try the new DDR related fixes in these version which ended up not being related to this invalid free bug.
Either way you have to keep u-boot on emmc and set it up to boot the kernel from the SATA (mind the m.2 slot on the Helios64 is SATA not pcie).

About the eth0 error "Net: dw_dm_mdio_init" I always though it had always been so. I will take a look if I can get this working but not asap I believe ( out of that being an easy catch).
Do you know with which u-boot was it working?

@blmhemu
Copy link
Author

blmhemu commented Aug 3, 2023

Do you know with which u-boot was it working?

2020.10 - https://pastebin.com/MmtpS7F9

@blmhemu
Copy link
Author

blmhemu commented Aug 3, 2023

dw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busNo ethernet found.

I have upgraded the system (apt update && apt upgrade) and could not boot now.

@prahal
Copy link
Collaborator

prahal commented Aug 3, 2023

dw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busNo ethernet found.

I have upgraded the system (apt update && apt upgrade) and could not boot now.

Do you mean u-boot load the kernel then nothing or an error on the serial console?

By the way this looks like another issue and to avoid this thread becoming unreadable I guess this requires a thread of it's own on the armbian forum. Feel free to tag me in your forum thread so I get a notice by email.

@blmhemu
Copy link
Author

blmhemu commented Aug 15, 2023

dw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busdw_dm_mdio_init: mdio node is missing, registering legacy mdio busNo ethernet found.

I have upgraded the system (apt update && apt upgrade) and could not boot now.

If anyone encounters this - this is due to the armbain provided linux-libc-dev (use the debian one instead by giving a lower priority to the armabian repo)

@prahal
Copy link
Collaborator

prahal commented Aug 19, 2023

I have upgraded the system (apt update && apt upgrade) and could not boot now.

If anyone encounters this - this is due to the armbain provided linux-libc-dev (use the debian one instead by giving a lower priority to the armabian repo)

Thanks for the follow-up and workaround. Feel free to open another bug report to track this issue!
It might even be an armbian rockchip64 family issue recently introduced.

I am currently trying a few ideas as I was able to reproduce the raid10 resync always crashing the kernel on helios64 I randomly have since I received the unit. I will try to sort out which of the ideas are useless against this issue (I even had hints that it could be related to the HDDs firmwares above the SATA/pci bridge (rk3399 pcie is known to have bugs, but I suspect at the very least it is not the known issue which affects pcie devices being slow to enumerate). Or it could be another memory ddr corruption that the mdadm raid10 resync stresses and is the only test case to reliably reproduce its crashes.
At least my old issue predates our current one which requires the ddr rockchip blob to avoid memory corruption from python3 as the initial u-boot form kobol had already this ddr rockchip blob.
I believe I could workaround this crasher but I would really like to sort the cause of this issue (even if hardware related). In the meantime, I cannot boot my helios64.

@snakekick
Copy link

snakekick commented Aug 21, 2023

Hi there,
I have the same (free(): invalid pointer) problem.
I notice this after upgrading my helios64 from debian 11 to 12.
I also have kernel problems when I run snapraid sync.

This problem is solved when I add cpufreq.off=1, but then the cpu is really slow.
Is it possible to share the new, working armbian-u-boot dpkg?
thanks
my current uboot log
https://pastebin.pl/view/80f4c9e7

@jmue
Copy link
Collaborator

jmue commented Aug 21, 2023

@prahal : Is there anything wrong with opening a pull request until a better solution is found?

@snakekick
Copy link

snakekick commented Aug 22, 2023

@prahal Thank you! your fix solved my helios64 problem.
Best of all, I can now run my Helios at full speed 400>1800MHz on demand.
This was not possible before and it looks very stable (which I can say 12h later).
But after installing your fix, I am able to run
for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done
6 or more times and start a snapraid sync that crashed before.
Thank you very much.
//edit :

Rejoyed too soon!


kernel:[47341.023705] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP

Message from syslogd@helios64 at Aug 22 10:59:53 ...
kernel:[47341.023705] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP

Message from syslogd@helios64 at Aug 22 10:59:53 ...
kernel:[47341.045273] Code: aa1c03e0 93407c62 2a0803e1 9400819e (a9408261)

@prahal
Copy link
Collaborator

prahal commented Sep 5, 2023

@snakekick indeed the kernel crashes are not fixed.
I do not know if I can sort this instability issue on my side.
Not worked much on it for a month due to life and will probably not for a while.

Seems to me this a memory corruption. It affects random kernel code.

But I left the helios64 down for a while as I have a way to reproduce the kernel corruption fast, that is boot when the raid 10 had a bad crash and is healing at boot.

Might still be memory related.
I even was able to reproduce the crash with cpufreq turned off but way less often. Seems cpufreq up the risk of the bug to trigger but is not the cause.

Though we should discuss this matter in the forum as the current issue I believe is not the same and we have a workaround for.
Though I believe we should push this fix to armbian repo I cannot asap.

@d3473r
Copy link

d3473r commented Sep 25, 2023

Hi @prahal and @snakekick, I'm also running a Helios64 wich freezes randomly every few days now :/
Can you please explain how to compile and flash the fixed uboot?

@bcecchinato
Copy link

@d3473r I've downgraded the bootloader as well, so far no more freezes. here is the way to do it :

cd /tmp
wget --content-disposition https://imola.armbian.com/apt/pool/main/l/linux-u-boot-helios64-edge/linux-u-boot-edge-helios64_22.02.1_arm64.deb
dpkg -x linux-u-boot-edge-helios64_22.02.1_arm64.deb linux-u-boot-edge-helios64_22.02.1_arm64/
vi /usr/lib/u-boot/platform_install.sh

While in the /usr/lib/u-boot/platform_install.sh, copy/paste the first line and change it to the new directory :

#DIR=/usr/lib/linux-u-boot-current-helios64
DIR=/tmp/linux-u-boot-edge-helios64_22.02.1_arm64/usr/lib/linux-u-boot-edge-helios64_22.02.1_arm64

Then launch armbian-install to update the bootloader.

I strongly suggest to dump your current bootloader just in case : dd if=/dev/mmcblk0 of=bootloader-backup.img bs=512 count=65535. Should this fail and you might brick your device.

@d3473r
Copy link

d3473r commented Sep 29, 2023

Hi @bcecchinato, i installed the bootloader with you instructions, /dev/mmcblk0 didn't exist on my machine, I dumped /dev/mmcblk1.

It runned for a while after a reboot but eventually freezed again after a few hours :(

@bcecchinato
Copy link

@d3473r yep it crashed on my side this morning as well :( Depending on which storage you are (emmc/sd card), the /dev/mmcblk will change indeed.

I'm trying with another bootloader here : wget --content-disposition https://imola.armbian.com/apt/pool/main/l/linux-u-boot-helios64-current/linux-u-boot-current-helios64_21.08.9_arm64.deb.

Since I don't really know what this changes, maybe this attempt is useless at all :D and unfortunately i'm not an expert with armbian/bootloader and etc. I can make some tests if other users from this topic want however.

@d3473r
Copy link

d3473r commented Sep 29, 2023

Are you running Debian 11 or 12?
I'm on Debian 11 with OMV 6

@bcecchinato
Copy link

I'm on Debian 12, the free issue started with bookworm, no issues with Bullseye and the latest bootloader (but I can't say if it was uptodate or not).

@prahal
Copy link
Collaborator

prahal commented Oct 6, 2023

@d3473r the free issue is not the same as the freeze one. What I mean is that you can fix the free issue but still have the freezes as I do.

Still, I am chasing the freeze issue too. Currently have the helios64 down for weeks since it is in a state were I can reproduce the freeze. That is raid10 resyncing at boot.

I would like to have a bug report to centralize the freeze issue reports. As of now, they are scattered in various threads on the Armbian forum. Maybe you could open a new one there and give the link here?

Note that I have freezes since I got the helios64.
I have not changed my setup much since then (raid10 with WD Red drives).
Could you elaborate on when this started for you?

At one point I blamed the rk3399 pcie ... but I am unsure now. Mind my raid10 stress the pcie in the SOC and the sata controller.
Or memory timings.
Still not diagnosed, but learning in the process (like the role of the ATF firmware).

So having other setup details could help this. Especially what the setups that were or are working are like.

@bcecchinato
Copy link

@prahal I don't know if the free and freezes are related, but since my downgrade to 21.08.9 of the bootloader i havn't encountered neither free error, nor freezes. I'm running on a uSD card, bootloader installed on the uSD card as well (the EMMC is completely blank, i've dd zeroes to be sure not to boot on it).

My case is a bit different, I had a uSD on debian bullseye, and made a fresh install on a second uSD card with bookworm. The troubles started from this point. I haven't deleted the old card, I can make some diffs between each in case this might help. Both cards have the same boot loader version (the 23.08.1 version), but I can't say if the bootloader written on the old uSD is 23.08.1 or 21.08.9.

I wish I could help more, but like you, I've no skills on bootloaders :(

The only sure thing is : bookworm with 21.08.9 bootloader is working fine and has no free issues at all.

@d3473r
Copy link

d3473r commented Oct 6, 2023

@prahal If have a pretty good understanding when the freezes started, but no why.
Have to investigate the system log.

My helios64 is used as a Timemachine backup target, and the backups started failing since the beginning of September.
These Backups ran for over a year (since August 2022) without any freeze.

I'm certain about this as the root filesystem is encrypted and any freeze or reboot would have forced me to unlock the root fs via ssh to boot the helios64 up again.

So my guess is: I updated something in the beginning of September (presumably kernel updates, i have not made a dist upgrade) and since then the freezes are occuring

@prahal
Copy link
Collaborator

prahal commented Oct 6, 2023

@d3473r you have a history of the upgrades in /var/log/apt/history.log<.n.gz>.

Note that knowing the previous working versions is even more interesting than the new broken one.

Also, it could be the new version is only more efficient and thus stresses the hardware more (or even enables a new hardware component).

When you say they ran over a year without a freeze, you mean there were also freezes beforehand. Were they rare before that time?

I bet you never upgraded the bootloader before you did recently. Do you know from which image you installed the EMMC or SD card initially? One might be able to guess the older bootloader from that.
If you have a log of the previous u-boot output on boot that would tell but it is unlikely you have one stored.

Also, it could be the load to the hardware changed over time and even without any upgrade you will have ended up with this freeze.
Could you tell me your storage layout (FS, LUKS, raid or not, which raid, brand and model of hard drives and maybe the smartctl output for them ie firmware version - probably smartctl -a /dev/sd<x> for each drive).

Also, do you have small static discharges when touching the helios64 enclosure? I am pretty sure this is unrelated nowadays but who knows (I have them when my helios64 power adapter is close to my UPS and set of other chargers (not sorted which one yet).

prahal added a commit to prahal/build that referenced this issue Dec 20, 2023
On the Helios64 random memory errors happens when using the
U-Boot DDR intialization code for rk3399.
Switching to the rkbin rk33 933MHz v1.25 allows this testcase to
run more than once without a memory error:
for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done

Could be LPDDR4 specific.

Workaround armbian#4761
"free() invalid pointer".
igorpecovnik pushed a commit that referenced this issue Dec 24, 2023
On the Helios64 random memory errors happens when using the
U-Boot DDR intialization code for rk3399.
Switching to the rkbin rk33 933MHz v1.25 allows this testcase to
run more than once without a memory error:
for i in $(seq 1 100);do python3 -c "import pkg_resources" || break;done

Could be LPDDR4 specific.

Workaround #4761
"free() invalid pointer".
@d3473r
Copy link

d3473r commented Feb 8, 2024

Hi @prahal, i did a complete new installation with kernel: Linux helios64 6.1.63-current-rockchip64 after this fix: #6066
The helios64 NAS ist now stable for one straight week, so this issue seems fixed :)
Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests