Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola: kdump.crash failure on kola-aws(for aarch64) and upstream CI for coreos-installer/afterburn repos #1075

Closed
gursewak1997 opened this issue Jan 23, 2022 · 5 comments
Labels

Comments

@gursewak1997
Copy link
Member

gursewak1997 commented Jan 23, 2022

Bug
Kola test failure: kdump.crash fails when enabled in upstream CI for coreos-installer and afterburn repos
Also, the ext.config.kdump.crash test fails for aarch64 architecture on aws

Expected behavior
The kola test should pass in aarch64 and upstream repos considering it passes for other repos such as coreos-assembler.

Test failure
So far, the test has been discovered to fail in just coreos-installer and afterburn repos.
kdump.crash failure in coreos-installer: Jenkins job
Aarch64 failure: multiarch pipeline
Error:

00:25:29.225  Jan 14 01:36:34 qemu0 systemd[1]: Started kola-runext.service.
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1403]: + case "${AUTOPKGTEST_REBOOT_MARK:-}" in
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1404]: ++ find /var/crash -type f -name vmcore
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1403]: + kcore=
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1403]: + test -z ''
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1403]: + fatal 'No kcore found in /var/crash'
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1403]: + echo 'No kcore found in /var/crash'
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1403]: No kcore found in /var/crash
00:25:29.225  Jan 14 01:36:34 qemu0 kola-runext-crash[1403]: + exit 1
00:25:29.225  Jan 14 01:36:34 qemu0 systemd[1]: kola-runext.service: Main process exited, code=exited, status=1/FAILURE
00:25:29.225  Jan 14 01:36:34 qemu0 systemd[1]: kola-runext.service: Failed with result 'exit-code'. 

kola-aws failure for kdump.crash:

[2022-04-20T22:24:17.697Z] --- FAIL: ext.config.kdump.crash (1388.73s)
[2022-04-20T22:24:17.697Z]         harness.go:958: kolet failed: : Waiting for reboot: machine "i-05a1c99faec3e27f1" failed to start: ssh journalctl failed: time limit exceeded
[2022-04-20T22:24:17.697Z]         harness.go:103: TIMEOUT[10m0s]: ssh: journalctl -t kola-runext-crash

At the moment, we have disabled/skipped the kdump.crash kola test in upstream CI for the above repositories.
Also, we have the test disabled for aarch64

Test disabling PRs:
coreos/coreos-installer#750
coreos/afterburn#686

coreos/fedora-coreos-config#1450

gursewak1997 added a commit to gursewak1997/afterburn that referenced this issue Jan 23, 2022
gursewak1997 added a commit to gursewak1997/coreos-installer that referenced this issue Jan 23, 2022
@gursewak1997 gursewak1997 changed the title kola: kdump.crash investigate failure in upstream CI for coreos-installer and afterburn repos kola: kdump.crash failure on aarch64 and upstream CI for coreos-installer/afterburn repos Jan 26, 2022
@dustymabe
Copy link
Member

I reproduced this on aarch64 by doing a cosa run and manually running through the steps in the test. At the very end in /var/crash/ there are two files:

# ls -lh /var/crash/127.0.0.1-2022-01-27-03\:03\:14/
total 68K
-rw-------. 1 root root 68K Jan 27 03:03 kexec-dmesg.log
-rw-------. 1 root root   0 Jan 27 03:03 vmcore-dmesg-incomplete.txt

The vmcore-dmesg-incomplete.txt is emtpy and the end of the kexec-dmesg.log looks like:

Jan 27 03:03:14 localhost systemd[1]: Starting Kdump Vmcore Save Service...
Jan 27 03:03:14 localhost kdump[527]: Kdump is using the default log level(3).
Jan 27 03:03:15 localhost kdump[562]: saving to /sysroot/ostree/deploy/fedora-coreos/var/crash/127.0.0.1-2022-01-27-03:03:14/
Jan 27 03:03:15 localhost kdump[565]: saving vmcore-dmesg.txt to /sysroot/ostree/deploy/fedora-coreos/var/crash/127.0.0.1-2022-01-27-03:03:14/
Jan 27 03:03:15 localhost kdump.sh[566]: No program header covering vaddr 0x4f434d5600000000found kexec bug?
Jan 27 03:03:15 localhost kdump[569]: saving vmcore-dmesg.txt failed
Jan 27 03:03:15 localhost kdump[571]: saving vmcore
Jan 27 03:03:15 localhost kdump.sh[572]: readpage_elf: Attempt to read non-existent page at 0x0.
Jan 27 03:03:15 localhost kdump.sh[572]: readmem: type_addr: 1, addr:e48, size:8
Jan 27 03:03:15 localhost kdump.sh[572]: vaddr_to_paddr_arm64: Can't read pmd
Jan 27 03:03:15 localhost kdump.sh[572]: readmem: Can't convert a virtual address(ffffaa1af9341728) to physical address.
Jan 27 03:03:15 localhost kdump.sh[572]: readmem: type_addr: 0, addr:ffffaa1af9341728, size:390
Jan 27 03:03:15 localhost kdump.sh[572]: check_release: Can't get the address of system_utsname.
Jan 27 03:03:15 localhost kdump.sh[572]: makedumpfile Failed.
Jan 27 03:03:15 localhost kdump[574]: saving vmcore failed, _exitcode:1
Jan 27 03:03:15 localhost kdump[576]: saving the /run/initramfs/kexec-dmesg.log to /sysroot/ostree/deploy/fedora-coreos/var/crash/127.0.0.1-2022-01-27-03:03:14//

@dustymabe
Copy link
Member

Tried it with the very latest kexec-tools in rawhide and I'm seeing a failure still, but a different failure. Opened a bug for that: https://bugzilla.redhat.com/show_bug.cgi?id=2046617

@dustymabe
Copy link
Member

We're also seeing a failure of kdump.crash on ppc64le which needs to be investigated:

@dustymabe
Copy link
Member

Tried it with the very latest kexec-tools in rawhide and I'm seeing a failure still, but a different failure. Opened a bug for that: https://bugzilla.redhat.com/show_bug.cgi?id=2046617

According to the BZ there are some reports of success in F36. Maybe we should try this again on the next-devel stream and rawhide.

@gursewak1997 gursewak1997 changed the title kola: kdump.crash failure on aarch64 and upstream CI for coreos-installer/afterburn repos kola: kdump.crash failure on kola-aws(for aarch64) and upstream CI for coreos-installer/afterburn repos Apr 26, 2022
@dustymabe
Copy link
Member

Since aarch64 was re-enabled, but aarch64 AWS instances are still failing the test I decided to open a new issue with more specific details on that particular failure.

#1187

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants