Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: LVH VM provisioning failed (kex_exchange_identification: read: Connection reset by peer) #26012

Closed
giorio94 opened this issue Jun 8, 2023 · 19 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@giorio94
Copy link
Member

giorio94 commented Jun 8, 2023

CI failure

Hit a few times, for instance: https://github.com/cilium/cilium/actions/runs/5200516663/jobs/9379487447

Run ssh -p 2222 -o "StrictHostKeyChecking=no" root@localhost << EOF
Pseudo-terminal will not be allocated because stdin is not a terminal.
kex_exchange_identification: read: Connection reset by peer
Connection reset by 127.0.0.1 port 2222

My guess is that this is due to a race condition, with the Set DNS resolver step attempting to connect through ssh while the ssh server in the VM has not yet fully started.

@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Jun 8, 2023
@giorio94
Copy link
Member Author

Hit another couple of times: https://github.com/cilium/cilium/actions/runs/5240409273/attempts/1

@dhawton
Copy link
Contributor

dhawton commented Jun 13, 2023

I have submitted a PR to LVH that should prevent the race condition with SSH server startup. cilium/little-vm-helper#75

@mhofstetter
Copy link
Member

Merging #26425 (Updating LVH GitHub Action to version 0.0.7) should have fixed this issue.

LVH 0.0.7 is related to cilium/little-vm-helper#77 which includes the actual fix cilium/little-vm-helper#75.

@mhofstetter
Copy link
Member

mhofstetter commented Jun 27, 2023

update: it looks like there are cases where re-trying 5 times isn't enough

kex_exchange_identification: read: Connection reset by peer
Connection reset by 127.0.0.1 port 2222
kex_exchange_identification: read: Connection reset by peer
Connection reset by 127.0.0.1 port 2222
kex_exchange_identification: read: Connection reset by peer
Connection reset by 127.0.0.1 port 2222
kex_exchange_identification: read: Connection reset by peer
Connection reset by 127.0.0.1 port 2222
kex_exchange_identification: read: Connection reset by peer
Connection reset by 127.0.0.1 port 2222
Error: Process completed with exit code 42.

-> going to re-open. there might be an actual issue why it's not possible to connect via SSH or the waittime of 5s (5 retries * 1s wait) is too short

example: https://github.com/cilium/cilium/actions/runs/5383703702/jobs/9770799136

@mhofstetter mhofstetter reopened this Jun 27, 2023
@giorio94
Copy link
Member Author

@giorio94
Copy link
Member Author

giorio94 commented Jul 5, 2023

@learnitall learnitall self-assigned this Jul 5, 2023
@gandro
Copy link
Member

gandro commented Jul 6, 2023

Hit again in #26662

@learnitall
Copy link
Contributor

I'm working on a PR to lvh to allow for dmesg logs to be exported to a file, which we can then tail to get SSH status and collect more information on what is going on.

learnitall added a commit to learnitall/little-vm-helper that referenced this issue Jul 12, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
learnitall added a commit to learnitall/little-vm-helper that referenced this issue Jul 12, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
learnitall added a commit to learnitall/little-vm-helper that referenced this issue Jul 12, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
learnitall added a commit to learnitall/little-vm-helper that referenced this issue Jul 12, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
learnitall added a commit to learnitall/little-vm-helper that referenced this issue Jul 12, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
learnitall added a commit to learnitall/little-vm-helper that referenced this issue Jul 13, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
learnitall added a commit to learnitall/little-vm-helper that referenced this issue Jul 14, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
brb pushed a commit to cilium/little-vm-helper that referenced this issue Jul 17, 2023
This commit updates the GHA action to watch for sshd status within the
VM's console log file. Success and failure messages for sshd are
continually searched for in the console log file over the course of 30s.
If a failure is observed, then the step fails. If success is observed,
then the step moves on to attempting to ssh into the VM. Continuing to
check if ssh access is available is important, because at the end of the
day, this is the functionality that is needed to continue. Parsing
through the console log file is an added step to help gain insight into
what is going on in the VM.

Regex match-alls is used for white-space in-between keywords when
searching, as console color characters may be printed between.
Additionally, '\.service' is not appended to the end of 'ssh' in these
search strings, as sometimes this part of the string gets truncated in
the log file.

Ref: cilium/cilium#26012

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
@mhofstetter
Copy link
Member

mhofstetter commented Jul 18, 2023

I'm working on a PR to lvh to allow for dmesg logs to be exported to a file, which we can then tail to get SSH status and collect more information on what is going on.

This has been implemented with cilium/little-vm-helper#89 and integrated with #26819 (currently on main only - the other renovate PRs are pending)

Thanks @learnitall

Please continue reporting detected failures related to this issue so we can analyze the log 🙏

@gandro
Copy link
Member

gandro commented Jul 19, 2023

Observed here again: https://github.com/cilium/cilium/actions/runs/5589813647/jobs/10218698894

dmesg output:

2023-07-18T16:22:33.8832181Z kex_exchange_identification: read: Connection reset by peer
2023-07-18T16:22:33.8832741Z Connection reset by 127.0.0.1 port 2222
2023-07-18T16:23:49.8198874Z kex_exchange_identification: read: Connection reset by peer
2023-07-18T16:23:49.8199300Z Connection reset by 127.0.0.1 port 2222
2023-07-18T16:25:06.1304607Z kex_exchange_identification: read: Connection reset by peer
2023-07-18T16:25:06.1305143Z Connection reset by 127.0.0.1 port 2222
2023-07-18T16:26:22.0658190Z kex_exchange_identification: read: Connection reset by peer
2023-07-18T16:26:22.0658759Z Connection reset by 127.0.0.1 port 2222
2023-07-18T16:27:38.0014735Z kex_exchange_identification: read: Connection reset by peer
2023-07-18T16:27:38.0015225Z Connection reset by 127.0.0.1 port 2222
2023-07-18T16:27:39.0035247Z [    0.000000] Linux version 5.4.243 (root@buildkitsandbox) (gcc version 12.2.0 (Ubuntu 12.2.0-17ubuntu1)) #1 SMP Fri May 26 10:56:55 UTC 2023
2023-07-18T16:27:39.0035905Z [    0.000000] Command line: BOOT_IMAGE=/vmlinuz initrd=initrd.img root=/dev/sda rw console=ttyS0
2023-07-18T16:27:39.0036634Z [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
2023-07-18T16:27:39.0037296Z [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
2023-07-18T16:27:39.0037888Z [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
2023-07-18T16:27:39.0038344Z [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
2023-07-18T16:27:39.0039016Z [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
2023-07-18T16:27:39.0039606Z [    0.000000] BIOS-provided physical RAM map:
2023-07-18T16:27:39.0040188Z [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
2023-07-18T16:27:39.0040798Z [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
2023-07-18T16:27:39.0041369Z [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
2023-07-18T16:27:39.0041939Z [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
2023-07-18T16:27:39.0042912Z [    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
2023-07-18T16:27:39.0043885Z [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
2023-07-18T16:27:39.0044483Z [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
2023-07-18T16:27:39.0045064Z [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000033fffffff] usable
2023-07-18T16:27:39.0045489Z [    0.000000] NX (Execute Disable) protection: active
2023-07-18T16:27:39.0045903Z [    0.000000] SMBIOS 2.8 present.
2023-07-18T16:27:39.0046463Z [    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
2023-07-18T16:27:39.0046930Z [    0.000000] Hypervisor detected: KVM
2023-07-18T16:27:39.0047501Z [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
2023-07-18T16:27:39.0048042Z [    0.000000] kvm-clock: cpu 0, msr 1fd566001, primary cpu clock
2023-07-18T16:27:39.0048590Z [    0.000000] kvm-clock: using sched offset of 5429414262 cycles
2023-07-18T16:27:39.0049325Z [    0.000001] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
2023-07-18T16:27:39.0049821Z [    0.000002] tsc: Detected 2445.432 MHz processor
2023-07-18T16:27:39.0050216Z [    0.000444] last_pfn = 0x340000 max_arch_pfn = 0x400000000
2023-07-18T16:27:39.0050781Z [    0.000746] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
2023-07-18T16:27:39.0051224Z [    0.000862] last_pfn = 0xbffe0 max_arch_pfn = 0x400000000
2023-07-18T16:27:39.0051773Z [    0.002772] found SMP MP-table at [mem 0x000f5b90-0x000f5b9f]
2023-07-18T16:27:39.0052228Z [    0.002884] check: Scanning 1 areas for low memory corruption
2023-07-18T16:27:39.0052647Z [    0.003027] Using GB pages for direct mapping
2023-07-18T16:27:39.0053154Z [    0.003165] RAMDISK: [mem 0x7dd9d000-0x7fffffff]
2023-07-18T16:27:39.0053570Z [    0.003180] ACPI: Early table checksum verification disabled
2023-07-18T16:27:39.0053987Z [    0.003189] ACPI: RSDP 0x00000000000F59B0 000014 (v00 BOCHS )
2023-07-18T16:27:39.0054424Z [    0.003198] ACPI: RSDT 0x00000000BFFE19C1 000034 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
2023-07-18T16:27:39.0054898Z [    0.003201] ACPI: FACP 0x00000000BFFE185D 000074 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
2023-07-18T16:27:39.0055366Z [    0.003204] ACPI: DSDT 0x00000000BFFE0040 00181D (v01 BOCHS  BXPC     00000001 BXPC 00000001)
2023-07-18T16:27:39.0055778Z [    0.003205] ACPI: FACS 0x00000000BFFE0000 000040
2023-07-18T16:27:39.0056195Z [    0.003207] ACPI: APIC 0x00000000BFFE18D1 000090 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
2023-07-18T16:27:39.0056881Z [    0.003208] ACPI: HPET 0x00000000BFFE1961 000038 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
2023-07-18T16:27:39.0057348Z [    0.003209] ACPI: WAET 0x00000000BFFE1999 000028 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
2023-07-18T16:27:39.0057980Z [    0.003211] ACPI: Reserving FACP table memory at [mem 0xbffe185d-0xbffe18d0]
2023-07-18T16:27:39.0058621Z [    0.003211] ACPI: Reserving DSDT table memory at [mem 0xbffe0040-0xbffe185c]
2023-07-18T16:27:39.0059258Z [    0.003212] ACPI: Reserving FACS table memory at [mem 0xbffe0000-0xbffe003f]
2023-07-18T16:27:39.0059873Z [    0.003212] ACPI: Reserving APIC table memory at [mem 0xbffe18d1-0xbffe1960]
2023-07-18T16:27:39.0060487Z [    0.003213] ACPI: Reserving HPET table memory at [mem 0xbffe1961-0xbffe1998]
2023-07-18T16:27:39.0061099Z [    0.003213] ACPI: Reserving WAET table memory at [mem 0xbffe1999-0xbffe19c0]
2023-07-18T16:27:39.0061524Z [    0.003707] No NUMA configuration found
2023-07-18T16:27:39.0062074Z [    0.003707] Faking a node at [mem 0x0000000000000000-0x000000033fffffff]
2023-07-18T16:27:39.0062639Z [    0.003709] NODE_DATA(0) allocated [mem 0x33fffc000-0x33fffffff]
2023-07-18T16:27:39.0062992Z [    0.004034] Zone ranges:
2023-07-18T16:27:39.0063474Z [    0.004035]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
2023-07-18T16:27:39.0064166Z [    0.004036]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
2023-07-18T16:27:39.0064907Z [    0.004036]   Normal   [mem 0x0000000100000000-0x000000033fffffff]
2023-07-18T16:27:39.0065293Z [    0.004037] Movable zone start for each node
2023-07-18T16:27:39.0065702Z [    0.004037] Early memory node ranges
2023-07-18T16:27:39.0066198Z [    0.004038]   node   0: [mem 0x0000000000001000-0x000000000009efff]
2023-07-18T16:27:39.0066726Z [    0.004038]   node   0: [mem 0x0000000000100000-0x00000000bffdffff]
2023-07-18T16:27:39.0067242Z [    0.004039]   node   0: [mem 0x0000000100000000-0x000000033fffffff]
2023-07-18T16:27:39.0067670Z [    0.004348] Zeroed struct page in unavailable ranges: 130 pages
2023-07-18T16:27:39.0068271Z [    0.004348] Initmem setup node 0 [mem 0x0000000000001000-0x000000033fffffff]
2023-07-18T16:27:39.0068782Z [    0.043837] ACPI: PM-Timer IO Port: 0x608
2023-07-18T16:27:39.0069195Z [    0.043844] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
2023-07-18T16:27:39.0069791Z [    0.043904] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
2023-07-18T16:27:39.0070260Z [    0.043906] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
2023-07-18T16:27:39.0070718Z [    0.043907] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
2023-07-18T16:27:39.0071181Z [    0.043908] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
2023-07-18T16:27:39.0071624Z [    0.043908] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
2023-07-18T16:27:39.0072084Z [    0.043909] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
2023-07-18T16:27:39.0072551Z [    0.043912] Using ACPI (MADT) for SMP configuration information
2023-07-18T16:27:39.0072983Z [    0.043913] ACPI: HPET id: 0x8086a201 base: 0xfed00000
2023-07-18T16:27:39.0073373Z [    0.043916] TSC deadline timer available
2023-07-18T16:27:39.0073764Z [    0.043922] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
2023-07-18T16:27:39.0074137Z [    0.043962] KVM setup pv sched yield
2023-07-18T16:27:39.0074689Z [    0.043968] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
2023-07-18T16:27:39.0075292Z [    0.043969] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
2023-07-18T16:27:39.0075877Z [    0.043970] PM: Registered nosave memory: [mem 0x000a0000-0x000effff]
2023-07-18T16:27:39.0076455Z [    0.043970] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff]
2023-07-18T16:27:39.0077057Z [    0.043971] PM: Registered nosave memory: [mem 0xbffe0000-0xbfffffff]
2023-07-18T16:27:39.0077629Z [    0.043971] PM: Registered nosave memory: [mem 0xc0000000-0xfeffbfff]
2023-07-18T16:27:39.0078211Z [    0.043971] PM: Registered nosave memory: [mem 0xfeffc000-0xfeffffff]
2023-07-18T16:27:39.0078959Z [    0.043972] PM: Registered nosave memory: [mem 0xff000000-0xfffbffff]
2023-07-18T16:27:39.0079556Z [    0.043972] PM: Registered nosave memory: [mem 0xfffc0000-0xffffffff]
2023-07-18T16:27:39.0080130Z [    0.043974] [mem 0xc0000000-0xfeffbfff] available for PCI devices
2023-07-18T16:27:39.0080568Z [    0.043974] Booting paravirtualized kernel on KVM
2023-07-18T16:27:39.0081276Z [    0.043976] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
2023-07-18T16:27:39.0081851Z [    0.121593] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:4 nr_node_ids:1
2023-07-18T16:27:39.0082332Z [    0.121830] percpu: Embedded 56 pages/cpu s191064 r8192 d30120 u524288
2023-07-18T16:27:39.0082744Z [    0.121862] KVM setup async PF for cpu 0
2023-07-18T16:27:39.0083231Z [    0.121876] kvm-stealtime: cpu 0, msr 333a1b1c0
2023-07-18T16:27:39.0083677Z [    0.121880] Built 1 zonelists, mobility grouping on.  Total pages: 3096425
2023-07-18T16:27:39.0084072Z [    0.121880] Policy zone: Normal
2023-07-18T16:27:39.0084558Z [    0.121881] Kernel command line: BOOT_IMAGE=/vmlinuz initrd=initrd.img root=/dev/sda rw console=ttyS0
2023-07-18T16:27:39.0085099Z [    0.123539] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, linear)
2023-07-18T16:27:39.0085733Z [    0.124310] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
2023-07-18T16:27:39.0086504Z [    0.124359] mem auto-init: stack:off, heap alloc:off, heap free:off
2023-07-18T16:27:39.0087337Z [    0.143486] Memory: 12228256K/12582392K available (12291K kernel code, 1363K rwdata, 6412K rodata, 1584K init, 1480K bss, 354136K reserved, 0K cma-reserved)
2023-07-18T16:27:39.0088000Z [    0.143689] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
2023-07-18T16:27:39.0088447Z [    0.143695] ftrace: allocating 37254 entries in 146 pages
2023-07-18T16:27:39.0088882Z [    0.155079] rcu: Hierarchical RCU implementation.
2023-07-18T16:27:39.0089288Z [    0.155080] rcu: 	RCU event tracing is enabled.
2023-07-18T16:27:39.0089741Z [    0.155081] rcu: 	RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=4.
2023-07-18T16:27:39.0090400Z [    0.155083] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
2023-07-18T16:27:39.0090904Z [    0.155083] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
2023-07-18T16:27:39.0091362Z [    0.156827] NR_IRQS: 4352, nr_irqs: 456, preallocated irqs: 16
2023-07-18T16:27:39.0091758Z [    0.157016] random: crng init done
2023-07-18T16:27:39.0092113Z [    0.157179] Console: colour *CGA 80x25
2023-07-18T16:27:39.0092476Z [    0.293648] printk: console [ttyS0] enabled
2023-07-18T16:27:39.0092853Z [    0.294516] ACPI: Core revision 20190816
2023-07-18T16:27:39.0093323Z [    0.295514] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
2023-07-18T16:27:39.0093809Z [    0.297461] APIC: Switch to symmetric I/O mode setup
2023-07-18T16:27:39.0094181Z [    0.298539] KVM setup pv IPIs
2023-07-18T16:27:39.0094709Z [    0.301481] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
2023-07-18T16:27:39.0095167Z [    0.302763] tsc: Marking TSC unstable due to TSCs unsynchronized
2023-07-18T16:27:39.0095673Z [    0.303997] Calibrating delay loop (skipped) preset value.. 4890.86 BogoMIPS (lpj=2445432)
2023-07-18T16:27:39.0096123Z [    0.304997] pid_max: default: 32768 minimum: 301
2023-07-18T16:27:39.0096526Z [    0.306012] LSM: Security Framework initializing
2023-07-18T16:27:39.0096905Z [    0.307003] SELinux:  Initializing.
2023-07-18T16:27:39.0097478Z [    0.307728] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)
2023-07-18T16:27:39.0098127Z [    0.308010] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)
2023-07-18T16:27:39.0098619Z [    0.309806] x86/cpu: User Mode Instruction Prevention (UMIP) activated
2023-07-18T16:27:39.0099069Z [    0.310265] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
2023-07-18T16:27:39.0099495Z [    0.310997] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
2023-07-18T16:27:39.0100168Z [    0.312003] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
2023-07-18T16:27:39.0100659Z [    0.313998] Spectre V2 : Mitigation: Retpolines
2023-07-18T16:27:39.0101138Z [    0.314946] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
2023-07-18T16:27:39.0101647Z [    0.314997] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT
2023-07-18T16:27:39.0102094Z [    0.315997] Speculative Store Bypass: Vulnerable
2023-07-18T16:27:39.0102499Z [    0.317232] Freeing SMP alternatives memory: 40K
2023-07-18T16:27:39.0103865Z ##[error]Process completed with exit code 42.

logs_1728891.zip

@mhofstetter
Copy link
Member

@gandro Thanks! The console.log output doesn't include any actual information beside the fact that the LVM wasn't ready after 30s.

I addressed some of its issues (proper waiting, configurable wait duration, increased default wait-time to 300s...) in the action with cilium/little-vm-helper#96.

The next integration into cilium contains more information: https://github.com/cilium/cilium/actions/runs/5619475924/job/15226671415

[    0.263773] smpboot: CPU0: AMD EPYC 7763 64-Core Processor (family: 0x19, model: 0x1, stepping: 0x1)
[ 3648.930780] cblist_init_generic: Setting adjustable number of callback queues.
[ 3660.061773] cblist_init_generic: Setting shift to 2 and lim to 1.
[ 4361.134773] cblist_init_generic: Setting shift to 2 and lim to 1.
[ 5542.672780] cblist_init_generic: Setting shift to 2 and lim to 1.
[ 6015.907773] Performance Events: AMD PMU driver.
[ 6067.739773] ... version:                0
[ 6071.804773] ... bit width:              48
[ 6071.807773] ... generic registers:      4
[ 6092.966773] ... value mask:             0000ffffffffffff
[ 6155.489774] ... max period:             00007fffffffffff
[ 6270.705773] ... fixed-purpose events:   0
[ 6312.172774] ... event mask:             000000000000000f
[ 6835.082773] kworker/dying (11) used greatest stack depth: 15080 bytes left
[ 7107.386773] rcu: Hierarchical SRCU implementation.
[ 7115.268773] rcu: 	Max phase no-delay instances is 400.
[10531.659780] smp: Bringing up secondary CPUs ...
[11251.030784] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[11251.031767] rcu: 	(detected by 0, t=21002 jiffies, g=-1171, q=1 ncpus=1)
[11251.031767] rcu: All QSes seen, last rcu_preempt kthread activity 20999 (4305918106-4305897107), jiffies_till_next_fqs=3, root ->qsmask 0x0
[11251.031767] rcu: rcu_preempt kthread starved for 20999 jiffies! g-1171 f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=0
[11251.031767] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[11251.031767] rcu: RCU grace-period kthread stack dump:
[11251.031767] task:rcu_preempt     state:R  running task     stack:15512 pid:20    ppid:2      flags:0x00004000
[11251.031767] Call Trace:
[11251.031767]  <TASK>
[11251.031767]  __schedule+0x2f7/0x850
[11251.031767]  __cond_resched+0x36/0x50
[11251.031767]  rcu_gp_cleanup+0x156/0x520
[11251.031767]  ? __pfx_rcu_gp_kthread+0x10/0x10
[11251.031767]  rcu_gp_kthread+0x1b9/0x260
[11251.031767]  kthread+0xe8/0x120
[11251.031767]  ? __pfx_kthread+0x10/0x10
[11251.031767]  ret_from_fork+0x2c/0x50
[11251.031767]  </TASK>
[11251.031767] rcu: Stack dump where RCU GP kthread last ran:

so far no better news than this - might be related to a kernel version.

@brb
Copy link
Member

brb commented Jul 21, 2023

so far no better news than this - might be related to a kernel version.

Do we know on which kernels it has happened so far?

@mhofstetter
Copy link
Member

Mostly 5.4 what i've analyzed so far.

https://github.com/cilium/cilium/actions/runs/5279198432	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5580840015	bpf-next-20230526.105339@sha256:4133d4e09b1e86ac175df8d899873180281bb4220dc43e2566c47b0241637411
https://github.com/cilium/cilium/actions/runs/5589813647	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5317564597	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5569324460	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5418362930	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5523898017	bpf-next-20230526.105339@sha256:4133d4e09b1e86ac175df8d899873180281bb4220dc43e2566c47b0241637411
https://github.com/cilium/cilium/actions/runs/5551449501	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5587283205	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5264084261	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5359677112	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5474252275	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5617616061	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5599415659	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5473457440	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5473493891	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5295602769	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5522235306	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5240555020	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5447784248	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5418362930	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c
https://github.com/cilium/cilium/actions/runs/5551141862	5.4-20230526.105339@sha256:523ff0c81ed9be73a2d0d02a64c72655225efdd32e35220f530ca3eff3eada3c

@giorio94
Copy link
Member Author

@giorio94
Copy link
Member Author

@aanm
Copy link
Member

aanm commented Jul 31, 2023

@brb can you take a look at this? 🥺

@brb
Copy link
Member

brb commented Aug 1, 2023

@brb can you take a look at this? pleading_face

Sorry, no cycles, as I am involved in other testing projects.

@github-actions
Copy link

github-actions bot commented Oct 1, 2023

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Oct 1, 2023
@github-actions
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

7 participants