Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vt_console:Skip waiting screen when login via console #3948

Merged
merged 1 commit into from
Jul 25, 2024

Conversation

chloerh
Copy link
Contributor

@chloerh chloerh commented Jul 15, 2024

To skip the UEFI guest reset process and prevent the forever waiting

Test with previously failed cases:

 (1/2) type_specific.io-github-autotest-libvirt.virtual_network.passt.connectivity_between_2vms.ip_portfw.root_user: PASS (202.79 s)
 (2/2) type_specific.io-github-autotest-libvirt.virtual_network.passt.connectivity_between_2vms.ip_portfw.non_root_user: PASS (67.56 s)
RESULTS    : PASS 2 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
 (1/6) type_specific.io-github-autotest-libvirt.virtual_devices.filesystem_device_unprivileged.one_fs.with_hugepages.two_guests.coldplug: PASS (86.80 s)
 (2/6) type_specific.io-github-autotest-libvirt.virtual_devices.filesystem_device_unprivileged.one_fs.with_memfd.two_guests.coldplug: PASS (83.90 s)
 (3/6) type_specific.io-github-autotest-libvirt.virtual_devices.filesystem_device_unprivileged.one_fs.with_shm.two_guests.coldplug: PASS (88.41 s)
 (4/6) type_specific.io-github-autotest-libvirt.virtual_devices.filesystem_device_unprivileged.two_fs.with_hugepages.two_guests.coldplug: PASS (92.72 s)
 (5/6) type_specific.io-github-autotest-libvirt.virtual_devices.filesystem_device_unprivileged.two_fs.with_memfd.two_guests.coldplug: PASS (92.89 s)
 (6/6) type_specific.io-github-autotest-libvirt.virtual_devices.filesystem_device_unprivileged.two_fs.with_shm.two_guests.coldplug: PASS (90.67 s)
RESULTS    : PASS 6 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

To skip the UEFI guest reset process and prevent the forever waiting

Signed-off-by: Haijiao Zhao <haizhao@redhat.com>
Copy link
Contributor

@Yingshun Yingshun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if not is_responsive:
raise ConsoleNotResponsiveError("Console is not responsive.")
self._console.read_nonblocking()
latest_output = "\n".join(self._console.get_output().splitlines()[-50:])
Copy link
Contributor

@chunfuwen chunfuwen Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here , we get output starting from [-50:], can we ensure it is the same in different arches, such as s390x , arm etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Thank you for helping check this on s390. I'll get help to check on arm too.

@chunfuwen
Copy link
Contributor

@smitterl Can you help check this from s390x perspective?

@chloerh
Copy link
Contributor Author

chloerh commented Jul 17, 2024

Run arm acceptance test, got no failure.
@chunfuwen @smitterl anything on s390?

Copy link
Contributor

@dzhengfy dzhengfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smitterl
Copy link
Contributor

smitterl commented Jul 22, 2024

IIUC this is trying to handle a very specific UEFI boot up screen that is only shown on first boot resp. when there's no nvram file. If so, in #3849 the proposed solution is to handle that file. Would this also be helpful here? @chloerh It is used to avoid the system reset in these tests: autotest/tp-libvirt#5465

if reset_str in latest_output:
time.sleep(1)
continue
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chloerh Sorry I don't understand what you are achieving here. IIUC, the code at login reads the last lines in the console output and waits as long as the UEFI system reset string is shown? When is this else branch relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is:
line 87- 89
first check the latest output, if 'reset' is currently on the screen, we'll redo the loop after sleep(1) aiming to skip the 'reset' screen.

line 90-101
After we cannot find 'reset' in the latest output, it means:

  • the 'reset' has passed
  • or it never appeared
    if it's the 1st: we will stop the loop and try the login on line 103
    if it's the 2nd: we're not sure if the 'reset' is going to appear or not. But we know that if it will appear, it will be at the beginning part of the output, so I set the limit to 500 lines. If no 'reset' after 500 lines, the 'reset' will not appear at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for explaining. I have run some tests on s390x to confirm there's no negative impact for our systems that don't use that system reset.

May I ask why the current implementation can't be used to wait for the message to disappear?

This looks like quite a specific solution that only applies to UEFI on x86_64 with a specific (qemu?)firmware. I wonder if there could be a general solution that would allow for waiting for any message to disappear first (make the line configurable). However, from #3948 (comment) I understand you're considering an alternative solution, so I'll wait for the outcome of that.

Copy link
Contributor Author

@chloerh chloerh Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smitterl Thanks so much for checking this pr with 390x tests!

Actually, the current implementation was added to workaround the same issue 2523255
Unfortunately, it reduced the number of failed cases but not to 0. Therefore we made another workaround to extend internal timeout of login (10s to 20s) which was configured in our ci.

But lately we found some login timeout errors and they could be workaround by time.sleep(x). I have even merged one with passt tests, but I don't think after 2 ways of workaround, adding a 3rd workaround is the best way to deal with this issue.

This looks like quite a specific solution that only applies to UEFI on x86_64 with a specific (qemu?)firmware.

About this, I'm not sure about that, but the 'reset' happens quite often after we do xml.sync(). Do you mean that it doesn't happen to other archs? If so, I could check with arm qes to get the info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the reset should happen only on x86_64 AFAIK but it could make the code clearer and reusable in case other setups will have a similar timeout, I don't know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that filesystem_device_unprivileged.two_fs..two_guests test cases reproduce the 'infinite wait for login' with reset message on UEFI guests. But with this patch they pass. I don't really understand how but I will approve this request.

JOB ID     : fbb46d18520a1a07a1fc72907fcc27e9f9b63e5b
JOB LOG    : /var/log/avocado/job-results/job-2024-07-24T08.08-fbb46d1/job.log
 (1/1) type_specific.io-github-autotest-libvirt.virtual_devices.filesystem_device_unprivileged.two_fs.with_shm.two_guests.coldplug:  PASS (109.02 s)
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /var/log/avocado/job-results/job-2024-07-24T08.08-fbb46d1/results.html
JOB TIME   : 110.63 s

@smitterl
Copy link
Contributor

@smitterl Can you help check this from s390x perspective?

From s390x perspective: The reset string won't be shown, so I am worried that taking away the is_responsive part will have a negative impact. What do you think?

@chloerh
Copy link
Contributor Author

chloerh commented Jul 22, 2024

IIUC this is trying to handle a very specific UEFI boot up screen that is only shown on first boot resp. when there's no nvram file. If so, in #3849 the proposed solution is to handle that file. Would this also be helpful here? @chloerh It is used to avoid the system reset in these tests: autotest/tp-libvirt#5465

I found it could be helpful to the problem I'm trying to fix with the current pr. The problem is what's being described in the comment below. I also think we need to have a discussion with the team before we make the change to sync() function.

Thank you for bringing this up, I believe this could solve from the root.


        # Backup the EFI vars file before sync to prevent reset/reboot issue
        # Updating a VM's XML using vmxml.sync() deletes the nvram file, which on a UEFI VM
        # causes an unwanted reset/reboot.

@chunfuwen
Copy link
Contributor

Moreover, xml.sync() is commonly and widely used in many job features, such as virtual disk , detach and attach related cases.
From 9.5 rhel jobs test results(all cases pass in multiple round), I can not conclude there is reset screen appearance.
So I wonder whether it is specific to VM xml ,for example having some special device in VM.
We may need dig more details about this

Therefore, since this is specific issue, so I am still concerned about necessity of this PR change.
It looks like we overturn previous general solution in order to resolve specific issue

@chunfuwen
Copy link
Contributor

chunfuwen commented Jul 25, 2024

Additionally, Can we consider to address this issue in specific tp-libvirt file? such as autotest/tp-libvirt#5465.

If reset/reboot happen in some specific cases, we need reconsider whether the test case manipulate VM life cycle in correct way.

@chloerh
Copy link
Contributor Author

chloerh commented Jul 25, 2024

Additionally, Can we consider to address this issue in specific tp-libvirt file? such as autotest/tp-libvirt#5465.

If reset/reboot happen in some specific cases, we need reconsider whether the test case manipulate VM life cycle in correct way.

  • Actually, it's not just specific cases, it happens to lots of network cases when we need to use wait_for_serial_login()

  • For test results: I have got arm acceptance tests with the help of Yingshun and no negative affect on some s390 tests according to Sebas' comments. I also have run network job of a few hundres cases with this pr too.

  • About that you didn't find the reset screen, you can only encounter this issue when using wait_for_serial_login()

@chunfuwen
Copy link
Contributor

vmxml.sync() support pass in "--keep-nvram", did we ever try to use vmxml.sync("--keep-nvram") ?
In this way, nvram file can be kept.

@chloerh
Copy link
Contributor Author

chloerh commented Jul 25, 2024

vmxml.sync() support pass in "--keep-nvram", did we ever try to use vmxml.sync("--keep-nvram") ? In this way, nvram file can be kept.

Yes, we're going to set --keep-nvram as default option, which is another issue to be discussed.

@chunfuwen
Copy link
Contributor

Why this issue appear to our testing? Most possibility is that we rudely virsh undefine --nram firstly, then define VM wth the same image. This is not correct way from user's perspective.
In this situation, the forward solution is to correct the way we define and undefine Vm, rather that use workaround solution to hide the incorrect usage way.

@dzhengfy
Copy link
Contributor

Althrough after we use --keep-nvram in sync(), we will not happen to "reset" message in most cases when booting the guest, current PR still can solve some specific cases when --nvram is used which leads to "reset" message and this PR will not impact --keep-nvram cases. So I will merge it and see if any improvement for cases' stability.

@dzhengfy dzhengfy merged commit 289525e into avocado-framework:master Jul 25, 2024
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants