Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nova-compute / ovn-chassis] upgrade nova-compute failed during zed -> 2023.1 #494

Open
chanchiwai-ray opened this issue Jul 15, 2024 · 23 comments

Comments

@chanchiwai-ray
Copy link
Contributor

When COU refresh nova-compute from `zed/stable' to '2023.1/stable' during Jammy/Zed -> Jammy/2023.1 upgrade, the following substeps of 'Upgrade plan for units: nova-compute/0' failed

Upgrade plan for unit 'nova-compute/0': ActionFailed('Run of action \'openstack-upgrade\' with parameters \'<not-set>\' on \'nova-compute/0\' failed with \'upgrade callback resulted in an unexpected error\' …)
unit-nova-compute-1: 03:50:34 WARNING unit.nova-compute/1.openstack-upgrade E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).
unit-nova-compute-1: 03:50:34 INFO unit.nova-compute/1.juju-log Couldn't acquire DPKG lock. Will retry in 10 seconds
unit-nova-compute-1: 03:50:44 DEBUG unit.nova-compute/1.openstack-upgrade Reading package lists...
unit-nova-compute-1: 03:50:44 DEBUG unit.nova-compute/1.openstack-upgrade Building dependency tree...
unit-nova-compute-1: 03:50:44 DEBUG unit.nova-compute/1.openstack-upgrade Reading state information...
unit-nova-compute-1: 03:50:44 DEBUG unit.nova-compute/1.openstack-upgrade You might want to run 'apt --fix-broken install' to correct these.
unit-nova-compute-1: 03:50:44 DEBUG unit.nova-compute/1.openstack-upgrade The following packages have unmet dependencies:
unit-nova-compute-1: 03:50:44 DEBUG unit.nova-compute/1.openstack-upgrade  openvswitch-switch : Depends: openvswitch-common (= 3.0.3-0ubuntu0.22.10.3~cloud3) but 3.1.3-0ubuntu0.23.04.1~cloud0 is installed
unit-nova-compute-1: 03:50:44 WARNING unit.nova-compute/1.openstack-upgrade E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

Environment: deployed with stsstack-bundle using ./generate-bundle.sh --name cou -r yoga -s jammy --ceph --run

Failed: Upgrade the unit: 'nova-compute/0' substep failed

Note

  • ovn-chassis is a subordinate to nova-compute
  • sudo apt --fix-broken install on nova-compute/0 unit will fix this issue
@samuelallan72
Copy link
Contributor

This appears related to LP: #2068109 - same broken packages for apt, same workaround, and similarly the broken packages relate to a colocated charm.

@Pjack
Copy link

Pjack commented Aug 6, 2024

The workaround solution may break the configuration of nova-compute. We need to find some other solution.

@jneo8
Copy link
Contributor

jneo8 commented Aug 6, 2024

Problem Description

Initially, I deployed an OpenStack Zed cloud with ovn-chassis channel 22.09/stable and nova-compute zed/stable. The upgrade path was: ovn-chassisovn-centralnova-compute. The ovn-chassis should be upgraded before ovn-central, and the OpenStack control plane should be upgraded before the hypervisor.

Upgrade Process

  1. ovn-chassis Upgrade:

    • Upgraded from 22.09/stable to 23.03/stable.
    • However, the workload version remained at Zed because /etc/apt/sources.list.d/cloud-archive.list still pointed to deb http://ubuntu-cloud.archive.canonical.com/ubuntu jammy-updates/zed main, keeping openvswitch-switch and ovn-common packages at 22.09.
  2. ovn-central Upgrade:

    • Upgraded on another node without issues.
  3. nova-compute Upgrade:

    • With action-managed-upgrade=True, the openstack-upgrade action changed /etc/apt/sources.list.d/cloud-archive.list to deb http://ubuntu-cloud.archive.canonical.com/ubuntu jammy-updates/antelope main. This upgraded ovn-common to 23.04, breaking the dependency on openvswitch-switch which was still at 22.09.
$ sudo apt-get --option Dpkg:Options::=--force-confnew --option Dpkg::Options::=--force-confdef dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
 openvswitch-switch : Depends: openvswitch-common (= 3.0.3-0ubuntu0.22.10.3~cloud3) but 3.1.3-0ubuntu0.23.04.1~cloud0 is installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

Conclusion

This presents a chicken-and-egg problem: the ovn-chassis must be upgraded before nova-compute, but the necessary OVN version isn't available until after the nova-compute upgrade.

The workaround can be running sudo DEBIAN_FRONTEND=noninteractive apt --fix-broken install -y -o Dpkg::Options::="--force-confold" -o Dpkg::Options::="--force-confdef" command after first time the upgrade action failed, which will update the ovn-chassis workload to 23.03.

@fnordahl
Copy link
Member

fnordahl commented Aug 6, 2024

Initially, I deployed an OpenStack Zed cloud with ovn-chassis channel 22.09/stable and nova-compute zed/stable. The upgrade path was: ovn-chassis → ovn-central → nova-compute. The ovn-chassis should be upgraded before ovn-central, and the OpenStack control plane should be upgraded before the hypervisor.

So far I'm with you.

This presents a chicken-and-egg problem: the ovn-chassis must be upgraded before nova-compute, but the necessary OVN version isn't available until after the nova-compute upgrade.

Why exactly does ovn-chassis need to be upgraded before nova-compute?

What prevents you from upgrading ovn-chassis and nova-compute together, leaving the ovn-central upgrade to the end when all chassis have been upgraded?

@jneo8
Copy link
Contributor

jneo8 commented Aug 6, 2024

Hi @fnordahl , thanks for you response first.

Why exactly does ovn-chassis need to be upgraded before nova-compute?

So these are the reasons we have this order: ovn-chassis -> ovn-central -> nova-compute

What prevents you from upgrading ovn-chassis and nova-compute together, leaving the ovn-central upgrade to the end when all chassis have been upgraded?

The main reason cause this issue is because no matter the order of upgrading on the hypervisor node, it will have issue:

@fnordahl
Copy link
Member

fnordahl commented Aug 6, 2024

It is possible to upgrade the charm without upgrading its payload. Would that not solve the RuntimeError problem?

@samuelallan72
Copy link
Contributor

What prevents you from upgrading ovn-chassis and nova-compute together,

@fnordahl could you explain what this looks like practically - ie. if you were to do this manually? I'm curious because I'm not aware of any charm mechanism that synchronises updates between charms.

@jneo8
Copy link
Contributor

jneo8 commented Aug 7, 2024

It is possible to upgrade the charm without upgrading its payload. Would that not solve the RuntimeError problem?

I'm almost sure it won't be possible to resolve the issue because the RuntimeError come from the charm-helpers don't have the OPENSTACK_CODENAME, for example missing zed in charm-helpers yoga release. If I understand correctly how charm works/packaged, it won't be possible to change it without refresh.

(Or I misunderstand what you try to do here)

@fnordahl
Copy link
Member

fnordahl commented Aug 7, 2024

As laid out in the documentation, it is expected to first upgrade charms and then upgrade the payload. There is nothing manual or special about this.

So if you upgrade the charms first and then proceed with the payload upgrade, is this an issue?

@jneo8
Copy link
Contributor

jneo8 commented Aug 7, 2024

There is also an issue on work-around:

# run work-around command on nova-compute unit

sudo DEBIAN_FRONTEND=noninteractive apt --fix-broken install -y -o Dpkg::Options::="--force-confold" -o Dpkg::Options::="--force-confdef"

# on juju client

# This will success
juju run nova-compute/0 openstack-upgrade

# This will failed
juju run nova-compute/0 resume

# This will success
juju run nova-compute/0 enable

This error message from resume action:

Running operation 73 with 1 task
  - task 74 on unit-nova-compute-0

Waiting for task 74...

failed
inactive
inactive
Filesystem                                             1K-blocks   Used Available Use% Mounted on
/dev/mapper/crypt-143335ca-8be8-4124-a149-05156760c184  52386824 398292  51988532   1% /var/lib/nova/instances
Filesystem                                             1K-blocks   Used Available Use% Mounted on
/dev/mapper/crypt-143335ca-8be8-4124-a149-05156760c184  52386824 398292  51988532   1% /var/lib/nova/instances
Filesystem                                             1K-blocks   Used Available Use% Mounted on
/dev/mapper/crypt-143335ca-8be8-4124-a149-05156760c184  52386824 398292  51988532   1% /var/lib/nova/instances
Filesystem                                             1K-blocks   Used Available Use% Mounted on
/dev/mapper/crypt-143335ca-8be8-4124-a149-05156760c184  52386824 398292  51988532   1% /var/lib/nova/instances
active
active
active
Removed /etc/systemd/system/nova-compute.service.
Synchronizing state of nova-compute.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable nova-compute
Created symlink /etc/systemd/system/multi-user.target.wants/nova-compute.service → /lib/systemd/system/nova-compute.service.
Removed /etc/systemd/system/nova-api-metadata.service.
Synchronizing state of nova-api-metadata.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable nova-api-metadata
Created symlink /etc/systemd/system/multi-user.target.wants/nova-api-metadata.service → /lib/systemd/system/nova-api-metadata.service.
Removed /etc/systemd/system/qemu-kvm.service.
Created symlink /etc/systemd/system/multi-user.target.wants/qemu-kvm.service → /lib/systemd/system/qemu-kvm.service.
ERROR no relation id specified

@jneo8
Copy link
Contributor

jneo8 commented Aug 7, 2024

As laid out in the documentation, it is expected to first upgrade charms and then upgrade the payload. There is nothing manual or special about this.

So if you upgrade the charms first and then proceed with the payload upgrade, is this an issue?

Hi @fnordahl

You mean for example we upgrade the all the charms without change any source or openstack-origin juju application configuration as first step then change the configuration or run upgrade action later?

@fnordahl
Copy link
Member

fnordahl commented Aug 7, 2024

You mean for example we upgrade the all the charms without change any source or openstack-origin juju application configuration as first step then change the configuration or run upgrade action later?

Yes.

@fnordahl
Copy link
Member

fnordahl commented Aug 7, 2024

@fnordahl could you explain what this looks like practically - ie. if you were to do this manually? I'm curious because I'm not aware of any charm mechanism that synchronises updates between charms.

On this topic there is precedence in the charms to have subordinates make their principal aware of their packages:
https://review.opendev.org/q/topic:%22bug/1806111%22

Not sure if it is required here though, it might be that just doing charm upgrade before payload upgrade will make the nova-compute payload upgrade just make this work already.

@javacruft
Copy link

This is a packaging upgrade bug:

Preparing to unpack .../0-openvswitch-switch_3.1.3-0ubuntu0.23.04.1~cloud0_amd64.deb ...
Unpacking openvswitch-switch (3.1.3-0ubuntu0.23.04.1~cloud0) over (3.0.3-0ubuntu0.22.10.3~cloud3) ...
dpkg: error processing archive /tmp/apt-dpkg-install-e4dbGT/0-openvswitch-switch_3.1.3-0ubuntu0.23.04.1~cloud0_amd64.deb (--
unpack):
 trying to overwrite '/usr/share/openvswitch/local-config.ovsschema', which is also in package openvswitch-common 3.0.3-0ubu
ntu0.22.10.3~cloud3

openvswitch-switch needs a versioned Breaks/Replaces in the antelope UCA to deal with this file moving, if that was the intent.

@javacruft
Copy link

javacruft commented Aug 7, 2024

Due to the packaging history (merging with Debian) I think the openvswitch-switch binary package needs:

Breaks: openvswitch-common (<< 3.0.1-1-)
Replaces: openvswitch-common (<< 3.0.1-1-)

as that's when the local-config.ovsschema file was added to the -switch package.

@fnordahl are you OK to pickup a fix for this?

@javacruft
Copy link

javacruft commented Aug 7, 2024

Reproducer in a focal LXD container

root@literate-goshawk:~# history
    1  add-apt-repository cloud-archive:zed
    2  apt install openvswitch-switch
    3  add-apt-repository cloud-archive:antelope
    4  apt dist-upgrade --assume-yes
    5  history

@samuelallan72
Copy link
Contributor

As laid out in the documentation, it is expected to first upgrade charms and then upgrade the payload. There is nothing manual or special about this.

So if you upgrade the charms first and then proceed with the payload upgrade, is this an issue?

For "charms upgrade" though, it's not always clear what this refers to, as there are a couple of options:

  1. upgrade the charm to latest revision in the same channel - eg. juju refresh <charm>. Some parts of the documentation refer to the older method where all charms were on latest/stable, but for others it's the newer channel-based charms (eg. ussuri/stable channel), so this has different connotation for each.
  2. upgrade the charm to the next (target openstack) channel - eg. juju refresh <charm> --channel victoria/stable

Also how does this work for subordinate charms; it's not possible to split upgrading the charm and the payload like the principal charms, because they don't have a separate openstack-origin or source config.

@fnordahl
Copy link
Member

fnordahl commented Aug 8, 2024

Due to the packaging history (merging with Debian) I think the openvswitch-switch binary package needs:

Breaks: openvswitch-common (<< 3.0.1-1-)
Replaces: openvswitch-common (<< 3.0.1-1-)

as that's when the local-config.ovsschema file was added to the -switch package.

@fnordahl are you OK to pickup a fix for this?

@javacruft sure! Do you think this is a requirement all the way from tip of the packages, or is this a stable only package fix?

@fnordahl
Copy link
Member

fnordahl commented Aug 8, 2024

As laid out in the documentation, it is expected to first upgrade charms and then upgrade the payload. There is nothing manual or special about this.
So if you upgrade the charms first and then proceed with the payload upgrade, is this an issue?

For "charms upgrade" though, it's not always clear what this refers to, as there are a couple of options:

1. upgrade the charm to latest revision in the same channel - eg. `juju refresh <charm>`.  Some parts of the documentation refer to the older method where all charms were on latest/stable, but for others it's the newer channel-based charms (eg. ussuri/stable channel), so this has different connotation for each.

2. upgrade the charm to the next (target openstack) channel - eg. `juju refresh <charm> --channel victoria/stable`

Regardless of which of these methods are used, the base reality is that the charm code needs to be upgraded prior to the payload for it to understand the new payload.

Also how does this work for subordinate charms; it's not possible to split upgrading the charm and the payload like the principal charms, because they don't have a separate openstack-origin or source config.

Indeed, subordinate charms typically do not deal with package upgrades for this reason. The chassis charm has a ovn-source configuration option solemnly to deal with the special Focal OVN 22.03 UCA pocket.

Some subordinate charms with package upgrade requirements exchange this information with their principal as mentioned in #494 (comment), but again I don't think explicit exchange is required for this relationship.

For ovn-chassis the openstack upgrade performed by the nova-compute should take care of the required package upgrades.

@jneo8
Copy link
Contributor

jneo8 commented Aug 20, 2024

@samuelallan72
Copy link
Contributor

samuelallan72 commented Oct 3, 2024

@fnordahl thanks for your explanation 🙂

@samuelallan72
Copy link
Contributor

@jneo8 can you confirm if this bug is fixed? I see you confirmed that https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/2077406 is resolved, but I'm not sure if that alone fixes this; there are many comments on this issue.

@jneo8
Copy link
Contributor

jneo8 commented Oct 7, 2024

@jneo8 can you confirm if this bug is fixed? I see you confirmed that https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/2077406 is resolved, but I'm not sure if that alone fixes this; there are many comments on this issue.

Thanks for follow up Samuel.

The fixed version 3.1.3-0ubuntu0.23.04.1~cloud0 is on staging now. Wait for release to ubuntu then we can resolve this one.

Check on: https://openstack-ci-reports.ubuntu.com/reports/cloud-archive/antelope_versions.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants