Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 27 additions & 8 deletions crates/lib/src/bootc_composefs/rollback.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,18 @@ use anyhow::{Context, Result, anyhow};
use cap_std_ext::cap_std::fs::Dir;
use cap_std_ext::dirext::CapStdExtDirExt;
use fn_error_context::context;
use ocidir::cap_std::ambient_authority;
use rustix::fs::{AtFlags, RenameFlags, fsync, renameat_with};

use crate::bootc_composefs::boot::{
BootType, FILENAME_PRIORITY_PRIMARY, FILENAME_PRIORITY_SECONDARY, primary_sort_key,
secondary_sort_key, type1_entry_conf_file_name,
};
use crate::bootc_composefs::status::{get_composefs_status, get_sorted_type1_boot_entries};
use crate::composefs_consts::TYPE1_ENT_PATH_STAGED;
use crate::spec::Bootloader;
use crate::composefs_consts::{
COMPOSEFS_STAGED_DEPLOYMENT_FNAME, COMPOSEFS_TRANSIENT_STATE_DIR, TYPE1_ENT_PATH_STAGED,
};
use crate::spec::{Bootloader, Host};
use crate::store::{BootedComposefs, Storage};
use crate::{
bootc_composefs::{boot::get_efi_uuid_source, status::get_sorted_grub_uki_boot_entries},
Expand Down Expand Up @@ -124,7 +127,7 @@ fn rollback_grub_uki_entries(boot_dir: &Dir) -> Result<()> {
/// a. Here we assume that rollback is queued as there's no way to differentiate between this
/// case and Case 1-b. This is what ostree does as well
#[context("Rolling back {bootloader} entries")]
fn rollback_composefs_entries(boot_dir: &Dir, bootloader: Bootloader) -> Result<()> {
fn rollback_composefs_entries(host: &Host, boot_dir: &Dir, bootloader: Bootloader) -> Result<()> {
// Get all boot entries sorted in descending order by sort-key
let mut all_configs = get_sorted_type1_boot_entries(&boot_dir, false)?;

Expand All @@ -143,6 +146,24 @@ fn rollback_composefs_entries(boot_dir: &Dir, bootloader: Bootloader) -> Result<
// OR if rollback was queued, it would become secondary
all_configs[1].sort_key = Some(secondary_sort_key(os_id));

// Ostree will drop any staged deployment on rollback
// We follow the same approach for now
if let Some(..) = &host.status.staged {
println!("Removing currently staged deployment");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to try to centralize reporting more in the future and get away from just random println!. This isn't the only case of course.

One thing we could probably do instead here is log to the journal, which would have other benefits.


boot_dir
.remove_dir_all(TYPE1_ENT_PATH_STAGED)
.context("Removing staged entries")?;

let transient_dir =
Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority())
.context("Opening transient dir")?;

transient_dir
.remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME)
.context("Removing staged deployment file")?;
}
Comment on lines +149 to +165
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Robustness & Correctness Issue

  1. Stale Staged Entries: If loader/entries.staged (TYPE1_ENT_PATH_STAGED) already exists on disk (e.g., from a previously interrupted update or rollback) but host.status.staged is None, we do not clean it up before writing the new rollback BLS configs. This can cause stale entries to be carried over and swapped into the active loader/entries directory during rename_exchange_bls_entries.
  2. Ignorable Deletion Failure: If the staged deployment file COMPOSEFS_STAGED_DEPLOYMENT_FNAME is missing or already deleted, remove_file will fail with NotFound, unnecessarily failing the entire rollback operation.

Recommendation

Always clean up TYPE1_ENT_PATH_STAGED if it exists using remove_all_optional, and ignore NotFound errors when deleting the staged deployment file.

    // Clean up any existing staged entries directory to avoid carrying over stale files
    boot_dir
        .remove_all_optional(TYPE1_ENT_PATH_STAGED)
        .context("Removing staged entries")?;

    // Ostree will drop any staged deployment on rollback
    // We follow the same approach for now
    if let Some(..) = &host.status.staged {
        println!("Removing currently staged deployment");

        let transient_dir =
            Dir::open_ambient_dir(COMPOSEFS_TRANSIENT_STATE_DIR, ambient_authority())
                .context("Opening transient dir")?;

        if let Err(e) = transient_dir.remove_file(COMPOSEFS_STAGED_DEPLOYMENT_FNAME) {
            if e.kind() != std::io::ErrorKind::NotFound {
                return Err(anyhow::Error::from(e)).context("Removing staged deployment file");
            }
        }
    }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    boot_dir
        .remove_all_optional(TYPE1_ENT_PATH_STAGED)
        .context("Removing staged entries")?;

I think we would definitely want to return an error if we have a staged deployment and this doesn't exist. Basically means the system is in a weird state. It is the same with COMPOSEFS_STAGED_DEPLOYMENT_FNAME, host.staged is only set if COMPOSEFS_STAGED_DEPLOYMENT_FNAME exists. Also, in the staged op, writing COMPOSEFS_STAGED_DEPLOYMENT_FNAME is the very last operation we do

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm a few principles here. First off we should be able to recover from most unexpected state ideally.

It'd actually be a good exercise (LLM assisted even) to look at just deleting files in the storage or in /run state and ensure that at least bootc switch is able to recover.

I didn't dig into this but on the staged state bits ensuring we e.g. log to the journal and continue seems like a good idea instead of just using ?.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think Gemini's suggestion makes sense. We could've failed writing the file in /run which would leave host.staged as None, but entries.staged will still exist, re-triggering this bug.

It'd actually be a good exercise (LLM assisted even) to look at just deleting files in the storage or in /run state and ensure that at least bootc switch is able to recover.

I believe this is true as of now even. I'll incorporate this scenario into one of the tests

I didn't dig into this but on the staged state bits ensuring we e.g. log to the journal and continue seems like a good idea instead of just using ?.

sounds good. I'll update it


// Write these
boot_dir
.create_dir_all(TYPE1_ENT_PATH_STAGED)
Expand Down Expand Up @@ -213,11 +234,9 @@ pub(crate) async fn composefs_rollback(
let rollback_status = host
.status
.rollback
.as_ref()
.ok_or_else(|| anyhow!("No rollback available"))?;

// TODO: Handle staged deployment
// Ostree will drop any staged deployment on rollback but will keep it if it is the first item
// in the new deployment list
let Some(rollback_entry) = &rollback_status.composefs else {
anyhow::bail!("Rollback deployment not a composefs deployment")
};
Expand All @@ -227,7 +246,7 @@ pub(crate) async fn composefs_rollback(
match &rollback_entry.bootloader {
Bootloader::Grub => match rollback_entry.boot_type {
BootType::Bls => {
rollback_composefs_entries(boot_dir, rollback_entry.bootloader.clone())?;
rollback_composefs_entries(&host, boot_dir, rollback_entry.bootloader.clone())?;
}
BootType::Uki => {
rollback_grub_uki_entries(boot_dir)?;
Expand All @@ -236,7 +255,7 @@ pub(crate) async fn composefs_rollback(

Bootloader::Systemd => {
// We use BLS entries for systemd UKI as well
rollback_composefs_entries(boot_dir, rollback_entry.bootloader.clone())?;
rollback_composefs_entries(&host, boot_dir, rollback_entry.bootloader.clone())?;
}

Bootloader::None => unreachable!("Checked at install time"),
Expand Down
Loading