Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreos: update aleph version handling #585

Merged
merged 3 commits into from
Dec 18, 2023

Conversation

dustymabe
Copy link
Member

In osbuild/osbuild#1475 we are updating
aleph version to have more information and some of the fields
are changing slightly. Notably here imgid is no longer going to
be populated and build -> version now. Let's make these tweaks
and also just drop any fields from the definition here that we
aren't using to tolerate changes better.

@dustymabe
Copy link
Member Author

I'm sure someone that knows rust better than I do can show me a much more elegant way to implement the 2nd commit here. What I have right now is kind of ugly IMO.

@dustymabe dustymabe force-pushed the dusty-aleph-versiono-updates branch 2 times, most recently from c87a979 to b3cf258 Compare December 16, 2023 05:10
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so, the only (non-informational) use of the aleph data is for adoption on very old coreos systems. Think about a case of updating from e.g. RHCOS 4.3 (RHEL 8.x) all the way to a modern version. We still need to parse the old aleph file for that; we can't just hard switch.

Probably simplest is to change every field to Option<>.

Or...honestly (and this relates to a larger topic of bootupd versioning) we could just extract the btime of the aleph file...or failing that, hmm we could probably rely on the btime of /boot or /sysroot/ostree. Basically our use of the aleph file here is just to have a "pretty" version but...who cares, we can just use timestamp-as-version.

@dustymabe
Copy link
Member Author

OK so, the only (non-informational) use of the aleph data is for adoption on very old coreos systems. Think about a case of updating from e.g. RHCOS 4.3 (RHEL 8.x) all the way to a modern version. We still need to parse the old aleph file for that; we can't just hard switch.

I agree that we can't hard switch. I thought the changes I was doing here were backwards compatible so thank you for pointing out that they aren't! I thought by making it:

pub(crate) struct Aleph {      
    #[serde(alias = "build")]  
    pub(crate) version: String,
}                              

Would mean that if build: was in the file then it would pick up that as version, but if version was in there (the new files) then it would use that. Am I missing something?

Probably simplest is to change every field to Option<>.

I removed all the other values from the struct Aleph because we don't really use them here in bootupd that I can tell. Do you think that will cause errors? I tested running bootupctl status on current FCOS, 4.15 RHCOS, and also OSBuild built FCOS via deploy-via-container and ostree deployed and it didn't seem to complain that there are fields it didn't know about.

Or...honestly (and this relates to a larger topic of bootupd versioning) we could just extract the btime of the aleph file...or failing that, hmm we could probably rely on the btime of /boot or /sysroot/ostree. Basically our use of the aleph file here is just to have a "pretty" version but...who cares, we can just use timestamp-as-version.

I kind of like having the version information, but I guess we could cut it out. Either way I'm mostly trying to not break things when we switch to OSBuild.

@dustymabe
Copy link
Member Author

I realized we were not failing if the file didn't exist before and my new code was failing if it didn't exist. Updated to not fail if no file exists.

I feel like the second commit is still really ugly and there is probably a much more elegant way to say "read this file, or the file this symlink points to if it is a symlink".

@cgwalters
Copy link
Member

Would mean that if build: was in the file then it would pick up that as version, but if version was in there (the new files) then it would use that. Am I missing something?

Ah sorry...yes, I think that will work, however what would increase my confidence here a lot is to leave the existing unit test unchanged to prove we continue to parse the old format, and add a new unit test instead.

@cgwalters
Copy link
Member

Let's canonicalize and find the actual file before dropping into
sysroot.open_file_optional() because the underlying code for that uses
O_NOFOLLOW and will give an ELOOP (too many symbolic links) error.

I think that's a deficiency of the openat crate, the cap-std library (which we want to port to) shouldn't behave that way. I'm good with just a TODO here like:
// TODO: When porting to cap-std we should be able to just automatically follow the link here

src/coreos.rs Outdated Show resolved Hide resolved
@dustymabe
Copy link
Member Author

Would mean that if build: was in the file then it would pick up that as version, but if version was in there (the new files) then it would use that. Am I missing something?

Ah sorry...yes, I think that will work, however what would increase my confidence here a lot is to leave the existing unit test unchanged to prove we continue to parse the old format, and add a new unit test instead.

Good idea! Done!

@dustymabe
Copy link
Member Author

Changes LGTM.

I'm trying to figure out the CI failure now.

@cgwalters
Copy link
Member

Ah yes, "no components are adoptable" means we messed up something with the aleph detection for "this is a coreos system".

@cgwalters
Copy link
Member

Ah I see the error is

[2023-12-18T15:42:12.526Z] # error: duplicate field version at line 3 column 11

And that's because the old aleph data has both build and imgid:

{
	"build": "44.81.202004260825-0",
	"ref": "2062bce64e4932160feb58ce4976a885172d3f1017dc01f09177504bd55e035b",
	"ostree-commit": "2062bce64e4932160feb58ce4976a885172d3f1017dc01f09177504bd55e035b",
	"imgid": "rhcos-44.81.202004260825-0-qemu.x86_64.qcow2"
}

@cgwalters
Copy link
Member

For my and others' future reference, a handy guestfish interaction session to get this data from an old qcow2:

$ guestfish --ro -a rhcos-4.4.3-x86_64-qemu.x86_64.qcow2
><fs> run
><fs> debug sh "blockdev --getsize /dev/sda4"
32503775
><fs> debug sh "echo 0 32471007 linear /dev/sda4 32768 | dmsetup create noluks-root"
><fs> mount /dev/mapper/noluks-root /
><fs> cat /.coreos-aleph-version.json 
{
	"build": "44.81.202004260825-0",
	"ref": "2062bce64e4932160feb58ce4976a885172d3f1017dc01f09177504bd55e035b",
	"ostree-commit": "2062bce64e4932160feb58ce4976a885172d3f1017dc01f09177504bd55e035b",
	"imgid": "rhcos-44.81.202004260825-0-qemu.x86_64.qcow2"
}

><fs> 

@dustymabe
Copy link
Member Author

Ah I see the error is

[2023-12-18T15:42:12.526Z] # error: duplicate field version at line 3 column 11

And that's because the old aleph data has both build and imgid:

{
	"build": "44.81.202004260825-0",
	"ref": "2062bce64e4932160feb58ce4976a885172d3f1017dc01f09177504bd55e035b",
	"ostree-commit": "2062bce64e4932160feb58ce4976a885172d3f1017dc01f09177504bd55e035b",
	"imgid": "rhcos-44.81.202004260825-0-qemu.x86_64.qcow2"
}

How is that different from our current test_parse_old_aleph() test?

@dustymabe
Copy link
Member Author

IIUC the error is happening when bootupctl status is run:

bootupctl status | tee out.txt

@dustymabe

This comment was marked as outdated.

@dustymabe

This comment was marked as outdated.

@dustymabe
Copy link
Member Author

oh actually I think I might know the issue here:

coreos/coreos-assembler@c2d37f4

I was trying to be helpful with that but I think it caused a problem.

@cgwalters
Copy link
Member

Ah. Yes, that commit created an aleph format which is neither the "v1" nor "v2"; probably should just revert.

@dustymabe
Copy link
Member Author

oh actually I think I might know the issue here:

coreos/coreos-assembler@c2d37f4

I was trying to be helpful with that but I think it caused a problem.

An easy solution to the problem is to revert that commit and scratch the builds that we started today since we haven't released them yet.

dustymabe added a commit to coreos/coreos-assembler that referenced this pull request Dec 18, 2023
This reverts commit c2d37f4.

We found an issue with this and need to revert it for now:
coreos/bootupd#585 (comment)
dustymabe and others added 3 commits December 18, 2023 17:28
In osbuild/osbuild#1475 we are updating
aleph version to have more information and some of the fields
are changing slightly. Notably here `imgid` is no longer going to
be populated and `build` -> `version` now. Let's make these tweaks
and also just drop any fields from the definition here that we
aren't using to tolerate changes better.
Once osbuild/osbuild#1475 gets picked up and
starts getting used the .coreos-aleph-version.json will be a symlink.
Let's canonicalize and find the actual file before dropping into
sysroot.open_file_optional() because the underlying code for that uses
O_NOFOLLOW and will give an ELOOP (too many symbolic links) error.
This makes it amenable to unit testing, and also since the
previous code was always operating on `/` there's no good
reason to use the `openat` APIs which don't follow symlinks
by default, which breaks ergonomics.
@cgwalters cgwalters merged commit c1bf5e7 into coreos:main Dec 18, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants