Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mkinitcpio: Produce reproducible initramfs images #1

Merged

Conversation

@esotericnonsense
Copy link
Contributor

esotericnonsense commented Sep 5, 2019

We achieve this by stripping timestamps from within the filesystem,
and by using a pipeline to strip inodes from the cpio archive.

It functions for at least the 'gzip', 'xz', 'bzip2', 'lz4' and 'cat'
compressors. The 'lzop' compressor embeds a runtime timestamp.

Motivation: https://reproducible-builds.org

Signed-off-by: Daniel Edgecumbe git@esotericnonsense.com

@grazzolini grazzolini self-requested a review Sep 6, 2019
@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 7, 2019

It may make sense to hold off on this for now as I think a SOURCE_DATE_EPOCH based solution[1] may be better as it could allow us to instantiate pacman with the variable set and thus automatically use reproducible builds where necessary.

I'm in the process of adjusting the archiso releng scripts and jamming the --reproducible flag in everywhere is quite clunky.

[1] https://reproducible-builds.org/docs/source-date-epoch/

@esotericnonsense esotericnonsense force-pushed the esotericnonsense:esotericnonsense/reproducible branch 2 times, most recently from f7d011c to 0ad05b4 Sep 7, 2019
@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 7, 2019

OK, @grazzolini this should be good for review now.

The change here means that linux.preset or all invocations of mkinitcpio don't need to be hacked as long as the env variable is set at some point (e.g. when invoking archiso build scripts).

It's slightly cheeky to use SOURCE_DATE_EPOCH to toggle the cpio --reproducible flag.

However, I can't see any reason not to use cpio --reproducible in the general case, other than the additional dependency. I don't think anyone will be negatively affects or surprised by this change.

@esotericnonsense esotericnonsense changed the title mkinitcpio: Add --reproducible flag mkinitcpio: Use SOURCE_DATE_EPOCH for reproducible builds Sep 7, 2019
@falconindy

This comment has been minimized.

Copy link
Member

falconindy commented Sep 7, 2019

FWIW, I originally chose bsdcpio over cpio because everyone on Arch already has libarchive installed. cpio is an added dependency that will need to be bumped up from [extra] to [core].

Would be great if libarchive could add support for this sort of thing, but I don't expect any progress on that front (libarchive/libarchive#975 for one of the problems).

@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 7, 2019

For the most part this was just a lazy fix. I can investigate the suggested solution there actually - sounds like using an interim step to filter out the inodes could work.

@grazzolini

This comment has been minimized.

Copy link
Member

grazzolini commented Sep 8, 2019

@esotericnonsense, I think the code itself is nice, but I think adding cpio as a dependency will probably never fly, as @falconindy mentioned. To be honest, that was the only red flag I saw. I didn't test it yet, but I'll test this over this weekend.

@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 8, 2019

I've had some limited success using a double bsdtar step (as stated in the issue @falconindy linked) to strip the inode numbers. It produces an initramfs that boots fine etc and works with just libarchive.

For some reason though, at the moment it ends up producing a 90MiB cpio that decompresses to 27MiB. Tinkering continues.

@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 9, 2019

Right, OK, I've cracked it. Needed a --norecurse flag on bsdtar to prevent it from adding directories multiple times (because the find step would specify them multiple times).

mkinitcpio Outdated
cpio_bin="bsdcpio"
fi
cpio_opts=('--null' '-cf-' '--format=newc')
# TODO (( _optquiet )) && cpio_opts+=('--quiet')

This comment has been minimized.

Copy link
@esotericnonsense

esotericnonsense Sep 9, 2019

Author Contributor

RFC. On my box, bsdtar is silent anyway.
I suppose we want this to append -v if not present?

@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 9, 2019

Need to adjust the PIPESTATUS/pipesave bits as the pipeline has more steps now

@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 9, 2019

PIPESTATUS and verbosity fixed, this should be good to go now with just libarchive required.

@falconindy

This comment has been minimized.

Copy link
Member

falconindy commented Sep 9, 2019

  1. you've still got a check for requiring cpio.
  2. I see no reason to make this non-default behavior. Two possibilities. The easy route of assuming SOURCE_DATE_EPOCH=0, or the slightly less easy route of assuming SOURCE_DATE_EPOCH=$(time_of_mkinitcpio_build)
  3. Please don't warn on using lzop. Instead, this feature should be documented in the manpage in a new section on reproducibility. mkinitcpio's support for compression is fully arbitrary and the user could pass something that isn't quite lzop, but eventually uses lzop.

This all should be squashed into a single commit.

@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 9, 2019

  1. ack, amusingly I caught this just before you commented. :)

  2. SOURCE_DATE_EPOCH=$(time_of_mkinitcpio_build) is the way to go.

  3. ack.

I'll work on those and squash.

@esotericnonsense esotericnonsense force-pushed the esotericnonsense:esotericnonsense/reproducible branch from 8624010 to d679029 Sep 9, 2019
@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 9, 2019

Cleaned up and squashed. Your comments should be addressed now.

Turns out gzip doesn't require -n when reading from stdin, so we can skip that as well.

It's actually a lot cleaner assuming we can default this behaviour.

I'm testing the functionality at the moment using a full archiso build.

@@ -100,6 +100,19 @@ Options
*-z, \--compress* 'compress'::
Override the compression method with the 'compress' program.

Environment
-----------
The following environment variables influence the program behavior:

This comment has been minimized.

Copy link
@falconindy

falconindy Sep 9, 2019

Member

But the whole point was to make this enabled by default, without the need to set SOURCE_DATE_EPOCH.

This comment has been minimized.

Copy link
@esotericnonsense

esotericnonsense Sep 9, 2019

Author Contributor

As per your earlier comment:

The easy route of assuming SOURCE_DATE_EPOCH=0, or the slightly less easy route of assuming SOURCE_DATE_EPOCH=$(time_of_mkinitcpio_build)

The former can produce reproducible builds by default. If we go with that approach, we don't really need SOURCE_DATE_EPOCH at all. If no-one cares about timestamps in initramfs (I don't know of a reason to care about them) then we can just do that.

If we want the default to be the time of the mkinitcpio build (latest commit has this behaviour), then the default behaviour is not a reproducible build by definition right, if you run it twice you'll have different embedded timestamps.

This comment has been minimized.

Copy link
@esotericnonsense

esotericnonsense Sep 9, 2019

Author Contributor

To distill this: is there any reason to care about timestamps at all here, if not, let's just touch -hcd @1 or whatever, strip all references to S_D_E and be done with it?

This comment has been minimized.

Copy link
@falconindy

falconindy Sep 9, 2019

Member

That's what I was trying to get at, but I wrote this hastily and my intent wasn't clear. I suggested something like the build date of mkinitcpio rather than 0 for two reasons:

  1. lack of user surprise. This is wishy washy, but some people might be oddly concerned about timestamps from 1970.
  2. encoding the build timestamp might give a hint as to what version of mkinitcpio created the archive. However, that's rather redundant -- we already have a VERSION file in the cpio.

I guess I really don't care what the value is -- I just don't see the point of making this some sort of opt-in feature. If we can make the initramfs reproducible in the common case without side effects, I see no reason not to.

So yes, IMO, strip everything related to SOURCE_DATE_EPOCH and document the fact that mkinitcpio tries to make reproducible archives by default. You can call out compression as something that might make the archive not reproducible and mention "known good recipes".

This comment has been minimized.

Copy link
@esotericnonsense

esotericnonsense Sep 9, 2019

Author Contributor

Gotcha. I hadn't realised you meant the build date of mkinitcpio itself rather than the runtime date. Probably just a misunderstanding on my part.

mkinitcpio Outdated Show resolved Hide resolved
mkinitcpio Outdated Show resolved Hide resolved
mkinitcpio Outdated Show resolved Hide resolved
@eli-schwartz

This comment has been minimized.

Copy link
Member

eli-schwartz commented Sep 9, 2019

On the topic of "default reproducible or not", note that makepkg sets up the variable SOURCE_DATE_EPOCH in early runtime, if it does not already exist (using $(date +%s)), then simply makes use of that variable when doing internal accounting or touching files for bsdtar -c to consume.

In the case of makepkg, we want to push the variable into the environment for build systems to use and unify on, so I don't know if that's directly applicable here. Is there anything else which needs the variable set for reproducibility in mkinitcpio (other than that one bsdcpio invocation)?

(The idea is that if you run makepkg without SOURCE_DATE_EPOCH set, you can consult the embedded builddate and reproduce the package by, now, setting SOURCE_DATE_EPOCH.)

@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 9, 2019

@eli-schwartz nothing actually needs the SOURCE_DATE_EPOCH variable itself here. The bsdcpio invocation itself doesn't need it - mkinitcpio itself is modifying the timestamps based on its' presence.

libarchive doesn't mention it at all (bsdtar, bsdcpio,...)
GNU gzip has one mention of S_D_E, in dfltcc.c which is used to disable hardware compression on a specific platform and not relevant for us

We achieve this by stripping timestamps from within the filesystem,
and by using a pipeline to strip inodes from the cpio archive.

It functions for at least the 'gzip', 'xz', 'bzip2', 'lz4' and 'cat'
compressors. The 'lzop' compressor embeds a runtime timestamp.

Motivation: https://reproducible-builds.org

Signed-off-by: Daniel Edgecumbe <git@esotericnonsense.com>
@esotericnonsense esotericnonsense force-pushed the esotericnonsense:esotericnonsense/reproducible branch from d679029 to ca8f13e Sep 9, 2019
@esotericnonsense esotericnonsense changed the title mkinitcpio: Use SOURCE_DATE_EPOCH for reproducible builds mkinitcpio: Produce reproducible initramfs images Sep 9, 2019
@esotericnonsense

This comment has been minimized.

Copy link
Contributor Author

esotericnonsense commented Sep 9, 2019

Right.

The latest commit sets all timestamps to 0 / 1970-01-01 and includes a comment on reproducibility in the manpage.

All references to SOURCE_DATE_EPOCH have been stripped.

I don't think it's necessary to futz around with the build date of mkinitcpio (as @falconindy states, the VERSION file is there anyway). Similarly, unless someone explicitly expresses a desire to change the timestamps within the file I don't think we need to support that as an option.

I'm now testing this again with a full archiso build, but it works on my machine as is. There may be additional sources of irreproducibility across different machines; we can solve those when we find them.

@eli-schwartz

This comment has been minimized.

Copy link
Member

eli-schwartz commented Sep 9, 2019

Okay, great, if we don't actually need any time-based modifications other than to totally suppress file timestamps in bsdcpio then that makes things a lot simpler. If they don't serve a purpose then they... don't serve a purpose. :D

Copy link
Member

grazzolini left a comment

I have done a few tests and this works like a charm. It works even when changing compression options. Of course, depending on the compression used, some options might affect reproducibility. I have not tested all options exhaustively, but the ones I did (compression level and some others), didn't affect it.

@grazzolini grazzolini merged commit 74d5acf into archlinux:master Sep 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.