components: add filemap backend for provenance-aware build systems#112
components: add filemap backend for provenance-aware build systems#112hanthor wants to merge 1 commit intocoreos:mainfrom
Conversation
Add a new `filemap` backend that auto-detects an exact file→component
map baked into the rootfs at `usr/lib/chunkah/filemap.json`. This is
analogous to how the RPM backend auto-detects the rpmdb: no CLI flag is
needed, and no changes are required to individual build element files.
The primary use-case is BuildStream (BST) based images such as GNOME OS
/ Bluefin / Dakota, where:
- xattrs are stripped on OCI export so the xattr backend cannot be used
- The image has no package manager database (RPM/dpkg absent)
- BST knows exactly which element installed which file
A companion script (`tools/gen-filemap/gen_filemap.py`) queries the
BST artifact cache after `bst build oci/layers/<target>.bst` and writes
`files/filemap.json`, which is then baked into the OCI image as a BST
`import` element.
File format:
{
"element-name": {
"interval": "weekly|monthly",
"files": ["/usr/bin/foo", "/usr/lib/foo.so"]
}
}
Benchmark vs 120-layer Dakota image (7.4 GiB, 197 609 files):
baseline (no map): 16.2 s ±0.75 s — 0 components
component-map (127): 22.1 s ±1.29 s — 54 components, 6.0 GiB covered
filemap (BST, 846): 17.5 s ±0.20 s — 713 components, 7.3 GiB covered
The filemap is 99% coverage vs 81% for the manual component-map, with
only +8% overhead vs baseline (vs +36% for the prefix-scan pathmap).
Priority 5: lower than RPM (10) so RPM wins if both are present.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new filemap component repository that enables rechunking based on a JSON file mapping components to file paths. It includes a Rust implementation for loading and querying this map, alongside a Python-based tool with a BuildStream adapter to generate the mapping. The review feedback suggests using BTreeMap for deterministic processing, correctly storing and applying the mtime_clamp for reproducibility, ensuring file paths are handled as absolute, and avoiding name collisions in the BuildStream adapter.
| let entries: HashMap<String, FileMapEntry> = | ||
| serde_json::from_str(content).context("deserializing filemap JSON")?; |
There was a problem hiding this comment.
Using HashMap for deserializing the filemap entries leads to non-deterministic iteration order. If multiple components in the JSON claim the same file, the "winner" (the last one inserted into path_to_component) will be random across different runs. This breaks the reproducibility of the rechunking process. Using BTreeMap ensures a stable processing order based on component names.
| let entries: HashMap<String, FileMapEntry> = | |
| serde_json::from_str(content).context("deserializing filemap JSON")?; | |
| let entries: std::collections::BTreeMap<String, FileMapEntry> = | |
| serde_json::from_str(content).context("deserializing filemap JSON")?; |
| pub struct FilemapRepo { | ||
| components: Vec<ComponentMeta>, | ||
| path_to_component: HashMap<Utf8PathBuf, ComponentId>, | ||
| } |
There was a problem hiding this comment.
The FilemapRepo should store the default_mtime_clamp provided during initialization. This ensures that components can return a valid mtime clamp for reproducibility, similar to how the RpmRepo and XattrRepo handle it.
| pub struct FilemapRepo { | |
| components: Vec<ComponentMeta>, | |
| path_to_component: HashMap<Utf8PathBuf, ComponentId>, | |
| } | |
| pub struct FilemapRepo { | |
| components: Vec<ComponentMeta>, | |
| path_to_component: HashMap<Utf8PathBuf, ComponentId>, | |
| default_mtime_clamp: u64, | |
| } |
| let repo = Self::load_from_str(&content, files) | ||
| .with_context(|| format!("parsing {FILEMAP_PATH}"))?; |
There was a problem hiding this comment.
The default_mtime_clamp passed to load is currently ignored. It should be assigned to the repo instance after loading.
| let repo = Self::load_from_str(&content, files) | |
| .with_context(|| format!("parsing {FILEMAP_PATH}"))?; | |
| let mut repo = Self::load_from_str(&content, files) | |
| .with_context(|| format!("parsing {FILEMAP_PATH}"))?; | |
| repo.default_mtime_clamp = _default_mtime_clamp; |
| let comp_id = ComponentId(comp_idx); | ||
|
|
||
| for path_str in &entry.files { | ||
| let path = Utf8PathBuf::from(path_str); |
There was a problem hiding this comment.
The Scanner typically produces absolute paths (starting with /). If the JSON contains relative paths, files.contains_key(&path) will fail to match, and the file will remain unclaimed. It is safer to ensure that paths are treated as absolute or validated against the expected format.
let path = if path_str.starts_with('/') {
Utf8PathBuf::from(path_str)
} else {
Utf8PathBuf::from(format!("/{path_str}"))
};| let meta = &self.components[id.0]; | ||
| ComponentInfo { | ||
| name: &meta.name, | ||
| mtime_clamp: 0, |
| @staticmethod | ||
| def _component_name(element: str) -> str: | ||
| """Convert 'bluefin/gnome-shell.bst' → 'bluefin-gnome-shell'.""" | ||
| return element.replace("/", "-").removesuffix(".bst") |
There was a problem hiding this comment.
Replacing / with - in BuildStream element names can lead to collisions (e.g., core/bash.bst and core-bash.bst both becoming core-bash). Since JSON keys can safely contain slashes, it is better to preserve the original element path or use a more robust escaping mechanism to avoid overwriting component metadata in the output filemap.
| return element.replace("/", "-").removesuffix(".bst") | |
| return element.removesuffix(".bst") |
|
Closing in favour of #113. The filemap backend works well (99% coverage, +1.3 s overhead) but adds a BST-specific JSON schema to chunkah. The cleaner solution is #113 — if chunkah's xattr scan falls back to libc, an LD_PRELOAD sidecar can serve |
Closes #110.
Adds a
filemapbackend: auto-detectsusr/lib/chunkah/filemap.jsonin the rootfs (no CLI flag), analogous to how the RPM backend finds the rpmdb. Intended for BuildStream-based images where xattrs are stripped on OCI export.How it works:
tools/gen-filemap/gen_filemap.pyqueriesbst artifact list-contentsafter build and emits an exact file→element map; that JSON is baked into the OCI image; chunkah picks it up automatically.Benchmark — Dakota 7.4 GiB, 197k files, 120 layers:
--component-map(PR #111, 127 rules)Priority 5 (below xattr=0, above rpm=10). 58 tests pass.
Assisted-by: Claude Sonnet 4.6