Skip to content

components: add filemap backend for provenance-aware build systems#112

Closed
hanthor wants to merge 1 commit intocoreos:mainfrom
hanthor:feat/bst-filemap
Closed

components: add filemap backend for provenance-aware build systems#112
hanthor wants to merge 1 commit intocoreos:mainfrom
hanthor:feat/bst-filemap

Conversation

@hanthor
Copy link
Copy Markdown

@hanthor hanthor commented Apr 15, 2026

Closes #110.

Adds a filemap backend: auto-detects usr/lib/chunkah/filemap.json in the rootfs (no CLI flag), analogous to how the RPM backend finds the rpmdb. Intended for BuildStream-based images where xattrs are stripped on OCI export.

How it works: tools/gen-filemap/gen_filemap.py queries bst artifact list-contents after build and emits an exact file→element map; that JSON is baked into the OCI image; chunkah picks it up automatically.

Benchmark — Dakota 7.4 GiB, 197k files, 120 layers:

Time Components Coverage Unclaimed
baseline 16.2 s ±0.75 s 0 7.4 GiB
--component-map (PR #111, 127 rules) 22.1 s ±1.29 s 54 81% 274 MiB
this PR (846 BST elements) 17.5 s ±0.20 s 713 99% 13 MiB

Priority 5 (below xattr=0, above rpm=10). 58 tests pass.

Assisted-by: Claude Sonnet 4.6

Add a new `filemap` backend that auto-detects an exact file→component
map baked into the rootfs at `usr/lib/chunkah/filemap.json`.  This is
analogous to how the RPM backend auto-detects the rpmdb: no CLI flag is
needed, and no changes are required to individual build element files.

The primary use-case is BuildStream (BST) based images such as GNOME OS
/ Bluefin / Dakota, where:
- xattrs are stripped on OCI export so the xattr backend cannot be used
- The image has no package manager database (RPM/dpkg absent)
- BST knows exactly which element installed which file

A companion script (`tools/gen-filemap/gen_filemap.py`) queries the
BST artifact cache after `bst build oci/layers/<target>.bst` and writes
`files/filemap.json`, which is then baked into the OCI image as a BST
`import` element.

File format:
  {
    "element-name": {
      "interval": "weekly|monthly",
      "files": ["/usr/bin/foo", "/usr/lib/foo.so"]
    }
  }

Benchmark vs 120-layer Dakota image (7.4 GiB, 197 609 files):
  baseline (no map):   16.2 s ±0.75 s   — 0 components
  component-map (127): 22.1 s ±1.29 s   — 54 components, 6.0 GiB covered
  filemap (BST, 846):  17.5 s ±0.20 s   — 713 components, 7.3 GiB covered

The filemap is 99% coverage vs 81% for the manual component-map, with
only +8% overhead vs baseline (vs +36% for the prefix-scan pathmap).

Priority 5: lower than RPM (10) so RPM wins if both are present.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new filemap component repository that enables rechunking based on a JSON file mapping components to file paths. It includes a Rust implementation for loading and querying this map, alongside a Python-based tool with a BuildStream adapter to generate the mapping. The review feedback suggests using BTreeMap for deterministic processing, correctly storing and applying the mtime_clamp for reproducibility, ensuring file paths are handled as absolute, and avoiding name collisions in the BuildStream adapter.

Comment thread src/components/filemap.rs
Comment on lines +130 to +131
let entries: HashMap<String, FileMapEntry> =
serde_json::from_str(content).context("deserializing filemap JSON")?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using HashMap for deserializing the filemap entries leads to non-deterministic iteration order. If multiple components in the JSON claim the same file, the "winner" (the last one inserted into path_to_component) will be random across different runs. This breaks the reproducibility of the rechunking process. Using BTreeMap ensures a stable processing order based on component names.

Suggested change
let entries: HashMap<String, FileMapEntry> =
serde_json::from_str(content).context("deserializing filemap JSON")?;
let entries: std::collections::BTreeMap<String, FileMapEntry> =
serde_json::from_str(content).context("deserializing filemap JSON")?;

Comment thread src/components/filemap.rs
Comment on lines +93 to +96
pub struct FilemapRepo {
components: Vec<ComponentMeta>,
path_to_component: HashMap<Utf8PathBuf, ComponentId>,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The FilemapRepo should store the default_mtime_clamp provided during initialization. This ensures that components can return a valid mtime clamp for reproducibility, similar to how the RpmRepo and XattrRepo handle it.

Suggested change
pub struct FilemapRepo {
components: Vec<ComponentMeta>,
path_to_component: HashMap<Utf8PathBuf, ComponentId>,
}
pub struct FilemapRepo {
components: Vec<ComponentMeta>,
path_to_component: HashMap<Utf8PathBuf, ComponentId>,
default_mtime_clamp: u64,
}

Comment thread src/components/filemap.rs
Comment on lines +115 to +116
let repo = Self::load_from_str(&content, files)
.with_context(|| format!("parsing {FILEMAP_PATH}"))?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default_mtime_clamp passed to load is currently ignored. It should be assigned to the repo instance after loading.

Suggested change
let repo = Self::load_from_str(&content, files)
.with_context(|| format!("parsing {FILEMAP_PATH}"))?;
let mut repo = Self::load_from_str(&content, files)
.with_context(|| format!("parsing {FILEMAP_PATH}"))?;
repo.default_mtime_clamp = _default_mtime_clamp;

Comment thread src/components/filemap.rs
let comp_id = ComponentId(comp_idx);

for path_str in &entry.files {
let path = Utf8PathBuf::from(path_str);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Scanner typically produces absolute paths (starting with /). If the JSON contains relative paths, files.contains_key(&path) will fail to match, and the file will remain unclaimed. It is safer to ensure that paths are treated as absolute or validated against the expected format.

                let path = if path_str.starts_with('/') {
                    Utf8PathBuf::from(path_str)
                } else {
                    Utf8PathBuf::from(format!("/{path_str}"))
                };

Comment thread src/components/filemap.rs
let meta = &self.components[id.0];
ComponentInfo {
name: &meta.name,
mtime_clamp: 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mtime_clamp is currently hardcoded to 0. It should use the default_mtime_clamp stored in the repo to ensure consistency with other backends and support reproducible builds.

Suggested change
mtime_clamp: 0,
mtime_clamp: self.default_mtime_clamp,

@staticmethod
def _component_name(element: str) -> str:
"""Convert 'bluefin/gnome-shell.bst' → 'bluefin-gnome-shell'."""
return element.replace("/", "-").removesuffix(".bst")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Replacing / with - in BuildStream element names can lead to collisions (e.g., core/bash.bst and core-bash.bst both becoming core-bash). Since JSON keys can safely contain slashes, it is better to preserve the original element path or use a more robust escaping mechanism to avoid overwriting component metadata in the output filemap.

Suggested change
return element.replace("/", "-").removesuffix(".bst")
return element.removesuffix(".bst")

@hanthor
Copy link
Copy Markdown
Author

hanthor commented Apr 15, 2026

Closing in favour of #113. The filemap backend works well (99% coverage, +1.3 s overhead) but adds a BST-specific JSON schema to chunkah. The cleaner solution is #113 — if chunkah's xattr scan falls back to libc, an LD_PRELOAD sidecar can serve user.component from the same file→element data with no chunkah-side schema at all. Reopening if #113 doesn't gain traction.

@hanthor hanthor closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support rechunking BuildStream images

1 participant