Skip to content

refactor(cli): single anchored walk seam — library globsets are dead and files are double-stat'd after #489 #495

@dekobon

Description

@dekobon

Summary

#489 moved include/exclude filtering to the CLI walk seam:
expand_seed_paths (big-code-analysis-cli/src/lib.rs) now walks each
directory seed with ignore::WalkBuilder, filters each candidate at the
walk root via walk_seed::match_path_for + WalkFilters::passes, and
passes the resulting flat file list to the library with
empty include/exclude globsets (FilesData { include: GlobSet::empty(), exclude: GlobSet::empty() }). This fixed the
path-form bugs (#488/#489) but left two rough edges worth a
follow-up.

1. Dead library globsets + double per-file stat

The library explore() (src/concurrent_files.rs) still runs
WalkDir::new(path) on every entry it is handed and still applies its
own globset filter (Filters::matches). With a pre-expanded file list
and empty globsets, that filter is a guaranteed no-op and each file is
re-stated a second time (once by the ignore walk in
expand_seed_paths, once by the library WalkDir). Not a full second
tree traversal (single-file WalkDir yields one entry), but O(N)
redundant stats and two filtering mechanisms for one job.

2. Latent re-introduction of the #488 path-form bug

The empty-globset hand-off is load-bearing and undocumented at the
FilesData construction site beyond a comment: the library globset
matcher matches against the walker's emitted (possibly absolute) path
form — exactly the path-form dependence #488 removed. Any future code
path that injects files into FilesData.paths after expand_seed_paths
(or sets non-empty library globsets expecting them to apply) silently
re-inherits the flood-the-offender-set bug. [check.exclude] (#493) is
the existing instance of a filter matching the emitted path instead of
the anchored form.

Suggested direction

Consolidate to a single anchored walk seam: have the library walk accept
an already-resolved file set as terminal (no re-walk, no globset params),
or push the anchored match_path_for filtering into the one library
walk so the emitted path is canonical root-relative once and every
consumer (include/exclude, [check.exclude] #493, baseline keys) matches
that single invariant — removing the per-filter/per-mechanism anchoring
(reanchor_seed for emission + match_path_for for matching + the dead
library globsets).

Acceptance

  • One walk + one filtering mechanism; the library globset params are
    either removed or genuinely used.
  • No O(N) redundant per-file stat.
  • A new path-consuming filter cannot silently re-inherit the #488
    path-form dependence.

Found during the final /code-review of the gate-consolidation epic
(#482). Introduced by the #488/#489 rework; not a correctness bug
today (the CLI pre-filters correctly), but an architecture/perf
follow-up. Related: #493.


Status (updated)

Scope: 2.0-deferred (API reshape). Investigation showed the
structural acceptance criteria need a SemVer-breaking FilesData reshape
and/or moving gitignore-aware walking into the library, and that
canonical emission conflicts with the #488 PyO3-parity contract — see the
analysis comment below. The 1.x-safe part (documenting the path-form
invariant at the public API surface, item 2) landed in 997720dc. The
structural single-seam consolidation remains open as a 2.0 item.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions