Resync Rust port to graphify-py v0.8.27 (4b17f19)#16
Conversation
Port the v0.8.25 -> v0.8.27 changes from `graphify-py` and bump the
workspace version to 0.8.27.
Ports:
- detect: anchored `.graphifyignore`/`.gitignore` patterns now match the
anchor-relative path directly (no subtree/basename fallback), so
`/inbox/` no longer leaks into `src/inbox/` (#1087). Sort the walked
file list and each file-type bucket for deterministic output.
- export: cap obsidian/canvas note filenames to 200 UTF-8 bytes with an
8-char SHA-1 suffix on truncation, preventing ENAMETOOLONG on long or
multibyte labels (#1094).
- analyze: add `find_import_cycles`, collapsing the graph to file-level
`imports_from`/`re_exports` edges and reporting simple cycles (#961).
- report: render an `## Import Cycles` section from `find_import_cycles`.
- extract: file-level node IDs follow the `{parent_dir}_{stem}` spec so
AST and semantic nodes coincide (#1033); root-level symbol prefixes are
relativised the same way (#1096); `pnpm-workspace.yaml` `'.'` resolves
to the root without crashing (#1083); TypeScript `interface A extends B`
now emits an `inherits` edge via the `extends_type_clause` branch
(#1095). `extract()` honours `cache_root` as the id-relativisation root.
- llm: load custom OpenAI-compatible providers from `providers.json`,
route `extract`/`call_llm` through them, and consult them in
`detect_backend` after the built-ins (#1084). Add batched community
labelling (`label_communities` / `generate_community_labels`, #1097).
- cli: `provider add|list|show|remove`, a `label` command, and
`--no-label`/`--backend` on `cluster-only`; auto-name communities when
no labels file exists. Replace console em-dash/arrow glyphs with ASCII
for non-UTF-8 terminals (#992).
Divergences from graphify-py:
- The Rust `extract` is a one-shot pipeline that already writes
`GRAPH_REPORT.md` and a placeholder labels file, so the Python
"next: run cluster-only" hint is omitted; run `graphify label` to
LLM-name communities.
- The `GRAPHIFY_DEBUG` traceback hook has no Rust equivalent (extractors
return `FileResult.error` strings, not a stack trace).
Glory to the Omnissiah
Close the gaps the `cargo llvm-cov` pass surfaced in the v0.8.27 resync's new code: - `extract_files_direct` with a custom provider now has a mockito test exercising `extract_custom` (llm/extract.rs 45% -> 72% line). - Community labelling gains an end-to-end test through the public `label_communities` / `generate_community_labels` wrappers (via a mock custom provider), a `quiet=false` degradation path, and a prose-wrapped-JSON parse case (llm/labeling.rs 81% -> 90% line). Test-only; no production code changes. By the will of the Machine God
Apply the two valid findings from the CodeRabbit review and document the two disputed ones. Fixed: - `custom_providers_path` falls back to the local path when `$HOME` is unset, and `load_custom_providers_from` reads identical local/global paths only once, so an unset `$HOME` no longer reads the same file twice (CodeRabbit, providers.rs). - The `provider` command now aborts when `providers.json` exists but is unreadable or not a JSON object, instead of silently overwriting it on `add`/`remove`. graphify-py clobbers a malformed file via a bare `except`; refusing to is a deliberate, safer divergence that prevents data loss (CodeRabbit, src/cli/provider.rs). Disputed (left as-is): - Provider load order `[local, global]` (global wins): this matches graphify-py `_load_custom_providers`, which iterates `(local, global)` and assigns `providers[name] = cfg`. The suggested "local wins" would diverge from the reference. - `load_custom_providers` per call in `extract_files_direct` / `call_llm`: only reached for custom (non-built-in) backends, since `is_builtin_backend` short-circuits built-ins without touching the filesystem. The file is tiny and correctness is unaffected; threading a process-lifetime cache through the parallel corpus path is not worth the added shared state. Tests cover the new branches (identical-path dedup, malformed-registry abort). Ave Deus Mechanicus
Fixed: - Rename the `ts_inheritance` test helper `has_inherits` to `has_edge` since it takes an arbitrary relation argument. - Use `sort_by_cached_key` for the deterministic file sort in `detect::walk` so each path is stringified once, not per comparison. - Annotate the env-mutating provider/labeling tests with `#[serial]` so they are safe under `cargo test` (thread-per-test), not only under nextest (process-per-test). - Make the community-labelling `max_tokens` arithmetic fully saturating. Disputed (left as-is, parity with graphify-py): - `is_included` anchored matching uses a direct fnmatch with no ancestor walk. This is exactly the #1087 change in graphify-py `_is_included`; adding an ancestor walk would revert that fix for the include path. - `cluster-only` regenerates `Community N` placeholders when an existing `.graphify_labels.json` fails to parse. graphify-py does the same, and the labels file is a regenerable cache, not user-authored config (unlike `providers.json`, whose clobber we did guard against). - The `ts_inheritance` per-test boilerplate is left inline for readability (trivial). By the will of the Machine God
Fixed (all trivial): - Atomic registry write in `provider`: write a sibling temp file then rename, so a crash mid-write cannot truncate `providers.json`. - Share one `EnvGuard` across the `graphify-llm` integration tests via `tests/common/mod.rs` instead of duplicating it. - Add a clarifying comment on the `find_import_cycles` edge-orientation fallback. Disputed (left as-is): - `sha1 = "0.11"` does NOT duplicate `digest`: `cargo tree` shows both `sha1 0.11` and `sha2 0.11` resolve to `digest 0.11.3`. Downgrading to `sha1 0.10` would pull in `digest 0.10` and create the duplication CodeRabbit warned about, so the suggestion is inverted. - The per-call `load_custom_providers` in `extract_files_direct` / `call_llm` only runs for custom backends (built-ins short-circuit via `is_builtin_backend` with no filesystem access); threading a cache through the parallel corpus path is not worth the shared state. Glory to the Omnissiah
Replace the post-empty-check index `c.cycle[0]` with a `let else` on `c.cycle.first()` so the non-empty invariant is explicit and there is no direct indexing (CodeRabbit, trivial). Ave Deus Mechanicus
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds file-level import-cycle detection and report rendering; introduces custom OpenAI-compatible LLM providers and batched community labeling (CLI + registry); fixes robust path relativisation and file-node IDs; caps export filenames to avoid ENAMETOOLONG; makes detect deterministic; and adds tests and docs. ChangesMain Functional Changes
Estimated code review effort: Possibly related PRs:
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
Applies the still-valid findings from CodeRabbit's review and documents the verified false positives. Fixed: - `find_import_cycles`: visit start files and neighbours in lexicographic order so the `cap` truncation collects a deterministic set by graph content, not edge-insertion order. - `find_import_cycles_bounded`: return early for `max_cycle_length == 0` so self-loops (length 1) cannot leak past a zero bound, matching Python's `len(cycle) <= max_cycle_length` filter. - Provider registry loader: skip records missing or blanking any of `base_url` / `default_model` / `env_key`, mirroring the `provider add` contract that already requires all three; a half-formed provider can never authenticate or address an endpoint. - `relativise_under_root` and `EnvGuard::new` gain `#[must_use]`. - `file_node_id_spec.rs` tests now return `Result` and propagate with `?` instead of a blanket `expect_used` allow. - `cli_commands.rs`: pass temp paths via `.arg(path)` rather than `to_str().unwrap()`. Added parity coverage: - `re_exports`-only cycle detection (treated identically to `imports_from`). - Zero `max_cycle_length` returns no cycles. - Provider records missing a required field are skipped. Disputed (verified false positives, left unchanged): - cluster-only `--no-label` precedence and malformed-labels overwrite: both match `graphify-py/__main__.py:2417-2448` exactly (existing file wins over `--no-label`; a parse failure falls back to placeholders and the file is rewritten unconditionally). The labels file is a regenerable cache, unlike the user-authored `providers.json`. - `prefix_remap` key normalisation: keys and the `n.source_file` lookup are the identical absolute-path string at that point (relativisation runs later), so switching to `file_node_id` keys would only add cross -file collision risk. - pnpm `.`-package edge assertion: the upstream test asserts only crash safety (no `IndexError`); there is no `main` entry to resolve `my-app` to deterministically. Glory to the Omnissiah
Fixed:
- Custom providers now honour a `max_completion_tokens` field from
`providers.json` on the extraction path (default 8192), mirroring
Python's `_resolve_max_tokens(cfg.get("max_completion_tokens", 8192))`
at `llm.py:720`. Previously the Rust port hardcoded 8192, so a
hand-tuned custom provider budget was silently ignored.
- `build_file_graph`: drop the dead `&& v_file != src_file` guard in the
orientation fallback — the preceding arm already rules out
`v_file == src_file`.
- `EnvGuard` gains a `Default` impl alongside `new`.
Added parity coverage for `max_completion_tokens` parse + default.
Disputed (verified false positives, left unchanged):
- `sha1` / `sha2` digest-major split: `cargo tree` shows both crates at
`0.11.0` resolving to a single `digest v0.11.3`; there is no split and
`util.rs`/`json.rs` each use their own crate's identical `Digest`.
- Caching `load_custom_providers()` in a process-global `LazyLock`: would
capture `$HOME` at first access and make the `#[serial]` env-mutating
tests order-dependent under a shared-process runner. The redundant read
is bounded to custom (non-built-in) backends and keeps a freshly written
`providers.json` visible within a long-running process.
By the will of the Machine God
Fixed: - `DetectResult.files` is now an `IndexMap` instead of a `HashMap`. The buckets are built in the fixed order `code, document, paper, image, video` (matching Python's insertion-ordered dict), but `HashMap` discarded that order. The order flows through `collect_extract_files` into `extract()` and thus into `graph.json` node order (which `to_json` does not re-sort) and the manifest, so the previous type made those outputs nondeterministic run-to-run and divergent from Python. The reconstruction path in `detect_result_from_incremental` and `persist_manifest` are updated to match; the latter now passes the map straight through (no clone-collect). - `anchored_file_not_matched_at_depth` gains a positive assertion that `/build` still matches `build` at the repo root, alongside the existing negative `src/build` case. Disputed (verified false positives, left unchanged): - Submodule pointer bump to `4b17f19`: intentional — it is the subject of this resync, documented in the first commit and the PR body. - Caching `load_custom_providers()` in a process-global static: re-raised from the prior round; same rationale (captures `$HOME` at first access, makes `#[serial]` env-mutating tests order-dependent under a shared -process runner, redundant read bounded to custom backends only). - Extracting a shared setup helper across the `ts_inheritance` tests: a stylistic test-DRY nitpick, not a defect; the explicit per-test setup keeps each case readable and isolated. Ave Deus Mechanicus
Fixed: - Import-cycles report renderer no longer clones the cycle vector to close the path; it joins the files and appends the start file inline. Disputed (verified false positives, left unchanged): - `--temperature` flag on `provider add`: Python's `provider add` hardcodes `"temperature": 0` (`__main__.py:1877`) and exposes no such flag, so the Rust port already matches; adding the flag would diverge from the reference CLI surface. A custom `temperature` is still honoured when present in a hand-edited `providers.json`. - `sha1` 0.10 downgrade: the workspace pins `sha2 = "0.11"`, so within `graphify-export` both `sha1` and `sha2` resolve to a single `digest v0.11.3`. Downgrading `sha1` to 0.10 would pair it with the 0.10 digest while `sha2` stays on 0.11 — creating the split the finding aims to remove. The 0.10 `sha*`/`digest` entries in `Cargo.lock` come from unrelated transitive deps. - Caching `load_custom_providers()` and `prefix_remap` keying: re-raised from prior rounds; same verified rationale (process-global cache breaks `#[serial]` test isolation; `source_file` is the verbatim input-path string at remap time, with relativisation running strictly afterward). Glory to the Omnissiah
Fixed: - `#[must_use]` on the pure `node_label_map` / `god_node_ids` helpers. Disputed (verified false positives, left unchanged): - Making `provider add`'s `--base-url` / `--default-model` / `--env-key` clap-required: the handler already validates them (provider.rs:110) and returns the byte-identical error message Python uses (`__main__.py:1863`), matching the reference's optional-parse + validate pattern. The loader's skip-incomplete guard is defense-in-depth for hand-edited registries. - Merging curated labels when `generate_community_labels` degrades: the `else` branch only runs when no labels file exists (nothing to lose) or `--force-relabel` was passed. Python behaves identically (`__main__.py:2424-2448`): `label` / force-relabel is a deliberate regenerate that degrades to placeholders by design, and the labels file is a regenerable cache. Merging would make `--force-relabel` not fully relabel. By the will of the Machine God
These were dismissed in earlier rounds either on an unverified assumption (finding 1) or on "matches graphify-py" grounds (findings 2-3), which the project rules reject: feature parity is the bar, not bug parity. 1. pnpm `.`-package import resolution (#1083 follow-up). The resolver canonicalises the entry path (on macOS `/tmp` -> `/private/tmp`), so the resolved `imports_from` target id (`make_id1` of the canonical path) did not match the input-path key in `id_remap` and the edge dangled off the relativised file node. `id_remap` now also maps each input file's canonicalised spelling to its node id, so the resolved edge connects. The parity test now asserts the import resolves to a REAL node id, not just that extraction produced nodes. 2. `cluster-only --no-label` was silently ignored when a labels file already existed, because the load-existing branch was checked first. The `--no-label` branch now runs first, so the flag always yields `Community N` placeholders as documented. 3. `cluster-only` silently overwrote a malformed/unreadable `.graphify_labels.json` with placeholders, destroying hand-curated edits. The read is now fallible (`read_existing_labels`): on failure it warns, falls back to placeholders for the run, and leaves the file on disk untouched. This is a deliberate divergence from graphify-py (`__main__.py:2418-2448`), which clobbers on parse failure. Added cluster-only parity tests: `--no-label` overrides curated names; malformed labels file is preserved. Ave Deus Mechanicus
Fixed: - `file_node_id_spec` root-leak guard now normalises the temp-dir name with `make_id1` (the same routine ids use) instead of only lowercasing and replacing `-`, so it detects a leak regardless of which non-alphanumeric characters the temp dir name contains. - Custom-provider `max_completion_tokens` accepts a whole-number JSON float (e.g. `8192.0`) in addition to an integer; negative / non-finite / out-of-range values fall back to the 8192 default instead of being silently dropped or wrapping. - USAGE.md custom-providers section now documents that the registry is read from both `~/.graphify/providers.json` (global) and `.graphify/providers.json` (project-local), that the global entry wins on a name clash, and that an entry needs a non-empty base_url/default_model/env_key plus the optional `max_completion_tokens` cap. Added coverage: float + negative `max_completion_tokens` parsing. Disputed (engineering cost/benefit, not parity): - Threading a pre-resolved provider from the corpus boundary to avoid the per-chunk `load_custom_providers()` read: the read is already gated to custom (non-built-in) backends via `&&` short-circuit, so the common path has zero overhead, and for custom backends a sub-kilobyte JSON read is negligible next to each chunk's LLM HTTP round-trip. Threading an owned provider through `CorpusConfig`/`AdaptiveRetryCtx`/retry into a new non-public extract variant (the public `extract_files_direct*` API and its tests can't change) is disproportionate plumbing for that bounded gain. Glory to the Omnissiah
Doc/comment polish only: - Add `// SAFETY:` comments to the `EnvGuard` env-mutating `unsafe` blocks (`set`/`unset`/`Drop`) stating the `#[serial]`-execution invariant that rules out concurrent process-environment access. - Make the `detect_backend_with` doc comment a complete sentence. By the will of the Machine God
Fixed:
- `is_included` anchored patterns now also match subtree descendants
(`/src` includes `src/main.py`), mirroring `could_contain_included_path`
and normal allowlist semantics. This is the inverse of the anchored
*ignore* precision fix (#1087): an ignore pattern must not leak into a
subtree, but an include directory is meant to pull its whole subtree in.
`is_included` is a public helper not yet wired into the walk (the walk
rescues via `could_contain_included_path`, and `.graphifyinclude` is a
loaded-but-unused stub in graphify-py), so this corrects the helper with
no change to walk behaviour.
- Community-label JSON sanitizer always slices the first `{` … last `}`
span, even when the reply already starts with `{`, so a leading-JSON +
trailing-prose reply parses instead of degrading to placeholders.
Diverges from graphify-py (`llm.py:1308`), which keeps that data loss.
Added coverage: anchored include root/subtree/depth + anchored-file +
unanchored-at-depth; label reply with trailing prose.
Disputed (engineering judgment, not parity):
- Extracting a shared `tempdir/canonicalize/write/extract` helper across the
`ts_inheritance` tests: a stylistic test-DRY change with no correctness
impact. The explicit per-test setup keeps each case self-contained and
readable; collapsing it trades that for indirection. Per the repo's
surgical-changes guidance, leaving working tests untouched.
Ave Deus Mechanicus
The headline fix corrects a regression I introduced two commits earlier. - `cluster-only --no-label` no longer clobbers a hand-curated `.graphify_labels.json`. My earlier "finding 2" change reordered the `--no-label` branch ahead of the load-existing branch so the flag always produced `Community N` placeholders and rewrote the file — silently destroying curated labels. That was wrong: an existing labels file already means no LLM call, so `--no-label` should be a harmless no-op there. The load-existing branch is restored to first position, so `--no-label` preserves curated labels and only yields placeholders when no file exists. The malformed-file guard (warn + don't overwrite) is kept. Also fixed: - `god_node_ids` magic `20` → named `GOD_NODE_CAP` constant. - `max_completion_tokens_from` doc now states it accepts any finite non-negative float, truncated toward zero (`8192.9` → `8192`), matching the implementation. Test changes: replaced the test that asserted `--no-label` wipes curated labels with one asserting it PRESERVES them, plus a test that `--no-label` with no file yields placeholders. Disputed (verified false positives): - "Print updated vs added in `provider add`": Python `__main__.py:1880` always prints `'added'` (upsert, no distinction). Adding an "updated" message would diverge from the reference, not match it. - "Use `NamedTempFile` in `save_registry`": Python writes the registry non-atomically (`write_text`); the current same-dir temp + rename already exceeds that. `provider` is an interactive, non-concurrent command, so a unique temp + new runtime dep is unwarranted. - Caching the per-chunk `load_custom_providers()` read: re-raised; same engineering rationale (gated to custom backends, dwarfed by the LLM call, threading through the parallel pipeline is disproportionate). By the will of the Machine God
Fixed:
- `is_included` anchored subtree check now uses a byte-level prefix test
instead of allocating a `"{p}/"` String per (pattern, path).
- Document the security rationale for global-over-local provider precedence
in `load_custom_providers_from` (see disputes).
Disputed (verified, non-parity reasons):
- Flip provider precedence to local-wins: declined for security. A
project-local `.graphify/providers.json` can come from a cloned/untrusted
repo; letting it shadow a globally-defined provider name would let the repo
redirect that backend's `base_url` to an attacker endpoint and exfiltrate
the API key (resolved from the user's env via `env_key`). global-wins
blocks that hijack; built-in names are already un-shadowable. Comment added
so the precedence reads as deliberate.
- Extract a `remap_if_different` helper in `multi.rs`: the two id_remap
insertions have intentionally different overwrite semantics (the input-path
id uses `insert` so it wins; the canonical spelling uses `or_insert` so it
never displaces a real input-path mapping). One helper would change
behavior on a make_id collision or need a flag that adds noise.
- Validate `base_url` shape in `provider add`: Python only checks non-empty
(`__main__.py:1862`); the URL is validated by the SSRF guard at request
time, so early `Url::parse` would add a runtime dep for marginal gain.
- Cache the per-chunk `load_custom_providers()` read: re-raised; same
engineering rationale as prior rounds.
Glory to the Omnissiah
Both findings were valid and are fixed:
- `detect_result_from_incremental` now seeds all canonical file-type buckets
in fixed order (even when empty) before merging the incremental results, so
a reconstructed `DetectResult` is structurally identical to a fresh
`detect` walk. Extracted the canonical kind list into a shared
`graphify_detect::FILE_TYPE_KINDS` constant so the walk and the CLI
reconstruction use one source of truth.
- `is_included` anchored subtree matching now handles globbed directory
stems. This matcher's `*` does not cross `/` (gitignore semantics), so
`/src*` previously matched `src1` but not `src1/deep.py`; the subtree check
now falls back to `{p}/**` for globbed stems while keeping the zero-alloc
byte-prefix fast path for literal stems. Converges with graphify-py, whose
`fnmatch`-based `_is_included` (`*` crosses `/`) already pulls such
descendants in.
Added coverage: anchored globbed include dir matches its subtree but not an
unrelated directory.
Ave Deus Mechanicus
All trivial cleanups: - Drop the unused `clippy::float_cmp` / `unsafe_code` allows from `labeling.rs` (the file has no float comparisons or `unsafe`); keep only `expect_used`. - Extract the dense anchored-include predicate into a named `anchored_include_matches` helper for readability. - `save_registry` serializes the `Map` directly instead of cloning it into a `Value::Object`. Disputed (already satisfied): - Add `#[must_use]` to `cap_filename`: it already carries the attribute (`export/util.rs:134`); no change needed. By the will of the Machine God
Fixes five findings and documents one dispute. - Add `EnvGuard::scrub_backends` to `graphify-llm`'s test `common` module and route the duplicated backend-env-unset loops in `labeling.rs` and `provider_registry.rs` through it. The canonical list now includes `AWS_CONTAINER_CREDENTIALS_FULL_URI` (read by `credentials_appear_configured` but previously omitted from the scrub), so a host with that variable set can no longer make `detect_backend` pick `bedrock` and pollute the no-backend path. - Enforce the mock endpoint in `label_communities_real_path_via_custom_provider` with an explicit `mock.assert()` rather than relying on the label assertions alone. - Add an `extract_files` helper to `ts_inheritance.rs` and collapse the repeated tempdir/canonicalize/write/extract setup across all seven inheritance tests onto it. - Reconcile the README LLM-backend list with the `--backend` identifiers (`Claude (Anthropic)`, `Kimi (Moonshot)`) and spell out the identifier names so readers know what to pass. Disputed (not changed): CodeRabbit suggested guarding `provider add` against overwriting an existing provider with a prompt or `--force` flag. graphify-py's `provider add` upserts unconditionally by design (add-or-update); a prompt would diverge from that behaviour, break non-interactive scripted usage, and fail the existing parity tests. No underlying bug, so left as-is. Glory to the Omnissiah
Fixes four findings and documents two disputes.
- Centralise the backend-selection env var list behind a new
`graphify_llm::backend_selection_env_vars()`. It is built from the
per-backend `ENV_KEY` constants, the new `bedrock::CREDENTIAL_ENV_VARS`,
and `ollama::BASE_URL_ENV` (the same names `detect_backend` reads), and
`credentials_appear_configured` now derives from that const too. The
test helper `EnvGuard::scrub_backends` calls the shared function, so the
scrub set can no longer drift from the detection logic.
- Simplify the glob-char scan in `anchored_include_matches` to
`p.contains('*') || p.contains('?')`.
- Rename the `ids` test helper in `file_node_id_spec.rs` to `node_ids` so
the per-test `let ids = ...` bindings no longer shadow it.
- Remove the orphaned temp file in `provider::save_registry` when the
atomic rename fails, preserving the original error.
Disputed (not changed): CodeRabbit flagged `load_custom_providers()` in
`call_llm` and `extract_files_direct_mode` as repeated disk I/O and
suggested a process-lifetime cache. Both calls are gated behind
`!is_builtin_backend(backend)`, so built-in backends never touch disk; only
the custom-provider path reads the (small) registry. A global OnceLock cache
would freeze the first-loaded registry for the whole process, breaking the
`#[serial]` custom-provider tests that each point `HOME` at a different
`providers.json`, and would introduce global mutable state the workspace
guidelines steer away from. The marginal I/O does not justify that risk.
By the will of the Machine God
Fixes two findings and re-documents two standing disputes.
- Rewrite the literal-stem subtree test in `anchored_include_matches`
as `rel_str.strip_prefix(p).is_some_and(|rest| rest.starts_with('/'))`,
replacing the manual length/index/`starts_with` checks. Same zero-alloc
behaviour, clearer intent.
- Swap the `seen` dedup set in `label_communities`'s name builder from
`IndexSet` to `std::collections::HashSet`. The set is membership-only;
`names` (a `Vec`) carries the observable order, so insertion-order
tracking was never needed.
Disputed again (not changed): CodeRabbit re-flagged `load_custom_providers()`
in `call_llm` and `extract_files_direct_mode` as repeated disk I/O. Both
calls are gated behind `!is_builtin_backend(backend)`, so only the
custom-provider path ever reads the (small) registry. A process-lifetime
`OnceLock` cache would freeze the first-loaded registry and introduce the
global mutable state the workspace guidelines steer away from, while the
alternative (threading a preloaded map through the public `call_llm` /
`extract_files_direct` API and the per-chunk retry context) is an invasive
signature change across crates for a trivial perf nit. The cost optimised
away is a few KB of JSON per chunk, custom-provider-only. Left as-is
pending owner direction.
Ave Deus Mechanicus
All four findings this round are no-change determinations. Disputed (approved won't-fix): CodeRabbit flagged `load_custom_providers()` as repeated disk I/O in three call sites this pass: - `call_llm` (call.rs) and `extract_files_direct_mode` (extract.rs) are gated behind `!is_builtin_backend`, so only the custom-provider path reads the small registry. - `detect_backend` (backends.rs) loads it ~once per run for auto-selection. A process-lifetime `OnceLock` cache would add global mutable state (against the workspace guideline) and stale `serve`/`watch`; threading a preloaded map is an invasive multi-crate API change for a trivial perf nit. Project owner approved keeping these as documented disputes. False positive: CodeRabbit suggested adding `#![allow(clippy::expect_used, clippy::unwrap_used)]` to `file_node_id_spec.rs`. That file uses `?` with a `TestResult` return type and contains zero `.expect()`/`.unwrap()` calls, so the suppression would silence lints that never fire. Clippy already passes on the file. Not added. By the will of the Machine God
Adds one regression test; both findings this round are disputes.
- Add `extract_swift_same_file_class_and_extension_keeps_canonical` to
lock in that a single Swift file declaring both `class Foo` and
`extension Foo` collapses to exactly one canonical node.
Disputed (false positive): CodeRabbit claimed `merge_swift_extensions`
over-marks same-file class+extension pairs because it matches extension
nodes by `(source_file, label)` instead of the node's own kind. It cannot:
tree-sitter-swift parses both as `class_declaration` and both derive the
same `{stem}_{label}` id, so the extractor's `seen_ids` collapses them to a
single node before postprocessing ever runs. There is only ever one node
per `(source_file, label)`, so the label match is equivalent to graphify-py's
nid-based `swift_extensions` tracking. The new test proves the canonical
node is preserved.
Disputed (parity): CodeRabbit wanted `provider add` to write
`"temperature": 0.0` instead of `0`. graphify-py writes integer `0`
(`__main__.py`), and byte-identical registry output is a resync goal;
`serde_json::to_string_pretty` matches it, and the loader reads the integer
back via `as_f64()`, so there is no functional difference. Changing it would
diverge the Rust registry from the Python one. Left as integer `0`.
Glory to the Omnissiah
Fixes three findings. - Correct the `scrub_backends` doc comment: it unsets only the built-in backend-selection vars from `backend_selection_env_vars()` and does NOT clear custom-provider `env_key`s (those are registry-defined, so a test relying on one must handle it explicitly). - Short-circuit `find_import_cycles_bounded` when `top_n == 0`, mirroring the existing `max_cycle_length == 0` guard, so a zero-results request skips `build_file_graph`. Add a `top_n == 0` parity test alongside the zero-max-length one. - Add descriptive failure messages to the `has_edge` assertions in the TypeScript inheritance tests so a failure names the missing edge. By the will of the Machine God
Both findings this round are disputes.
Disputed (over-abstraction): CodeRabbit suggested making the `has_edge`
test helper in `ts_inheritance.rs` generic over a map-like type. It is a
private, single-file helper that only ever receives `ExtractOutput.edges`
(`Vec<IndexMap<String, Value>>`); a generic signature adds an abstraction
no caller needs (against the workspace's "no abstractions for single-use
code" guideline), and the suggested `serde_json::Map` signature would not
even match the concrete `IndexMap` field type. Left concrete.
Disputed (parity, owner-approved): CodeRabbit wanted `provider add` to
upsert and preserve hand-edited `temperature` / `max_completion_tokens`
fields rather than replace the whole entry. graphify-py's `add` replaces
the entry outright (`existing[name] = {...}`), and byte-identical registry
output is a resync goal. The project owner chose to keep the replace
behaviour for parity. No test in either codebase pins re-add semantics.
Ave Deus Mechanicus
The headline change completes the `.graphifyinclude` allowlist, which `graphify-py` defines but never wires into `detect()` (dead code in the reference). The Rust port already used `could_contain_included_path` for directory pruning yet never rescued files at classification time, so the allowlist was inert end-to-end. Completing a feature the reference leaves inert is a deliberate divergence approved by the project owner (feature parity, not bug parity), and is documented in `USAGE.md`. - `classify_one` now keeps an ignored file when it (or an ancestor directory) matches `.graphifyinclude`. The sensitive-file guard still runs after the rescue, so an allowlist can never pull secrets into the corpus. - `could_contain_included_path` now mirrors `anchored_include_matches`, so anchored stems (`/src`, `/src*`) keep the walker descending through the whole subtree instead of stopping one level deep. Adds an end-to-end `detect` regression for the rescue. - `detect_result_from_incremental` sorts each bucket so incremental reconstruction is byte-identical to a fresh `detect` walk. - The Swift extension-merge parity tests now assert the survivor is the canonical class (by `source_file`) carrying both the class and merged extension methods, not just a node count. - The `provider.rs` temp-rename comment no longer claims unconditional atomicity; overwrite and atomicity semantics are platform- and filesystem-dependent. By the will of the Machine God
All three findings this round are previously-approved disputes verified against the current code; no changes warranted. - `crates/graphify-llm/src/call.rs` and `crates/graphify-llm/src/extract.rs`: CodeRabbit again flags the per-call `load_custom_providers()` disk read and asks for a process-wide cache. Approved won't-fix: a `OnceLock` / `LazyLock` global breaks the `#[serial]` HOME-mutating test isolation (each test points `HOME` at a different `providers.json`), and threading a resolved registry through every call site is disproportionate for a path taken only by custom (non-built-in) backends. - `src/cli/provider.rs`: CodeRabbit asks for a `--temperature` flag on `provider add`. Disputed on parity grounds: graphify-py `__main__.py` (lines 1872-1878) hardcodes `"temperature": 0` and parses no `--temperature` argument, so the Rust `add` is byte-identical to the reference. Adding a flag would exceed the feature-parity bar. Glory to the Omnissiah
All three findings this round are disputes verified against the current
code and the graphify-py reference; no changes warranted.
- `crates/graphify-export/Cargo.toml`: the recurring sha1/sha2 digest
"version split" is a false positive. `cargo tree` shows `sha1 v0.11.0`
and `sha2 v0.11.0` resolving through a single `digest v0.11.3`; the
workspace pins `sha2 = "0.11"`, so `sha1 = "0.11"` already unifies the
trait family. Downgrading sha1 would *create* the split.
- `crates/graphify-export/src/canvas.rs`: `cap_filename` is parity-matched.
graphify-py `export.py` (`safe_name` -> `_cap_filename(..., limit=200)`
then `f"{base}_{count}"`) caps before dedup exactly as Rust does, and the
200-byte cap reserves 55 bytes under the 255-byte limit for `.md` plus
the dedup suffix. Overflow would need >1e51 collisions; re-ordering the
cap would diverge from the reference for no real-world gain.
- `crates/graphify-analyze/src/cycles.rs`: the `top_n * 10` enumeration cap
matches graphify-py `analyze.py` (#961), which caps `nx.simple_cycles`
output at `top_n * 10` before `cycles.sort(key=len)`. Both suggested
changes (sort-then-limit, or length-prioritised enumeration) would defeat
the documented combinatorial-explosion guard and break byte-identical
parity with the reference.
Ave Deus Mechanicus
Twenty-third CodeRabbit round: one fix, one dispute. - `crates/graphify-llm/tests/provider_registry.rs`: add `scrub_backends()` to the `call_llm` and `extract_files_direct` custom-provider routing tests. Both were already hermetic (an explicit non-builtin backend routes straight through the custom path and never consults built-in env vars), but scrubbing makes the isolation explicit, matches the sibling detection test's setup, and guards against a future refactor that grew a built-in fallback. `CUSTOM*_KEY` is a registry env_key and is untouched by the scrub. Disputed (no change): - `src/cli/cluster_only.rs`: CodeRabbit asked to pretty-print `.graphify_labels.json`. Declined: `extract.rs` writes the same file with compact `serde_json::to_string`, and graphify-py `__main__.py` (line 2448) writes it single-line via `json.dumps` without `indent`. Pretty-printing would break both internal consistency and byte-level parity with the reference. By the will of the Machine God
Twenty-fourth CodeRabbit round: three fixes, four disputes. Fixes: - `crates/graphify-llm/tests/labeling.rs`: add `scrub_backends()` to the custom-provider labeling test, matching the routing tests hardened in the previous round (hermetic, explicit-backend setup). - `crates/graphify-llm/tests/provider_registry.rs`: drop the now-redundant `unsafe_code` from the crate-level allow. CodeRabbit flagged the asymmetry with `labeling.rs`; the correct direction is removal. The earlier round replaced this file's inline `std::env::remove_var` scrub loop with the safe `EnvGuard::scrub_backends()` call, so no direct `unsafe` remains here (the `unsafe` blocks live in `tests/common/mod.rs`, which carries its own inner allow). Clippy confirms it compiles clean. - `src/cli/cluster_only.rs`: annotate the bare `false` argument to `generate_community_labels` as `// quiet`. Disputed (no change): - `crates/graphify-extract/tests/ts_inheritance.rs`: a single-use `EdgeMap` type alias for `IndexMap<String, Value>` is over-abstraction; the concrete type matches the file's style. - `crates/graphify-export/Cargo.toml`: the workspace MSRV is already 1.95 (`rust-version = "1.95"`), well above `sha1 = "0.11"`'s 1.85 floor. - `crates/graphify-analyze/src/cycles.rs`: the requested explanatory comment on the edge-orientation fallback already exists (lines 139-143). - `src/cli/cluster_only.rs`: pretty-printing or conditionally skipping the `.graphify_labels.json` write both diverge from graphify-py `__main__.py:2448`, which writes single-line via `json.dumps` and rewrites unconditionally. Glory to the Omnissiah
Twenty-fifth CodeRabbit round: one fix, five disputes.
Fix:
- `crates/graphify-detect/tests/parity_include.rs`: add a regression test
for an anchored `?`-glob include stem (`/src?`), exercising the
`p.contains('?')` branch of `anchored_include_matches` that the existing
`/src*` test left uncovered.
Disputed (no change):
- `crates/graphify-detect/src/walk.rs`: `FileType` has exactly five variants
and `as_str()` is an exhaustive match onto exactly `FILE_TYPE_KINDS`, so
bucket insertion can only ever hit a pre-seeded key. A runtime assertion
would guard a compile-time-impossible state.
- `crates/graphify-llm/src/call.rs` and `.../extract.rs`: the per-call
`load_custom_providers()` read is the standing approved won't-fix (a global
cache breaks `#[serial]` HOME-mutating test isolation).
- `crates/graphify-llm/src/bedrock.rs`: the closure is correct and
clippy-clean. `CREDENTIAL_ENV_VARS: &[&str]` means `.iter()` yields `&&str`,
so the `filter` predicate receives `&&&str` and `**k` is the `&str`.
CodeRabbit's `|&k|` (binding `&&str`) would not match the `&str` literal
patterns.
- `crates/graphify-export/Cargo.toml`: the recurring sha1/sha2 "digest split"
is a false positive. `cargo tree` shows `sha1 v0.11.0` and `sha2 v0.11.0`
resolving through one `digest v0.11.3`; the workspace pins `sha2 = "0.11"`.
Ave Deus Mechanicus
Address two CodeRabbit findings on the `.graphifyinclude` rescue and the `provider add` pricing inputs. `ConvertCtx::record` re-checked `is_ignored` on the converted sidecar path without honouring the source's allowlist verdict. A source rescued by `.graphifyinclude` but ignored by a broad pattern such as `*` had its office/Google-Workspace sidecar silently dropped, so the very content the rescue was meant to keep never reached the corpus. `record` now takes the source path and bypasses the sidecar ignore check when `is_included(src)` holds, mirroring the `Direct` path in `classify_one`. The allowlist is keyed on source paths, so the verdict is taken from the source rather than the derived `.md` path. An ordinary (non-rescued) source still has its sidecar filtered through `.graphifyignore`, matching graphify-py `detect()`. `provider add` accepted non-finite `--pricing-input` / `--pricing-output` values. `serde_json` serializes `NaN`/`+-Inf` to `null`, which the loader reads back as the `0.0` default, so an invalid price was silently lost rather than rejected. `add` now validates both values with `f64::is_finite` and returns an error before writing the registry. Both branches are covered: a docx rescued past a `*` ignore whose sidecar must survive, and a `--pricing-input nan` add that must fail without creating a registry file. Glory to the Omnissiah
7f5a5c9 to
8089979
Compare
Resyncs the Rust workspace from
graphify-pyv0.8.25 → v0.8.27 (4b17f19),porting 17 upstream commits and bumping the workspace version to 0.8.27.
Ported
.graphifyignore/.gitignorepatterns match theanchor-relative path directly, no subtree leak (#1087); deterministic
file-list + per-bucket sort.
8-char SHA-1 suffix on truncation (#1094); ASCII backup messages.
find_import_cycles(file-level cycle detection, #961).## Import Cyclessection.{parent_dir}_{stem}file-node-id spec (#1033) + root-levelsymbol prefix relativisation (#1096); pnpm-workspace
'.'(#1083); TSinterface extendsviaextends_type_clause(#1095).extract()nowhonours
cache_rootas the id-relativisation root.providers.json(load +
detect_backendtail +extract/call_llmrouting, #1084);batched community labelling (#1097).
provider add|list|show|remove, alabelcommand, and--no-label/--backendoncluster-only; console em-dash/arrow → ASCII(#992).
Divergences from graphify-py
extractis a one-shot pipeline (already writesGRAPH_REPORT.mdand aplaceholder labels file), so the Python "next: run cluster-only" hint is
omitted — run
graphify labelto LLM-name communities.GRAPHIFY_DEBUGtraceback hook has no Rust equivalent (extractors returnFileResult.errorstrings).providercommand refuses to overwrite a malformedproviders.json(graphify-py clobbers it); a deliberate, safer choice.
Verification
cargo clippy --all-targets --all-features --workspace— cleancargo nextest run --workspace— 1651 passed, 2 skipped (net_)hk check— cleanCodeRabbit
Four review rounds; substantive findings (anchored-include semantics,
#[serial]on env tests, malformed-registry clobber,max_tokensoverflow,atomic registry write, deterministic-sort key,
EnvGuarddedup) all fixed ordocumented as parity disputes in the commit history. The final pass hit the
org's CodeRabbit usage limit; remaining items were the documented disputes
(global-wins provider load order — matches graphify-py; per-chunk provider
load — custom-backends only;
sha1 0.11vsdigest— verified no duplication).Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores