fix(sync): preserve canonical markdown in single-file sync#746
fix(sync): preserve canonical markdown in single-file sync#746phernandez merged 4 commits intomainfrom
Conversation
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 39s —— View job PR Review: fix(sync): preserve canonical markdown in single-file sync
SummaryThis PR adds a dedicated single-file markdown sync primitive ( Code Quality & Standards
Testing & Documentation
Architecture
Issues Found1. Redundant
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ecaf247168
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| content=file_content, | ||
| mtime=file_metadata.modified_at.timestamp(), | ||
| ctime=file_metadata.created_at.timestamp(), | ||
| initial_checksum = await compute_checksum(initial_markdown_content) |
There was a problem hiding this comment.
Compute single-file checksum from file bytes
Using compute_checksum(initial_markdown_content) hashes the decoded text, not the actual file bytes. For markdown files with CRLF (or any encoding/newline normalization), this produces a different checksum than FileService.compute_checksum(path), so sync_one_markdown_file can persist a checksum that does not match disk when no frontmatter rewrite occurs. That breaks checksum-based move/modify detection and can cause repeated unnecessary sync work.
Useful? React with 👍 / 👎.
| if ( | ||
| self.app_config.disable_permalinks | ||
| or entity.permalink is None | ||
| or entity.permalink == prepared.markdown.frontmatter.permalink | ||
| ): |
There was a problem hiding this comment.
Skip permalink rewrite when frontmatter enforcement is off
This reconciliation branch rewrites frontmatter whenever the persisted entity permalink differs, but it does not check whether the source file had frontmatter. If ensure_frontmatter_on_sync is False, syncing an existing frontmatterless note still triggers write_frontmatter because the DB entity has a permalink and parsed frontmatter permalink is None, which unexpectedly mutates files and violates the no-frontmatter-enforcement behavior.
Useful? React with 👍 / 👎.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Addressed the two actionable bits from this review:
I left the mutation/value-semantics note alone since it stays inside the prepared-file normalization path, and skipped the extra |
|
Claude finished @phernandez's task in 3m 17s —— View job PR Review: fix(sync): preserve canonical markdown in single-file sync
Follow-up ReviewGood follow-through on both actionable items from the first pass:
Codex Automated Review FindingsP1 — Checksum computed from decoded text, not raw bytes
initial_checksum = await compute_checksum(initial_markdown_content)
That said, this is pre-existing behavior inherited from the old P2 —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee9281209f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await self.search_service.index_entity_data( | ||
| updated_entity, | ||
| content=remove_frontmatter(final_markdown_content), | ||
| ) |
There was a problem hiding this comment.
Avoid strict frontmatter stripping in one-file search indexing
sync_one_markdown_file(..., index_search=True) now builds search content with remove_frontmatter(final_markdown_content), but remove_frontmatter raises ParseError when text starts with --- and has no closing delimiter. That can happen for valid markdown that begins with a thematic break (or malformed frontmatter when enforcement is disabled), so this path now fails hard after the entity upsert instead of completing sync. Using parser-derived body content (or handling this parse error) would keep one-file sync robust for these inputs.
Useful? React with 👍 / 👎.
| updated = await self.entity_repository.update( | ||
| entity.id, | ||
| self._entity_metadata_updates(prepared.file, prepared.final_checksum), | ||
| ) |
There was a problem hiding this comment.
Refresh metadata after permalink reconciliation writes
When _reconcile_persisted_permalink rewrites frontmatter, the on-disk file mtime/size can change, but _persist_markdown_file still calls _entity_metadata_updates(prepared.file, ...) using the original IndexInputFile metadata from before that write. In the index_files batch path this leaves stale updated_at/mtime/size in the entity row, which causes future scans to repeatedly treat the file as metadata-changed and re-hash it unnecessarily (and can skew recency ordering).
Useful? React with 👍 / 👎.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 9s —— View job PR Review: fix(sync): preserve canonical markdown in single-file sync
Follow-up Review (commit
|
| Previous Issue | Status |
|---|---|
| P1 — Checksum from decoded text, not raw bytes | ✅ Fixed in 8ef292d — compute_checksum(initial_markdown_bytes) |
P2 — _reconcile_persisted_permalink ignores ensure_frontmatter_on_sync |
❌ Still open |
Codex P2 — remove_frontmatter raises ParseError for thematic-break files |
❌ Still open |
Codex P2 — Stale mtime/size after _reconcile_persisted_permalink rewrites |
❌ Still open |
Remaining Open Issues
1. _reconcile_persisted_permalink ignores ensure_frontmatter_on_sync — blocking
src/basic_memory/indexing/batch_indexer.py L565–570
if (
self.app_config.disable_permalinks
or entity.permalink is None
or entity.permalink == prepared.markdown.frontmatter.permalink
):
return prepared # skip rewriteWhen ensure_frontmatter_on_sync=False and a file has no frontmatter block, _normalize_markdown_file correctly leaves the file untouched — but the upsert still generates a permalink for the DB entity. _reconcile_persisted_permalink then sees entity.permalink != None and prepared.markdown.frontmatter.permalink == None, evaluates the condition as False, and calls write_frontmatter, silently injecting frontmatter that the user opted out of.
The one-line fix (same as flagged last review):
if (
self.app_config.disable_permalinks
or (not prepared.file_contains_frontmatter and not self.app_config.ensure_frontmatter_on_sync)
or entity.permalink is None
or entity.permalink == prepared.markdown.frontmatter.permalink
):
return prepared2. remove_frontmatter raises ParseError for thematic-break files — blocking
src/basic_memory/sync/sync_service.py L1131
await self.search_service.index_entity_data(
updated_entity,
content=remove_frontmatter(final_markdown_content),
)remove_frontmatter raises ParseError when content starts with --- but has no closing --- delimiter (file_utils.py L392–393). Valid markdown files can start with a thematic break (---), and malformed frontmatter that slipped through with ensure_frontmatter_on_sync=False would also trigger this. The result is a hard failure after the entity upsert and DB write — leaving the entity in the DB without a search index entry.
The fix is to catch ParseError from remove_frontmatter and fall back to the raw content, or use the parsed body from the markdown model instead:
try:
search_body = remove_frontmatter(final_markdown_content)
except ParseError:
search_body = final_markdown_content
await self.search_service.index_entity_data(updated_entity, content=search_body)Alternatively, use indexed.search_content if it's accessible on the _PreparedMarkdownFile result, which already went through the body extraction path.
3. Stale mtime/size after _reconcile_persisted_permalink rewrites — low priority
src/basic_memory/indexing/batch_indexer.py L552–555
prepared = await self._reconcile_persisted_permalink(prepared, entity)
updated = await self.entity_repository.update(
entity.id,
self._entity_metadata_updates(prepared.file, prepared.final_checksum),
)When _reconcile_persisted_permalink rewrites frontmatter, the file's mtime/size on disk change. But _entity_metadata_updates(prepared.file, ...) uses prepared.file.size and prepared.file.last_modified from the original IndexInputFile — the pre-rewrite values. This means follow-up scans see a DB mtime that doesn't match the filesystem and re-hash the file on every subsequent scan until the next full sync overwrites it.
This is lower priority than #1 and #2 (it doesn't corrupt data, just causes extra work), but worth tracking. The sync_one_markdown_file path already solves this correctly by re-reading metadata after indexing (L1108). The batch path would need a similar refresh, or the _reconcile_persisted_permalink return value could carry the new mtime/size from the write result.
Checklist
Code Quality & Standards
- Coding conventions (async, dataclasses, type annotations)
- Python 3.12+ types
- SQLAlchemy 2.0
- 100-char line length
- No commented-out code
- Literate Trigger/Why/Outcome comments
Testing & Documentation
- Unit tests for new functions
- Integration tests use real services
- Edge cases covered
- Regression test for frontmatter-only search content (
ee92812) - [
⚠️ ] No test forensure_frontmatter_on_sync=False+ permalink reconciliation path (issue Pre release #1) - [
⚠️ ] No test for thematic-break---file throughindex_search=True(issue Memory json import #2)
Architecture
-
IndexFileWriterprotocol boundary clean -
SyncedMarkdownFileinindexing/models.py - Knowledge graph consistency maintained
- Intentional double DB write documented
Security & Performance
- No hardcoded secrets
- Proper error handling
- Checksum now computed from raw bytes (P1 fixed)
Compatibility
-
Pathobjects used for file ops - Text-based vs binary checksum divergence resolved
Verdict
Issues #1 and #2 are both correctness bugs in new code and worth fixing before merge. #1 (frontmatter injection when enforcement is off) is a one-line guard. #2 (ParseError on thematic-break files in the search indexing path) needs a try/except. #3 is real but low priority and can ship as a known follow-up. Overall the PR is in good shape — these are targeted, fixable gaps.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ef292de29
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if existing_permalink_by_path is None: | ||
| existing_permalink_by_path = { | ||
| path: permalink | ||
| for path, permalink in ( | ||
| await self.entity_repository.get_file_path_to_permalink_map() |
There was a problem hiding this comment.
Avoid full permalink scans in one-file sync
index_markdown_file rebuilds existing_permalink_by_path from get_file_path_to_permalink_map() whenever the caller does not pass a map, and sync_one_markdown_file invokes this path for each markdown file sync. In incremental/watch workloads (which call sync_file per changed file), this adds a full-entity-table scan to every single-file update, creating an O(N) DB read per event and causing avoidable slowdowns on larger projects. A single-file fast path (or a shared cached map across calls) would prevent this regression.
Useful? React with 👍 / 👎.
Signed-off-by: phernandez <paul@basicmachines.co>
|
Addressed the two remaining correctness issues from the latest follow-up review:
Also tightened the no-rewrite assertions to compare against the exact on-disk text so the Windows newline cases stop flapping. I left the stale batch mtime/size note alone for now since it looks like extra work rather than bad data. |
|
Claude finished @phernandez's task in 1m 45s —— View job PR Review: fix(sync): preserve canonical markdown in single-file sync
Follow-up Review (commit
|
| Previous Issue | Status |
|---|---|
P2 — _reconcile_persisted_permalink ignores ensure_frontmatter_on_sync |
✅ Fixed — batch_indexer.py L569–577 |
P2 — remove_frontmatter raises ParseError for thematic-break files |
✅ Fixed — sync_service.py L1128–1140 |
Stale mtime/size after batch _reconcile_persisted_permalink rewrite |
|
| No-rewrite assertions flapping on Windows CRLF | ✅ Fixed — assertions now compare against read_bytes().decode() |
Fix Quality
_reconcile_persisted_permalink guard (batch_indexer.py L565–577)
if (
self.app_config.disable_permalinks
or (
not prepared.file_contains_frontmatter
and not self.app_config.ensure_frontmatter_on_sync
)
or entity.permalink is None
or entity.permalink == prepared.markdown.frontmatter.permalink
):
return preparedThe guard is correct and the Trigger/Why/Outcome comment explains the intent. Test test_batch_indexer_does_not_inject_frontmatter_when_sync_enforcement_is_disabled covers the key path: an existing entity with a DB permalink whose file has no frontmatter and enforcement is off — no write occurs.
ParseError fallback in sync_one_markdown_file (sync_service.py L1128–1140)
try:
search_content = remove_frontmatter(final_markdown_content)
except ParseError:
search_content = final_markdown_contentClean and well-commented. Test test_sync_one_markdown_file_indexes_thematic_break_content_without_frontmatter covers the ---\nBody content... case end-to-end, asserting both that the file is not mutated and that the raw content is passed to the search index.
New Codex Finding: O(N) DB Scan Per One-File Sync
src/basic_memory/indexing/batch_indexer.py L193–200
if existing_permalink_by_path is None:
existing_permalink_by_path = {
path: permalink
for path, permalink in (
await self.entity_repository.get_file_path_to_permalink_map()
).items()
}sync_one_markdown_file → index_markdown_file always hits this branch (no map is passed), triggering a full entity table scan on every single-file watch event. For a project with thousands of files, each watcher event incurs an O(N) query.
This is a real performance concern for watch workloads. _index_changed_files already caches this map once per sync pass (L539–544), but the one-file path has no equivalent optimization.
Severity: Performance regression, not a correctness bug. The watch/single-file path was also present before this PR — this PR makes it more prominent by funneling through index_markdown_file. Worth a follow-up issue but not blocking.
Checklist
Code Quality & Standards
- Coding conventions (async, dataclasses, type annotations)
- Python 3.12+ types
- SQLAlchemy 2.0
- 100-char line length
- No commented-out code
- Literate Trigger/Why/Outcome comments
Testing & Documentation
- Unit tests for new functions
- Integration tests use real services
- Edge cases covered (
ensure_frontmatter_on_sync=False, thematic-break content) - No-rewrite assertions hardened against Windows CRLF
- Regression test for frontmatter-only search content
Architecture
-
IndexFileWriterprotocol boundary clean -
SyncedMarkdownFileinindexing/models.py - Knowledge graph consistency maintained
- Intentional double DB write documented
Security & Performance
- No hardcoded secrets
- Proper error handling
- Checksum computed from raw bytes (P1 fixed in prior pass)
- [
⚠️ ] O(N) scan per one-file sync event — performance, not correctness; follow-up issue recommended
Compatibility
-
Pathobjects used for file ops - Binary
read_bytes().decode()for Windows CRLF robustness
Verdict
Both blocking correctness bugs are fixed and tested. The PR is clean — ready to merge. The O(N) permalink scan in the one-file path is worth a follow-up issue but does not block this fix landing.
Summary
Testing