Skip to content

fix(sync): preserve canonical markdown in single-file sync#746

Merged
phernandez merged 4 commits intomainfrom
more-write-fixes
Apr 16, 2026
Merged

fix(sync): preserve canonical markdown in single-file sync#746
phernandez merged 4 commits intomainfrom
more-write-fixes

Conversation

@phernandez
Copy link
Copy Markdown
Member

Summary

  • add a one-file markdown sync path that returns the canonical persisted markdown, checksum, and entity state
  • reconcile permalink conflicts after markdown upserts so frontmatter is updated to the accepted permalink
  • cover the single-file sync and batch indexer behavior with focused regression tests

Testing

  • user ran tests locally

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Claude finished @phernandez's task in 2m 39s —— View job


PR Review: fix(sync): preserve canonical markdown in single-file sync

  • Gather context and understand the request
  • Read changed files
  • Review against team checklist
  • Post full review findings

Summary

This PR adds a dedicated single-file markdown sync primitive (sync_one_markdown_file) that returns the canonical, post-rewrite content/checksum/entity state, and adds a permalink reconciliation step after upsert to handle late DB-level conflicts. The design is clean and the test coverage targets the right regression scenarios.


Code Quality & Standards

Item Status Notes
Coding conventions async patterns, dataclasses, type annotations throughout
Python 3.12+ types
SQLAlchemy 2.0
100-char line length
No commented-out code
Literate comments Trigger/Why/Outcome pattern used consistently

Testing & Documentation

Item Status Notes
Unit tests for new methods test_sync_one_markdown_file.py, test_batch_indexer.py additions
Edge cases covered No-rewrite path, permalink conflict, checksum source
100% coverage intent ⚠️ See notes below
Integration tests Tests use real services

Architecture

Item Status Notes
Protocol-based IndexFileWriter Clean abstraction boundary
Adapter _FileServiceIndexWriter Prevents indexing layer from importing FileService
SyncedMarkdownFile placement Lives in indexing/models.py with TYPE_CHECKING guard for Entity
Knowledge graph consistency Relations resolved before returning
Markdown file integrity Frontmatter reconciled post-upsert

Issues Found

1. Redundant entity_repository.update() in sync_one_markdown_file — missing comment

src/basic_memory/sync/sync_service.py L1099–1114

batch_indexer.index_markdown_file() calls _persist_markdown_file() which already calls entity_repository.update() with metadata from the initial IndexInputFile. Then sync_one_markdown_file() immediately calls entity_repository.update() again with freshly re-read file metadata. This second write is intentional — it captures the post-rewrite mtime/size — but there's no comment explaining why. Without it, the pattern looks like an accidental double-write.

# Sync one markdown file, apply metadata from rewritten file on disk
updated_entity = await self.entity_repository.update(
    refreshed_entities[0].id,
    {
        "checksum": indexed.checksum,
        "created_at": file_metadata.created_at,
        ...
    },
)

Suggest adding a Trigger/Why/Outcome comment: the initial metadata came from the pre-rewrite file; this second update captures mtime/size from the file as it exists after any frontmatter rewrite.

Fix this →

2. _reconcile_persisted_permalink mixes mutation and value semantics

src/basic_memory/indexing/batch_indexer.py L560–591

The method mutates prepared.markdown.frontmatter.metadata["permalink"] in-place (L578) while also returning a new _PreparedMarkdownFile wrapper. Since the markdown field is shared by reference, callers holding a reference to the original prepared will also see the mutation. The same pattern appears in _normalize_markdown_file (L319). This is harmless in the current call graph but is a subtle API footgun — either mutate in-place exclusively, or always return a fully new object. A comment noting the intentional mutation would help.

3. _build_prepared_entity can pass frontmatter YAML as search_content

src/basic_memory/indexing/batch_indexer.py L604–609

search_content=(
    prepared.markdown.content
    if prepared.markdown.content is not None
    else prepared.content  # full file including frontmatter
),

When prepared.markdown.content is None (body-less file), prepared.content (the full markdown with frontmatter YAML) is sent to the search index. This is pre-existing, but the new _build_prepared_entity helper makes it more prominent. Consider using remove_frontmatter(prepared.content) as the fallback to avoid indexing YAML metadata. This would match the explicit remove_frontmatter(final_markdown_content) used in sync_one_markdown_file (L1119).

Fix this →

4. Minor: pragma: no cover for the outer exception in update_frontmatter_with_result

src/basic_memory/services/file_service.py L507–515

The outer except Exception as e: # pragma: no cover is correct (hard to trigger outside of platform I/O failures). No action needed, just noting it is appropriately guarded.

5. Test gap: index_search=True path in test_sync_one_markdown_file_writes_missing_frontmatter_and_returns_canonical_content

The test patches index_entity_data and asserts it is called with content=remove_frontmatter(final_content). This is good. The only missing assertion is that result.content_type equals "text/markdown". Minor, not blocking.


Checklist Results

Code Quality & Standards

  • Follows Basic Memory coding conventions
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices
  • FastAPI/Typer conventions followed (N/A here)
  • 100-character line length maintained
  • No commented-out code

Testing & Documentation

  • Unit tests for new functions/methods
  • Integration tests (real services, no mocks)
  • Edge cases covered
  • [⚠️] 100% coverage — update_frontmatter_with_result outer except is pragma: no cover; defensible
  • Documentation updated — no README/docstring changes needed for this fix

Basic Memory Architecture

  • Atomic, composable MCP design maintained
  • No database schema changes (no migration needed)
  • Local-first architecture preserved
  • Knowledge graph consistency maintained
  • Markdown file integrity preserved

Security & Performance

  • No hardcoded secrets
  • Input validation present
  • Proper error handling
  • [⚠️] Double DB write per markdown sync — intentional but undocumented
  • No sensitive data in logs

Compatibility

  • Windows: Path objects used for file ops; as_posix() used in scan paths
  • No emoji/unicode in log output

Verdict

The core fix is sound and the test coverage is focused and well-targeted. Issues #1 and #3 above are the most worth addressing: the missing comment on the double-update makes the code harder to reason about, and the frontmatter fallback in search_content is a latent correctness issue. Issue #2 is low risk given the current call graph but worth a comment.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecaf247168

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/basic_memory/sync/sync_service.py Outdated
content=file_content,
mtime=file_metadata.modified_at.timestamp(),
ctime=file_metadata.created_at.timestamp(),
initial_checksum = await compute_checksum(initial_markdown_content)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Compute single-file checksum from file bytes

Using compute_checksum(initial_markdown_content) hashes the decoded text, not the actual file bytes. For markdown files with CRLF (or any encoding/newline normalization), this produces a different checksum than FileService.compute_checksum(path), so sync_one_markdown_file can persist a checksum that does not match disk when no frontmatter rewrite occurs. That breaks checksum-based move/modify detection and can cause repeated unnecessary sync work.

Useful? React with 👍 / 👎.

Comment on lines +565 to +569
if (
self.app_config.disable_permalinks
or entity.permalink is None
or entity.permalink == prepared.markdown.frontmatter.permalink
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip permalink rewrite when frontmatter enforcement is off

This reconciliation branch rewrites frontmatter whenever the persisted entity permalink differs, but it does not check whether the source file had frontmatter. If ensure_frontmatter_on_sync is False, syncing an existing frontmatterless note still triggers write_frontmatter because the DB entity has a permalink and parsed frontmatter permalink is None, which unexpectedly mutates files and violates the no-frontmatter-enforcement behavior.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>
@phernandez
Copy link
Copy Markdown
Member Author

Addressed the two actionable bits from this review:

  • Added the Trigger/Why/Outcome comment around the post-rewrite metadata refresh in sync_one_markdown_file.
  • Switched the _build_prepared_entity() fallback to strip frontmatter before indexing search content, and added a regression for a frontmatter-only note.

I left the mutation/value-semantics note alone since it stays inside the prepared-file normalization path, and skipped the extra content_type assertion suggestion.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Claude finished @phernandez's task in 3m 17s —— View job


PR Review: fix(sync): preserve canonical markdown in single-file sync

  • Gather context and understand the request
  • Read changed files
  • Evaluate previous review items and new Codex findings
  • Post full review findings

Follow-up Review

Good follow-through on both actionable items from the first pass:

Previous Issue Status
#1 — Missing Trigger/Why/Outcome on double entity_repository.update() ✅ Added in ee92812
#3 — Frontmatter YAML in search_content fallback ✅ Fixed + regression test added in ee92812
#2 — Mutation/value semantics note Intentionally deferred — acceptable

Codex Automated Review Findings

P1 — Checksum computed from decoded text, not raw bytes

src/basic_memory/sync/sync_service.py L1080

initial_checksum = await compute_checksum(initial_markdown_content)

compute_checksum(str) calls .encode() which normalizes CRLF → LF before hashing. FileService.compute_checksum(path) hashes raw binary chunks. On Windows with CRLF files, these produce different digests.

sync_one_markdown_file stores the text-based checksum into the DB; the watcher/scanner uses file_service.compute_checksum(path) (binary). If the two diverge, every scan iteration would flag the file as "modified" and re-sync it unnecessarily.

That said, this is pre-existing behavior inherited from the old sync_markdown_file and the project primarily targets macOS/Linux. Not blocking, but worth a note in the sync_one_markdown_file docstring or a follow-up issue.

P2 — _reconcile_persisted_permalink ignores ensure_frontmatter_on_sync

src/basic_memory/indexing/batch_indexer.py L565–591

if (
    self.app_config.disable_permalinks
    or entity.permalink is None
    or entity.permalink == prepared.markdown.frontmatter.permalink
):
    return prepared  # skip rewrite

_normalize_markdown_file correctly guards both write paths:

# batch-normalize path
if not prepared.file_contains_frontmatter and self.app_config.ensure_frontmatter_on_sync:
    ...
elif prepared.file_contains_frontmatter and not self.app_config.disable_permalinks:
    ...

But _reconcile_persisted_permalink only checks disable_permalinks. When ensure_frontmatter_on_sync=False and a file has no frontmatter:

  1. _normalize_markdown_file correctly leaves the file unchanged.
  2. The DB entity receives a generated permalink.
  3. _reconcile_persisted_permalink sees entity.permalink != None and prepared.markdown.frontmatter.permalink == None → condition is False → proceeds to call write_frontmatter.
  4. Frontmatter is injected into a file the user configured to be frontmatter-free.

This is a bug introduced by this PR (new code path). The fix is one extra guard:

if (
    self.app_config.disable_permalinks
    or (not prepared.file_contains_frontmatter and not self.app_config.ensure_frontmatter_on_sync)
    or entity.permalink is None
    or entity.permalink == prepared.markdown.frontmatter.permalink
):
    return prepared

Fix this →


Checklist Results

Code Quality & Standards

  • Coding conventions (async, dataclasses, type annotations)
  • Python 3.12+ types
  • SQLAlchemy 2.0
  • 100-char line length
  • No commented-out code
  • Literate Trigger/Why/Outcome comments

Testing & Documentation

  • Unit tests for new functions
  • Integration tests use real services
  • Edge cases covered
  • Regression test for frontmatter-only search content

Architecture

  • IndexFileWriter protocol boundary clean
  • SyncedMarkdownFile in indexing/models.py
  • Knowledge graph consistency maintained

Security & Performance

  • No hardcoded secrets
  • Proper error handling
  • Intentional double DB write is now documented

Compatibility

  • Path objects used for file ops
  • [⚠️] Text-based checksum diverges from binary on Windows/CRLF (P1 — pre-existing, non-blocking)
  • [⚠️] _reconcile_persisted_permalink ignores ensure_frontmatter_on_sync (P2 — blocking)

Verdict

The two requested fixes from the first review are clean. The P2 Codex finding is a real correctness bug in new code: _reconcile_persisted_permalink can inject frontmatter into files that the user configured to be frontmatter-free when ensure_frontmatter_on_sync=False. That one-line guard is worth adding before merge. P1 is pre-existing and can be tracked as a follow-up.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee9281209f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1120 to +1123
await self.search_service.index_entity_data(
updated_entity,
content=remove_frontmatter(final_markdown_content),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid strict frontmatter stripping in one-file search indexing

sync_one_markdown_file(..., index_search=True) now builds search content with remove_frontmatter(final_markdown_content), but remove_frontmatter raises ParseError when text starts with --- and has no closing delimiter. That can happen for valid markdown that begins with a thematic break (or malformed frontmatter when enforcement is disabled), so this path now fails hard after the entity upsert instead of completing sync. Using parser-derived body content (or handling this parse error) would keep one-file sync robust for these inputs.

Useful? React with 👍 / 👎.

Comment on lines +552 to +555
updated = await self.entity_repository.update(
entity.id,
self._entity_metadata_updates(prepared.file, prepared.final_checksum),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh metadata after permalink reconciliation writes

When _reconcile_persisted_permalink rewrites frontmatter, the on-disk file mtime/size can change, but _persist_markdown_file still calls _entity_metadata_updates(prepared.file, ...) using the original IndexInputFile metadata from before that write. In the index_files batch path this leaves stale updated_at/mtime/size in the entity row, which causes future scans to repeatedly treat the file as metadata-changed and re-hash it unnecessarily (and can skew recency ordering).

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Claude finished @phernandez's task in 2m 9s —— View job


PR Review: fix(sync): preserve canonical markdown in single-file sync

  • Gather context and understand the request
  • Read changed files and evaluate new commits
  • Assess open issues from previous reviews
  • Post full review findings

Follow-up Review (commit 8ef292d)

Good progress on the Codex P1 finding — 8ef292d now passes raw bytes to compute_checksum so the checksum in the DB matches what the watcher/scanner sees on disk.

Previous Issue Status
P1 — Checksum from decoded text, not raw bytes ✅ Fixed in 8ef292dcompute_checksum(initial_markdown_bytes)
P2 — _reconcile_persisted_permalink ignores ensure_frontmatter_on_sync ❌ Still open
Codex P2 — remove_frontmatter raises ParseError for thematic-break files ❌ Still open
Codex P2 — Stale mtime/size after _reconcile_persisted_permalink rewrites ❌ Still open

Remaining Open Issues

1. _reconcile_persisted_permalink ignores ensure_frontmatter_on_syncblocking

src/basic_memory/indexing/batch_indexer.py L565–570

if (
    self.app_config.disable_permalinks
    or entity.permalink is None
    or entity.permalink == prepared.markdown.frontmatter.permalink
):
    return prepared   # skip rewrite

When ensure_frontmatter_on_sync=False and a file has no frontmatter block, _normalize_markdown_file correctly leaves the file untouched — but the upsert still generates a permalink for the DB entity. _reconcile_persisted_permalink then sees entity.permalink != None and prepared.markdown.frontmatter.permalink == None, evaluates the condition as False, and calls write_frontmatter, silently injecting frontmatter that the user opted out of.

The one-line fix (same as flagged last review):

if (
    self.app_config.disable_permalinks
    or (not prepared.file_contains_frontmatter and not self.app_config.ensure_frontmatter_on_sync)
    or entity.permalink is None
    or entity.permalink == prepared.markdown.frontmatter.permalink
):
    return prepared

Fix this →

2. remove_frontmatter raises ParseError for thematic-break files — blocking

src/basic_memory/sync/sync_service.py L1131

await self.search_service.index_entity_data(
    updated_entity,
    content=remove_frontmatter(final_markdown_content),
)

remove_frontmatter raises ParseError when content starts with --- but has no closing --- delimiter (file_utils.py L392–393). Valid markdown files can start with a thematic break (---), and malformed frontmatter that slipped through with ensure_frontmatter_on_sync=False would also trigger this. The result is a hard failure after the entity upsert and DB write — leaving the entity in the DB without a search index entry.

The fix is to catch ParseError from remove_frontmatter and fall back to the raw content, or use the parsed body from the markdown model instead:

try:
    search_body = remove_frontmatter(final_markdown_content)
except ParseError:
    search_body = final_markdown_content
await self.search_service.index_entity_data(updated_entity, content=search_body)

Alternatively, use indexed.search_content if it's accessible on the _PreparedMarkdownFile result, which already went through the body extraction path.

Fix this →

3. Stale mtime/size after _reconcile_persisted_permalink rewrites — low priority

src/basic_memory/indexing/batch_indexer.py L552–555

prepared = await self._reconcile_persisted_permalink(prepared, entity)
updated = await self.entity_repository.update(
    entity.id,
    self._entity_metadata_updates(prepared.file, prepared.final_checksum),
)

When _reconcile_persisted_permalink rewrites frontmatter, the file's mtime/size on disk change. But _entity_metadata_updates(prepared.file, ...) uses prepared.file.size and prepared.file.last_modified from the original IndexInputFile — the pre-rewrite values. This means follow-up scans see a DB mtime that doesn't match the filesystem and re-hash the file on every subsequent scan until the next full sync overwrites it.

This is lower priority than #1 and #2 (it doesn't corrupt data, just causes extra work), but worth tracking. The sync_one_markdown_file path already solves this correctly by re-reading metadata after indexing (L1108). The batch path would need a similar refresh, or the _reconcile_persisted_permalink return value could carry the new mtime/size from the write result.


Checklist

Code Quality & Standards

  • Coding conventions (async, dataclasses, type annotations)
  • Python 3.12+ types
  • SQLAlchemy 2.0
  • 100-char line length
  • No commented-out code
  • Literate Trigger/Why/Outcome comments

Testing & Documentation

  • Unit tests for new functions
  • Integration tests use real services
  • Edge cases covered
  • Regression test for frontmatter-only search content (ee92812)
  • [⚠️] No test for ensure_frontmatter_on_sync=False + permalink reconciliation path (issue Pre release #1)
  • [⚠️] No test for thematic-break --- file through index_search=True (issue Memory json import #2)

Architecture

  • IndexFileWriter protocol boundary clean
  • SyncedMarkdownFile in indexing/models.py
  • Knowledge graph consistency maintained
  • Intentional double DB write documented

Security & Performance

  • No hardcoded secrets
  • Proper error handling
  • Checksum now computed from raw bytes (P1 fixed)

Compatibility

  • Path objects used for file ops
  • Text-based vs binary checksum divergence resolved

Verdict

Issues #1 and #2 are both correctness bugs in new code and worth fixing before merge. #1 (frontmatter injection when enforcement is off) is a one-line guard. #2 (ParseError on thematic-break files in the search indexing path) needs a try/except. #3 is real but low priority and can ship as a known follow-up. Overall the PR is in good shape — these are targeted, fixable gaps.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ef292de29

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +194 to +198
if existing_permalink_by_path is None:
existing_permalink_by_path = {
path: permalink
for path, permalink in (
await self.entity_repository.get_file_path_to_permalink_map()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid full permalink scans in one-file sync

index_markdown_file rebuilds existing_permalink_by_path from get_file_path_to_permalink_map() whenever the caller does not pass a map, and sync_one_markdown_file invokes this path for each markdown file sync. In incremental/watch workloads (which call sync_file per changed file), this adds a full-entity-table scan to every single-file update, creating an O(N) DB read per event and causing avoidable slowdowns on larger projects. A single-file fast path (or a shared cached map across calls) would prevent this regression.

Useful? React with 👍 / 👎.

Signed-off-by: phernandez <paul@basicmachines.co>
@phernandez
Copy link
Copy Markdown
Member Author

Addressed the two remaining correctness issues from the latest follow-up review:

  • Added the _reconcile_persisted_permalink() guard so frontmatterless files stay untouched when ensure_frontmatter_on_sync=False, including the existing-entity-with-permalink case.
  • Made sync_one_markdown_file() fall back to raw markdown content when strict frontmatter stripping rejects leading --- text, and added a regression for thematic-break content.

Also tightened the no-rewrite assertions to compare against the exact on-disk text so the Windows newline cases stop flapping. I left the stale batch mtime/size note alone for now since it looks like extra work rather than bad data.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Claude finished @phernandez's task in 1m 45s —— View job


PR Review: fix(sync): preserve canonical markdown in single-file sync

  • Gather context and understand the request
  • Read changed files (batch_indexer.py, sync_service.py, tests)
  • Evaluate fixes for previously identified issues
  • Check new Codex finding (permalink scan performance)
  • Post full review findings

Follow-up Review (commit 1bb42df)

Both blocking issues from the previous review have been addressed cleanly.

Previous Issue Status
P2 — _reconcile_persisted_permalink ignores ensure_frontmatter_on_sync ✅ Fixed — batch_indexer.py L569–577
P2 — remove_frontmatter raises ParseError for thematic-break files ✅ Fixed — sync_service.py L1128–1140
Stale mtime/size after batch _reconcile_persisted_permalink rewrite ↗️ Deferred — acknowledged as non-critical
No-rewrite assertions flapping on Windows CRLF ✅ Fixed — assertions now compare against read_bytes().decode()

Fix Quality

_reconcile_persisted_permalink guard (batch_indexer.py L565–577)

if (
    self.app_config.disable_permalinks
    or (
        not prepared.file_contains_frontmatter
        and not self.app_config.ensure_frontmatter_on_sync
    )
    or entity.permalink is None
    or entity.permalink == prepared.markdown.frontmatter.permalink
):
    return prepared

The guard is correct and the Trigger/Why/Outcome comment explains the intent. Test test_batch_indexer_does_not_inject_frontmatter_when_sync_enforcement_is_disabled covers the key path: an existing entity with a DB permalink whose file has no frontmatter and enforcement is off — no write occurs.

ParseError fallback in sync_one_markdown_file (sync_service.py L1128–1140)

try:
    search_content = remove_frontmatter(final_markdown_content)
except ParseError:
    search_content = final_markdown_content

Clean and well-commented. Test test_sync_one_markdown_file_indexes_thematic_break_content_without_frontmatter covers the ---\nBody content... case end-to-end, asserting both that the file is not mutated and that the raw content is passed to the search index.


New Codex Finding: O(N) DB Scan Per One-File Sync

src/basic_memory/indexing/batch_indexer.py L193–200

if existing_permalink_by_path is None:
    existing_permalink_by_path = {
        path: permalink
        for path, permalink in (
            await self.entity_repository.get_file_path_to_permalink_map()
        ).items()
    }

sync_one_markdown_fileindex_markdown_file always hits this branch (no map is passed), triggering a full entity table scan on every single-file watch event. For a project with thousands of files, each watcher event incurs an O(N) query.

This is a real performance concern for watch workloads. _index_changed_files already caches this map once per sync pass (L539–544), but the one-file path has no equivalent optimization.

Severity: Performance regression, not a correctness bug. The watch/single-file path was also present before this PR — this PR makes it more prominent by funneling through index_markdown_file. Worth a follow-up issue but not blocking.

Track this as follow-up →


Checklist

Code Quality & Standards

  • Coding conventions (async, dataclasses, type annotations)
  • Python 3.12+ types
  • SQLAlchemy 2.0
  • 100-char line length
  • No commented-out code
  • Literate Trigger/Why/Outcome comments

Testing & Documentation

  • Unit tests for new functions
  • Integration tests use real services
  • Edge cases covered (ensure_frontmatter_on_sync=False, thematic-break content)
  • No-rewrite assertions hardened against Windows CRLF
  • Regression test for frontmatter-only search content

Architecture

  • IndexFileWriter protocol boundary clean
  • SyncedMarkdownFile in indexing/models.py
  • Knowledge graph consistency maintained
  • Intentional double DB write documented

Security & Performance

  • No hardcoded secrets
  • Proper error handling
  • Checksum computed from raw bytes (P1 fixed in prior pass)
  • [⚠️] O(N) scan per one-file sync event — performance, not correctness; follow-up issue recommended

Compatibility

  • Path objects used for file ops
  • Binary read_bytes().decode() for Windows CRLF robustness

Verdict

Both blocking correctness bugs are fixed and tested. The PR is clean — ready to merge. The O(N) permalink scan in the one-file path is worth a follow-up issue but does not block this fix landing.

@phernandez phernandez merged commit b3d5448 into main Apr 16, 2026
39 of 41 checks passed
@phernandez phernandez deleted the more-write-fixes branch April 16, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant