Skip to content

fix: strip NUL bytes from content before PostgreSQL search indexing#592

Merged
phernandez merged 13 commits intomainfrom
hotfix/strip-nul-bytes-rclone-preallocate
Feb 21, 2026
Merged

fix: strip NUL bytes from content before PostgreSQL search indexing#592
phernandez merged 13 commits intomainfrom
hotfix/strip-nul-bytes-rclone-preallocate

Conversation

@phernandez
Copy link
Member

Summary

  • Strip \x00 bytes in SearchService.index_entity_markdown() before building search index rows (primary fix)
  • Add _strip_nul_from_row() safety net in PostgresSearchRepository.bulk_index_items() before INSERT
  • Add --local-no-preallocate flag to rclone sync and bisync commands to prevent NUL byte padding at the source

Context

Users syncing via rclone bisync to Google Drive File Stream on Windows get NUL bytes appended to markdown files due to a known rclone preallocation bug with virtual filesystems (rclone/rclone#6801). When these files are indexed, PostgreSQL rejects them with CharacterNotInRepertoireError.

Fixes #548

Test plan

  • test_strip_nul_removes_nul_bytes — unit test for _strip_nul() helper
  • test_index_entity_markdown_strips_nul_bytes — integration test: index entity with NUL content, verify no \x00 in stored rows
  • test_strip_nul_from_row — unit test for Postgres repository helper
  • test_project_sync_includes_no_preallocate_flag — verify flag in sync command
  • test_project_bisync_includes_no_preallocate_flag — verify flag in bisync command
  • just fast-check passes (lint, format, typecheck, 1662 tests, MCP smoke)

🤖 Generated with Claude Code

phernandez and others added 10 commits February 11, 2026 22:32
Signed-off-by: phernandez <paul@basicmachines.co>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
…ansactions

--header-download and --header-upload only apply to GET and PUT requests.
bisync starts by listing files via S3 ListObjectsV2 — neither a download
nor upload — so the consistency header was never sent on list requests.
Non-US users saw stale edge-cached metadata and bisync reported "0 changes."

--header is rclone's global flag that applies to ALL HTTP transactions
(list, download, upload), which is a superset of the previous two flags.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
rclone preallocation on virtual filesystems (e.g. Google Drive File Stream)
pads markdown files with \x00 bytes (rclone/rclone#6801), which PostgreSQL
rejects with CharacterNotInRepertoireError during search indexing.

Three-pronged fix:
- 🛡️ Primary: _strip_nul() in SearchService.index_entity_markdown() sanitizes
  content_stems, content_snippet, and observation/relation content before
  building SearchIndexRow objects
- 🛡️ Secondary: _strip_nul_from_row() in PostgresSearchRepository.bulk_index_items()
  as a safety net before INSERT
- 🔧 Prevention: --local-no-preallocate flag added to rclone sync and bisync
  commands to prevent the padding at the source

Fixes #548

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Resolve conflicts keeping hotfix NUL-byte stripping alongside main's
semantic search, vector embeddings, and full content_snippet changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 20, 2026

Claude finished @phernandez's task in 3m 27s —— View job


PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing

  • Gather context and understand the request
  • Read changed files
  • Review against team checklist
  • Post comprehensive review

Summary

This is a solid, well-targeted fix with a clear layered defense strategy. The root cause (rclone preallocation on virtual filesystems) is well-understood and well-documented. The code is clean and minimal. A few minor gaps are noted below.


Team Checklist

Code Quality & Standards

  • Follows Basic Memory coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices
  • FastAPI and Typer conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Unit tests for new functions (test_strip_nul_removes_nul_bytes, test_strip_nul_from_row)
  • Service-level integration test for NUL stripping (test_index_entity_markdown_strips_nul_bytes)
  • Edge cases covered (empty string, clean string, multiple NUL bytes, non-string values)
  • [~] No Postgres-specific integration test that proves the actual CharacterNotInRepertoireError is resolved end-to-end (see issue Pre release #1)
  • Test coverage maintained — new code has corresponding tests
  • Comments reference upstream rclone bug (rclone/rclone#6801)

Basic Memory Architecture

  • No database schema changes (no migration required)
  • Preserves local-first architecture principles
  • Knowledge graph operations maintain consistency
  • Literate programming style followed — section headers and decision-point comments present

Security & Performance

  • No hardcoded secrets or credentials
  • str.replace() is O(n) — minimal indexing overhead
  • No sensitive data in logs

Compatibility

  • --local-no-preallocate is a local filesystem rclone flag (Windows compatible)
  • No emoji/unicode characters added to console/log output

Issues Found

Issue 1 — No Postgres integration test for end-to-end NUL rejection

test_index_entity_markdown_strips_nul_bytes (in tests/services/test_search_service.py) runs against SQLite by default, which silently accepts NUL bytes. The test verifies that content_snippet is clean after indexing, which validates the stripping logic, but it does not prove that PostgreSQL would have previously thrown CharacterNotInRepertoireError and now succeeds.

The postgres-specific test file (tests/repository/test_postgres_search_repository.py) only has a unit test for _strip_nul_from_row() (the secondary defense helper) — not an end-to-end bulk_index_items() call with NUL bytes against a live Postgres instance.

Adding a Postgres integration test like this would close the gap:

@pytest.mark.asyncio
async def test_bulk_index_items_strips_nul_bytes(session_maker, test_project):
    """NUL bytes in content must not cause CharacterNotInRepertoireError on INSERT."""
    repo = PostgresSearchRepository(session_maker, project_id=test_project.id)
    now = datetime.now(timezone.utc)
    row = SearchIndexRow(
        project_id=test_project.id,
        id=99,
       ,
        content_stems="some\x00stems",
        content_snippet="snippet\x00here",
        permalink="test/nul-row",
        file_path="test/nul.md",
        type="entity",
        metadata={},
        created_at=now,
        updated_at=now,
    )
    # Should not raise CharacterNotInRepertoireError
    await repo.bulk_index_items([row])
    results = await repo.search(permalink="test/nul-row")
    assert len(results) == 1
    assert "\x00" not in (results[0].content_snippet or "")

Fix this →


Issue 2 — title fields not stripped at the primary layer (observation and entity titles)

In SearchService.index_entity_markdown(), the content_stems and content_snippet fields are consistently stripped, but the title field for observations is derived directly from obs.content without stripping:

src/basic_memory/services/search_service.py:532

content_stems=obs_content_stems,                   # stripped ✓
content_snippet=_strip_nul(obs.content),           # stripped ✓

The secondary defense in _strip_nul_from_row() covers this (it strips all string values before INSERT), but primary sanitization is inconsistent. Same applies to the entity title and relation title fields.

This is not critical since the secondary defense handles it, but it's worth making explicit that the secondary defense is load-bearing for title, not just a belt-and-suspenders extra.

Fix this →


Issue 3 — index_item() single-item path lacks secondary defense

PostgresSearchRepository.index_item() (the single-item UPSERT path used by index_entity_file()) does not call _strip_nul_from_row(). In practice this is fine because index_entity_file() only indexes binary file metadata (no text content), but if this path is ever extended to handle content, there would be no secondary defense.

This is low severity but worth a comment near index_item() noting that NUL sanitization is only guaranteed by the primary layer (SearchService) for this code path.


Positive Notes

  • Layered defense strategy is well-conceived: primary fix in SearchService at content-building time, secondary safety net in PostgresSearchRepository before INSERT. The comment explicitly documents this relationship — "Secondary defense: PostgreSQL text columns cannot store \\x00. Primary sanitization happens in SearchService.index_entity_markdown()." This is exactly the right approach.
  • --local-no-preallocate is a proactive upstream fix: preventing NUL bytes from entering files in the first place is the right call alongside the defensive indexing fix.
  • Test for test_index_entity_markdown_strips_nul_bytes is correctly designed: even though it runs on SQLite, it validates service-layer behavior (stripping before storage) by asserting the returned content_snippet is clean, which would fail if stripping hadn't occurred.
  • Compact, focused diff: no unrelated refactors, no scope creep.

Overall this is ready to merge with minor follow-ups noted above. The fix is correct and the gaps are non-critical.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 20, 2026

Claude finished @phernandez's task in 2m 3s —— View job


PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing

  • Gather context and understand the request
  • Read changed files and check previous review issues
  • Review against team checklist
  • Post comprehensive review

Summary

This review covers the updated branch after the latest synchronize (merge from main + unrelated move_note contract test fix). The NUL byte stripping code itself is unchanged from the previous review. The three follow-up issues from the previous review are still outstanding.


Team Checklist

Code Quality & Standards

  • Follows Basic Memory coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices
  • FastAPI and Typer conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Unit tests for _strip_nul() (test_strip_nul_removes_nul_bytes)
  • Unit test for _strip_nul_from_row() (test_strip_nul_from_row)
  • Service-level integration test for NUL stripping (test_index_entity_markdown_strips_nul_bytes)
  • rclone flag tests for both sync and bisync
  • [~] No Postgres end-to-end bulk_index_items test with NUL bytes against live Postgres (see Issue 1)
  • Upstream rclone bug referenced (rclone/rclone#6801) in both code comments and tests
  • 100% test coverage maintained for new code paths

Basic Memory Architecture

  • No database schema changes (no migration required)
  • Preserves local-first architecture principles
  • Literate programming style followed — constraint comments and section headers present

Security & Performance

  • No hardcoded secrets or credentials
  • str.replace() is O(n) — minimal overhead per indexed row
  • No sensitive data in logs

Compatibility

  • --local-no-preallocate is a standard rclone local filesystem flag
  • No emoji/unicode characters added to console/log output

Outstanding Issues (unchanged from previous review)

Issue 1 — No Postgres integration test for bulk_index_items with NUL bytes

test_index_entity_markdown_strips_nul_bytes runs under SQLite (which silently accepts \x00) and validates service-layer behavior by asserting content_snippet is clean after indexing. It does not exercise bulk_index_items against a live Postgres instance and would not have caught the original CharacterNotInRepertoireError.

test_strip_nul_from_row (in tests/repository/test_postgres_search_repository.py) is a pure unit test for the dict-transformation helper — no DB interaction.

A test like the following in tests/repository/test_postgres_search_repository.py would close this gap:

@pytest.mark.asyncio
async def test_bulk_index_items_strips_nul_bytes(session_maker, test_project):
    """NUL bytes in content must not cause CharacterNotInRepertoireError on INSERT."""
    repo = PostgresSearchRepository(session_maker, project_id=test_project.id)
    now = datetime.now(timezone.utc)
    row = SearchIndexRow(
        project_id=test_project.id,
        id=99,
       ,
        content_stems="some\x00stems",
        content_snippet="snippet\x00here",
        permalink="test/nul-row",
        file_path="test/nul.md",
        type="entity",
        metadata={},
        created_at=now,
        updated_at=now,
    )
    # Should not raise CharacterNotInRepertoireError
    await repo.bulk_index_items([row])
    results = await repo.search(permalink="test/nul-row")
    assert len(results) == 1
    assert "\x00" not in (results[0].content_snippet or "")

Fix this →


Issue 2 — title field not stripped at the primary layer for observations

In search_service.py line 532, the observation title is built directly from obs.content without _strip_nul():

The entity title (line 496) and relation title (line 563) are also passed through without primary-layer stripping. _strip_nul_from_row() at the Postgres repository layer covers these as the secondary defense, but this makes the secondary defense load-bearing for title — which is not documented at the call sites.

Either apply _strip_nul() to the title fields at the primary layer for consistency, or add a comment near each noting that NUL sanitization for title is guaranteed by the secondary defense only.

Fix this →


Issue 3 — index_item() single-item path lacks secondary defense (low severity)

PostgresSearchRepository.index_item() (used by index_entity_file() for binary files) does not call _strip_nul_from_row(). In practice this is safe because binary file metadata doesn't contain NUL-padded text content. Worth a comment near index_item() noting this if the method is ever extended.


Positive Notes

  • Layered defense is well-conceived: primary fix in SearchService.index_entity_markdown() at content-build time, secondary safety net in PostgresSearchRepository.bulk_index_items() before INSERT. The comment explicitly calls out this relationship.
  • --local-no-preallocate is the right proactive fix: prevents NUL bytes from entering files in the first place alongside the defensive indexing fix.
  • Focused diff with no scope creep: changes are minimal and targeted.
  • All tests pass per the PR description (just fast-check — 1662 tests).

Overall this is ready to merge with minor follow-ups noted above. Issues 1–3 are non-blocking but Issue 1 is worth closing before the fix is considered fully verified against Postgres.

Postgres rejects \x00 in text columns at the ORM level. The NUL bytes
arrive via file content read from disk, so the test now only exercises
the content path passed to index_entity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 20, 2026

Claude finished @phernandez's task in 2m 41s —— View job


PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing

  • Gather context and understand the request
  • Read changed files and check previous review issues
  • Review against team checklist
  • Post comprehensive review

Summary

This is the third review pass. The NUL byte stripping implementation is correct, clean, and well-motivated. The bd09b4b commit explains why a live Postgres observation-insertion test was removed: Postgres rejects \x00 at the ORM level before search indexing can strip it, so testing that code path directly against Postgres requires a SearchIndexRow approach (as noted in issue 1 below). All three outstanding issues from the previous two review cycles remain unresolved.


Team Checklist

Code Quality & Standards

  • Follows Basic Memory coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices
  • FastAPI and Typer conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Unit tests for _strip_nul() and _strip_nul_from_row()
  • Service-level integration test for NUL stripping (test_index_entity_markdown_strips_nul_bytes)
  • rclone flag tests for sync and bisync
  • Upstream rclone bug referenced in both code comments and tests (rclone/rclone#6801)
  • [~] No Postgres end-to-end bulk_index_items test with NUL bytes (Issue 1 — see below)

Basic Memory Architecture

  • No database schema changes (no migration required)
  • Preserves local-first architecture principles
  • Literate programming style followed — constraint comments and section headers present

Security & Performance

  • No hardcoded secrets or credentials
  • str.replace() is O(n) — minimal overhead per indexed row
  • No sensitive data in logs

Compatibility

  • --local-no-preallocate is a standard rclone local filesystem flag
  • No emoji/unicode characters added to console/log output

Outstanding Issues (third time noted — non-blocking but worth resolving before close)

Issue 1 — No Postgres integration test for bulk_index_items with NUL bytes

test_index_entity_markdown_strips_nul_bytes (SQLite) validates service-layer behaviour by asserting content_snippet is clean. test_strip_nul_from_row is a pure dict-transformation unit test with no database interaction.

The bd09b4b commit correctly removed an observation-insertion path (Postgres rejects \x00 in entity observation columns before search indexing can touch it). The right approach is to construct a SearchIndexRow directly with NUL bytes and call bulk_index_items against a live Postgres instance:

@pytest.mark.asyncio
async def test_bulk_index_items_strips_nul_bytes(session_maker, test_project):
    """NUL bytes in content must not cause CharacterNotInRepertoireError on INSERT."""
    repo = PostgresSearchRepository(session_maker, project_id=test_project.id)
    now = datetime.now(timezone.utc)
    row = SearchIndexRow(
        project_id=test_project.id,
        id=99,
       ,
        content_stems="some\x00stems",
        content_snippet="snippet\x00here",
        permalink="test/nul-row",
        file_path="test/nul.md",
        type="entity",
        metadata={},
        created_at=now,
        updated_at=now,
    )
    # Should not raise CharacterNotInRepertoireError
    await repo.bulk_index_items([row])
    results = await repo.search(permalink="test/nul-row")
    assert len(results) == 1
    assert "\x00" not in (results[0].content_snippet or "")

Fix this →


Issue 2 — title field not stripped at primary layer (three occurrences)

search_service.py line 532:

Lines 496 and 563 pass entity.title and relation_title (built from entity titles) without _strip_nul(). The secondary defense in _strip_nul_from_row() handles these, but the primary layer is inconsistent: content_stems and content_snippet are always stripped at first touch; title is not. This makes the secondary defense silently load-bearing for title, which is not documented at the call sites.

Simplest fix — wrap the three title expressions in _strip_nul():

  • title=_strip_nul(entity.title) (line 496)
  • title=_strip_nul(f"{obs.category}: {obs.content[:100]}...") (line 532)
  • relation_title = _strip_nul(f"...") (line 551-555)

Fix this →


Issue 3 — index_item() single-item path lacks secondary defense (low severity)

PostgresSearchRepository.index_item() (lines 88–143) does not call _strip_nul_from_row() before the INSERT. In practice this is safe because index_item() is only called from index_entity_file() for binary file metadata (no NUL-padded content), but the asymmetry between index_item and bulk_index_items is a latent footgun if the method is extended. At minimum, a comment near index_item() noting that NUL sanitization is not applied here (and why it is safe) would prevent future confusion.

Fix this →


Positive Notes

  • bd09b4b reasoning is correct: removing the Postgres-incompatible observation NUL test is the right call. Inserting \x00 bytes via entity observations hits Postgres at the ORM level before search indexing, so the SearchIndexRow path is the appropriate place to test bulk_index_items sanitization.
  • Layered defense is sound: primary fix in SearchService.index_entity_markdown() at content-build time; secondary safety net in PostgresSearchRepository.bulk_index_items() before INSERT; upstream prevention via --local-no-preallocate.
  • Test for test_index_entity_markdown_strips_nul_bytes is valid: even running on SQLite, it verifies service-layer stripping by asserting content_snippet is clean after indexing — it would fail if _strip_nul() were absent.
  • Focused, no scope creep: diff is minimal and on-target.

Overall: ready to merge — Issues 1–3 are non-blocking. Issue 2 (title fields) is the most worth closing since it makes the secondary defense silently load-bearing without documentation.

…ndex paths

- Apply _strip_nul() to entity, observation, and relation title fields
  at the primary sanitization layer for consistency with content_stems
  and content_snippet
- Add _strip_nul_from_row() to index_item() single-item path so both
  index_item and bulk_index_items have secondary defense
- Add Postgres integration tests for bulk_index_items and index_item
  with NUL bytes to prove CharacterNotInRepertoireError is prevented

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 20, 2026

Claude finished @phernandez's task in 1m 45s —— View job


PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing

  • Gather context and understand the request
  • Read changed files and check previous review issues
  • Review against team checklist
  • Post comprehensive review

Summary

This is the fourth and final review pass. Commit 35328e0 ("fix: address PR review — NUL stripping on all title fields and both index paths") resolves all three issues raised across the previous three review cycles. The PR is now complete and ready to merge.


Team Checklist

Code Quality & Standards

  • Follows Basic Memory coding conventions in CLAUDE.md
  • Python 3.12+ type annotations and async patterns
  • SQLAlchemy 2.0 best practices
  • FastAPI and Typer conventions followed
  • 100-character line length limit maintained
  • No commented-out code blocks

Testing & Documentation

  • Unit tests for _strip_nul() (test_strip_nul_removes_nul_bytes)
  • Unit test for _strip_nul_from_row() (test_strip_nul_from_row)
  • Service-level integration test (test_index_entity_markdown_strips_nul_bytes)
  • Postgres end-to-end bulk_index_items test with NUL bytes (test_bulk_index_items_strips_nul_bytes) — resolved
  • Postgres index_item() NUL test (test_index_item_strips_nul_bytes) — resolved
  • rclone flag tests for sync and bisync
  • Upstream rclone bug referenced in code comments and tests (rclone/rclone#6801)
  • 100% test coverage maintained

Basic Memory Architecture

  • No database schema changes (no migration required)
  • Preserves local-first architecture principles
  • Literate programming style followed — constraint comments and section headers present

Security & Performance

  • No hardcoded secrets or credentials
  • str.replace() is O(n) — minimal overhead per indexed row
  • No sensitive data in logs

Compatibility

  • --local-no-preallocate is a standard rclone local filesystem flag
  • No emoji/unicode characters added to console/log output

Previously Outstanding Issues — All Resolved

Issue 1 — Postgres end-to-end bulk_index_items test ✅

tests/repository/test_postgres_search_repository.py now contains test_bulk_index_items_strips_nul_bytes (lines 244–266): constructs a SearchIndexRow with \x00 in title, content_stems, and content_snippet, calls bulk_index_items against a live Postgres instance, asserts no exception and verifies the stored content_snippet and title are clean. Closes the gap where only the SQLite path was exercised.

Also added test_index_item_strips_nul_bytes (lines 269–291) covering the single-item index_item() path, which addresses Issue 3 as well.

Issue 2 — title fields stripped at primary layer ✅

search_service.py now applies _strip_nul() consistently at first touch for all three title fields:

  • title=_strip_nul(entity.title) (line 496)
  • title=_strip_nul(f"{obs.category}: {obs.content[:100]}...") (line 532)
  • relation_title = _strip_nul(...) (lines 551–554)

The primary and secondary defense layers are now symmetric: all fields are sanitized at content-build time, with _strip_nul_from_row() as a belt-and-suspenders fallback before INSERT.

Issue 3 — index_item() secondary defense ✅

PostgresSearchRepository.index_item() now calls _strip_nul_from_row(insert_data) at line 104, matching the defense applied in bulk_index_items(). The test_index_item_strips_nul_bytes test confirms this path works end-to-end against Postgres.


Final Notes

The layered defense strategy is now fully realized:

  1. Primary: _strip_nul() applied to all text fields in SearchService.index_entity_markdown() before SearchIndexRow construction
  2. Secondary: _strip_nul_from_row() applied to both index_item() and bulk_index_items() before INSERT
  3. Upstream prevention: --local-no-preallocate flag on rclone sync and bisync to prevent NUL padding at the source

Ready to merge.

@phernandez phernandez merged commit edb7991 into main Feb 21, 2026
26 checks passed
@phernandez phernandez deleted the hotfix/strip-nul-bytes-rclone-preallocate branch February 21, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rclone bisync reads stale content from Tigris edge cache for non-US users

1 participant