fix: strip NUL bytes from content before PostgreSQL search indexing#592
fix: strip NUL bytes from content before PostgreSQL search indexing#592phernandez merged 13 commits intomainfrom
Conversation
Signed-off-by: phernandez <paul@basicmachines.co> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
…ansactions --header-download and --header-upload only apply to GET and PUT requests. bisync starts by listing files via S3 ListObjectsV2 — neither a download nor upload — so the consistency header was never sent on list requests. Non-US users saw stale edge-cached metadata and bisync reported "0 changes." --header is rclone's global flag that applies to ALL HTTP transactions (list, download, upload), which is a superset of the previous two flags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
rclone preallocation on virtual filesystems (e.g. Google Drive File Stream) pads markdown files with \x00 bytes (rclone/rclone#6801), which PostgreSQL rejects with CharacterNotInRepertoireError during search indexing. Three-pronged fix: - 🛡️ Primary: _strip_nul() in SearchService.index_entity_markdown() sanitizes content_stems, content_snippet, and observation/relation content before building SearchIndexRow objects - 🛡️ Secondary: _strip_nul_from_row() in PostgresSearchRepository.bulk_index_items() as a safety net before INSERT - 🔧 Prevention: --local-no-preallocate flag added to rclone sync and bisync commands to prevent the padding at the source Fixes #548 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Resolve conflicts keeping hotfix NUL-byte stripping alongside main's semantic search, vector embeddings, and full content_snippet changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 3m 27s —— View job PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing
SummaryThis is a solid, well-targeted fix with a clear layered defense strategy. The root cause (rclone preallocation on virtual filesystems) is well-understood and well-documented. The code is clean and minimal. A few minor gaps are noted below. Team ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Issues FoundIssue 1 — No Postgres integration test for end-to-end NUL rejection
The postgres-specific test file ( Adding a Postgres integration test like this would close the gap: @pytest.mark.asyncio
async def test_bulk_index_items_strips_nul_bytes(session_maker, test_project):
"""NUL bytes in content must not cause CharacterNotInRepertoireError on INSERT."""
repo = PostgresSearchRepository(session_maker, project_id=test_project.id)
now = datetime.now(timezone.utc)
row = SearchIndexRow(
project_id=test_project.id,
id=99,
,
content_stems="some\x00stems",
content_snippet="snippet\x00here",
permalink="test/nul-row",
file_path="test/nul.md",
type="entity",
metadata={},
created_at=now,
updated_at=now,
)
# Should not raise CharacterNotInRepertoireError
await repo.bulk_index_items([row])
results = await repo.search(permalink="test/nul-row")
assert len(results) == 1
assert "\x00" not in (results[0].content_snippet or "")Issue 2 —
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 3s —— View job PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing
SummaryThis review covers the updated branch after the latest synchronize (merge from Team ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Outstanding Issues (unchanged from previous review)Issue 1 — No Postgres integration test for
|
Postgres rejects \x00 in text columns at the ORM level. The NUL bytes arrive via file content read from disk, so the test now only exercises the content path passed to index_entity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 41s —— View job PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing
SummaryThis is the third review pass. The NUL byte stripping implementation is correct, clean, and well-motivated. The Team ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Outstanding Issues (third time noted — non-blocking but worth resolving before close)Issue 1 — No Postgres integration test for
|
…ndex paths - Apply _strip_nul() to entity, observation, and relation title fields at the primary sanitization layer for consistency with content_stems and content_snippet - Add _strip_nul_from_row() to index_item() single-item path so both index_item and bulk_index_items have secondary defense - Add Postgres integration tests for bulk_index_items and index_item with NUL bytes to prove CharacterNotInRepertoireError is prevented Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 45s —— View job PR Review: fix: strip NUL bytes from content before PostgreSQL search indexing
SummaryThis is the fourth and final review pass. Commit Team ChecklistCode Quality & Standards
Testing & Documentation
Basic Memory Architecture
Security & Performance
Compatibility
Previously Outstanding Issues — All ResolvedIssue 1 — Postgres end-to-end
|
Summary
\x00bytes inSearchService.index_entity_markdown()before building search index rows (primary fix)_strip_nul_from_row()safety net inPostgresSearchRepository.bulk_index_items()before INSERT--local-no-preallocateflag to rclonesyncandbisynccommands to prevent NUL byte padding at the sourceContext
Users syncing via rclone bisync to Google Drive File Stream on Windows get NUL bytes appended to markdown files due to a known rclone preallocation bug with virtual filesystems (rclone/rclone#6801). When these files are indexed, PostgreSQL rejects them with
CharacterNotInRepertoireError.Fixes #548
Test plan
test_strip_nul_removes_nul_bytes— unit test for_strip_nul()helpertest_index_entity_markdown_strips_nul_bytes— integration test: index entity with NUL content, verify no\x00in stored rowstest_strip_nul_from_row— unit test for Postgres repository helpertest_project_sync_includes_no_preallocate_flag— verify flag in sync commandtest_project_bisync_includes_no_preallocate_flag— verify flag in bisync commandjust fast-checkpasses (lint, format, typecheck, 1662 tests, MCP smoke)🤖 Generated with Claude Code