Conversation
refactor(version-control): remove last_pruned column fix(tests): update assertions for external files
Jorgedanisc
There was a problem hiding this comment.
Pull request overview
Updates the DUC schema and supporting libraries to separate external-file revision blobs from revision metadata (enabling metadata-only reads/lazy loading), while also removing pruning-related fields and refreshing README asset links.
Changes:
- Bump schema to
3000002and splitexternal_file_revisions.datainto a newexternal_file_revision_datatable. - Update Rust/JS/Python types, parsing, and serialization to use a separate
filesData/external_files_datablob map keyed by revision id. - Remove pruning-level / last-pruned fields across schema + SDKs, and switch README images to jsDelivr CDN.
Reviewed changes
Copilot reviewed 36 out of 36 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| schema/version_control.sql | Removes last_pruned column from version_graph definition. |
| schema/migrations/3000001_to_3000002.sql | Migration adds external_file_revision_data and recreates external_file_revisions without blob column. |
| schema/duc.sql | Bumps user_version to 3000002; updates external file tables; removes pruning_level from global state. |
| packages/ducsvg/README.md | Switches logo URL to jsDelivr CDN. |
| packages/ducrs/src/types.rs | Removes pruning types/fields; introduces revision metadata type + ExternalFileLoaded; adds filesData to exported state. |
| packages/ducrs/src/serialize.rs | Writes external file metadata + separate blob table; removes pruning/global-state fields from SQL writes. |
| packages/ducrs/src/parse.rs | Reads external file blobs into separate map; updates get_external_file to return metadata + blob map. |
| packages/ducrs/src/lib.rs | Updates tests to account for external_files_data and get_external_file return type. |
| packages/ducrs/src/api/version_control.rs | Removes last_pruned reads and pruning write APIs. |
| packages/ducrs/README.md | Switches logo URL to jsDelivr CDN. |
| packages/ducpy/src/tests/src/test_pdf_image_elements.py | Adjusts SQL test inserts to new external file blob table. |
| packages/ducpy/src/tests/src/test_create_image_element_with_external_file.py | Adjusts SQL inserts/selects to join external_file_revision_data. |
| packages/ducpy/src/tests/src/test_a_duc_with_everything.py | Updates global state insert + external file blob inserts for new schema. |
| packages/ducpy/src/tests/src/test_CSPMDS_local_and_global_states.py | Removes pruning_level usage in global state SQL insert. |
| packages/ducpy/src/ducpy/serialize.py | Emits files (meta) + filesData (blobs) payload split for Rust serializer. |
| packages/ducpy/src/ducpy/enums.py | Removes PRUNING_LEVEL enum. |
| packages/ducpy/src/ducpy/classes/DataStateClass.py | Removes pruning fields; adds files_data to exported state. |
| packages/ducpy/src/ducpy/builders/state_builders.py | Removes pruning builder APIs; attaches _data_blobs for split blob serialization. |
| packages/ducpy/docs/index.rst | Switches logo URL to jsDelivr CDN. |
| packages/ducpy/README.md | Switches logo URL to jsDelivr CDN. |
| packages/ducpdf/src/duc2pdf/src/utils/svg_to_pdf.rs | Updates SVG conversion helper to accept active revision bytes explicitly. |
| packages/ducpdf/src/duc2pdf/src/streaming/stream_resources.rs | Updates resource processing to require separate revision bytes rather than revision-embedded blobs. |
| packages/ducpdf/src/duc2pdf/src/builder.rs | Passes active revision bytes from external_files_data into resource processing. |
| packages/ducpdf/src/duc2pdf/README.md | Switches logo URL to jsDelivr CDN. |
| packages/ducpdf/README.md | Switches logo URL to jsDelivr CDN. |
| packages/ducjs/src/utils/state/index.ts | Removes default pruningLevel from global state. |
| packages/ducjs/src/utils/normalize.ts | Removes unused grid normalization helpers. |
| packages/ducjs/src/utils/constants.ts | Removes unused DEFAULT_GRID_STEP. |
| packages/ducjs/src/types/index.ts | Introduces filesData + revision-meta types; removes pruning types/fields. |
| packages/ducjs/src/transform.ts | Stops normalizing pruningLevel for Rust. |
| packages/ducjs/src/restore/restoreDataState.ts | Restores file metadata separately from file blobs via filesData; removes pruning restore. |
| packages/ducjs/src/parse.ts | Initializes filesData in lazy parse path. |
| packages/ducjs/src/lazy-files.ts | Returns ExternalFileLoaded from WASM; exposes active revision bytes via ResolvedFileData; adds export/merge for filesData. |
| packages/ducjs/src/enums.ts | Removes PRUNING_LEVEL enum. |
| packages/ducjs/README.md | Switches logo URL to jsDelivr CDN. |
| README.md | Switches logo URL to jsDelivr CDN. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -479,36 +480,34 @@ impl DucToPdfBuilder { | |||
| Some(rev) => rev.mime_type.clone(), | |||
| None => continue, | |||
| }; | |||
|
|
|||
| let rev_data = files_data | |||
| .as_ref() | |||
| .and_then(|d| d.get(&file.active_revision_id)) | |||
| .map(|b| b.as_ref()); | |||
There was a problem hiding this comment.
files_data is cloned from self.context.exported_data.external_files_data before the loop. Since this map can contain large blobs, the clone can double peak memory usage for big documents. Prefer borrowing (as_ref()) and looking up blobs by reference (or cloning only the single active blob you need per file).
| def _convert_external_files( | ||
| entries: Optional[Any], | ||
| ) -> Optional[Dict[str, Any]]: | ||
| """Convert DucExternalFile dict (or legacy list) to camelCase dicts for serialization.""" | ||
| ) -> Tuple[Optional[Dict[str, Any]], Optional[Dict[str, bytes]]]: | ||
| """Convert DucExternalFile dict (or legacy list) to camelCase dicts for serialization. | ||
|
|
||
| Returns ``(files_meta, files_data)`` where *files_meta* contains revision | ||
| metadata (no blobs) and *files_data* maps revision-id → raw bytes for the | ||
| separate ``filesData`` key expected by the Rust serializer. | ||
| """ | ||
| if not entries: | ||
| return None | ||
| return None, None | ||
|
|
||
| data_blobs: Dict[str, bytes] = {} | ||
|
|
||
| def _extract_blobs(entry: Any) -> None: | ||
| blobs = getattr(entry, "_data_blobs", None) | ||
| if blobs: | ||
| data_blobs.update(blobs) | ||
|
|
||
| # New format: already a dict mapping id → DucExternalFile | ||
| if isinstance(entries, dict): | ||
| result: dict = {} | ||
| for key, value in entries.items(): | ||
| _extract_blobs(value) | ||
| if is_dataclass(value): | ||
| value = asdict(value) | ||
| result[key] = deep_snake_to_camel(value) if isinstance(value, dict) else value | ||
| return result if result else None | ||
| return (result if result else None), (data_blobs if data_blobs else None) |
There was a problem hiding this comment.
_convert_external_files() only extracts binary blobs from a private _data_blobs attribute. If callers pass external files as plain dicts (or legacy objects) with inline revision data/dataURL, those bytes will be silently dropped because the Rust structs no longer include data in revision metadata. Consider also extracting inline revision data from dict inputs (and removing it from the metadata payload), or raising an explicit error when blobs are present but no filesData can be produced.
| -- Migration: 3000001 → 3000002 | ||
| -- Splits binary data out of `external_file_revisions` into a dedicated | ||
| -- `external_file_revision_data` table so metadata can be read without | ||
| -- loading heavy blobs. | ||
|
|
There was a problem hiding this comment.
This migration bumps user_version to 3000002 but only modifies the external file tables. Since the base schema for 3000002 also removes duc_global_state.pruning_level and version_graph.last_pruned, upgrading from 3000001 will leave those columns in place, making the on-disk schema dependent on upgrade path. Consider updating this migration to also drop/recreate those tables/columns (or keep them in the base schema if intentional).
| // Write data blobs from the separate filesData map | ||
| if let Some(data_map) = files_data { | ||
| for (rev_id, blob) in data_map { | ||
| data_stmt.execute(params![rev_id, blob.as_ref()])?; | ||
| } |
There was a problem hiding this comment.
files_data is written wholesale into external_file_revision_data without verifying that each blob corresponds to an existing revision (and FK checks are disabled for the whole export). This can produce orphaned data rows or revisions with no data rows, depending on the caller, yielding an internally inconsistent .duc. Consider (a) filtering files_data to only revision IDs present in files, and (b) erroring (or at least warning) when an active revision is missing a blob.
| for (const revKey in rawRevisions) { | ||
| if (!Object.prototype.hasOwnProperty.call(rawRevisions, revKey)) continue; | ||
| if (result[revKey]) continue; // filesData takes precedence | ||
| const rev = rawRevisions[revKey]; | ||
| if (!rev || typeof rev !== "object") continue; | ||
| const r = rev as Record<string, unknown>; | ||
| const dataSource = r.data ?? (r as any).dataURL; | ||
| const data = isValidUint8Array(dataSource); | ||
| if (data) { | ||
| result[revKey] = data; | ||
| } |
There was a problem hiding this comment.
restoreFilesData() uses revKey (the revisions map key) as the key into filesData, but the canonical key for blobs is the revision id (r.id). If the map key differs from r.id (or if legacy exports key by something else), the restored blobs won't be found by consumers that index by revision id (e.g., activeRevisionId). Use the validated revId/r.id for both the result[...] key and the precedence check.
📋 Release Preview
|
|
🎉 This PR is included in version 3.2.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.2.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.7.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
|
🎉 This PR is included in version 3.6.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
No description provided.