Skip to content

fix(oxml): keep generated ids within signed int32 range#36

Merged
airmang merged 1 commit into
airmang:mainfrom
seonghoony:fix/id-generators-clear-bit31
Apr 27, 2026
Merged

fix(oxml): keep generated ids within signed int32 range#36
airmang merged 1 commit into
airmang:mainfrom
seonghoony:fix/id-generators-clear-bit31

Conversation

@seonghoony
Copy link
Copy Markdown

Closes #34.

Root cause

_paragraph_id, _object_id, and _memo_id in src/hwpx/oxml/document.py masked uuid4().int with 0xFFFFFFFF. Since uuid4().int is uniformly distributed over 122 bits of entropy, the masked output is uniform over the full unsigned 32-bit range; about 50% of generated ids have bit 31 set (i.e. >= 2^31).

Downstream consumers that parse the id attribute as a signed 32-bit integer interpret those values as negative, which has caused interop issues.

Fix

Mask one bit lower (& 0x7FFFFFFF) so the result fits in [0, 2^31) and survives any consumer that uses signed int32. Three call sites updated:

 def _paragraph_id() -> str:
     """Generate an identifier for a new paragraph element."""
-    return str(uuid4().int & 0xFFFFFFFF)
+    return str(uuid4().int & 0x7FFFFFFF)


 def _object_id() -> str:
     """Generate an identifier suitable for table and shape objects."""
-    return str(uuid4().int & 0xFFFFFFFF)
+    return str(uuid4().int & 0x7FFFFFFF)


 def _memo_id() -> str:
     """Generate a lightweight identifier for memo elements."""
-    return str(uuid4().int & 0xFFFFFFFF)
+    return str(uuid4().int & 0x7FFFFFFF)

Collision risk is essentially unchanged — birthday probability for N=10_000 ids drops from ~2.3e-5 (32-bit) to ~4.7e-5 (31-bit), which is still negligible for any realistic document size.

Regression test

tests/test_id_generator_range.py (new):

  • test_id_generators_stay_within_signed_int32 — samples 200 values from each of the three generators and asserts 0 <= v < 2**31.
  • test_id_generators_use_full_31_bit_range — samples 4,000 values from each generator and asserts at least one is >= 2^30, guarding against accidental over-restriction (e.g. a future change that masks too many bits).

Without the fix, test_id_generators_stay_within_signed_int32 fails immediately:

AssertionError: _paragraph_id produced 3521445198 (0xd1e4fd4e); must be in [0, 2^31)

With the fix:

$ pytest tests/test_id_generator_range.py
tests/test_id_generator_range.py ..                                      [100%]
============================== 2 passed in 0.06s ===============================

Wider regression check

$ pytest tests/test_id_generator_range.py tests/test_oxml_parsing.py tests/test_document_save_api.py
14 passed, 2 skipped in 0.12s

(2 skips are pre-existing, due to absent sample HWPX fixtures in this checkout.)

Notes

  • This PR is intentionally minimal: only the three random ID generators are touched.
  • The _allocate_*_id helpers (_allocate_char_property_id, _allocate_border_fill_id, _allocate_bin_item_id) propagate max(numeric_ids) + 1. They will start producing in-range values once the input document also has in-range ids — that is a follow-up topic and not part of this PR.
  • src/hwpx/data/Skeleton.hwpx ships with <hp:p id="3121190098"> (>= 2^31). That is filed separately as Skeleton.hwpx ships with <hp:p id="3121190098"> (out of signed int32 range) #35 and patched separately so the two fixes can land independently.

_paragraph_id, _object_id, and _memo_id used uuid4().int & 0xFFFFFFFF, which
yields values >= 2^31 about 50% of the time. Downstream consumers that parse
the id attribute as a signed 32-bit integer treat those values as negative,
which has caused interop failures. Mask one bit lower so the result stays
in [0, 2^31) while keeping the practical collision rate unchanged.

Adds a regression test that asserts the new range across all three generators.
@airmang airmang merged commit 3e5e348 into airmang:main Apr 27, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_paragraph_id / _object_id / _memo_id produce values that overflow signed 32-bit

2 participants