Fix block reference handling in ByteArrayBackedDataSource by lurongjiang · Pull Request #1044 · apache/poi

lurongjiang · 2026-04-08T01:03:22Z

Modify the ByteArrayBackedDataSource.read method to handle positions beyond the file size.
Return a zero-padded buffer instead of throwing an exception when the position exceeds the file size.

…f the file - Modify the ByteArrayBackedDataSource.read method to handle positions beyond the file size. - Return a zero-padded buffer instead of throwing an exception when the position exceeds the file size.

lurongjiang · 2026-04-08T01:15:00Z

side effect

- Remove redundant zero-padding logic in ByteArrayBackedDataSource. - Modify FileBackedDataSource to return an empty buffer instead of throwing an exception when the position exceeds the file size. - Add test cases for HWPFParser targeting WPS and Office 97-2003 document formats.

lurongjiang · 2026-04-08T15:59:27Z

Changes Made

Removed zero-padding logic
- During standard .doc file parsing, zero-padding introduced unexpected exceptions due to invalid buffer content.
- Added test cases for both standard (Microsoft Office 97-2003 .doc) and non-standard (WPS-compatible) formats to ensure robustness.
Enhanced out-of-bounds handling in FileBackedDataSource
- Discovered that file-based operations rely on FileBackedDataSource, which previously threw exceptions when position >= size().
- Now returns an empty ByteBuffer.allocate(length) instead of throwing an exception, ensuring consistent behavior with ByteArrayBackedDataSource.
- Added corresponding test cases to validate boundary conditions for file-backed reads.

Impact Assessment

Since the original implementation threw exceptions on out-of-bounds access, the new behavior (returning an empty buffer) maintains backward compatibility for valid cases.
For invalid cases (e.g., corrupted files or incomplete reads), the change prevents abrupt exceptions, allowing callers to handle incomplete data gracefully (e.g., failing silently or logging errors).
No functional regression expected, as the new logic aligns with the previous exception-throwing behavior in terms of outcome (failure to parse invalid data).

Testing Validation

Verified that:
- Standard .doc files parse correctly (no zero-padding artifacts).
- Non-standard WPS files remain supported.
- Out-of-bounds reads (both byte-array and file-backed) now return empty buffers instead of exceptions.

pjfanning · 2026-04-08T16:08:05Z

target main branch - not 5.5.x

pjfanning · 2026-04-08T16:10:45Z

    public ByteBuffer read(int length, long position) throws IOException {
        if (position >= size()) {
-            throw new IndexOutOfBoundsException("Position " + position + " past the end of the file");
+           return ByteBuffer.allocate(length);


add a comment about why this is allowed - like in the other class above

pjfanning · 2026-04-08T16:13:23Z

+            WordExtractor extractor = new WordExtractor(doc);
+            String text = extractor.getText();
+            assertNotNull(doc);
+            assertNotNull(text);


must test the actual text - it being not null is not enough

pjfanning · 2026-04-08T16:17:29Z

I must admit that I'm basically -1 on this.

I don't see why we should be returning empty ByteBuffers.

I think we should fail when we read corrupt files.

Feel free to fork POI and hack it how you please but I can't in good conscience agree to a change like this.

If it was an option where this behaviour only kicks in if you set a flag on the extractor to say ignore corrupt blocks - I would possibly agree.

pjfanning · 2026-04-08T16:31:15Z

relates to #1041

…ding - Added strict mode controlled by system properties in ByteArrayBackedDataSource and FileBackedDataSource. - By default, throw an IndexOutOfBoundsException instead of returning a zero-filled buffer when the position exceeds EOF (end of file). - Allow enabling tolerance mode by setting the system property `org.apache.poi.poifs.allowCorruptBlocks`. - Improved assertion logic in test code to verify actual text content rather than just checking for non-emptiness.

lurongjiang · 2026-04-09T00:25:47Z

Key Changes:

Added Configurable Tolerance Mode (org.apache.poi.poifs.allowCorruptBlocks)
- Modified ByteArrayBackedDataSource.java and FileBackedDataSource.java
- Default behavior: Strict mode - throws IndexOutOfBoundsException when encountering blocks beyond EOF
- Optional tolerant mode: Set system property to true to allow reading corrupt files with missing blocks
- System property is checked dynamically at runtime, allowing flexible configuration
Improved Test Coverage
- Updated TestHWPFParser.java with comprehensive tests:
  - testDocRead(): Verifies file can be read in tolerant mode with actual text content validation
  - testDocReadStrictMode(): Verifies strict mode properly rejects corrupt files
  - testWpsDocByFs(): Tests file system-based reading with content validation
  - testOffice97_2003DocRead(): Validates normal document reading
- All tests now verify actual text content (not just non-null), checking:
  - Text is not null
  - Text is not empty
  - Text is not blank (after trimming)
Clear Error Messages
- Exception messages guide users on how to enable tolerant mode if needed
- Example: "Position X is beyond EOF (Y). Set system property 'org.apache.poi.poifs.allowCorruptBlocks' to true to allow reading corrupt files with missing blocks."

Design Rationale:

Fail-fast by default: Aligns with Apache POI's principle of strict validation
Opt-in tolerance: Users who need to handle damaged documents can enable it via system property
No API changes: Backward compatible, uses standard Java system properties
Follows existing patterns: Similar to other POI configuration options like org.apache.poi.ss.ignoreMissingFontSystem

Testing:

All tests pass successfully, verifying both strict and tolerant modes work as expected.

@pjfanning

pjfanning · 2026-04-09T07:13:37Z

This is a WPS bug so I don't think we should make hacks in fundamental parts of POI code:
#1041 (comment)

Report the issue to WPS

lurongjiang · 2026-04-09T12:06:09Z

Thank you for the review and feedback. I understand your concerns.

My original intention was that since POI aims to support various Office document formats, I thought it should be able to handle WPS-generated files as well. When I encountered parsing failures with certain WPS documents, I attempted to add a tolerance mechanism as a workaround.

However, I now understand your point. This is not really a WPS bug, but rather a limitation of POI's current implementation - it cannot handle certain non-standard file formats generated by WPS. Adding workarounds in POI's core components (like ByteArrayBackedDataSource and FileBackedDataSource) would compromise the integrity of the codebase.

I'll close this PR as suggested. Users who need to handle such files can implement workarounds at the application level rather than modifying the library itself.

pjfanning · 2026-04-09T12:08:01Z

Thank you for the review and feedback. I understand your concerns.

My original intention was that since POI aims to support various Office document formats, I thought it should be able to handle WPS-generated files as well. When I encountered parsing failures with certain WPS documents, I attempted to add a tolerance mechanism as a workaround.

However, I now understand your point. This is not really a WPS bug, but rather a limitation of POI's current implementation - it cannot handle certain non-standard file formats generated by WPS. Adding workarounds in POI's core components (like ByteArrayBackedDataSource and FileBackedDataSource) would compromise the integrity of the codebase.

I'll close this PR as suggested. Users who need to handle such files can implement workarounds at the application level rather than modifying the library itself.

I can't open the file using Microsoft Word. So it is a WPS bug.

Fix (HWPF): Address the issue of block references exceeding the end o…

2bac6b5

…f the file - Modify the ByteArrayBackedDataSource.read method to handle positions beyond the file size. - Return a zero-padded buffer instead of throwing an exception when the position exceeds the file size.

lurongjiang closed this Apr 8, 2026

lurongjiang reopened this Apr 8, 2026

pjfanning reviewed Apr 8, 2026

View reviewed changes

lurongjiang closed this Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix block reference handling in ByteArrayBackedDataSource#1044

Fix block reference handling in ByteArrayBackedDataSource#1044
lurongjiang wants to merge 3 commits intoapache:5.5.xfrom
lurongjiang:fix_doc

lurongjiang commented Apr 8, 2026

Uh oh!

lurongjiang commented Apr 8, 2026

Uh oh!

lurongjiang commented Apr 8, 2026

Uh oh!

pjfanning commented Apr 8, 2026

Uh oh!

pjfanning Apr 8, 2026

Uh oh!

pjfanning Apr 8, 2026

Uh oh!

pjfanning commented Apr 8, 2026

Uh oh!

pjfanning commented Apr 8, 2026

Uh oh!

lurongjiang commented Apr 9, 2026 •

edited

Loading

Uh oh!

pjfanning commented Apr 9, 2026 •

edited

Loading

Uh oh!

lurongjiang commented Apr 9, 2026

Uh oh!

pjfanning commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lurongjiang commented Apr 8, 2026

Uh oh!

lurongjiang commented Apr 8, 2026

Uh oh!

lurongjiang commented Apr 8, 2026

Changes Made

Impact Assessment

Testing Validation

Uh oh!

pjfanning commented Apr 8, 2026

Uh oh!

pjfanning Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

pjfanning Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

pjfanning commented Apr 8, 2026

Uh oh!

pjfanning commented Apr 8, 2026

Uh oh!

lurongjiang commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes:

Design Rationale:

Testing:

Uh oh!

pjfanning commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lurongjiang commented Apr 9, 2026

Uh oh!

pjfanning commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lurongjiang commented Apr 9, 2026 •

edited

Loading

pjfanning commented Apr 9, 2026 •

edited

Loading