Skip to content

Fix block reference handling in ByteArrayBackedDataSource#1044

Closed
lurongjiang wants to merge 3 commits intoapache:5.5.xfrom
lurongjiang:fix_doc
Closed

Fix block reference handling in ByteArrayBackedDataSource#1044
lurongjiang wants to merge 3 commits intoapache:5.5.xfrom
lurongjiang:fix_doc

Conversation

@lurongjiang
Copy link
Copy Markdown

  • Modify the ByteArrayBackedDataSource.read method to handle positions beyond the file size.
  • Return a zero-padded buffer instead of throwing an exception when the position exceeds the file size.

…f the file

- Modify the ByteArrayBackedDataSource.read method to handle positions beyond the file size.
- Return a zero-padded buffer instead of throwing an exception when the position exceeds the file size.
@lurongjiang lurongjiang closed this Apr 8, 2026
@lurongjiang
Copy link
Copy Markdown
Author

side effect

- Remove redundant zero-padding logic in ByteArrayBackedDataSource.
- Modify FileBackedDataSource to return an empty buffer instead of throwing an exception when the position exceeds the file size.
- Add test cases for HWPFParser targeting WPS and Office 97-2003 document formats.
@lurongjiang lurongjiang reopened this Apr 8, 2026
@lurongjiang
Copy link
Copy Markdown
Author

Changes Made

  1. Removed zero-padding logic

    • During standard .doc file parsing, zero-padding introduced unexpected exceptions due to invalid buffer content.
    • Added test cases for both standard (Microsoft Office 97-2003 .doc) and non-standard (WPS-compatible) formats to ensure robustness.
  2. Enhanced out-of-bounds handling in FileBackedDataSource

    • Discovered that file-based operations rely on FileBackedDataSource, which previously threw exceptions when position >= size().
    • Now returns an empty ByteBuffer.allocate(length) instead of throwing an exception, ensuring consistent behavior with ByteArrayBackedDataSource.
    • Added corresponding test cases to validate boundary conditions for file-backed reads.

Impact Assessment

  • Since the original implementation threw exceptions on out-of-bounds access, the new behavior (returning an empty buffer) maintains backward compatibility for valid cases.
  • For invalid cases (e.g., corrupted files or incomplete reads), the change prevents abrupt exceptions, allowing callers to handle incomplete data gracefully (e.g., failing silently or logging errors).
  • No functional regression expected, as the new logic aligns with the previous exception-throwing behavior in terms of outcome (failure to parse invalid data).

Testing Validation

  • Verified that:
    • Standard .doc files parse correctly (no zero-padding artifacts).
    • Non-standard WPS files remain supported.
    • Out-of-bounds reads (both byte-array and file-backed) now return empty buffers instead of exceptions.

@pjfanning
Copy link
Copy Markdown
Member

target main branch - not 5.5.x

public ByteBuffer read(int length, long position) throws IOException {
if (position >= size()) {
throw new IndexOutOfBoundsException("Position " + position + " past the end of the file");
return ByteBuffer.allocate(length);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment about why this is allowed - like in the other class above

WordExtractor extractor = new WordExtractor(doc);
String text = extractor.getText();
assertNotNull(doc);
assertNotNull(text);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must test the actual text - it being not null is not enough

@pjfanning
Copy link
Copy Markdown
Member

I must admit that I'm basically -1 on this.

I don't see why we should be returning empty ByteBuffers.

I think we should fail when we read corrupt files.

Feel free to fork POI and hack it how you please but I can't in good conscience agree to a change like this.

If it was an option where this behaviour only kicks in if you set a flag on the extractor to say ignore corrupt blocks - I would possibly agree.

@pjfanning
Copy link
Copy Markdown
Member

relates to #1041

…ding

- Added strict mode controlled by system properties in ByteArrayBackedDataSource and FileBackedDataSource.
- By default, throw an IndexOutOfBoundsException instead of returning a zero-filled buffer when the position exceeds EOF (end of file).
- Allow enabling tolerance mode by setting the system property `org.apache.poi.poifs.allowCorruptBlocks`.
- Improved assertion logic in test code to verify actual text content rather than just checking for non-emptiness.
@lurongjiang
Copy link
Copy Markdown
Author

lurongjiang commented Apr 9, 2026

Key Changes:

  1. Added Configurable Tolerance Mode (org.apache.poi.poifs.allowCorruptBlocks)

    • Modified ByteArrayBackedDataSource.java and FileBackedDataSource.java
    • Default behavior: Strict mode - throws IndexOutOfBoundsException when encountering blocks beyond EOF
    • Optional tolerant mode: Set system property to true to allow reading corrupt files with missing blocks
    • System property is checked dynamically at runtime, allowing flexible configuration
  2. Improved Test Coverage

    • Updated TestHWPFParser.java with comprehensive tests:
      • testDocRead(): Verifies file can be read in tolerant mode with actual text content validation
      • testDocReadStrictMode(): Verifies strict mode properly rejects corrupt files
      • testWpsDocByFs(): Tests file system-based reading with content validation
      • testOffice97_2003DocRead(): Validates normal document reading
    • All tests now verify actual text content (not just non-null), checking:
      • Text is not null
      • Text is not empty
      • Text is not blank (after trimming)
  3. Clear Error Messages

    • Exception messages guide users on how to enable tolerant mode if needed
    • Example: "Position X is beyond EOF (Y). Set system property 'org.apache.poi.poifs.allowCorruptBlocks' to true to allow reading corrupt files with missing blocks."

Design Rationale:

  • Fail-fast by default: Aligns with Apache POI's principle of strict validation
  • Opt-in tolerance: Users who need to handle damaged documents can enable it via system property
  • No API changes: Backward compatible, uses standard Java system properties
  • Follows existing patterns: Similar to other POI configuration options like org.apache.poi.ss.ignoreMissingFontSystem

Testing:

All tests pass successfully, verifying both strict and tolerant modes work as expected.

@pjfanning

@pjfanning
Copy link
Copy Markdown
Member

pjfanning commented Apr 9, 2026

This is a WPS bug so I don't think we should make hacks in fundamental parts of POI code:
#1041 (comment)

Report the issue to WPS

@lurongjiang
Copy link
Copy Markdown
Author

Thank you for the review and feedback. I understand your concerns.

My original intention was that since POI aims to support various Office document formats, I thought it should be able to handle WPS-generated files as well. When I encountered parsing failures with certain WPS documents, I attempted to add a tolerance mechanism as a workaround.

However, I now understand your point. This is not really a WPS bug, but rather a limitation of POI's current implementation - it cannot handle certain non-standard file formats generated by WPS. Adding workarounds in POI's core components (like ByteArrayBackedDataSource and FileBackedDataSource) would compromise the integrity of the codebase.

I'll close this PR as suggested. Users who need to handle such files can implement workarounds at the application level rather than modifying the library itself.

@lurongjiang lurongjiang closed this Apr 9, 2026
@pjfanning
Copy link
Copy Markdown
Member

Thank you for the review and feedback. I understand your concerns.

My original intention was that since POI aims to support various Office document formats, I thought it should be able to handle WPS-generated files as well. When I encountered parsing failures with certain WPS documents, I attempted to add a tolerance mechanism as a workaround.

However, I now understand your point. This is not really a WPS bug, but rather a limitation of POI's current implementation - it cannot handle certain non-standard file formats generated by WPS. Adding workarounds in POI's core components (like ByteArrayBackedDataSource and FileBackedDataSource) would compromise the integrity of the codebase.

I'll close this PR as suggested. Users who need to handle such files can implement workarounds at the application level rather than modifying the library itself.

I can't open the file using Microsoft Word. So it is a WPS bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants