Fix block reference handling in ByteArrayBackedDataSource#1044
Fix block reference handling in ByteArrayBackedDataSource#1044lurongjiang wants to merge 3 commits intoapache:5.5.xfrom
Conversation
lurongjiang
commented
Apr 8, 2026
- Modify the ByteArrayBackedDataSource.read method to handle positions beyond the file size.
- Return a zero-padded buffer instead of throwing an exception when the position exceeds the file size.
…f the file - Modify the ByteArrayBackedDataSource.read method to handle positions beyond the file size. - Return a zero-padded buffer instead of throwing an exception when the position exceeds the file size.
|
side effect |
- Remove redundant zero-padding logic in ByteArrayBackedDataSource. - Modify FileBackedDataSource to return an empty buffer instead of throwing an exception when the position exceeds the file size. - Add test cases for HWPFParser targeting WPS and Office 97-2003 document formats.
Changes Made
Impact Assessment
Testing Validation
|
|
target main branch - not 5.5.x |
| public ByteBuffer read(int length, long position) throws IOException { | ||
| if (position >= size()) { | ||
| throw new IndexOutOfBoundsException("Position " + position + " past the end of the file"); | ||
| return ByteBuffer.allocate(length); |
There was a problem hiding this comment.
add a comment about why this is allowed - like in the other class above
| WordExtractor extractor = new WordExtractor(doc); | ||
| String text = extractor.getText(); | ||
| assertNotNull(doc); | ||
| assertNotNull(text); |
There was a problem hiding this comment.
must test the actual text - it being not null is not enough
|
I must admit that I'm basically -1 on this. I don't see why we should be returning empty ByteBuffers. I think we should fail when we read corrupt files. Feel free to fork POI and hack it how you please but I can't in good conscience agree to a change like this. If it was an option where this behaviour only kicks in if you set a flag on the extractor to say ignore corrupt blocks - I would possibly agree. |
|
relates to #1041 |
…ding - Added strict mode controlled by system properties in ByteArrayBackedDataSource and FileBackedDataSource. - By default, throw an IndexOutOfBoundsException instead of returning a zero-filled buffer when the position exceeds EOF (end of file). - Allow enabling tolerance mode by setting the system property `org.apache.poi.poifs.allowCorruptBlocks`. - Improved assertion logic in test code to verify actual text content rather than just checking for non-emptiness.
Key Changes:
Design Rationale:
Testing:All tests pass successfully, verifying both strict and tolerant modes work as expected. |
|
This is a WPS bug so I don't think we should make hacks in fundamental parts of POI code: Report the issue to WPS |
|
Thank you for the review and feedback. I understand your concerns. My original intention was that since POI aims to support various Office document formats, I thought it should be able to handle WPS-generated files as well. When I encountered parsing failures with certain WPS documents, I attempted to add a tolerance mechanism as a workaround. However, I now understand your point. This is not really a WPS bug, but rather a limitation of POI's current implementation - it cannot handle certain non-standard file formats generated by WPS. Adding workarounds in POI's core components (like I'll close this PR as suggested. Users who need to handle such files can implement workarounds at the application level rather than modifying the library itself. |
I can't open the file using Microsoft Word. So it is a WPS bug. |