Skip to content

Conversation

@scanny
Copy link
Contributor

@scanny scanny commented Aug 13, 2024

Summary
Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case.

Additional Context
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories.

An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The filetype module does a good job of this but is not perfect and does not identify MSG files.

Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that filetype is not completely reliable at detecting these types.

Change the behavior to remove the assumption of MSG format. _OleFileDifferentiator returns None in this case and filetype detection falls back to use filename-extension.

Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.

@sentry
Copy link

sentry bot commented Aug 13, 2024

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: unstructured/file_utils/filetype.py

Function Unhandled Issue
file_type BadZipFile: Bad magic number for central directory /general/v0/gener...
Event Count: 2

Did you find this useful? React with a 👍 or 👎

@scanny scanny force-pushed the scanny/no-default-OLE-subtype branch from 68d6aba to 8669f06 Compare August 21, 2024 19:34
@scanny scanny requested a review from Coniferish August 21, 2024 19:35
@scanny scanny force-pushed the scanny/no-default-OLE-subtype branch from 8669f06 to 05b4167 Compare August 21, 2024 20:53
Copy link
Contributor

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.

An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes. The `filetype` module does a good job of this but it does not
identify MSG files.

Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.

Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.

Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
@scanny scanny added this pull request to the merge queue Aug 22, 2024
Merged via the queue into main with commit 32bb77a Aug 22, 2024
@scanny scanny deleted the scanny/no-default-OLE-subtype branch August 22, 2024 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants