Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install python-magic-bin instead of python-magic for Windows #234

Closed
MthwRobinson opened this issue Feb 16, 2023 · 1 comment
Closed

Install python-magic-bin instead of python-magic for Windows #234

MthwRobinson opened this issue Feb 16, 2023 · 1 comment
Labels
python Pull requests that update Python code

Comments

@MthwRobinson
Copy link
Contributor

Currently windows users have difficulty with file detection because windows needs to install python-magic-bin instead of python-magic. The goal of this issue is to see if we can install python-magic-bin instead of python-magic if the user's OS is Windows.

See this comment for details.

References:

@tomaarsen
Copy link
Contributor

tomaarsen commented Feb 25, 2023

@MthwRobinson
I can confirm that python-magic-bin must be installed on Windows. However, it must be noted that the tests do not pass using it. Notably:

FAILED test_unstructured/partition/test_auto.py::test_auto_partition_email_from_file - ValueError: Invalid file. File type not support in partition.
FAILED test_unstructured/partition/test_auto.py::test_auto_partition_html_from_file - ValueError: Invalid file. File type not support in partition.
FAILED test_unstructured/partition/test_auto.py::test_auto_partition_text_from_file - ValueError: Invalid file. File type not support in partition.
FAILED test_unstructured/staging/test_base_staging.py::test_convert_to_isd_serializes_with_posix_paths - NotImplementedError: cannot instantiate 'PosixPath' on your system

(Note: Not an exhaustive list of test failures)
For the first three failures, the fake-text.txt, fake-html.html and the fake-email.eml all get detected as application/octet-stream mime type by libmagic, after which unstructured tries to check if it might be a docx, xlsx or pptx. After failing, it assigns the unknown filetype.

Lastly, the posix path simply can't be created on Windows.

I'll open a PR for the last issue.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Pull requests that update Python code
Projects
None yet
Development

No branches or pull requests

3 participants