Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/macos-arm64-and-lxml-import-error #1707

Closed
liamvdv opened this issue Oct 11, 2023 · 5 comments
Closed

bug/macos-arm64-and-lxml-import-error #1707

liamvdv opened this issue Oct 11, 2023 · 5 comments
Assignees
Labels
bug Something isn't working packaging Issues with building and installing `unstructured`

Comments

@liamvdv
Copy link

liamvdv commented Oct 11, 2023

Describe the bug
Cannot use unstructured on MacOS M2 Pro because from unstructured.partition.html import partition_html throws

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/partition/html.py", line 7, in <module>
    from unstructured.documents.html import HTMLDocument
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/documents/html.py", line 11, in <module>
    from lxml import etree
ImportError: dlopen(/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '___cyg_profile_func_enter'

To Reproduce

brew install python@3.9
python3.9 -m venv .venv
source .venv/bin/activate
python --version # should show 3.9 now
which python # should be .../.venv/bin/....
pip install unstructured
python
# in interactive shell
from unstructured.partition.html import partition_html

on my machine throws

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/partition/html.py", line 7, in <module>
    from unstructured.documents.html import HTMLDocument
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/documents/html.py", line 11, in <module>
    from lxml import etree
ImportError: dlopen(/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '___cyg_profile_func_enter'

Expected behavior
Normal import to then parse HTML/XML files.

Screenshots
image

Environment Info

/Users/liamvdv/src/github.com/REDACT/collect.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  macOS-13.4.1-arm64-arm-64bit
Python version:  3.9.18
unstructured version:  0.10.19
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
  File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 242, in <module>
    main()
  File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 234, in main
    libreoffice_version = get_libreoffice_version()
  File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 163, in get_libreoffice_version
    result = subprocess.run(
  File "/opt/homebrew/Cellar/python@3.9/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/homebrew/Cellar/python@3.9/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/homebrew/Cellar/python@3.9/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

Additional context
I'm on a Mac M2 Pro with macOS Version 13.4.1.

Thank you for your help.

@liamvdv liamvdv added the bug Something isn't working label Oct 11, 2023
@badGarnet
Copy link
Collaborator

Hi @liamvdv sorry for a late response; we are tracking this and reviewing the problem. Will keep this thread updated.

@badGarnet badGarnet self-assigned this Oct 20, 2023
@MikeRecognex
Copy link

This thread hasn't been updated. Is it fixed?

@zihaolam
Copy link

having same issue too

@scanny
Copy link
Collaborator

scanny commented Apr 28, 2024

After getting an Apple Silicon Mac I was finally able to reproduce this error.

I believe the problem is that arm64 wheels are not available for the latest versions of lxml. The solution that worked for me is the following:

$ pip install lxml==4.9.2

The later versions of lxml have "universal" macOS wheels and for some reason those don't seem to work.

@scanny scanny assigned scanny and unassigned badGarnet Apr 28, 2024
@scanny scanny added the packaging Issues with building and installing `unstructured` label Apr 28, 2024
scanny added a commit that referenced this issue Apr 30, 2024
`unstructured` uses table features added in the most recent version of
`python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon (#1707).

`python-docx` requires `lxml` although other file formats require it as
well.
scanny added a commit that referenced this issue Apr 30, 2024
`unstructured` uses table features added in the most recent version of
`python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon (#1707).

`python-docx` requires `lxml` although other file formats require it as
well.
scanny added a commit that referenced this issue Apr 30, 2024
`unstructured` uses table features added in the most recent version of
`python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon (#1707).

`python-docx` requires `lxml` although other file formats require it as
well.
scanny added a commit that referenced this issue Apr 30, 2024
`unstructured` uses table features added in the most recent version of
`python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon (#1707).

`python-docx` requires `lxml` although other file formats require it as
well.
github-merge-queue bot pushed a commit that referenced this issue May 1, 2024
**Summary**
`unstructured` will use table features added in the most recent version
of `python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon
(#1707).

`python-docx` requires `lxml` although other file formats require it as
well.
@MthwRobinson
Copy link
Contributor

Closing this issue. You can try @scanny 's suggestion from above if you run into this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working packaging Issues with building and installing `unstructured`
Projects
None yet
Development

No branches or pull requests

6 participants