Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/partition.auto.partition raises ValueError: Invalid file (FileType.UNK) caused by semicolon including Content-Type without libmagic #2257

Closed
ftnext opened this issue Dec 13, 2023 · 3 comments · Fixed by #2950 or cohere-ai/unstructured#3
Labels
bug Something isn't working

Comments

@ftnext
Copy link

ftnext commented Dec 13, 2023

Describe the bug

unstructured.partition.auto.partition raises ValueError

>>> from unstructured.partition.auto import partition
>>> html_elements = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/")
libmagic is unavailable but assists in filetype detection on file-like objects. Please consider installing libmagic for better results.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../venv/lib/python3.11/site-packages/unstructured/partition/auto.py", line 490, in partition
    raise ValueError(f"{msg}. The {filetype} file type is not supported in partition.")
ValueError: Invalid file. The FileType.UNK file type is not supported in partition.

partition_html works (but partition does not)

>>> from unstructured.partition.html import partition_html
>>> html_elements = partition_html(url="https://www.arxiv-vanity.com/papers/2305.14283/")

To Reproduce
(See above snippet)

Expected behavior
No error (same as partition_html example)

Environment Info

OS version:  macOS-12.6.6-arm64-arm-64bit
Python version:  3.11.4
unstructured version:  0.11.2
unstructured-inference version:  0.7.15
pytesseract version:  0.3.10
Torch version:  2.1.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
  ...
  File "/.../collect_env.py", line 234, in main
    libreoffice_version = get_libreoffice_version()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

Additional context
Workaround: Specify content_type

>>> html_elements = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/", content_type="text/html")

Cause:
Content-Type of the response header is 'text/html; charset=utf-8'

content_type = content_type or response.headers.get("Content-Type")

detect_filetype(content_type='text/html; charset=utf-8') returns FileType.UNK

# first check (content_type)
if content_type:
filetype = STR_TO_FILETYPE.get(content_type)
if filetype:
return filetype

Idea: Add semicolon parse logic

    content_type = content_type or response.headers.get("Content-Type")
+    if content_type:
+        content_type = content_type.split(";")[0]
@ftnext ftnext added the bug Something isn't working label Dec 13, 2023
@christinestraub
Copy link
Contributor

christinestraub commented Dec 19, 2023

Hi @ftnext

What version of the unstructured library did you use? I tested both partition_html() and partition() with unstructured==0.11.5. It didn't get any errors and worked as expected.

from unstructured.partition.html import partition_html
html_elements = partition_html(url="https://www.arxiv-vanity.com/papers/2305.14283/")
from unstructured.partition.auto import partition
html_elements = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/")

@ftnext
Copy link
Author

ftnext commented Dec 20, 2023

@christinestraub Thanks for your reply.

What version of the unstructured library did you use?

I used unstructured version: 0.11.2 (I mentioned above)

I also try unstructured==0.11.5

>>> from unstructured.partition.auto import partition
>>> html_elements2 = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/")
libmagic is unavailable but assists in filetype detection on file-like objects. Please consider installing libmagic for better results.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../venv/lib/python3.11/site-packages/unstructured/partition/auto.py", line 490, in partition
    raise ValueError(f"{msg}. The {filetype} file type is not supported in partition.")
ValueError: Invalid file. The FileType.UNK file type is not supported in partition.

I noticed the message "libmagic is unavailable", so I ran brew install libmagic

>>> from unstructured.partition.auto import partition
>>> html_elements2 = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/")

So it seems that partition.auto.partition raises ValueError: Invalid file (FileType.UNK) caused by semicolon including Content-Type without libmagic

@ftnext ftnext changed the title bug/partition.auto.partition raises ValueError: Invalid file (FileType.UNK) caused by semicolon including Content-Type bug/partition.auto.partition raises ValueError: Invalid file (FileType.UNK) caused by semicolon including Content-Type without libmagic Dec 20, 2023
@hackgoofer
Copy link

I installed libmagic, and then the error disappeared.

cragwolfe pushed a commit that referenced this issue Apr 30, 2024
Currently, `file_and_type_from_url()` does not correctly handle the
`Content-Type` header. Specifically, it assumes that the header contains
only the mime-type (e.g. `text/html`), however, [RFC
9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows
for additional directives — specifically the `charset` — to be returned
in the header. This leads to a `ValueError` when loading a URL with a
response Content-Type header such as `text/html; charset=UTF-8`.

To reproduce the issue:

```python
from unstructured.partition.auto import partition

url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/"
partition(url=url)
```

Which will result in the following exception:

```python
{
	"name": "ValueError",
	"message": "Invalid file. The FileType.UNK file type is not supported in partition.",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 4
      1 from unstructured.partition.auto import partition
      3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\"
----> 4 partition(url=url)

File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs)
    539 else:
    540     msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\"
--> 541     raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\")
    543 for element in elements:
    544     element.metadata.url = url

ValueError: Invalid file. The FileType.UNK file type is not supported in partition."
}
```

This PR fixes the issue by parsing the mime-type out of the
`Content-Type` header string.


Closes #2257
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants