-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug/partition.auto.partition raises ValueError: Invalid file (FileType.UNK) caused by semicolon including Content-Type without libmagic #2257
Comments
Hi @ftnext What version of the
|
@christinestraub Thanks for your reply.
I used I also try >>> from unstructured.partition.auto import partition
>>> html_elements2 = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/")
libmagic is unavailable but assists in filetype detection on file-like objects. Please consider installing libmagic for better results.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../venv/lib/python3.11/site-packages/unstructured/partition/auto.py", line 490, in partition
raise ValueError(f"{msg}. The {filetype} file type is not supported in partition.")
ValueError: Invalid file. The FileType.UNK file type is not supported in partition. I noticed the message "libmagic is unavailable", so I ran >>> from unstructured.partition.auto import partition
>>> html_elements2 = partition(url="https://www.arxiv-vanity.com/papers/2305.14283/") So it seems that |
I installed libmagic, and then the error disappeared. |
Currently, `file_and_type_from_url()` does not correctly handle the `Content-Type` header. Specifically, it assumes that the header contains only the mime-type (e.g. `text/html`), however, [RFC 9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows for additional directives — specifically the `charset` — to be returned in the header. This leads to a `ValueError` when loading a URL with a response Content-Type header such as `text/html; charset=UTF-8`. To reproduce the issue: ```python from unstructured.partition.auto import partition url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/" partition(url=url) ``` Which will result in the following exception: ```python { "name": "ValueError", "message": "Invalid file. The FileType.UNK file type is not supported in partition.", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 4 1 from unstructured.partition.auto import partition 3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\" ----> 4 partition(url=url) File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs) 539 else: 540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\" --> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\") 543 for element in elements: 544 element.metadata.url = url ValueError: Invalid file. The FileType.UNK file type is not supported in partition." } ``` This PR fixes the issue by parsing the mime-type out of the `Content-Type` header string. Closes #2257
Describe the bug
unstructured.partition.auto.partition
raises ValueErrorpartition_html
works (butpartition
does not)To Reproduce
(See above snippet)
Expected behavior
No error (same as
partition_html
example)Environment Info
Additional context
Workaround: Specify
content_type
Cause:
Content-Type of the response header is
'text/html; charset=utf-8'
unstructured/unstructured/partition/auto.py
Line 516 in 039ae17
detect_filetype(content_type='text/html; charset=utf-8')
returnsFileType.UNK
unstructured/unstructured/file_utils/filetype.py
Lines 238 to 242 in 039ae17
Idea: Add semicolon parse logic
The text was updated successfully, but these errors were encountered: