Skip to content

Conversation

@tomaarsen
Copy link
Contributor

Hello!

Pull Request overview

  • Introduce an exactly_one function to simplify the partitioning functions.

Details

I noticed that many partitioning functions follow this outline:

if not any([filename, file, text, url]):
raise ValueError("One of filename, file, or text must be specified.")
if filename is not None and not file and not text and not url:
document = HTMLDocument.from_file(filename, parser=parser)
elif file is not None and not filename and not text and not url:
file_content = file.read()
if isinstance(file_content, bytes):
file_text = file_content.decode("utf-8")
else:
file_text = file_content
document = HTMLDocument.from_string(file_text, parser=parser)
elif text is not None and not filename and not file and not url:
_text: str = str(text)
document = HTMLDocument.from_string(_text, parser=parser)
elif url is not None and not filename and not file and not text:
response = requests.get(url)
if not response.ok:
raise ValueError(f"URL return an error: {response.status_code}")
content_type = response.headers.get("Content-Type", "")
if not content_type.startswith("text/html"):
raise ValueError(f"Expected content type text/html. Got {content_type}.")
document = HTMLDocument.from_string(response.text, parser=parser)
else:
raise ValueError("Only one of filename, file, or text can be specified.")

I.e. first checking if a non-zero amount of arguments are used, then a number of really large conditionals ensuring that only one is used, and then a final else-branch to raise virtually the same error as before, but now stating to use a maximum of one argument.

I figured this could easily be simplified by introducing a function that verifies that exactly one of the arguments are non-None. This function is called once, and if it doesn't result in an error, then the partitioning function can assume that only one of the arguments is used, allowing the simplification of the conditionals (e.g. no more if file is not None and not filename and not text and not url, but just if file).

Just like before, if a method is called incorrectly, an error is raised:

>>> from unstructured.partition.html import partition_html
>>> elements = partition_html(url=url, text="hello")
ValueError: Exactly one of filename, file, text and url must be specified.

But the partitioning functions themselves are simplified a lot.

Let me know if you need anything else from me at this point.

  • Tom Aarsen

Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments and suggestions. I did some digging on what it would take to get the new function to do the type verification in a way mypy would accept. I didn't come up with anything conclusive, but I got the impression there might be a more pythonic way of doing this using TypeGuards or overloads. I'm ok with the current implementation with some changes though.

In favor of unreachable Exceptions, as discussed in the review comments
@tomaarsen
Copy link
Contributor Author

@qued Thank you for the very detailed review! I love those. I've tackled each of the concerns in separate commits, so they can be verified more easily.

Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@qued qued merged commit a915231 into Unstructured-IO:main Mar 7, 2023
@tomaarsen tomaarsen deleted the refactor/partition_utils branch March 7, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants