Skip to content

Conversation

@amastilovic
Copy link
Contributor

Method _ls_real tries to download the whole r.text() of a link regardless of the type of HTML content. Prevent this download in all cases except when Content-Type header is not set, or it is set to text/html

Method _ls_real tries to download the whole r.text() of a link
regardless of the type of HTML content. Prevent this download in all
cases except when Content-Type header is not set, or it is set to text/html
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible approach would be to stream-open the URL, and download the first chunk, looking for <html> or <!DOCTYPE html>. That would also give us the opportunity of setting an upper limit on the size of the download even for URLs with no type.

So, how about we only allow "text/html" for instant read, skip for all other types when given as you are doing, but for the case of no header, only read by chunks up to a maximum size?

links = [u[2] for u in ex.findall(text)]
except UnicodeDecodeError:
links = [] # binary, not HTML
url_info = await self._info(url, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extra call, it would be better to look at r's headers.

links = ex2.findall(text) + [u[2] for u in ex.findall(text)]
else:
links = [u[2] for u in ex.findall(text)]
except UnicodeDecodeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://www.w3schools.com/html/html_charset.asp , although utf8 and ascii are far dominant, other encodings are allowed and still in use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That exception block was already there in the original code. While I'm certainly not an expert on Python's string decoding, it seems like UnicodeDecodeError is being thrown in any case where a string or sequence of bytes can't be decoded according to the given encoding: https://wiki.python.org/moin/UnicodeDecodeError so the catch seems appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using errors='ignore' should be the right thing to do here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stated purpose of catching the exception is to determine if the content is binary or HTML, would ignoring errors make sense in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data could be text, with HTML and links, but not UTF8

@amastilovic
Copy link
Contributor Author

amastilovic commented Jul 11, 2025

Another possible approach would be to stream-open the URL, and download the first chunk, looking for <html> or <!DOCTYPE html>. That would also give us the opportunity of setting an upper limit on the size of the download even for URLs with no type.

While I agree that this approach would be preferred, I unfortunately do not see a way to read only a chunk of bytes using aiohttp.client_reqrep.ClientResponse: https://github.com/aio-libs/aiohttp/blob/v3.12.13/aiohttp/client_reqrep.py#L682

Would it be OK to keep the current approach?

@martindurant
Copy link
Member

response.content.iter_chunks() or .read() with a fixed not-too-large number of bytes.

@amastilovic
Copy link
Contributor Author

response.content.iter_chunks() or .read() with a fixed not-too-large number of bytes.

OK that worked. The unit test for isdir() failed though, because the test case HTML content simply lists <a href> links with no <html> or <!DOCTYPE html>. Seems like determining whether content is HTML or not might be slightly more complicated :-) I could use a simple regex like (<([^>]+)>) which is simply looking for something between < and >, although even that might fail in case where content is plain text with some <a href> links within it.

Either that or we decide not to get into the business of parsing the content to determine if it's HTML or not.

* Get Content-Type from headers instead of another `.info()` call
* Use `r.text(errors="ignore")`
* Add a `test_isdir` case for when MIME type is present
@martindurant
Copy link
Member

OK, we'll leave the possibility of streaming, looking for HTML specifically and maximum download for future work - this is already an improvement.

@martindurant martindurant merged commit b5f9391 into fsspec:master Jul 14, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants