Easy way to retrieve just the first X bytes from a URL #1392

simonw · 2020-11-16T18:36:57Z

I'm writing code that retrieves HTML from a URL provided by a (untrusted) user.

httpx has excellent support for timeouts already, but I'm also worried about people giving me a URL to a giant resource - I don't want to naively load the entire thing into process memory.

It looks like I can achieve this using httpx.stream - but it's not instantly obvious how to close the connection after a certain number of bytes. I'm figuring that out at the moment.

I'd love to be able to do this using the simpler httpx.get() interface. Maybe something like this:

response = httpx.get("https://example.com/", truncate_after=2048)
potentially_truncated_content = response.text
if response.truncated:
    print("Only retrieved first 2048 bytes")
else:
    print("Got everything")

I don't like truncate_after here but it's the first thing that came to mind.

The text was updated successfully, but these errors were encountered:

simonw · 2020-11-16T18:37:15Z

(This is intended as a feature request, not a bug report)

simonw · 2020-11-16T19:08:31Z

I'm actually having trouble figuring out the right way to do this with the existing httpx.stream interface. It looks like I can start calling iter_bytes() or iter_raw() and then stop iterating once I get enough content - but I'm not clear on how many bytes I would get back in each of those chunks.

florimondmanca · 2020-11-16T19:26:54Z

It looks like I can achieve this using httpx.stream - but it's not instantly obvious how to close the connection after a certain number of bytes. I'm figuring that out at the moment.

@simonw

HTTPCore reads from the network in chunks of 64kB:

https://github.com/encode/httpcore/blob/12a25fdb311cc3f0feb2b7ca7f0afab46989fe19/httpcore/_async/http11.py#L27

So you won't be able to download in chunks smaller than that, but 64kB sounds like an okay minimum amount of bytes to download before hanging up, right?

Then, I think the missing piece for you might be response.num_bytes_downloaded, introduced in 0.15.

Uvicorn server:

BODY = b"*" * 1024 * 1024  # 1MB, i.e. 16 * 64 kB (16 HTTPCore-sized chunks)


async def app(scope, receive, send):
    assert scope["type"] == "http"
    await send(
        {
            "type": "http.response.start",
            "status": 200,
            "headers": [[b"content-type", b"text/plain"]],
        }
    )
    await send({"type": "http.response.body", "body": BODY})

Client:

import httpx

url = "http://localhost:8000"
truncate_after = 32 * 1024  # 32kB

with httpx.stream("GET", url) as response:
    body = b""
    for chunk in response.iter_bytes():
        body += chunk
        if response.num_bytes_downloaded >= truncate_after:
            break

assert len(body) == response.num_bytes_downloaded
print(len(body))  # Actually 64kB (due to HTTPCore's chunk size)

Can also be rewritten using itertools.takewhile:

with httpx.stream("GET", url) as response:
    truncated_iter_bytes = itertools.takewhile(
        lambda _: response.num_bytes_downloaded < truncate_after,
        response.iter_bytes(),
    )
    body = b"".join(truncated_iter_bytes)

I'm not sure this will guarantee that data won't actually be fully read from the server, though. Eg in this case we can see that the server sends the huge body in one go, so it may try to push it through and have it accumulate in the internal socket buffers that we'd just happen to not ready entirely on the client side. This is all supposition though, I'm not a networking junkie enough to tell. :-) Point is, you may need to verify this does do what you want (limit memory usage) in case the server tries to send a huge chunk in one go. Otherwise I think that should do the trick? :-)

florimondmanca · 2020-11-16T19:28:14Z

We've got some docs about response.num_bytes_downloaded here: https://www.python-httpx.org/advanced/#monitoring-download-progress

simonw · 2020-11-16T19:34:12Z

Thanks! The 64KB thing was the bit I was missing - so it looks like I'll be safe if I open a stream and then consume just the first chunk of the iterator.

I do think a useful feature for httpx would be the ability to more finely control the amount of data that is read back from a response before terminating that connection early. I'm writing code that just needs to look in the <head> section of the HTML (for some meta tags) so I only want to retrieve the first couple of KBs.

florimondmanca · 2020-11-16T19:42:45Z

Yes, if you roughly know how much you'd need to read and that's below the 64kB limit, you could probably simplify it down to:

import httpx

with httpx.stream("GET", "http://localhost:8000") as response:
    head = next(response.iter_bytes())

assert len(head) == response.num_bytes_downloaded
print(len(head))  # 65,528 bytes

I do think a useful feature for httpx would be the ability to more finely control the amount of data that is read back from a response before terminating that connection early. I'm writing code that just needs to look in the section of the HTML (for some meta tags) so I only want to retrieve the first couple of KBs.

How do you think of #1277? Adds a bit more control onto the chunking behavior (at least on the user side). Would allow doing something like:

with httpx.stream("GET", url) as response:
    head = next(response.iter_bytes(chunk_size=1024))  # Only the 1st kB.

HTTPCore would still read as 64kB-sized chunks (we figured that was an optimal size for kernel-side vs Python-side processing, see encode/httpcore#135), but those chunks would be further sliced and diced on the HTTPX side to be exposed to users as having the expected size.

simonw · 2020-11-16T19:48:49Z

Oh I hadn't seen #1277 - looks like exactly what I want.

I still think it's worth considering a higher-level interface for this. I'm a big fan of HTTP libraries that make it easy to deal with potentially hostile inputs, and the key features needed for that are:

Control over the number of redirects followed
Control over read/connect timeouts (HTTPX is fantastic on this front)
Control over how many bytes are retrieved, in case a user provides a URL to a multi-GB file

florimondmanca · 2020-11-16T19:51:24Z

Control over the number of redirects followed

I believe Client(max_redirects=...) is enough here (raises TooManyRedirects in case that number is exceeded), or is there something else you're thinking of?

simonw · 2020-11-16T20:50:13Z

No that's exactly right, it's just the response size limit that's not obvious at the moment.

tomchristie · 2020-11-25T15:37:24Z

Closed via #1277

florimondmanca added question Further information is requested user-experience Ensuring that users have a good experience using the library labels Nov 16, 2020

simonw mentioned this issue Nov 16, 2020

Support for chunk_size #1277

Merged

3 tasks

tomchristie closed this as completed Nov 25, 2020

simonw mentioned this issue Jan 23, 2021

Overall response timeout. #1450

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy way to retrieve just the first X bytes from a URL #1392

Easy way to retrieve just the first X bytes from a URL #1392

simonw commented Nov 16, 2020 •

edited

simonw commented Nov 16, 2020

simonw commented Nov 16, 2020

florimondmanca commented Nov 16, 2020 •

edited

florimondmanca commented Nov 16, 2020

simonw commented Nov 16, 2020

florimondmanca commented Nov 16, 2020 •

edited

simonw commented Nov 16, 2020

florimondmanca commented Nov 16, 2020

simonw commented Nov 16, 2020

tomchristie commented Nov 25, 2020

Easy way to retrieve just the first X bytes from a URL #1392

Easy way to retrieve just the first X bytes from a URL #1392

Comments

simonw commented Nov 16, 2020 • edited

simonw commented Nov 16, 2020

simonw commented Nov 16, 2020

florimondmanca commented Nov 16, 2020 • edited

florimondmanca commented Nov 16, 2020

simonw commented Nov 16, 2020

florimondmanca commented Nov 16, 2020 • edited

simonw commented Nov 16, 2020

florimondmanca commented Nov 16, 2020

simonw commented Nov 16, 2020

tomchristie commented Nov 25, 2020

simonw commented Nov 16, 2020 •

edited

florimondmanca commented Nov 16, 2020 •

edited

florimondmanca commented Nov 16, 2020 •

edited