Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy way to retrieve just the first X bytes from a URL #1392

Closed
simonw opened this issue Nov 16, 2020 · 10 comments
Closed

Easy way to retrieve just the first X bytes from a URL #1392

simonw opened this issue Nov 16, 2020 · 10 comments
Labels
question Further information is requested user-experience Ensuring that users have a good experience using the library

Comments

@simonw
Copy link
Contributor

simonw commented Nov 16, 2020

I'm writing code that retrieves HTML from a URL provided by a (untrusted) user.

httpx has excellent support for timeouts already, but I'm also worried about people giving me a URL to a giant resource - I don't want to naively load the entire thing into process memory.

It looks like I can achieve this using httpx.stream - but it's not instantly obvious how to close the connection after a certain number of bytes. I'm figuring that out at the moment.

I'd love to be able to do this using the simpler httpx.get() interface. Maybe something like this:

response = httpx.get("https://example.com/", truncate_after=2048)
potentially_truncated_content = response.text
if response.truncated:
    print("Only retrieved first 2048 bytes")
else:
    print("Got everything")

I don't like truncate_after here but it's the first thing that came to mind.

@simonw
Copy link
Contributor Author

simonw commented Nov 16, 2020

(This is intended as a feature request, not a bug report)

@florimondmanca florimondmanca added question Further information is requested user-experience Ensuring that users have a good experience using the library labels Nov 16, 2020
@simonw
Copy link
Contributor Author

simonw commented Nov 16, 2020

I'm actually having trouble figuring out the right way to do this with the existing httpx.stream interface. It looks like I can start calling iter_bytes() or iter_raw() and then stop iterating once I get enough content - but I'm not clear on how many bytes I would get back in each of those chunks.

@florimondmanca
Copy link
Member

florimondmanca commented Nov 16, 2020

It looks like I can achieve this using httpx.stream - but it's not instantly obvious how to close the connection after a certain number of bytes. I'm figuring that out at the moment.

@simonw

HTTPCore reads from the network in chunks of 64kB:

https://github.com/encode/httpcore/blob/12a25fdb311cc3f0feb2b7ca7f0afab46989fe19/httpcore/_async/http11.py#L27

So you won't be able to download in chunks smaller than that, but 64kB sounds like an okay minimum amount of bytes to download before hanging up, right?

Then, I think the missing piece for you might be response.num_bytes_downloaded, introduced in 0.15.

Uvicorn server:

BODY = b"*" * 1024 * 1024  # 1MB, i.e. 16 * 64 kB (16 HTTPCore-sized chunks)


async def app(scope, receive, send):
    assert scope["type"] == "http"
    await send(
        {
            "type": "http.response.start",
            "status": 200,
            "headers": [[b"content-type", b"text/plain"]],
        }
    )
    await send({"type": "http.response.body", "body": BODY})

Client:

import httpx

url = "http://localhost:8000"
truncate_after = 32 * 1024  # 32kB

with httpx.stream("GET", url) as response:
    body = b""
    for chunk in response.iter_bytes():
        body += chunk
        if response.num_bytes_downloaded >= truncate_after:
            break

assert len(body) == response.num_bytes_downloaded
print(len(body))  # Actually 64kB (due to HTTPCore's chunk size)

Can also be rewritten using itertools.takewhile:

with httpx.stream("GET", url) as response:
    truncated_iter_bytes = itertools.takewhile(
        lambda _: response.num_bytes_downloaded < truncate_after,
        response.iter_bytes(),
    )
    body = b"".join(truncated_iter_bytes)

I'm not sure this will guarantee that data won't actually be fully read from the server, though. Eg in this case we can see that the server sends the huge body in one go, so it may try to push it through and have it accumulate in the internal socket buffers that we'd just happen to not ready entirely on the client side. This is all supposition though, I'm not a networking junkie enough to tell. :-) Point is, you may need to verify this does do what you want (limit memory usage) in case the server tries to send a huge chunk in one go. Otherwise I think that should do the trick? :-)

@florimondmanca
Copy link
Member

We've got some docs about response.num_bytes_downloaded here: https://www.python-httpx.org/advanced/#monitoring-download-progress

@simonw
Copy link
Contributor Author

simonw commented Nov 16, 2020

Thanks! The 64KB thing was the bit I was missing - so it looks like I'll be safe if I open a stream and then consume just the first chunk of the iterator.

I do think a useful feature for httpx would be the ability to more finely control the amount of data that is read back from a response before terminating that connection early. I'm writing code that just needs to look in the <head> section of the HTML (for some meta tags) so I only want to retrieve the first couple of KBs.

@florimondmanca
Copy link
Member

florimondmanca commented Nov 16, 2020

Yes, if you roughly know how much you'd need to read and that's below the 64kB limit, you could probably simplify it down to:

import httpx

with httpx.stream("GET", "http://localhost:8000") as response:
    head = next(response.iter_bytes())

assert len(head) == response.num_bytes_downloaded
print(len(head))  # 65,528 bytes

I do think a useful feature for httpx would be the ability to more finely control the amount of data that is read back from a response before terminating that connection early. I'm writing code that just needs to look in the section of the HTML (for some meta tags) so I only want to retrieve the first couple of KBs.

How do you think of #1277? Adds a bit more control onto the chunking behavior (at least on the user side). Would allow doing something like:

with httpx.stream("GET", url) as response:
    head = next(response.iter_bytes(chunk_size=1024))  # Only the 1st kB.

HTTPCore would still read as 64kB-sized chunks (we figured that was an optimal size for kernel-side vs Python-side processing, see encode/httpcore#135), but those chunks would be further sliced and diced on the HTTPX side to be exposed to users as having the expected size.

@simonw
Copy link
Contributor Author

simonw commented Nov 16, 2020

Oh I hadn't seen #1277 - looks like exactly what I want.

I still think it's worth considering a higher-level interface for this. I'm a big fan of HTTP libraries that make it easy to deal with potentially hostile inputs, and the key features needed for that are:

  • Control over the number of redirects followed
  • Control over read/connect timeouts (HTTPX is fantastic on this front)
  • Control over how many bytes are retrieved, in case a user provides a URL to a multi-GB file

@simonw simonw mentioned this issue Nov 16, 2020
3 tasks
@florimondmanca
Copy link
Member

Control over the number of redirects followed

I believe Client(max_redirects=...) is enough here (raises TooManyRedirects in case that number is exceeded), or is there something else you're thinking of?

@simonw
Copy link
Contributor Author

simonw commented Nov 16, 2020

No that's exactly right, it's just the response size limit that's not obvious at the moment.

@tomchristie
Copy link
Member

Closed via #1277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested user-experience Ensuring that users have a good experience using the library
Projects
None yet
Development

No branches or pull requests

3 participants