Skip to content

Latest commit

 

History

History
48 lines (33 loc) · 2.01 KB

http.rst

File metadata and controls

48 lines (33 loc) · 2.01 KB

HTTP Tools

Helper functions for parsing raw HTTP payloads.

Read Chunked HTTP Payloads

Contrary to WARCIO, Resiliparse's FastWARC <fastwarc-manual> does not automatically decode chunked HTTP responses. This is simply a design decision in favour of simplicity, since decoding chunked HTTP payloads is actually the crawler's job. In the Common Crawl, for example, all chunked payloads are already decoded and the original Transfer-Encoding header is preserved as X-Crawler-Transfer-Encoding: chunked. We do, however, acknowledge that in some cases it is still necessary to decode chunked payloads anyway, which is why Resiliparse provides ~.parse.http.read_http_chunk as a helper function for this.

The function accepts a buffered reader (either a fastwarc.stream_io.BufferedReader or a file-like Python object that implements readline(), such as io.BytesIO) and is supposed to be called iteratively until no further output is produced. Each call will return a single chunk, which can be concatenated with the previous chunks:

from fastwarc.stream_io import BufferedReader, BytesIOStream
from resiliparse.parse.http import read_http_chunk

chunked = b'''c\r\n\
Resiliparse \r\n\
6\r\n\
is an \r\n\
8\r\n\
awesome \r\n\
5\r\n\
tool.\r\n\
0\r\n\
\r\n'''

reader = BufferedReader(BytesIOStream(chunked))
decoded = b''
while chunk := read_http_chunk(reader):
    decoded += chunk

# b'Resiliparse is an awesome tool.'
print(decoded)

For convenience, you can also use ~.parse.http.iterate_http_chunks, which is a generator that wraps around ~.parse.http.read_http_chunk and fully consumes the chunked stream:

from resiliparse.parse.http import iterate_http_chunks

# b'Resiliparse is an awesome tool.'
print(b''.join(iterate_http_chunks(reader)))