r.text can make calls to chardet which are very slow #564

mlissner · 2022-09-29T21:17:34Z

We have a lot of code in Juriscraper that looks like this:

r = request.get(some_link)
if "some phrase" in r.text:
    # throw an error

The thing is, when we have large PDFs that we're downloading, r.text is pretty hard to produce. To do so, requests calls out to chardet, which is slow and uses a lot of memory on big files.

From models.py in the requests library, here's r.text:

@property
    def text(self):
        """Content of the response, in unicode.

        If Response.encoding is None, encoding will be guessed using
        ``charset_normalizer`` or ``chardet``.

        The encoding of the response content is determined based solely on HTTP
        headers, following RFC 2616 to the letter. If you can take advantage of
        non-HTTP knowledge to make a better guess at the encoding, you should
        set ``r.encoding`` appropriately before accessing this property.
        """

        # Try charset from content-type
        content = None
        encoding = self.encoding

        if not self.content:
            return ""

        # Fallback to auto-detected encoding.
        if self.encoding is None:
            encoding = self.apparent_encoding  # <----  HERE'S THE CALL

        # Decode unicode from given encoding.
        try:
            content = str(self.content, encoding, errors="replace")
        except (LookupError, TypeError):
            # A LookupError is raised if the encoding was not found which could
            # indicate a misspelling or similar mistake.
            #
            # A TypeError can be raised if encoding is None
            #
            # So we try blindly encoding.
            content = str(self.content, errors="replace")

        return content

apparent_encoding is NOT cached (but probably should be, and thus each time we call r.text we're hitting the CPU and memory pretty hard.

The text was updated successfully, but these errors were encountered:

mlissner · 2022-09-29T21:18:06Z

This came up because a client is trying to use the Fetch API to get a 36MB PDF. It's crashing the maintenance pod I'm using to debug the issue!

It turns out that r.text makes calls to chardet each time it is called. That's not great because chardet can be slow and use a lot of memory, particularly when checking PDFs. Instead of doing that or checking if things are PDFs all the time, simply use the binary content instead of the text. Fixes: #564 Relates to: psf/requests#6250

mlissner mentioned this issue Sep 29, 2022

564 chardet performance binary content #565

Merged

mlissner closed this as completed in #565 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r.text can make calls to chardet which are very slow #564

r.text can make calls to chardet which are very slow #564

mlissner commented Sep 29, 2022

mlissner commented Sep 29, 2022

r.text can make calls to chardet which are very slow #564

r.text can make calls to chardet which are very slow #564

Comments

mlissner commented Sep 29, 2022

mlissner commented Sep 29, 2022