Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r.text can make calls to chardet which are very slow #564

Closed
mlissner opened this issue Sep 29, 2022 · 1 comment · Fixed by #565
Closed

r.text can make calls to chardet which are very slow #564

mlissner opened this issue Sep 29, 2022 · 1 comment · Fixed by #565

Comments

@mlissner
Copy link
Member

We have a lot of code in Juriscraper that looks like this:

r = request.get(some_link)
if "some phrase" in r.text:
    # throw an error

The thing is, when we have large PDFs that we're downloading, r.text is pretty hard to produce. To do so, requests calls out to chardet, which is slow and uses a lot of memory on big files.

From models.py in the requests library, here's r.text:

@property
    def text(self):
        """Content of the response, in unicode.

        If Response.encoding is None, encoding will be guessed using
        ``charset_normalizer`` or ``chardet``.

        The encoding of the response content is determined based solely on HTTP
        headers, following RFC 2616 to the letter. If you can take advantage of
        non-HTTP knowledge to make a better guess at the encoding, you should
        set ``r.encoding`` appropriately before accessing this property.
        """

        # Try charset from content-type
        content = None
        encoding = self.encoding

        if not self.content:
            return ""

        # Fallback to auto-detected encoding.
        if self.encoding is None:
            encoding = self.apparent_encoding  # <----  HERE'S THE CALL

        # Decode unicode from given encoding.
        try:
            content = str(self.content, encoding, errors="replace")
        except (LookupError, TypeError):
            # A LookupError is raised if the encoding was not found which could
            # indicate a misspelling or similar mistake.
            #
            # A TypeError can be raised if encoding is None
            #
            # So we try blindly encoding.
            content = str(self.content, errors="replace")

        return content

apparent_encoding is NOT cached (but probably should be, and thus each time we call r.text we're hitting the CPU and memory pretty hard.

@mlissner
Copy link
Member Author

This came up because a client is trying to use the Fetch API to get a 36MB PDF. It's crashing the maintenance pod I'm using to debug the issue!

mlissner added a commit that referenced this issue Sep 29, 2022
It turns out that r.text makes calls to chardet each time it is called. That's
not great because chardet can be slow and use a lot of memory, particularly
when checking PDFs.

Instead of doing that or checking if things are PDFs all the time, simply use
the binary content instead of the text.

Fixes: #564
Relates to: psf/requests#6250
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant