Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make charset auto-detection optional. #2165

Merged
merged 16 commits into from
May 23, 2022
Merged

Conversation

tomchristie
Copy link
Member

@tomchristie tomchristie commented Apr 5, 2022

From discussion #2083

(This is a simpler alternative to a previous attempt at this functionality with #2152)

  • Add Response(..., default_encoding=...)
  • Add Client(..., default_encoding=...)
  • Documentation.
  • Switch from autodetect as the default to utf-8 as the default.
  • Remove charset_normalizer aas a dependency.

Changelog

  • Add support for Client(..., default_encoding=...)
  • Add support for Response(..., default_encoding=...)
  • The default behaviour is now to use "utf-8" instead of character-set autodetection.
  • charset_normalizer is no longer a dependancy.
  • To enable character-set autodetection, install chardet or charset_normalizer and use a callable for the default_encoding, like Client(default_encoding=autodetect). See the docs for an example.

@tomchristie tomchristie marked this pull request as ready for review April 5, 2022 14:32
@tomchristie tomchristie requested a review from a team April 5, 2022 14:33
@adriangb
Copy link
Member

adriangb commented Apr 5, 2022

Overall I am very much 👍 for not having automagic behavior as the default, especially when there is a sane default like "utf-8".

Commenting on #2152 (review):

Given the two options presented, I would actually prefer (1).

I prefer that it doesn't overload the default_encoding type into two different cases, and it's obvious to me from reading the code what the intent is.

I think this particular sort of overloading is quite common and not hard to understand.
I also feel like mixing in strings that are encodings (e.g. "utf-8") with strings that are doing more complex stuff ("autodetect") and are not valid encodings is still "overloading" on a conceptual level, even if they types are technically the same.

I also think it leads to better error handling. With the current setup of "autodetect" an error because of a missing dependency may not bubble up until the response is read, which might be too late. A webserver would end up returning a 500 to it's client.

On the other hand if this was handled via a callback:

from starlette.charsets import autodetect   # error right here

def some_user_code():
    client = Client(default_encoding=autodetect.autodetect_charset)

(probably some better naming is needed for parameters and modules, this is just an example)

@tomchristie
Copy link
Member Author

I would actually prefer (1)

My feeling is that practically the default_encoding="autodetect" just ends up being a neater developer experience. The alternative is slightly more fiddly to setup, is more verbose and less readable as a typing, and means we need an extra module in httpx, with an additional function exposed.

Another tiny little wrinkle is that if we want the import to fast fail, then it'd need to be a public module within httpx, and imported as a module. Eg from httpx.charset_autodetect import charset_autodetect. That's at odds with all the other API in httpx, which is all within private modules, and always accessed as import httpx ... # Use httpx.<...>.

With the current setup of "autodetect" an error because of a missing dependency may not bubble up until the response is read.

Valid point. It'd probably make sense here to attempt importing charset_normalizer at the point of Client(default_encoding="autodetect"), so that any error there is raised immediately.

@adriangb
Copy link
Member

adriangb commented Apr 6, 2022

The point about the extra module/public function is valid. As for fiddleliness/verboseness, I think this is a pretty advanced use case, some verboseness is probably a good thing. One nice thing about it being a function is that developers can easily trace the logic: you can't right click a string and be taken to where it changes behavior.

In any case, I think the change to make the error show up when you instantiate the client would be a good move. In a web framework for example that would move the error from within the request response cycle (500 for their client) to startup time (likely just a failed deploy)

@tomchristie tomchristie mentioned this pull request May 9, 2022
@tomchristie
Copy link
Member Author

I've drafted up the documentation for an alternative, based on @adriangb's feedback here...

"Given the two options presented, I would actually prefer (1)."

  1. Add support for default_encoding=..., and allow it to either be a string, or a callable.

I've written up below how I think the documentation should look if we want to have default_encoding support callables.

Note that the difference here is in the "Using character set auto-detection" section. The rest remains the same.

If we do want it to support callables, then I'd rather we don't include those implementations directly in httpx. This way around we fully remove the dependancy, which just feels neater to me.

Any reviews based on the documentation would be welcome. I'll draft up a full pull request based on these docs, but this seemed like a good starting point in the meantime...


Character set encodings and auto-detection

When accessing response.text, we need to decode the response bytes into a unicode text representation.

By default httpx will use "charset" information included in the response Content-Type header to determine how the response bytes should be decoded into text.

In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet.

Using the default encoding

To understand this better let's start by looking at the default behaviour for text decoding...

import httpx
# Instantiate a client with the default configuration.
client = httpx.Client()
# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "utf-8".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "utf-8".

This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted.

Using an explicit encoding

In some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.

import httpx
# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")
# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "shift-jis".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "shift-jis".

Using character set auto-detection

In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.

To use auto-detection you need to set the default_encoding argument to a callable instead of a string. This callable should be a function which takes the input bytes as an argument and returns the character set to use for decoding those bytes to text.

There are two widely used Python packages which both handle this functionality:

Let's take a look at installing autodetection using one of these packages...

$ pip install httpx
$ pip install chardet

Once chardet is installed, we can configure a client to use character-set autodetection.

import httpx

def autodetect(content):
    return chardet.detect(content).get("encoding")

# Using a client with character-set autodetection enabled.
client = httpx.Client(default_encoding=autodetect)
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                            # the Content-Type charset, or else the auto-detected
                            # character set.
print(response.text)

@bratao
Copy link

bratao commented May 9, 2022

Just an +1 from me. Debugging httpx on a high speed crawler, the decode (httpx/_decoders.py:76) is a bottleneck. I would like the option to not decode it automatically to speedup my code.

@adriangb
Copy link
Member

adriangb commented May 9, 2022

That looks nice Tom, thank you for writing it up this clearly. I think the example makes it easy to understand. And like you say, completely removing the dependency would be very neat.

Is the content parameter to autodetect the entire content body, in bytes? I suppose it only gets called if the user called .text, so it wouldn't read the body into memory unless the user requested the entire body anyway, right?

@tomchristie
Copy link
Member Author

@bratao Gotcha - that's a different consideration, which I've opened a discussion for... #2220

@tomchristie
Copy link
Member Author

Is the content parameter to autodetect the entire content body, in bytes?

Yes.

so it wouldn't read the body into memory unless the user requested the entire body anyway, right?

Yes.

The one additional consideration here is what to do when the user calls iter_text. In that case we don't have the entire content available before we need to take a decision on which text encoding to use. Currently we don't do autodetect in this case and just fallback to utf-8 always. We probably just want to stick with that behaviour. (At least for now)

@tomchristie
Copy link
Member Author

Now updated based on @adriangb's feedback.

@tomchristie tomchristie merged commit 1c33a28 into master May 23, 2022
@tomchristie tomchristie deleted the add-default-encoding-support branch May 23, 2022 15:27
archlinux-github pushed a commit to archlinux/svntogit-community that referenced this pull request May 23, 2022
…d upstream

This is pushed to [community-testing] due to behavior changes [1] in
this version. More testing needed.

* charset-normalizer is no longer needed since [1]
* rich is optional - used for CLI only
* Fill optdepends per namcap reports
* Remove the CVE fix, which is included in this version
* Workaround test failures from newer pytest-asyncio

[1] encode/httpx#2165


git-svn-id: file:///srv/repos/svn-community/svn@1210123 9fca08f4-af9d-4005-b8df-a31f2cc04f65
archlinux-github pushed a commit to archlinux/svntogit-community that referenced this pull request May 23, 2022
…d upstream

This is pushed to [community-testing] due to behavior changes [1] in
this version. More testing needed.

* charset-normalizer is no longer needed since [1]
* rich is optional - used for CLI only
* Fill optdepends per namcap reports
* Remove the CVE fix, which is included in this version
* Workaround test failures from newer pytest-asyncio

[1] encode/httpx#2165

git-svn-id: file:///srv/repos/svn-community/svn@1210123 9fca08f4-af9d-4005-b8df-a31f2cc04f65
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants