-
-
Notifications
You must be signed in to change notification settings - Fork 829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make charset auto-detection optional. #2165
Conversation
Overall I am very much 👍 for not having automagic behavior as the default, especially when there is a sane default like Commenting on #2152 (review): Given the two options presented, I would actually prefer (1).
I think this particular sort of overloading is quite common and not hard to understand. I also think it leads to better error handling. With the current setup of On the other hand if this was handled via a callback: from starlette.charsets import autodetect # error right here
def some_user_code():
client = Client(default_encoding=autodetect.autodetect_charset) (probably some better naming is needed for parameters and modules, this is just an example) |
My feeling is that practically the Another tiny little wrinkle is that if we want the import to fast fail, then it'd need to be a public module within
Valid point. It'd probably make sense here to attempt importing |
The point about the extra module/public function is valid. As for fiddleliness/verboseness, I think this is a pretty advanced use case, some verboseness is probably a good thing. One nice thing about it being a function is that developers can easily trace the logic: you can't right click a string and be taken to where it changes behavior. In any case, I think the change to make the error show up when you instantiate the client would be a good move. In a web framework for example that would move the error from within the request response cycle (500 for their client) to startup time (likely just a failed deploy) |
I've drafted up the documentation for an alternative, based on @adriangb's feedback here...
I've written up below how I think the documentation should look if we want to have Note that the difference here is in the "Using character set auto-detection" section. The rest remains the same. If we do want it to support callables, then I'd rather we don't include those implementations directly in Any reviews based on the documentation would be welcome. I'll draft up a full pull request based on these docs, but this seemed like a good starting point in the meantime... Character set encodings and auto-detectionWhen accessing By default In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet. Using the default encodingTo understand this better let's start by looking at the default behaviour for text decoding... import httpx
# Instantiate a client with the default configuration.
client = httpx.Client()
# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "utf-8".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "utf-8". This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted. Using an explicit encodingIn some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client. import httpx
# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")
# Using the client...
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else "shift-jis".
print(response.text) # The text will either be decoded with the Content-Type
# charset, or using "shift-jis". Using character set auto-detectionIn cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text. To use auto-detection you need to set the There are two widely used Python packages which both handle this functionality:
Let's take a look at installing autodetection using one of these packages... $ pip install httpx
$ pip install chardet Once import httpx
def autodetect(content):
return chardet.detect(content).get("encoding")
# Using a client with character-set autodetection enabled.
client = httpx.Client(default_encoding=autodetect)
response = client.get(...)
print(response.encoding) # This will either print the charset given in
# the Content-Type charset, or else the auto-detected
# character set.
print(response.text) |
Just an +1 from me. Debugging httpx on a high speed crawler, the decode (httpx/_decoders.py:76) is a bottleneck. I would like the option to not decode it automatically to speedup my code. |
That looks nice Tom, thank you for writing it up this clearly. I think the example makes it easy to understand. And like you say, completely removing the dependency would be very neat. Is the |
Yes.
Yes. The one additional consideration here is what to do when the user calls |
Now updated based on @adriangb's feedback. |
…d upstream This is pushed to [community-testing] due to behavior changes [1] in this version. More testing needed. * charset-normalizer is no longer needed since [1] * rich is optional - used for CLI only * Fill optdepends per namcap reports * Remove the CVE fix, which is included in this version * Workaround test failures from newer pytest-asyncio [1] encode/httpx#2165 git-svn-id: file:///srv/repos/svn-community/svn@1210123 9fca08f4-af9d-4005-b8df-a31f2cc04f65
…d upstream This is pushed to [community-testing] due to behavior changes [1] in this version. More testing needed. * charset-normalizer is no longer needed since [1] * rich is optional - used for CLI only * Fill optdepends per namcap reports * Remove the CVE fix, which is included in this version * Workaround test failures from newer pytest-asyncio [1] encode/httpx#2165 git-svn-id: file:///srv/repos/svn-community/svn@1210123 9fca08f4-af9d-4005-b8df-a31f2cc04f65
From discussion #2083
(This is a simpler alternative to a previous attempt at this functionality with #2152)
Response(..., default_encoding=...)
Client(..., default_encoding=...)
autodetect
as the default toutf-8
as the default.charset_normalizer
aas a dependency.Changelog
Client(..., default_encoding=...)
Response(..., default_encoding=...)
"utf-8"
instead of character-set autodetection.charset_normalizer
is no longer a dependancy.chardet
orcharset_normalizer
and use a callable for thedefault_encoding
, likeClient(default_encoding=autodetect)
. See the docs for an example.