Skip to content

Cleanup response.json() method.#2911

Merged
lovelydinosaur merged 3 commits into
masterfrom
cleanup-json-decoding
Oct 31, 2023
Merged

Cleanup response.json() method.#2911
lovelydinosaur merged 3 commits into
masterfrom
cleanup-json-decoding

Conversation

@lovelydinosaur
Copy link
Copy Markdown
Contributor

The Response.json() method appears unnecessarily complicated, I think this is a historical appendix that can be surgically tidied up. The stdlib implementation of json.loads already includes the charset detection that we're currently providing with our internal guess_json_utf() function.

  • Switch test_utils to use a public API for testing JSON utf-8 encodings.
  • Remove the internal guess_json_utf() function.
  • Simpliy the Response.json() method.

@lovelydinosaur lovelydinosaur added the refactor Issues and PRs related to code refactoring label Oct 30, 2023
@lovelydinosaur lovelydinosaur marked this pull request as ready for review October 30, 2023 14:36
Copy link
Copy Markdown
Contributor

@T-256 T-256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this change.
No more need _guess_content_json_utf at #2851.

@T-256
Copy link
Copy Markdown
Contributor

T-256 commented Oct 30, 2023

#2851 now included this change, can you review that? I think we can merge #2851 instead of this PR.

@lovelydinosaur
Copy link
Copy Markdown
Contributor Author

can you review that?

👍

@lovelydinosaur lovelydinosaur merged commit 1d73150 into master Oct 31, 2023
@lovelydinosaur lovelydinosaur deleted the cleanup-json-decoding branch October 31, 2023 10:10
@whitehatboxer
Copy link
Copy Markdown

Sorry to bother all, but in my situation, changes in response.json() from json.loads(self.text) to json.loads(self.content) have caused UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 351: invalid start byte.

It cames from that I found some test failed in my new system. I diffed it with my old one, and found there was a difference between httpx from 0.27.0 to 0.22.0. The new system had the new version.

So I had downgraded httpx from 0.27.0 to 0.22.0, and really resloved this bug.

After that, I tried to found the reason. I delved into code, and thought it was guess_json_utf, so I searched this keyword and found this pr.
I tried to add some debug cod in _models.py, it turned out that no deal with guess_json_utf.

    def json(self, **kwargs: typing.Any) -> typing.Any:
        if self.charset_encoding is None and self.content and len(self.content) > 3:
            encoding = guess_json_utf(self.content)
            print("==== encoding was: ", encoding)
            if encoding is not None:
                return jsonlib.loads(self.content.decode(encoding), **kwargs)
        print("==== no encoding")
        return jsonlib.loads(self.text, **kwargs)

Then I guessed that using self.text is necessary in some uncommon situation.

@lovelydinosaur
Copy link
Copy Markdown
Contributor Author

Then I guessed that using self.text is necessary in some uncommon situation.

The server is likely returning non-compliant JSON.

https://datatracker.ietf.org/doc/html/rfc8259#section-8.1

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8
when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding,
to the extent that it is the only encoding that achieves
interoperability.

Implementations MUST NOT add a byte order mark (U+FEFF) to the
beginning of a networked-transmitted JSON text. In the interests of
interoperability, implementations that parse JSON texts MAY ignore
the presence of a byte order mark rather than treating it as an
error.

Take a look at what Content-Type is being returned.

Also save response.content and determine from the prior version of guess_json_utf what encoding is being used there.

If you're able to resolve the issue on the server-side, then do that because it appears to be a server-side issue. Otherwise use json.loads(response.text) or json.loads(response.content.decode(whatever)).

Comment thread httpx/_models.py
encoding = guess_json_utf(self.content)
if encoding is not None:
return jsonlib.loads(self.content.decode(encoding), **kwargs)
return jsonlib.loads(self.text, **kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a random thing I noticed looking through the code, not a problem I've seen in the wild: isn't using self.text like this this better than the new version when there's a non-utf charset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON is spec'ed as always UTF encoded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Issues and PRs related to code refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants