Accept-Encoding header parsing and interpretation #8104

steverep · 2024-01-29T22:51:46Z

Is your feature request related to a problem?

Forgive me - parts of this might be considered bug fix and parts feature request, but they are highly related and intertwined in the same code. These are things I noticed while working on #8063.

Dynamic compression via enable_compression() will compare Accept-Encoding case-insensitively, but retrieving static compressed files does not it should be case-insensitive per RFC 9110).
When the force parameter is used in enable_compression(), it does not even consider Accept-Encoding. I fail to see any real use case for this and feel it should be treated like a preference instead (i.e. override the default preference to use deflate to something else). If Brotli were implemented, it may very well be more performant than deflate, so preferring it might be desirable, but then the server no longer supports legacy browsers without Brotli support.
AFAICT, there is no proper parsing of the Accept-Encoding field per RFC 9110. There is simply an in operator test of the string. For example:
- ❌ If the header is not set, it indicates any encoding is okay, but the server currently will only send uncompressed. Technically that's fine, but if a static compressed version exists but the uncompressed version does not (e.g. to save disk space), then the compressed version should be sent.
- ❌ "*" in the field selects any encoding not explicitly listed.
- ❌ Each coding is permitted to have a quality factor (e.g. "br;q=1, deflate;q=0, *;q=0). Higher quality factors should be preferred over lower ones if supported by the server, and ";q=0" says not to send that encoding.
- ✔️ If the header is empty ("") then no compression should be sent.

Describe the solution you'd like

Accept-Encoding should be treated case-insensitively when returning static compressed files.
The force parameter to enable_compression() should really act like a preference, not an override (i.e. it should not be forced if the encoding is not supported per the Accept-Encoding request header).
Parse and interpret the header per RFC 9110 with quality factors, "*", etc. Send a 415 response if merited.

Describe alternatives you've considered

None for the first 2, but admittedly the parsing issue is more or less a moot point since most clients always send the header with an explicit list and no quality factors (i.e. "gzip, deflate, br").

Related component

Server

Additional context

n/a

Code of Conduct

I agree to follow the aio-libs Code of Conduct

The text was updated successfully, but these errors were encountered:

steverep · 2024-01-30T18:37:24Z

@bdraco I think handling the 1st task here for case-sensitivity is quite easy to do before #8063 so I'll submit that.

However, the 3rd task to completely parsing it correctly per the RFC is less a priority to backport IMO. And implementing it raises a larger question for me...

Although it's relatively easy in python, shouldn't that parsing be done by llhttp, and probably the conversion to lowercase too? Meaning, by the time it hits the code to use it for picking or compressing the file, it should be lowered and in a form like {"<encoding>": <quality>, ...} instead of just the raw string. There are other headers that use this format -- Accept, Accept-Language, TE -- but I don't see any of those parsed anywhere in the code.

@Dreamsorcerer is this a question for you?

steverep · 2024-01-31T20:38:21Z

The 3rd task would be partially accomplished by #7679, which is drafted for the 4.0 milestone.

steverep · 2024-02-02T20:17:33Z

For the 2nd task regarding the behavior of the force parameter in enable_compression(), the change would be pretty easy; however, a decision is needed on intent and backporting. Either:

Bug fix back to 3.9: In other words, the intent of force was never to ignore Accept-Encoding.
Feature for 3.10: Perhaps add a new prefer parameter to make the intent clear, and potentially deprecate force in version 4.
Breaking for v4: I don't recommend this as it would be problematic for users if Brotli were implemented for 3.10 unless it was the default over deflate.

Thoughts?

bdraco · 2024-02-02T20:58:04Z

I think we need to do some git archeology to figure out why force was added in the first place before a path forward can be determined.

Dreamsorcerer · 2024-02-02T21:17:30Z

Although it's relatively easy in python, shouldn't that parsing be done by llhttp, and probably the conversion to lowercase too? Meaning, by the time it hits the code to use it for picking or compressing the file, it should be lowered and in a form like {"<encoding>": <quality>, ...} instead of just the raw string. There are other headers that use this format -- Accept, Accept-Language, TE -- but I don't see any of those parsed anywhere in the code.

My impression is that llhttp is just passing us header values. Meaning that it has parsed it as HTTP, but not converted anything based on the particular header names, which is probably out-of-scope. So, such changes would likely go into _http_parser.pyx instead, if you want the compiled performance.

However, is there actually any merit in implementing this quality check? I've seen the same syntax for locales, and I think the general consensus is that this was over-engineered and basically pointless. The quality numbers fail to achieve anything more than defining an order of preference (i.e. the actual numbers are never used for anything, they're just sorted on). Therefore, browsers don't provide any way for users to customise these quality values, they just pick evenly spaced values according to order of user preference. Because of all this, browsers will also always order them with the highest quality number at the start, so in reality, you can just read the values left-to-right while ignoring the quality numbers and you'll get the same result. Parsing the quality numbers would therefore just be a waste of CPU-time.

steverep · 2024-02-02T23:59:11Z

I think we need to do some git archeology to figure out why force was added in the first place before a path forward can be determined.

It was a boolean before #403 added gzip, and the code before that made it clear that it was actually a "force" to ignore Accept-Encoding. The boolean was originally added in #239 but there's no comments to explain any use case for it.

Dreamsorcerer · 2024-02-03T01:26:56Z

I think the force parameter is pretty clear in that it forces a given compression. I'm not clear why you'd want to do this, so I'm not clear that turning it into a preference makes sense without understanding the use case (can we find any projects using the parameter?).

Clearly it's not correct behaviour to use compression that the client claims not to support, but it also seems weird to try and use a fixed compression method in the first place...

Dreamsorcerer · 2024-02-03T01:38:21Z

I've seen the same syntax for locales, and I think the general consensus is that this was over-engineered and basically pointless.

OK, there seems to be an important difference between Accept-Encoding and Accept-Language. The former can disallow encodings with q=0, so maybe we need to atleast check for that.

Edit: OK, not a difference, but it's explicitly mentioned under Accept-Encoding, whereas it'd be very weird to try and exclude languages.

steverep · 2024-02-03T02:17:26Z

My impression is that llhttp is just passing us header values. Meaning that it has parsed it as HTTP, but not converted anything based on the particular header names, which is probably out-of-scope. So, such changes would likely go into _http_parser.pyx instead, if you want the compiled performance.

Okay, makes sense 👍🏻

However, is there actually any merit in implementing this quality check? I've seen the same syntax for locales, and I think the general consensus is that this was over-engineered and basically pointless. The quality numbers fail to achieve anything more than defining an order of preference (i.e. the actual numbers are never used for anything, they're just sorted on).

That's certainly correct - the numbers mean nothing but a sort order. I think the reason it's not just a list in preference order is that then it becomes impossible to communicate equal preference for some or all items. That's why the default quality if not specified is 1.

To a lesser extent it's also a way to have negations with a wildcard, e.g. "compress;q=0, *" says I'll take any encoding except compress. Although even the spec admits this pattern has little to no practical value.

FWIW, Apache does use the qualities in its content negotiation.

Therefore, browsers don't provide any way for users to customise these quality values, they just pick evenly spaced values according to order of user preference.

True they are not user-facing for sure (nor would they have any reason to be?), and the values are just picked to set the browser's preference.

Because of all this, browsers will also always order them with the highest quality number at the start, so in reality, you can just read the values left-to-right while ignoring the quality numbers and you'll get the same result. Parsing the quality numbers would therefore just be a waste of CPU-time.

This is definitely not true. The spec is clear that values without a quality default to 1 and browsers take advantage of that. For example, both Chrome and Firefox send Accept-Encoding: gzip, deflate, br, but that certainly does not mean they prefer gzip over brotli. Decoding time is inconsequential so the best interest for a client is smallest transfer size. Google even makes this clear in their Lighthouse audit for compression.

What they send for the Accept header also makes this clear:

Firefox = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8"
Chrome = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7"

Dreamsorcerer · 2024-02-03T02:21:55Z

This is definitely not true.

I was referring to the real world behaviour of Accept-Language, and wondering if the same applies. Clearly that's not the case, so you are probably right that we should use the quality value here.

steverep · 2024-02-03T21:51:58Z

Clearly it's not correct behaviour to use compression that the client claims not to support, but it also seems weird to try and use a fixed compression method in the first place...

I completely agree - neither makes sense to me. I think there is a use case for changing the server's default encoding order though. For example, a user might benchmark and find that sending large files over long distances is likely faster with gzip than deflate (i.e. the larger encoding time is offset by the faster transfer time). So they might want the order to be ["br", "gzip", ..others].

If I were to redesign the current API, it might look something like this to allow a set of encodings to be preferred first but then fallback to others supported:

    def __init__(self, ...) -> None:
        ...
        self._ordered_codings = deque(ContentCoding, len(ContentCoding))

    def enable_compression(self, *prefer: ContentCoding) -> None:
        """Enables response compression with preferred encodings."""
        self._compression = True
        for encoding in prefer:
            self._ordered_codings.remove(encoding)
        self._ordered_codings.extend(prefer)
        self._ordered_codings.rotate(len(prefer))

    async def _start_compression(self, request: "BaseRequest") -> None:
        # Encoding comparisons should be case-insensitive
        # https://www.rfc-editor.org/rfc/rfc9110#section-8.4.1
        accept_encoding = request.headers.get(hdrs.ACCEPT_ENCODING, "").lower()
        for coding in self._ordered_codings:
            if coding.value in accept_encoding:
                await self._do_start_compression(coding)
                return

Dreamsorcerer · 2024-02-04T00:12:51Z

Yeah, I'd be open to something like that. I wonder if it's worth adding this as a new feature (and potentially removing force if we can't find any uses for it).

steverep · 2024-02-05T04:25:03Z

Yeah, I'd be open to something like that. I wonder if it's worth adding this as a new feature (and potentially removing force if we can't find any uses for it).

Here's my attempt at a search for enable_compression() uses with an argument, after a lot of filtering to get rid of just copies of aiohttp itself.

There are some results using the old boolean, some oddly specifying "identity" 😕, some just specifying "gzip" that clearly would have no harm by turning to a preference (e.g. ignore the Home Assistant results), and even one result that is doing the job aiohttp should be doing and checking the Accept-Encoding header 😉.

Passing "identity" is concerning because currently that would just do nothing, but under my proposal, it would be skipped because it's not going to be in the Accept-Encoding header and then the algorithm would go on to deflate. I could either put an assertion in that disallows it, or just adjust the header check because it's implied as included (unless explicitly excluded).

Dreamsorcerer · 2024-02-05T12:35:32Z

Here's my attempt at a search for enable_compression() uses with an argument, after a lot of filtering to get rid of just copies of aiohttp itself.

That gives me no results for some reason. I did create a script previously to search dependents, though it can be slow to run (it searches 1 repo every ~7 seconds due to rate limit): https://gist.github.com/Dreamsorcerer/70285fac0a11c3d9c26b577f7dd989a7

There are some results using the old boolean

Might be interesting to see if any of them mention any rationale in the commit they were added...

Passing "identity" is concerning because currently that would just do nothing, but under my proposal, it would be skipped because it's not going to be in the Accept-Encoding header and then the algorithm would go on to deflate. I could either put an assertion in that disallows it, or just adjust the header check because it's implied as included (unless explicitly excluded).

My counter-proposal is to just add a new prefer parameter which takes a sequence of strings. Then add a deprecation warning to the force parameter and remove it from v4. If we get complaints from the deprecation notice, then we can reconsider removing it.

steverep · 2024-02-05T14:36:14Z

That gives me no results for some reason.

The markdown sanitizer keeps removing all the escapes from the regex parts. The search term should look like this with 20 results for me:

aiohttp AND /\.enable_compression$.+$/ language:python NOT is:archived NOT is:fork NOT repo:aio-libs/aiohttp NOT is:vendored NOT is:generated NOT path:/(^|\/)aiohttp\/(multipart|web_response|client_reqrep)\.py$/ NOT path:/tests\/test_(web_response|client_functional|http_writer)\.py$/ NOT repo:cndn/intelligent-code-completion

Might be interesting to see if any of them mention any rationale in the commit they were added...

I didn't check, but one of the false arguments did have a comment mentioning it was a workaround for the double compression issue recently fixed by @bdraco.

My counter-proposal is to just add a new prefer parameter which takes a sequence of strings. Then add a deprecation warning to the force parameter and remove it from v4. If we get complaints from the deprecation notice, then we can reconsider removing it.

Not as elegant, but safer of course. Also, consider that either way this becomes a "breaking" change. For example, the Home Assistant use is definitely not expecting a force, so a change would be required to specify prefer=.

steverep added the enhancement label Jan 29, 2024

steverep mentioned this issue Jan 29, 2024

Add server capability to check for Brotli compressed static files #8063

Merged

5 tasks

webknjaz assigned steverep Jan 30, 2024

steverep mentioned this issue Feb 1, 2024

Treat Accept-Encoding header as case-insensitive for gzip file check #8127

Merged

5 tasks

This was referenced Feb 2, 2024

[PR #8127/b97d9a97 backport][3.9] Treat Accept-Encoding header as case-insensitive for gzip file check #8130

Merged

[PR #8127/b97d9a97 backport][3.10] Treat Accept-Encoding header as case-insensitive for gzip file check #8131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept-Encoding header parsing and interpretation #8104

Accept-Encoding header parsing and interpretation #8104

steverep commented Jan 29, 2024 •

edited

steverep commented Jan 30, 2024 •

edited

steverep commented Jan 31, 2024

steverep commented Feb 2, 2024

bdraco commented Feb 2, 2024

Dreamsorcerer commented Feb 2, 2024 •

edited

steverep commented Feb 2, 2024

Dreamsorcerer commented Feb 3, 2024

Dreamsorcerer commented Feb 3, 2024 •

edited

steverep commented Feb 3, 2024

Dreamsorcerer commented Feb 3, 2024

steverep commented Feb 3, 2024

Dreamsorcerer commented Feb 4, 2024

steverep commented Feb 5, 2024 •

edited

Dreamsorcerer commented Feb 5, 2024

steverep commented Feb 5, 2024 •

edited

Accept-Encoding header parsing and interpretation #8104

Accept-Encoding header parsing and interpretation #8104

Comments

steverep commented Jan 29, 2024 • edited

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Related component

Additional context

Code of Conduct

steverep commented Jan 30, 2024 • edited

steverep commented Jan 31, 2024

steverep commented Feb 2, 2024

bdraco commented Feb 2, 2024

Dreamsorcerer commented Feb 2, 2024 • edited

steverep commented Feb 2, 2024

Dreamsorcerer commented Feb 3, 2024

Dreamsorcerer commented Feb 3, 2024 • edited

steverep commented Feb 3, 2024

Dreamsorcerer commented Feb 3, 2024

steverep commented Feb 3, 2024

Dreamsorcerer commented Feb 4, 2024

steverep commented Feb 5, 2024 • edited

Dreamsorcerer commented Feb 5, 2024

steverep commented Feb 5, 2024 • edited

steverep commented Jan 29, 2024 •

edited

steverep commented Jan 30, 2024 •

edited

Dreamsorcerer commented Feb 2, 2024 •

edited

Dreamsorcerer commented Feb 3, 2024 •

edited

steverep commented Feb 5, 2024 •

edited

steverep commented Feb 5, 2024 •

edited