Skip to content

Inconsistent support of IDNA hostnames in Client #1444

Closed
@Martiusweb

Description

Long story short

aiohttp's client handle IDNA hostnames in a way that seems inconsistent: the Host header always contains a dedcoded utf-8 value which seems problematic.

For instance:

  • session.get("http://éé.com/") makes a request with Host: éé.com
  • session.get("http://xn--9caa.com/") also makes a request with Host: éé.com.

While it's unclear to me if an unicode hostname should always be IDNA encoded (see bellow), it should at least not be decoded when explicitly encoded by the caller.

IDNA or not?

The newest HTTP/1 RFCs doesn't specify the encoding of the headers, but recommend to handle them as US-ASCII characters only for security reasons (see: https://tools.ietf.org/html/rfc7230#section-3, especially the last paragraph of 3.2.4).
Most of the resources I read from the W3C or the IETF (normative or not) tells that the hostname should always be encoded, for instance, https://www.w3.org/International/articles/idn-and-iri/#resolvedomain says:

Finally the user agent sends the request for the page. Since punycode contains no characters outside those normally allowed for protocols such as HTTP, there is no issue with the transmission of the address. This should simply match against a registered domain name.

Browsers I tested (Firefox, Chromium) always encode the hostname in IDNA.

I made some tests on a random hostname with unicode characters served by nginx. Nginx doesn't care about the encoding and applies the virtual host rules matching the exact string. Ie: with xn--9caa.com I see the right website, while éé.com returns a 404 probably because only the IDNA encoded version is specified in the configuration.

Expected behaviour

  • session.get("http://xn--9caa.com/") must make a request with Host: xn--9caa.com (encoded host).
  • session.get("http://éé.com/") should make a request with Host: xn--9caa.com (encoded host)

Actual behaviour

  • session.get("http://xn--9caa.com/") makes a request with a decoded host: Host: éé.com (UTF-8 encoded host).
  • session.get("http://éé.com/") makes a request with Host: éé.com too.

Suggested fix

It seems that self.url.raw_host should be used rather than self.url.host in ClientRequest:
https://github.com/KeepSafe/aiohttp/blob/master/aiohttp/client_reqrep.py#L168
(according to my quick test, yarl.URL.raw_host is always return the idna-encoded version, regardless of the encoding of the input url).

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions