drafts/url.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<!--<!DOCTYPE rfc SYSTEM "rfc2629.dtd">-->
<rfc ipr="trust200902" docName="draft-abarth-url-01" category="std">
  <front>
    <title abbrev="How Browsers Process URLs">
      Parsing URLs for Fun and Profit
    </title>
    <author initials="A." surname="Barth" fullname="Adam Barth">
      <organization abbrev="Google, Inc.">
        Google, Inc.
      </organization>
      <address>
        <email>ietf@adambarth.com</email>
        <uri>http://www.adambarth.com/</uri>
      </address>
    </author>
    <date month="April" year="2011"/>
    <workgroup>iri</workgroup>
    <abstract>
      <t>This document contains a precise specification of how browsers process
      URLs.  The behavior specified in this document might or might not match
      any particular browser, but browsers might be well-served by adopting the
      behavior defined herein.</t>
    </abstract>
    <note title="Editorial Note (To be removed by RFC Editor)">
      <t>If you have suggestions for improving this document, please send
      email to <eref target="mailto:public-iri@w3.org" />.  Further Working
      Group information is available from <eref
      target="https://tools.ietf.org/wg/iri/"/>.</t>
    </note>
  </front>
  <middle>
    <section title="Open Issues">
      <t>Browsers parse URLs differently depending on which operating system
      they're running on.  The problem is that they want to do sensible things
      for file paths, but file paths look different on Windows and Unix
      systems.</t>

      <t>How should we handle cases where browsers disaggree with the regular
      expression in RFC 3986?  Currently, this document aims to describe how
      browsers behave, but we'll likely need to compare that to RFC 3986 at
      some point.  Some specific differences that have been brought up on the
      mailing list:
      <list style="symbols">
        <t>http:///example.com/</t>
        <t>http://example.com;</t>
      </list>
      </t>
    </section>
    <section title="Definitions">
      <t>A control character is a character whose value is less than or equal
      to U+0020 (" ").</t>

      <t>A slash character is either U+???? ("/") or U+???? ("\"). TODO:
      There's some question as to whether this is necessary for non-file
      URLs.</t>

      <t>An authority terminating character is either a slash charcter,
      U+???? ("?"), U+???? ("#"), or U+???? (";").  TODO: Why is ";" on this
      list?</t>

      <t>During a parsing algorithm, the remaining string is the characters of
      the input that have not yet been consumed.</t>
    </section>
    <section title="Parsing a URL">
      <t>Given a string of characters, consume all leading and trailing control
      characters.</t>

      <t>Find the scheme, as described in Section ??.</t>

      <t>If the algorithm for finding the scheme determines that the URL is
      invalid:
      <list>
        <t>-> Abort these steps.</t>
      </list>
      </t>

      <t>If the scheme is a single upper or lower case ASCII character (TODO:
      Just ALPHA?):
      <list>
        <t>-> TODO: Windows drive specs!</t>
      </list>
      </t>

      <t>If the scheme is a ASCII case-insensitive match for "file":
      <list>
        <t>-> TODO: File URLs!</t>
      </list>
      </t>

      <t>If the scheme is a ASCII case-insensitive match for "mailto":
      <list>
        <t>-> TODO: I think mailto URLs are special, but more testing is
        required.</t>
      </list>
      </t>

      <t>If the scheme is hierarchical:
      <list>
        <t>-> In the after-scheme, if any, find the authority, path, query, and
        fragment, as described in Section ??.</t>

        <t>-> In the authority, if any, find the user-info, host, and port, as
        described in Section ??.</t>

        <t>-> In the user-info, if any, find the user name and password, as
        described in Section ??.</t>

        <t>-> Abort these steps.</t>
      </list>
      </t>

      <t>The remaining string is the path. TODO: This might not be the best
      approach.  We need to do more testing of data and javascript URLs.</t>

      <section title="Finding the scheme">
        <t>If the remaining string does not contain a ":" character:
        <list>
          <t>-> The URL is invalid.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>Consume characters up to, but not including, the first ":"
        character.  These characters are the scheme.</t>

        <t>Consume the ":" character.</t>

        <t>The remaining characters are the after-scheme.</t>
      </section>
      <section title="Finding the authority, path, query, and fragment">
        <t>Consume any number of slash characters.</t>

        <t>If the remaining string does not contain any authority terminating
        characters:
        <list>
          <t>-> The remaining string is the authority.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>Consume characters up to, but not including, the first authority
        terminating character.  The consumed characters are authority.</t>

        <t>If the remaining string does not contain a "?" character or a "#"
        character:
        <list>
          <t>-> The remaining string is the path.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>Consume characters up to, but not including, the first "?" or "#"
        charcter.  The consumed characters are the path.</t>

        <t>If the first character of the remaining string is a "?" character:
        <list>
          <t>-> Consume the "?" character.</t>

          <t>-> If the remaining string does not contain a "#" character:
          <list>
            <t>-> The remaining string is the query.</t>

            <t>-> Abort these steps.</t>
          </list>
          </t>

          <t>-> Consume characters up to, but not including, the first "#"
          charcter.  The consumed characters are the query.</t>
        </list>
        </t>

        <t>Consume the "#" character.</t>

        <t>The remaining string is the fragment.</t>
      </section>
      <section title="Finding the user-info, host, and port">
        <t>If the remaining string contains an "@" character:
        <list>
          <t>-> Consume characters up to, but not including the *last* "@"
          character.  The consumed characters are the user-info.</t>

          <t>-> Consume the "@" character.</t>
        </list>
        </t>

        <t>If the remaining string does not contain an ":" character:
        <list>
          <t>-> The remaining string is the host.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>If the first character of the remaining string is a "[" character,
        the remaining string contains a "]" character, and the last ":"
        character in the remaining string occurs before the last "]" character
        in the remaining string:
        <list>
          <t>-> The remaining string is the host.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>Consume characters up to, but not including, the last ":" character.
        The consumed characters are the host.</t>

        <t>Consume the ":" character.</t>

        <t>The remaining string is the port.</t>
      </section>
      <section title="Find the user name and password">
        <t>If the remaining string does not contain a ":" character:
        <list>
          <t>-> The remaining string is the user name.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>Consume characters up to, but not including, the first ":"
        character.  The consumed characters are the user name.</t>

        <t>Consume the ":" character.</t>

        <t>The remaining string is the password.</t>
      </section>
    </section>
    <section title="Resolving a string relative to a base URL">
      <t>Given a string relative-url and a ParsedURL base-url, find the scheme
      of relative-url.</t>

      <t>TODO: We probably need to trim leading and trailing control
      characters.</t>

      <t>If relative-url is an invalid URL:
      <list>
        <t>-> The resolved URL is relative-url resolved as relative URL.</t>

        <t>-> Abort these steps.</t>
      </list>
      </t>

      <t>If relative-url's scheme contains any characters which are not "valid scheme
      characters" (TODO: Define valid scheme characters):
      <list>
        <t>-> The resolved URL is relative-url resolved as relative URL.</t>

        <t>-> Abort these steps.</t>
      </list>
      </t>

      <t>If base-url's scheme is an ASCII case insensitive match for relative-url's
      scheme and the shared scheme is hierarchical:
      <list>
        <t>-> The resolved URL is relative-url's after-scheme resolved as a relative
        URL.</t>

        <t>-> Abort these steps.</t>
      </list>
      </t>

      <t>The resolved URL is relative-url parsed as an absolute URL.</t>

      <section title="Resolving a string as a relative URL">
        <t>Given a string relative-url and a ParsedURL base-url, determine the
        resolved URL as follows:</t>

        <t>TODO: If base-url's scheme is not hierarchical, we can't resolve as
        a relative URL.  We'll probably want to return an invalid URL.  Check
        what happens when resolving an empty string as a relative URL with a
        non-hierarchical base.</t>

        <t>If relative-url is empty:
        <list>
          <t>-> The resolved URL is identical to base-url, with the fragment,
          if any, removed.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>If the first character of relative-url is a slash character:
        <list>
          <t>-> If relative-url has at least two characters and the second character is
          also a slash character:
          <list>
            <t>-> The resolved URL is relative-url resolved as a scheme-relative URL.</t>
          </list>
          Otherwise:
          <list>
            <t>-> The resolved URL is relative-url resolved as an authority-relative URL.</t>
          </list>
          </t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>If the first character of relative-url is a "?" character:
        <list>
          <t>-> The resolved URL is relative-url resolved as a query-relative URL.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>If the first character of relative-url is a "#" character:
        <list>
          <t>-> The resolved URL is relative-url resolved as a fragment-relative
          URL.</t>

          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>TODO: Think about the case where the relative-url is empty.</t>

        <t>The resolved URL is relative-url resolved as a path-relative URL.</t>
      </section>
      <section title="Resolving a string as a scheme-relative URL">
        <t>Given a string relative-url and a ParsedURL base-url, let
        resolved-url be
        <list style="symbols">
          <t>base-url's scheme</t>
          <t>concatenated with ":",</t>
          <t>concatenated with relative-url.</t>
        </list>
        </t>

        <t>The resolved URL is resolved-url parsed as an absolute URL.</t>
      </section>
      <section title="Resolving a string as an authority-relative URL">
        <t>Given a string relative-url and a ParsedURL base-url, let
        resolved-url be
        <list style="symbols">
          <t>base-url's scheme</t>
          <t>concatenated with "://",</t>
          <t>concatenated with base-url's authority,</t>
          <t>concatenated with relative-url.</t>
        </list>
        </t>

        <t>The resolved URL is resolved-url parsed as an absolute URL.</t>
      </section>
      <section title="Resolving a string as a path-relative URL">
        <t>TODO: Can the first character of relative-url be a slash character
        at this point?</t>

        <t>TODO: Can we assume base-url is canonicalized here so that it always
        has at least one "/" character?</t>

        <t>Let the directory-name be the characters of the base-url's path up
        to and including the last slash character.</t>

        <t>Let resolved-url be
        <list style="symbols">
          <t>base-url's scheme</t>
          <t>concatenated with "://",</t>
          <t>concatenated with base-url's authority,</t>
          <t>concatenated with directory-name.</t>
          <t>concatenated with relative-url.</t>
        </list>
        </t>

        <t>The resolved URL is resolved-url parsed as an absolute URL.</t>
      </section>
      <section title="Resolving a string as a query-relative URL">
        <t>Given a string relative-url and a ParsedURL base-url, let
        resolved-url be
        <list style="symbols">
          <t>base-url's scheme</t>
          <t>concatenated with "://",</t>
          <t>concatenated with base-url's authority,</t>
          <t>concatenated with base-url's path,</t>
          <t>concatenated with relative-url.</t>
        </list>
        </t>

        <t>The resolved URL is resolved-url parsed as an absolute URL.</t>
      </section>
      <section title="Resolving a string as a fragment-relative URL">
        <t>Given a string relative-url and a ParsedURL base-url, let
        resolved-url be
        <list style="symbols">
          <t>base-url's scheme</t>
          <t>concatenated with "://",</t>
          <t>concatenated with base-url's authority,</t>
          <t>concatenated with base-url's path,</t>
          <t>concatenated with "?",</t>
          <t>concatenated with base-url's query,</t>
          <t>concatenated with relative-url.</t>
        </list>
        </t>

        <t>The resolved URL is resolved-url parsed as an absolute URL.</t>
      </section>
    </section>
    <section title="Canonicalizing a URL">
      <t>This section describes how to construct a canonical version of a
      parsed URL string. TODO: We probably should mention somewhere that there
      is *not* a unique canonicalization for every URL.</t>

      <t>Given parsed URL original-url, if original-url is invalid:
      <list>
        <t>-> Abort these steps.</t>
      </list>
      </t>

      <t>TODO: Handle file URLs.</t>

      <t>If the scheme is hierarchical:
      <list>
        <t>Output the canonicalized scheme (as described in Section ??).</t>

        <t>Output "://".</t>

        <t>If the user-info is non-empty:
        <list>
          <t>Output the canonicalized user-info (as described in Section ??).</t>

          <t>Output "@".</t>
        </list>
        </t>

        <t>Output the canonicalized host (as described in Section ??).</t>

        <t>Let the canonicalized-port be the canonicalized port (as described
        in Section ??).</t>

        <t>If the canonicalized-port is non-empty and is not the default port
        for the scheme:
        <list>
          <t>Output ":".</t>

          <t>Output the canonicalized-port.</t>
        </list>
        </t>

        <t>Output the canonicalized path (as described in Section ??).</t>

        <t>Let the canonicalized-query be the canonicalized query (as described
        in Section ??).</t>

        <t>If the canonicalized-query is non-empty (TODO: Distinguish between
        empty and non-existent queries):
        <list>
          <t>Output "?".</t>

          <t>Output the canonicalized-query.</t>
        </list>
        </t>

        <t>Let the canonicalized-fragment be the canonicalized fragment (as
        described in Section ??).</t>

        <t>If the canonicalized-fragment is non-empty (TODO: Distinguish
        between empty and non-existent fragments):
        <list>
          <t>Output "#".</t>

          <t>Output the canonicalized-fragment.</t>
        </list>
        </t>
      </list>
      </t>
      <section title="Canonicalizing a Scheme">
        <t>If the first character of the scheme is not in ALPHA, the scheme is
        invalid.</t>

        <t>Process each character of the scheme in sequence:
        <list>
          <t>If the current character is among ALPHA, DIGIT, "+", "-", and ".":
          <list>
            <t>-> Output the current character.</t>
          </list>
          Otherwise, if the current character is "%":
          <list>
            <t>-> The scheme is invalid.</t>

            <t>-> Output the current character.</t>
          </list>
          Otherwise:
          <list>
            <t>-> The scheme is invalid.</t>

            <t>-> Output the utf8-percent-escaping of the current
            character.</t>
          </list>
          </t>
        </list>
        </t>
      </section>
      <section title="Canonicalizing a User-Info">
        <t>Process each character of the username in sequence:
        <list>
          <t>If the current character is among TODO:
          <list>
            <t>-> Output the current character.</t>
          </list>
          Otherwise:
          <list>
            <t>-> Output the utf8-percent-escaping of the current
            character.</t>
          </list>
          </t>
        </list>
        </t>

        <t>If there is no password or if the password is empty:
        <list>
          <t>-> Abort these steps.</t>
        </list>
        </t>

        <t>Output ":".</t>

        <t>Process each character of the password in sequence:
        <list>
          <t>If the current character is among TODO:
          <list>
            <t>-> Output the current character.</t>
          </list>
          Otherwise:
          <list>
            <t>-> Output the utf8-percent-escaping of the current
            character.</t>
          </list>
          </t>
        </list>
        </t>
      </section>
      <section title="Canonicalizing a Host">
        <t>TODO: Handle IP addresses.</t>

        <t>Let unicode-host be the host-escape-normalized host (see Section ??).</t>

        <t>Output result of applying the IDNA to-ascii algorithm to the
        unicode-host. TODO: Properly reference IDNA's to-ascii algorith (we
        might need a wrapper like we do in the cookie spec).</t>

        <section title="Host Escape Normalization">

          <figure>
            <artwork type="abnf">
host-escaped   = U+0000-U+002A / U+002C / U+002F / U+003B-U+0040 / U+005C /
                 U+005E / U+0060 / U+007B-U+007F
            </artwork>
          </figure>

          <t>Process each character of the host in sequence:
          <list>
            <t>If the current character is "%":
            <list>
              <t>-> TODO: Handle percent-unescaping.</t>
            </list>
            </t>

            <t>If the current character matches host-escaped:
            <list>
              <t>-> Output the utf8-percent-escaping of the current
              character.</t>
            </list>
            Otherwise, if the current character matches ALPHA:
            <list>
              <t>-> Output the current character converted to lower case.</t>
            </list>
            Otherwise:
            <list>
              <t>-> Output the current character.</t>
            </list>
            </t>
          </list>
          </t>
        </section>
      </section>
      <section title="Canonicalizing a Path">
        <t>TODO: Do we need to ensure that path's always start with a slash
        character?</t>

        <t>If the path is empty:
        <list>
          <t>-> Ouput "/" and abort these steps.</t>
        </list>
        </t>

        <figure>
          <artwork type="abnf">
path-escaped   = U+0000-U+0020 / U+0022-U+0023 / U+0025 / U+003C / U+003E /
                 U+003F / U+005C / U+005E / U+0060 / U+007B-U+007D / U+007F
path-unescaped = "-" / DIGIT / ALPHA / "_" / "~"
          </artwork>
        </figure>

        <t>Process each character of the path in sequence:
        <list>
          <t>If the current character matches path-escaped or is greater
          than or equal to U+0080:
          <list>
            <t>-> Output the utf8-percent-escaping of the current
            character.</t>
          </list>
          Otherwise, if the current character is ".":
          <list>
            <t>-> TODO: Handle "." collapsing.</t>
          </list>
          Otherwise, if the current character is "\":
          <list>
            <t>-> Output "/".</t>
          </list>
          Otherwise, if the current character is "%":
          <list>
            <t>-> TODO: Handle percent-unescaping.</t>
          </list>
          Otherwise:
          <list>
            <t>-> Output the current character.</t>
          </list>
          </t>
        </list>
        </t>
      </section>
      <section title="Canonicalizing a Query">
        <t>TODO: Handle the ambient encoding case.</t>

        <t>Process each character of the query in sequence:
        <list>
          <t>If the current character is among TODO:
          <list>
            <t>-> Output the current character.</t>
          </list>
          Otherwise:
          <list>
            <t>-> Output the utf8-percent-escaping of the current
            character. TODO: We need to handle the goofy query escaping
            format.</t>
          </list>
          </t>
        </list>
        </t>
      </section>
      <section title="Canonicalizing a Fragment">
        <t>Process each character of the fragment in sequence:
        <list>
          <t>If the current character has a Unicode value greater than or equal
          to U+0020:
          <list>
            <t>-> Output the current character.</t>
          </list>
          Otherwise:
          <list>
            <t>-> Output the utf8-percent-escaping of the current
            character.</t>
          </list>
          </t>
        </list>
        </t>

        <t>Note: The above algorithm results in the canonicalized fragment
        containing non-US-ASCII characters.</t>
      </section>
    </section>
  </middle>
  <back>
    <section title="Acknowledgements">
      <t>TODO</t>
    </section>
  </back>
</rfc>