/
url.xml
676 lines (571 loc) · 21.8 KB
/
url.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<!--<!DOCTYPE rfc SYSTEM "rfc2629.dtd">-->
<rfc ipr="trust200902" docName="draft-abarth-url-01" category="std">
<front>
<title abbrev="How Browsers Process URLs">
Parsing URLs for Fun and Profit
</title>
<author initials="A." surname="Barth" fullname="Adam Barth">
<organization abbrev="Google, Inc.">
Google, Inc.
</organization>
<address>
<email>ietf@adambarth.com</email>
<uri>http://www.adambarth.com/</uri>
</address>
</author>
<date month="April" year="2011"/>
<workgroup>iri</workgroup>
<abstract>
<t>This document contains a precise specification of how browsers process
URLs. The behavior specified in this document might or might not match
any particular browser, but browsers might be well-served by adopting the
behavior defined herein.</t>
</abstract>
<note title="Editorial Note (To be removed by RFC Editor)">
<t>If you have suggestions for improving this document, please send
email to <eref target="mailto:public-iri@w3.org" />. Further Working
Group information is available from <eref
target="https://tools.ietf.org/wg/iri/"/>.</t>
</note>
</front>
<middle>
<section title="Open Issues">
<t>Browsers parse URLs differently depending on which operating system
they're running on. The problem is that they want to do sensible things
for file paths, but file paths look different on Windows and Unix
systems.</t>
<t>How should we handle cases where browsers disaggree with the regular
expression in RFC 3986? Currently, this document aims to describe how
browsers behave, but we'll likely need to compare that to RFC 3986 at
some point. Some specific differences that have been brought up on the
mailing list:
<list style="symbols">
<t>http:///example.com/</t>
<t>http://example.com;</t>
</list>
</t>
</section>
<section title="Definitions">
<t>A control character is a character whose value is less than or equal
to U+0020 (" ").</t>
<t>A slash character is either U+???? ("/") or U+???? ("\"). TODO:
There's some question as to whether this is necessary for non-file
URLs.</t>
<t>An authority terminating character is either a slash charcter,
U+???? ("?"), U+???? ("#"), or U+???? (";"). TODO: Why is ";" on this
list?</t>
<t>During a parsing algorithm, the remaining string is the characters of
the input that have not yet been consumed.</t>
</section>
<section title="Parsing a URL">
<t>Given a string of characters, consume all leading and trailing control
characters.</t>
<t>Find the scheme, as described in Section ??.</t>
<t>If the algorithm for finding the scheme determines that the URL is
invalid:
<list>
<t>-> Abort these steps.</t>
</list>
</t>
<t>If the scheme is a single upper or lower case ASCII character (TODO:
Just ALPHA?):
<list>
<t>-> TODO: Windows drive specs!</t>
</list>
</t>
<t>If the scheme is a ASCII case-insensitive match for "file":
<list>
<t>-> TODO: File URLs!</t>
</list>
</t>
<t>If the scheme is a ASCII case-insensitive match for "mailto":
<list>
<t>-> TODO: I think mailto URLs are special, but more testing is
required.</t>
</list>
</t>
<t>If the scheme is hierarchical:
<list>
<t>-> In the after-scheme, if any, find the authority, path, query, and
fragment, as described in Section ??.</t>
<t>-> In the authority, if any, find the user-info, host, and port, as
described in Section ??.</t>
<t>-> In the user-info, if any, find the user name and password, as
described in Section ??.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>The remaining string is the path. TODO: This might not be the best
approach. We need to do more testing of data and javascript URLs.</t>
<section title="Finding the scheme">
<t>If the remaining string does not contain a ":" character:
<list>
<t>-> The URL is invalid.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>Consume characters up to, but not including, the first ":"
character. These characters are the scheme.</t>
<t>Consume the ":" character.</t>
<t>The remaining characters are the after-scheme.</t>
</section>
<section title="Finding the authority, path, query, and fragment">
<t>Consume any number of slash characters.</t>
<t>If the remaining string does not contain any authority terminating
characters:
<list>
<t>-> The remaining string is the authority.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>Consume characters up to, but not including, the first authority
terminating character. The consumed characters are authority.</t>
<t>If the remaining string does not contain a "?" character or a "#"
character:
<list>
<t>-> The remaining string is the path.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>Consume characters up to, but not including, the first "?" or "#"
charcter. The consumed characters are the path.</t>
<t>If the first character of the remaining string is a "?" character:
<list>
<t>-> Consume the "?" character.</t>
<t>-> If the remaining string does not contain a "#" character:
<list>
<t>-> The remaining string is the query.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>-> Consume characters up to, but not including, the first "#"
charcter. The consumed characters are the query.</t>
</list>
</t>
<t>Consume the "#" character.</t>
<t>The remaining string is the fragment.</t>
</section>
<section title="Finding the user-info, host, and port">
<t>If the remaining string contains an "@" character:
<list>
<t>-> Consume characters up to, but not including the *last* "@"
character. The consumed characters are the user-info.</t>
<t>-> Consume the "@" character.</t>
</list>
</t>
<t>If the remaining string does not contain an ":" character:
<list>
<t>-> The remaining string is the host.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>If the first character of the remaining string is a "[" character,
the remaining string contains a "]" character, and the last ":"
character in the remaining string occurs before the last "]" character
in the remaining string:
<list>
<t>-> The remaining string is the host.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>Consume characters up to, but not including, the last ":" character.
The consumed characters are the host.</t>
<t>Consume the ":" character.</t>
<t>The remaining string is the port.</t>
</section>
<section title="Find the user name and password">
<t>If the remaining string does not contain a ":" character:
<list>
<t>-> The remaining string is the user name.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>Consume characters up to, but not including, the first ":"
character. The consumed characters are the user name.</t>
<t>Consume the ":" character.</t>
<t>The remaining string is the password.</t>
</section>
</section>
<section title="Resolving a string relative to a base URL">
<t>Given a string relative-url and a ParsedURL base-url, find the scheme
of relative-url.</t>
<t>TODO: We probably need to trim leading and trailing control
characters.</t>
<t>If relative-url is an invalid URL:
<list>
<t>-> The resolved URL is relative-url resolved as relative URL.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>If relative-url's scheme contains any characters which are not "valid scheme
characters" (TODO: Define valid scheme characters):
<list>
<t>-> The resolved URL is relative-url resolved as relative URL.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>If base-url's scheme is an ASCII case insensitive match for relative-url's
scheme and the shared scheme is hierarchical:
<list>
<t>-> The resolved URL is relative-url's after-scheme resolved as a relative
URL.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>The resolved URL is relative-url parsed as an absolute URL.</t>
<section title="Resolving a string as a relative URL">
<t>Given a string relative-url and a ParsedURL base-url, determine the
resolved URL as follows:</t>
<t>TODO: If base-url's scheme is not hierarchical, we can't resolve as
a relative URL. We'll probably want to return an invalid URL. Check
what happens when resolving an empty string as a relative URL with a
non-hierarchical base.</t>
<t>If relative-url is empty:
<list>
<t>-> The resolved URL is identical to base-url, with the fragment,
if any, removed.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>If the first character of relative-url is a slash character:
<list>
<t>-> If relative-url has at least two characters and the second character is
also a slash character:
<list>
<t>-> The resolved URL is relative-url resolved as a scheme-relative URL.</t>
</list>
Otherwise:
<list>
<t>-> The resolved URL is relative-url resolved as an authority-relative URL.</t>
</list>
</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>If the first character of relative-url is a "?" character:
<list>
<t>-> The resolved URL is relative-url resolved as a query-relative URL.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>If the first character of relative-url is a "#" character:
<list>
<t>-> The resolved URL is relative-url resolved as a fragment-relative
URL.</t>
<t>-> Abort these steps.</t>
</list>
</t>
<t>TODO: Think about the case where the relative-url is empty.</t>
<t>The resolved URL is relative-url resolved as a path-relative URL.</t>
</section>
<section title="Resolving a string as a scheme-relative URL">
<t>Given a string relative-url and a ParsedURL base-url, let
resolved-url be
<list style="symbols">
<t>base-url's scheme</t>
<t>concatenated with ":",</t>
<t>concatenated with relative-url.</t>
</list>
</t>
<t>The resolved URL is resolved-url parsed as an absolute URL.</t>
</section>
<section title="Resolving a string as an authority-relative URL">
<t>Given a string relative-url and a ParsedURL base-url, let
resolved-url be
<list style="symbols">
<t>base-url's scheme</t>
<t>concatenated with "://",</t>
<t>concatenated with base-url's authority,</t>
<t>concatenated with relative-url.</t>
</list>
</t>
<t>The resolved URL is resolved-url parsed as an absolute URL.</t>
</section>
<section title="Resolving a string as a path-relative URL">
<t>TODO: Can the first character of relative-url be a slash character
at this point?</t>
<t>TODO: Can we assume base-url is canonicalized here so that it always
has at least one "/" character?</t>
<t>Let the directory-name be the characters of the base-url's path up
to and including the last slash character.</t>
<t>Let resolved-url be
<list style="symbols">
<t>base-url's scheme</t>
<t>concatenated with "://",</t>
<t>concatenated with base-url's authority,</t>
<t>concatenated with directory-name.</t>
<t>concatenated with relative-url.</t>
</list>
</t>
<t>The resolved URL is resolved-url parsed as an absolute URL.</t>
</section>
<section title="Resolving a string as a query-relative URL">
<t>Given a string relative-url and a ParsedURL base-url, let
resolved-url be
<list style="symbols">
<t>base-url's scheme</t>
<t>concatenated with "://",</t>
<t>concatenated with base-url's authority,</t>
<t>concatenated with base-url's path,</t>
<t>concatenated with relative-url.</t>
</list>
</t>
<t>The resolved URL is resolved-url parsed as an absolute URL.</t>
</section>
<section title="Resolving a string as a fragment-relative URL">
<t>Given a string relative-url and a ParsedURL base-url, let
resolved-url be
<list style="symbols">
<t>base-url's scheme</t>
<t>concatenated with "://",</t>
<t>concatenated with base-url's authority,</t>
<t>concatenated with base-url's path,</t>
<t>concatenated with "?",</t>
<t>concatenated with base-url's query,</t>
<t>concatenated with relative-url.</t>
</list>
</t>
<t>The resolved URL is resolved-url parsed as an absolute URL.</t>
</section>
</section>
<section title="Canonicalizing a URL">
<t>This section describes how to construct a canonical version of a
parsed URL string. TODO: We probably should mention somewhere that there
is *not* a unique canonicalization for every URL.</t>
<t>Given parsed URL original-url, if original-url is invalid:
<list>
<t>-> Abort these steps.</t>
</list>
</t>
<t>TODO: Handle file URLs.</t>
<t>If the scheme is hierarchical:
<list>
<t>Output the canonicalized scheme (as described in Section ??).</t>
<t>Output "://".</t>
<t>If the user-info is non-empty:
<list>
<t>Output the canonicalized user-info (as described in Section ??).</t>
<t>Output "@".</t>
</list>
</t>
<t>Output the canonicalized host (as described in Section ??).</t>
<t>Let the canonicalized-port be the canonicalized port (as described
in Section ??).</t>
<t>If the canonicalized-port is non-empty and is not the default port
for the scheme:
<list>
<t>Output ":".</t>
<t>Output the canonicalized-port.</t>
</list>
</t>
<t>Output the canonicalized path (as described in Section ??).</t>
<t>Let the canonicalized-query be the canonicalized query (as described
in Section ??).</t>
<t>If the canonicalized-query is non-empty (TODO: Distinguish between
empty and non-existent queries):
<list>
<t>Output "?".</t>
<t>Output the canonicalized-query.</t>
</list>
</t>
<t>Let the canonicalized-fragment be the canonicalized fragment (as
described in Section ??).</t>
<t>If the canonicalized-fragment is non-empty (TODO: Distinguish
between empty and non-existent fragments):
<list>
<t>Output "#".</t>
<t>Output the canonicalized-fragment.</t>
</list>
</t>
</list>
</t>
<section title="Canonicalizing a Scheme">
<t>If the first character of the scheme is not in ALPHA, the scheme is
invalid.</t>
<t>Process each character of the scheme in sequence:
<list>
<t>If the current character is among ALPHA, DIGIT, "+", "-", and ".":
<list>
<t>-> Output the current character.</t>
</list>
Otherwise, if the current character is "%":
<list>
<t>-> The scheme is invalid.</t>
<t>-> Output the current character.</t>
</list>
Otherwise:
<list>
<t>-> The scheme is invalid.</t>
<t>-> Output the utf8-percent-escaping of the current
character.</t>
</list>
</t>
</list>
</t>
</section>
<section title="Canonicalizing a User-Info">
<t>Process each character of the username in sequence:
<list>
<t>If the current character is among TODO:
<list>
<t>-> Output the current character.</t>
</list>
Otherwise:
<list>
<t>-> Output the utf8-percent-escaping of the current
character.</t>
</list>
</t>
</list>
</t>
<t>If there is no password or if the password is empty:
<list>
<t>-> Abort these steps.</t>
</list>
</t>
<t>Output ":".</t>
<t>Process each character of the password in sequence:
<list>
<t>If the current character is among TODO:
<list>
<t>-> Output the current character.</t>
</list>
Otherwise:
<list>
<t>-> Output the utf8-percent-escaping of the current
character.</t>
</list>
</t>
</list>
</t>
</section>
<section title="Canonicalizing a Host">
<t>TODO: Handle IP addresses.</t>
<t>Let unicode-host be the host-escape-normalized host (see Section ??).</t>
<t>Output result of applying the IDNA to-ascii algorithm to the
unicode-host. TODO: Properly reference IDNA's to-ascii algorith (we
might need a wrapper like we do in the cookie spec).</t>
<section title="Host Escape Normalization">
<figure>
<artwork type="abnf">
host-escaped = U+0000-U+002A / U+002C / U+002F / U+003B-U+0040 / U+005C /
U+005E / U+0060 / U+007B-U+007F
</artwork>
</figure>
<t>Process each character of the host in sequence:
<list>
<t>If the current character is "%":
<list>
<t>-> TODO: Handle percent-unescaping.</t>
</list>
</t>
<t>If the current character matches host-escaped:
<list>
<t>-> Output the utf8-percent-escaping of the current
character.</t>
</list>
Otherwise, if the current character matches ALPHA:
<list>
<t>-> Output the current character converted to lower case.</t>
</list>
Otherwise:
<list>
<t>-> Output the current character.</t>
</list>
</t>
</list>
</t>
</section>
</section>
<section title="Canonicalizing a Path">
<t>TODO: Do we need to ensure that path's always start with a slash
character?</t>
<t>If the path is empty:
<list>
<t>-> Ouput "/" and abort these steps.</t>
</list>
</t>
<figure>
<artwork type="abnf">
path-escaped = U+0000-U+0020 / U+0022-U+0023 / U+0025 / U+003C / U+003E /
U+003F / U+005C / U+005E / U+0060 / U+007B-U+007D / U+007F
path-unescaped = "-" / DIGIT / ALPHA / "_" / "~"
</artwork>
</figure>
<t>Process each character of the path in sequence:
<list>
<t>If the current character matches path-escaped or is greater
than or equal to U+0080:
<list>
<t>-> Output the utf8-percent-escaping of the current
character.</t>
</list>
Otherwise, if the current character is ".":
<list>
<t>-> TODO: Handle "." collapsing.</t>
</list>
Otherwise, if the current character is "\":
<list>
<t>-> Output "/".</t>
</list>
Otherwise, if the current character is "%":
<list>
<t>-> TODO: Handle percent-unescaping.</t>
</list>
Otherwise:
<list>
<t>-> Output the current character.</t>
</list>
</t>
</list>
</t>
</section>
<section title="Canonicalizing a Query">
<t>TODO: Handle the ambient encoding case.</t>
<t>Process each character of the query in sequence:
<list>
<t>If the current character is among TODO:
<list>
<t>-> Output the current character.</t>
</list>
Otherwise:
<list>
<t>-> Output the utf8-percent-escaping of the current
character. TODO: We need to handle the goofy query escaping
format.</t>
</list>
</t>
</list>
</t>
</section>
<section title="Canonicalizing a Fragment">
<t>Process each character of the fragment in sequence:
<list>
<t>If the current character has a Unicode value greater than or equal
to U+0020:
<list>
<t>-> Output the current character.</t>
</list>
Otherwise:
<list>
<t>-> Output the utf8-percent-escaping of the current
character.</t>
</list>
</t>
</list>
</t>
<t>Note: The above algorithm results in the canonicalized fragment
containing non-US-ASCII characters.</t>
</section>
</section>
</middle>
<back>
<section title="Acknowledgements">
<t>TODO</t>
</section>
</back>
</rfc>