|
| 1 | +// |
| 2 | +// Copyright (c) 2023 Alan de Freitas (alandefreitas@gmail.com) |
| 3 | +// |
| 4 | +// Distributed under the Boost Software License, Version 1.0. (See accompanying |
| 5 | +// file LICENSE_1_0.txt or copy at https://www.boost.org/LICENSE_1_0.txt) |
| 6 | +// |
| 7 | +// Official repository: https://github.com/boostorg/url |
| 8 | +// |
| 9 | + |
| 10 | += Design Rationale |
| 11 | +:navtitle: Design Rationale |
| 12 | + |
| 13 | +This section documents the rationale behind design decisions in Boost.URL that are not obvious from the API alone. |
| 14 | +For a general overview of the library's goals and features, see the xref:index.adoc[introduction]. |
| 15 | + |
| 16 | +== Character Type |
| 17 | + |
| 18 | +Boost.URL uses `char` as its character type. |
| 19 | +The library does not provide class templates parameterized on character type (e.g. `basic_url_view<CharT>`). |
| 20 | + |
| 21 | +URLs are sequences of ASCII octets as defined by https://tools.ietf.org/html/rfc3986[RFC 3986,window=blank_]. |
| 22 | +In practice, URLs are always handled as `char` strings: in HTTP headers, in JSON, in configuration files, and in every major programming language's URL library. |
| 23 | +Wide character types (`wchar_t`, `char16_t`, `char32_t`) are not used for URLs in any real-world context, so supporting them would add complexity with no practical benefit. |
| 24 | + |
| 25 | +This also means the library does not provide a `char8_t` (C++20) instantiation. |
| 26 | +While `char8_t` is portably correct for ASCII/UTF-8 text, its adoption in the C++ ecosystem remains limited: the standard library does not fully support it for I/O or formatting, and no major framework has adopted it in public APIs. |
| 27 | +Using `char` means Boost.URL interoperates directly with `std::string`, `std::string_view`, string literals, and the rest of the ecosystem without conversion. |
| 28 | + |
| 29 | +=== EBCDIC |
| 30 | + |
| 31 | +The C++ standard does not require that `char` use an ASCII-compatible encoding. |
| 32 | +On EBCDIC platforms (primarily IBM z/OS), the character literal `'/'` does not have the value `0x2F`, so a URL parser that compares `char` values against ASCII constants would malfunction. |
| 33 | + |
| 34 | +In practice, this is not a concern for Boost.URL: |
| 35 | + |
| 36 | +* z/OS is the only remaining platform where EBCDIC is relevant for C++ compilation. |
| 37 | +* The z/OS C++ compilers support an ASCII compilation mode (`-qascii` or `-fzos-le-char-mode=ascii`) that makes `char` literals use ASCII values. This mode exists specifically for open-source software that assumes ASCII. |
| 38 | +* Real-world C++ libraries that handle URLs and HTTP on z/OS (such as cpp-httplib and DuckDB) use this ASCII mode rather than adding EBCDIC transcoding. |
| 39 | +* The z/OS REST and web services ecosystem is almost entirely Java-based. No evidence exists of C++ code parsing RFC 3986 URIs in EBCDIC `char` encoding. |
| 40 | +* WG21 is moving in this direction as well: P3688 (ASCII character utilities) proposes `char`-based functions that treat input as ASCII regardless of literal encoding. |
| 41 | + |
| 42 | +On EBCDIC platforms where ASCII mode is not used, `char8_t` provides a portably correct alternative since it is guaranteed to use UTF-8 (an ASCII superset). |
| 43 | +A future extension to support `char8_t` constructor overloads on the concrete `char`-based types could address this without requiring templates, since both `char` and `char8_t` are single-byte types and the conversion between them is trivial for ASCII content. |
| 44 | + |
| 45 | +== No Dynamic Allocation by Default |
| 46 | + |
| 47 | +The library is designed so that most operations do not require dynamic memory allocation. |
| 48 | + |
| 49 | +cpp:url_view[] does not retain ownership of the underlying string buffer and does not allocate memory. |
| 50 | +Like a cpp:string_view[], it references the original string directly. |
| 51 | +As long as the contents of the original string are unmodified, constructed URL views always contain a valid URL in its correctly serialized form. |
| 52 | + |
| 53 | +Accessor functions return views referring to substrings and sub-ranges of the underlying URL. |
| 54 | +By referencing the relevant portion of the URL string internally, components can represent percent-decoded strings and be converted to other types without allocation. |
| 55 | +cpp:decode_view[] and its decoding functions perform no memory allocations unless the result needs to be stored in another container. |
| 56 | +Objects can be recycled to reuse their memory, deferring allocations until the application actually needs them. |
| 57 | + |
| 58 | +This makes the library suitable for performance-sensitive network programs and embedded devices. |
| 59 | + |
| 60 | +== Error Handling |
| 61 | + |
| 62 | +The library uses error codes rather than exceptions as its primary error reporting mechanism. |
| 63 | +If input does not match the URL grammar, an error code is reported through cpp:result[] rather than throwing. |
| 64 | +This allows the library to be used in environments that disable exceptions (`-fno-exceptions`), which is detected automatically. |
| 65 | + |
| 66 | +== URL Validity Invariant |
| 67 | + |
| 68 | +All modifications to a cpp:url[] leave it in a valid state. |
| 69 | +It is not possible for a cpp:url[] to hold syntactically illegal text. |
| 70 | +All modifying functions perform validation on their input: attempting to set the scheme or port to an invalid string results in an exception, while other components are automatically percent-encoded as needed. |
| 71 | +All non-const operations offer the strong exception safety guarantee. |
| 72 | + |
| 73 | +== No IRIs |
| 74 | + |
| 75 | +The library does not handle https://www.rfc-editor.org/rfc/rfc3987.html[Internationalized Resource Identifiers,window=blank_] (IRIs). |
| 76 | +IRIs are different from URLs: they come from Unicode strings instead of low-ASCII strings and are covered by a separate specification. |
0 commit comments