Skip to content

Latest commit

 

History

History
226 lines (155 loc) · 13.3 KB

0243-codepoint-and-character-literals.md

File metadata and controls

226 lines (155 loc) · 13.3 KB

Integer-convertible character literals

Introduction

Swift’s String type is designed for Unicode correctness and abstracts away the underlying binary representation of the string to model it as a Collection of grapheme clusters. This is an appropriate string model for human-readable text, as to a human reader, the atomic unit of a string is (usually) the extended grapheme cluster. When treated this way, many logical string operations “just work” the way users expect.

However, it is also common in programming to need to express values which are intrinsically numeric, but have textual meaning, when taken as an ASCII value. We propose adding a new literal syntax takes single-quotes ('), and is transparently convertible to Swift’s integer types. This syntax, but not the behavior, will extend to all “single element” text literals, up to and including Character, and will become the preferred literal syntax these types.

Motivation

A pain point of using characters in Swift is they lack a first-class literal syntax. Users have to manually coerce string literals to a Character or Unicode.Scalar type using as Character or as Unicode.Scalar, respectively. Having the collection share the same syntax as its element also harms code clarity and makes it difficult to tell if a double-quoted literal is being used as a string or character in some cases.

Additional challenges arise when using ASCII scalars in Swift. Swift currently provides no static mechanism to assert that a unicode scalar literal is restricted to the ASCII range, and lacks a readable literal syntax for such values as well. In C, 'a' is a uint8_t literal, equivalent to 97. Swift has no such equivalent, requiring awkward spellings like UInt8(ascii: "a"), or spelling out the values in hex or decimal directly. This harms readability of code, and makes bytestring processing in Swift painful.

static char const hexcodes[16] = {
    '0', '1', '2', '3', '4' ,'5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
};
let hexcodes = [
    UInt8(ascii: "0"), UInt8(ascii: "1"), UInt8(ascii: "2"), UInt8(ascii: "3"),
    UInt8(ascii: "4"), UInt8(ascii: "5"), UInt8(ascii: "6"), UInt8(ascii: "7"),
    UInt8(ascii: "8"), UInt8(ascii: "9"), UInt8(ascii: "a"), UInt8(ascii: "b"),
    UInt8(ascii: "c"), UInt8(ascii: "d"), UInt8(ascii: "e"), UInt8(ascii: "f")
]    

Higher-level constructs can regain some readability,

let hexcodes = [
    "0", "1", "2", "3",
    "4", "5", "6", "7",
    "8", "9", "a", "b",
    "c", "d", "e", "f"
].map{ UInt8(ascii: $0) }

but may not be familiar to all users, and can come at a runtime cost.

In addition, the init(ascii:) initializer only exists on UInt8. If you're working with other types like Int8 (common when dealing with C APIs that take char), it is much more awkward. Consider scanning through a char* buffer as an UnsafeBufferPointer<Int8>:

for scalar in int8buffer {
    switch scalar {
    case Int8(UInt8(ascii: "a")) ... Int8(UInt8(ascii: "f")):
        // lowercase hex letter
    case Int8(UInt8(ascii: "A")) ... Int8(UInt8(ascii: "F")):
        // uppercase hex letter
    case Int8(UInt8(ascii: "0")) ... Int8(UInt8(ascii: "9")):
        // hex digit
    default:
        // something else
    }
}

Transforming Unicode.Scalar literals also sacrifices compile-time guarantees. The statement let char: UInt8 = 1989 is a compile time error, whereas let char: UInt8 = .init(ascii: "߅") is a run time error.

ASCII scalars are inherently textual, so it should be possible to express them with a textual literal directly. Just as applying the String APIs runs counter to Swift’s stated design goals of safety and efficiency, requiring users to express basic data values in such a verbose way runs counter to our design goal of expressiveness.

Integer character literals would provide benefits to String users. One of the future directions for String is to provide performance-sensitive or low-level users with direct access to code units. Having numeric character literals for use with this API is highly motivating. Furthermore, improving Swift’s bytestring ergonomics is an important part of our long term goal of expanding into embedded platforms.

Proposed solution

The most straightforward solution is to conform Swift’s integer types to ExpressibleByUnicodeScalarLiteral. Due to ABI constraints, it is not currently possible to add this conformance, so we will add the conformance implementations to the standard library, and allow users to “enable” to the feature by declaring this conformance in user code, for example:

extension Int8: ExpressibleByUnicodeScalarLiteral { }

Once the Swift ABI supports retroactive conformances, this conformance can be declared in the standard library, making it available by default.

These integer conversions will only be valid for the ASCII range U+0 ..< U+128; unicode scalar literals outside of that range will be invalid and will generate compile-time errors similar to the way we currently diagnose overflowing integer literals. This is a conservative approach, as allowing transparent unicode conversion to integer types carries encoding pitfalls users may not anticipate or easily understand.

Because it is currently possible to call literal initializers at run-time, a run-time precondition failure will occur if a non-ASCII value is passed to the integer initializer. (We expect the compiler to elide this precondition check for “normal” invocations.)

let u: Unicode.Scalar = '\u{FF}'
let i1: Int = '\u{FF}'                          // compile-time error
let i2: Int = .init(unicodeScalarLiteral: u)    // run-time error
ExpressibleBy UnicodeScalarLiteral ExtendedGraphemeClusterLiteral StringLiteral
UInt8:, … , Int: yes* (initially opt-in) no no
Unicode.Scalar: yes no no
Character: yes (inherited) yes no
String: yes yes yes
StaticString: yes yes yes

Cells marked with an asterisk * indicate behavior that is different from the current language behavior.

The ASCII range restriction will only apply to single-quote literals coerced to a Unicode.Scalar and (either statically or dynamically) converted to an integer type. Any valid Unicode.Scalar can be written as a single-quoted unicode scalar literal, and any valid Character can be written as a single-quoted character literal.

'a' 'é' 'β' '𓀎' '👩‍✈️' "ab"
:String "a" "é" "β" "𓀎" "👩‍✈️" "ab"
:Character 'a' 'é' 'β' '𓀎' '👩‍✈️'
:Unicode.Scalar U+0061 U+00E9 U+03B2 U+1300E
:UInt32 97
:UInt16 97
:UInt8 97
:Int8 97

With these changes, the hex code example can be written much more naturally:

let hexcodes: [UInt8] = [
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
]

for scalar in int8buffer {
    switch scalar {
    case 'a' ... 'f':
        // lowercase hex letter
    case 'A' ... 'F':
        // uppercase hex letter
    case '0' ... '9':
        // hex digit
    default:
        // something else
    }
}

Choice of single quotes

We propose to adopt the 'x' syntax for all textual literal types up to and including ExtendedGraphemeClusterLiteral, but not including StringLiteral. These literals will be used to express integer types, Character, Unicode.Scalar, and types like UTF16.CodeUnit in the standard library.

The default inferred literal type for let x = 'a' will be Character, following the principle of least surprise. This also allows for a natural user-side syntax for differentiating methods overloaded on both Character and String.

Use of single quotes for character/scalar literals is precedented in other languages, including C, Objective-C, C++, Java, and Rust, although different languages have slightly differing ideas about what a “character” is. We choose to use the single quote syntax specifically because it reinforces the notion that strings and character values are different: the former is a sequence, the later is an element. Character types also don't support string literal interpolation, which is another reason to move away from double quotes.

Single quotes in Swift, a historical perspective

In Swift 1.0, single quotes were reserved for some yet-to-be determined syntactical purpose. Since then, pretty much all of the things that might have used single quotes have already found homes in other parts of the Swift syntactical space:

  • syntax for multi-line string literals uses triple quotes (""")

  • string interpolation syntax uses standard double quote syntax.

  • raw-mode string literals settled into the #""# syntax.

  • In current discussions around regex literals, most people seem to prefer slashes (/).

Given that, and the desire for lightweight syntax for single chararcter syntax, and the precedent in other languages for characters, it is natural to use single quotes for this purpose.

Existing double quote initializers for characters

We propose deprecating the double quote literal form for Character and Unicode.Scalar types and slowly migrating them out of Swift.

let c2 = 'f'               // preferred
let c1: Character = "f"   // deprecated

Detailed Design

The only standard library changes will be to extend FixedWidthInteger to have the init(unicodeScalarLiteral:) initializer required by ExpressibleByUnicodeScalarLiteral. Static ASCII range checking will be done in the type checker, dynamic ASCII range checking will be done by a runtime precondition.

extension FixedWidthInteger {
    init(unicodeScalarLiteral: Unicode.Scalar)
}

The default inferred type for all single-quoted literals will be Character. This addresses a language pain point where declaring a Character requires type context.

Source compatibility

This proposal could be done in a way that is strictly additive, but we feel it is best to deprecate the existing double quote initializers for characters, and the UInt8.init(ascii:) initializer.

Here is a specific sketch of a deprecation policy:

  • Continue accepting these in Swift 5 mode with no change.

  • Introduce the new syntax support into Swift 5.1.

  • Swift 5.1 mode would start producing deprecation warnings (with a fixit to change double quotes to single quotes.)

  • The Swift 5 to 5.1 migrator would change the syntax (by virtue of applying the deprecation fixits.)

  • Swift 6 would not accept the old syntax.

During the transition period, "a" will remain a valid unicode scalar literal, but attempting to initialize integer types with double-quoted ASCII literals will produce an error.

let ascii:Int8 = "a" // error

However, as this will only be possible in new code, and will produce a deprecation warning from the outset, this should not be a problem.

Effect on ABI stability

All changes except deprecating the UInt8.init(ascii:) initializer are either additive, or limited to the type checker, parser, or lexer.

Removing UInt8.init(ascii:) would break ABI, but this is not necessary to implement the proposal, it’s merely housekeeping.

Effect on API resilience

None.

Alternatives considered

Integer initializers

Some have proposed extending the UInt8(ascii:) initializer to other integer types (Int8, UInt16, … , Int). However, this forgoes compile-time validity checking, and entails a substantial increase in API surface area for questionable gain.

Lifting the ASCII range restriction

Some have proposed allowing any unicode scalar literal whose codepoint index does not overflow the target integer type to be convertible to that integer type. Consensus was that this is an easy source of unicode encoding bugs, and provides little utility to the user. If people change their minds in the future, this restriction can always be lifted in a source and ABI compatible way.

Single-quoted ASCII strings

Some have proposed allowing integer array types to be expressible by multi-character ASCII strings such as 'abcd'. We consider this to be out of scope of this proposal, as well as unsupported by precedent in C and related languages.