Permalink
454 lines (364 sloc) 18.2 KB

String Revision: Collection Conformance, C Interop, Transcoding

Introduction

This proposal is to implement a subset of the changes from the Swift 4 String Manifesto.

Specifically:

  • Make String conform to BidirectionalCollection
  • Make String conform to RangeReplaceableCollection
  • Create a Substring type for String.SubSequence
  • Create a StringProtocol protocol to allow for generic operations over both types.
  • Consolidate on a concise set of C interop methods.
  • Revise the transcoding infrastructure.
  • Sink Unicode-specific functionality into a Unicode namespace.

Other existing aspects of String remain unchanged for the purposes of this proposal.

Motivation

This proposal follows up on a number of recommendations found in the manifesto:

Collection conformance was dropped from String in Swift 2. After reevaluation, the feeling is that the minor discrepancies with required RangeReplaceableCollection semantics (the fact that some characters may merge when Strings are concatenated) are outweighed by the significant benefits of restoring these conformances. For more detail on the reasoning, see here

While it is not a collection, the Swift 3 string does have slicing operations. String is currently serving as its own subsequence, allowing substrings to share storage with their "owner". This can lead to memory leaks when small substrings of larger strings are stored long-term (see here for more detail on this problem). Introducing a separate type of Substring to serve as String.Subsequence is recommended to resolve this issue, in a similar fashion to ArraySlice.

As noted in the manifesto, support for interoperation with nul-terminated C strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C string into a String and four ways to do the inverse. These APIs should be replaced with a simpler set of methods on String.

Proposed solution

A new type, Substring, will be introduced. Similar to ArraySlice it will be documented as only for short- to medium-term storage:

Important

Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.

Aside from minor differences, such as having a SubSequence of Self and a larger size to describe the range of the subsequence, Substring will be near-identical from a user perspective.

In order to be able to write extensions across both String and Substring, a new StringProtocol protocol to which the two types will conform will be introduced. For the purposes of this proposal, StringProtocol will be defined as a protocol to be used whenever you would previously extend String. It should be possible to substitute extension StringProtocol { ... } in Swift 4 wherever extension String { ... } was written in Swift 3, with one exception: any passing of self into an API that takes a concrete String will need to be rewritten as String(self). If Self is a String then this should effectively optimize to a no-op, whereas if Self is a Substring then this will force a copy, helping to avoid the "memory leak" problems described above.

The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.

StringProtocol will conform to BidirectionalCollection. RangeReplaceableCollection conformance will be added directly onto the String and Substring types, as it is possible future StringProtocol-conforming types might not be range-replaceable (e.g. an immutable type that wraps a const char *).

The C string interop methods will be updated to a variant of those described here: two withCString operations and two init(cString:) constructors, one each for UTF8 and for arbitrary encodings. The primary change is to remove "non-repairing" variants of construction from nul-terminated C strings. In both of the construction APIs, any invalid encoding sequence detected will have its longest valid prefix replaced by U+FFFD, the Unicode replacement character, per the Unicode specification. This covers the common case. The replacement can be done physically in the underlying storage and the validity of the result can be recorded in the String's encoding such that future accesses need not be slowed down by possible error repair separately. Construction that is aborted when encoding errors are detected can be accomplished using APIs on the encoding.

Additionally, an init that takes a collection of code units and an encoding will allow for construction of a String from arbitrary collections – for example, an UnsafeBufferPointer containing a non-nul-terminated C string.

The current transcoding support will be updated to improve usability and performance. The primary changes will be:

  • to allow transcoding directly from one encoding to another without having to triangulate through an intermediate scalar value
  • to add the ability to transcode an input collection in reverse, allowing the different views on String to be made bi-directional
  • to ensure that the APIs can be used to create performant bidirectional decoded and transcoded views of underlying code units.
  • to replace the UnicodeCodec with a stateless Unicode.Encoding protocol having associated ForwardParser and ReverseParser types for decoding.

The standard library currently lacks a Latin1 codec, so a enum Latin1: Unicode.Encoding type will be added.

Detailed design

The Unicode Namespace

A Unicode “namespace” will be added for components related to low-level Unicode operations such as transcoding and grapheme breaking. Absent more direct language support, Unicode will, for the time being, be implemented as a caseless enum. [The caseless enum technique is precedented by CommandLine, which vends the equivalent of argc and argv for command-line applications.]

enum Unicode {
  enum ASCII : Unicode.Encoding { ... }
  enum UTF8 : Unicode.Encoding { ... }
  enum UTF16 : Unicode.Encoding { ... }
  enum UTF32 : Unicode.Encoding { ... }
  ...
  enum ParseResult<T> { ... }
  struct Scalar { ... }
}

The names UTF8, UTF16, UTF32, and Scalar correspond to entities that exist in Swift 3. For backward compatibility they will be exposed to Swift 3 programs with their legacy spellings:

@available(swift, obsoleted: 4.0, renamed: "Unicode.UTF8")
public typealias UTF8 = Unicode.UTF8
@available(swift, obsoleted: 4.0, renamed: "Unicode.UTF16")
public typealias UTF16 = Unicode.UTF16
@available(swift, obsoleted: 4.0, renamed: "Unicode.UTF32")
public typealias UTF32 = Unicode.UTF32
@available(swift, obsoleted: 4.0, renamed: "Unicode.Scalar")
public typealias UnicodeScalar = Unicode.Scalar

Unicode-specific protocols will be presented as members of this namespace. Pending the addition of more direct language support, typealiases will be used to bring them in from underscored names in the Swift namespace. The intention is that diagnostics and documentation will display the nested, non-underscored names.

protocol _UnicodeEncoding { ... }
protocol _UnicodeParser { ... }
extension Unicode {
  typealias Encoding = _UnicodeEncoding
  typealias Parser = _UnicodeParser
}

UnicodeCodec will be updated to refine Unicode.Encoding, and deprecated for Swift 4. Existing models of UnicodeCodec such as UTF8 will inherit Unicode.Encoding conformance for Swift 3.

As noted below we anticipate adding many more Unicode-specific components to the Unicode namespace in the near future.

String, Substring, and StringProtocol

The following additions will be made to the standard library:

protocol StringProtocol : BidirectionalCollection {
  // Implementation detail as described above
}

extension String : StringProtocol, RangeReplaceableCollection {
  typealias SubSequence = Substring
  subscript(bounds: Range<String.Index>) -> Substring { 
    ...
  }
}

struct Substring : StringProtocol, RangeReplaceableCollection {
  typealias SubSequence = Substring
  // near-identical API surface area to String
}

The slicing operations on String will be amended to return Substring:

struct String {
  subscript(bounds: Range<Index>) -> Substring { ... }
}

Note that properties or methods that due to their nature create new String storage (such as lowercased()) will not change.

C string interopability will be consolidated on the following methods:

extension String {
  /// Constructs a `String` having the same contents as `codeUnits`.
  ///
  /// - Parameter codeUnits: a collection of code units in
  ///   the given `encoding`.
  /// - Parameter encoding: describes the encoding in which the code units
  ///   should be interpreted.
  init<C: Collection, Encoding: Unicode.Encoding>(
    decoding codeUnits: C, as encoding: Encoding.Type
  )
    where C.Iterator.Element == Encoding.CodeUnit

  /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
  ///
  /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
  ///   bytes ending just before the first zero byte (NUL character).
  init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
  
  /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
  ///
  /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
  ///   the given `encoding`, ending just before the first zero code unit.
  /// - Parameter encoding: describes the encoding in which the code units
  ///   should be interpreted.
  init<Encoding: Unicode.Encoding>(
    decodingCString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
    as: Encoding.Type)
    
  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of UTF-8 code units.
  func withCString<Result>(
    _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result

  /// Invokes the given closure on the contents of the string, represented as a
  /// pointer to a null-terminated sequence of code units in the given encoding.
  func withCString<Result, Encoding: Unicode.Encoding>(
    encodedAs: Encoding.Type,
    _ body: (UnsafePointer<Encoding.CodeUnit>) throws -> Result
  ) rethrows -> Result
}

Additionally, the current ability to pass a Swift String directly into methods that take a C string (UnsafePointer<CChar>) will remain as-is.

Low-level Unicode Processing

A new protocol, Unicode.Encoding, will be added to replace the current UnicodeCodec protocol.

extension Unicode { typealias Encoding = _UnicodeEncoding }

public protocol _UnicodeEncoding {
  /// The basic unit of encoding
  associatedtype CodeUnit : UnsignedInteger, FixedWidthInteger
  
  /// A valid scalar value as represented in this encoding
  associatedtype EncodedScalar : BidirectionalCollection
    where EncodedScalar.Iterator.Element == CodeUnit

  /// A unicode scalar value to be used when repairing
  /// encoding/decoding errors, as represented in this encoding.
  ///
  /// If the Unicode replacement character U+FFFD is representable in this
  /// encoding, `encodedReplacementCharacter` encodes that scalar value.
  static var encodedReplacementCharacter : EncodedScalar { get }

  /// Converts from encoded to encoding-independent representation
  static func decode(_ content: EncodedScalar) -> Unicode.Scalar

  /// Converts from encoding-independent to encoded representation, returning
  /// `nil` if the scalar can't be represented in this encoding.
  static func encode(_ content: Unicode.Scalar) -> EncodedScalar?

  /// Converts a scalar from another encoding's representation, returning
  /// `nil` if the scalar can't be represented in this encoding.
  ///
  /// A default implementation of this method will be provided 
  /// automatically for any conforming type that does not implement one.
  static func transcode<FromEncoding : UnicodeEncoding>(
    _ content: FromEncoding.EncodedScalar, from _: FromEncoding.Type
  ) -> EncodedScalar?

  /// A type that can be used to parse `CodeUnits` into
  /// `EncodedScalar`s.
  associatedtype ForwardParser : Unicode.Parser
    where ForwardParser.Encoding == Self
    
  /// A type that can be used to parse a reversed sequence of
  /// `CodeUnits` into `EncodedScalar`s.
  associatedtype ReverseParser : Unicode.Parser
    where ReverseParser.Encoding == Self
}

Parsing CodeUnits into EncodedScalars, in either direction, is done with models of Unicode.Parser:

extension Unicode {  typealias Parser = _UnicodeParser }

/// Types that separate streams of code units into encoded Unicode
/// scalar values.
public protocol _UnicodeParser {
  /// The encoding with which this parser is associated
  associatedtype Encoding : Unicode.Encoding

  /// Constructs an instance that can be used to begin parsing `CodeUnit`s at
  /// any Unicode scalar boundary.
  init()

  /// Parses a single Unicode scalar value from `input`.
  mutating func parseScalar<I : IteratorProtocol>(
    from input: inout I
  ) -> Unicode.ParseResult<Encoding.EncodedScalar>
  where I.Element == Encoding.CodeUnit
}

extension Unicode { 
  /// The result of attempting to parse a `T` from some input.
  public enum ParseResult<T> {
  /// A `T` was parsed successfully
  case valid(T)
  
  /// The input was entirely consumed.
  case emptyInput
  
  /// An encoding error was detected.
  ///
  /// `length` is the number of underlying code units consumed by this
  /// error (when decoding, the length of the longest prefix that
  /// could be recognized of a valid encoding sequence).
  case error(length: Int)
  }
}

Higher-Level Unicode Processing

The Unicode processing APIs proposed here are intentionally extremely low-level. We have proven that they are sufficient to implement higher-level constructs, but those designs are still baking and not yet ready for review. We expect to propose generic Iterator, Sequence, and Collection views that expose transcoded or segmented views of arbitrary underlying storage, as separate components in the Unicode namespace.

Source compatibility

Adding collection conformance to String should not materially impact source stability as it is purely additive: Swift 3's String interface currently fulfills all of the requirements for a bidirectional range replaceable collection.

Altering String's slicing operations to return a different type is source breaking. The following mitigating steps are proposed:

  • Add a deprecated subscript operator that will run in Swift 3 compatibility mode and which will return a String not a Substring.

  • Add deprecated versions of all current slicing methods to similarly return a String.

i.e.:

extension String {
  @available(swift, obsoleted: 4)
  subscript(bounds: Range<Index>) -> String {
    return String(characters[bounds])
  }

  @available(swift, obsoleted: 4)
  subscript(bounds: ClosedRange<Index>) -> String {
    return String(characters[bounds])
  }
}

In a review of 77 popular Swift projects found on GitHub, these changes resolved any build issues in the 12 projects that assumed an explicit String type returned from slicing operations.

Due to the change in internal implementation, this means that these operations will be O(n) rather than O(1). This is not expected to be a major concern, based on experiences from a similar change made to Java, but projects will be able to work around performance issues without upgrading to Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 variant, and which will be available but not invoked by default in Swift 3 mode.

The C string interoperability methods outside the ones described in the detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and be removed in a subsequent release. UnicodeCodec will be similarly deprecated.

Effect on ABI stability

As a fundamental currency type for Swift, it is essential that the String type (and its associated subsequence) is in a good long-term state before being locked down when Swift declares ABI stability. Shrinking the size of String to be 64 bits is an important part of the story. As full ABI stablity is not planned for Swift 4, it is currently unclear when the transition to a 64-bit memory layout will occur.

Effect on API resilience

Decisions about the API resilience of the String type are still to be determined, but are not adversely affected by this proposal.

Alternatives considered

For a more in-depth discussion of some of the trade-offs in string design, see the manifesto and associated evolution thread.

This proposal does not yet introduce an implicit conversion from Substring to String. The decision on whether to add this will be deferred pending feedback on the initial implementation. The intention is to make a preview toolchain available for feedback, including on whether this implicit conversion is necessary, prior to the release of Swift 4.