- Proposal: SE-0163
- Authors: Ben Cohen, Dave Abrahams
- Review Manager: John McCall
- Status: Implemented (Swift 4.0)
- Revision: 2
- Previous Revision: 1
- Decision Notes: Rationale #1, Rationale #2
This proposal is to implement a subset of the changes from the Swift 4 String Manifesto.
Specifically:
- Make
String
conform toBidirectionalCollection
- Make
String
conform toRangeReplaceableCollection
- Create a
Substring
type forString.SubSequence
- Create a
StringProtocol
protocol to allow for generic operations over both types. - Consolidate on a concise set of C interop methods.
- Revise the transcoding infrastructure.
- Sink Unicode-specific functionality into a
Unicode
namespace.
Other existing aspects of String
remain unchanged for the purposes of this
proposal.
This proposal follows up on a number of recommendations found in the manifesto:
Collection
conformance was dropped from String
in Swift 2. After
reevaluation, the feeling is that the minor discrepancies with
required RangeReplaceableCollection
semantics (the fact that some
characters may merge when Strings are concatenated) are outweighed by
the significant benefits of restoring these conformances. For more
detail on the reasoning,
see
here
While it is not a collection, the Swift 3 string does have slicing operations.
String
is currently serving as its own subsequence, allowing substrings
to share storage with their "owner". This can lead to memory leaks when small substrings of larger
strings are stored long-term (see here
for more detail on this problem). Introducing a separate type of Substring
to
serve as String.Subsequence
is recommended to resolve this issue, in a similar
fashion to ArraySlice
.
As noted in the manifesto, support for interoperation with nul-terminated C
strings in Swift 3 is scattered and incoherent, with 6 ways to transform a C
string into a String
and four ways to do the inverse. These APIs should be
replaced with a simpler set of methods on String
.
A new type, Substring
, will be introduced. Similar to ArraySlice
it will
be documented as only for short- to medium-term storage:
Important
Long-term storage of
Substring
instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.
Aside from minor differences, such as having a SubSequence
of Self
and a larger size to describe the range of the subsequence,
Substring
will be near-identical from a user perspective.
In order to be able to write extensions across both String
and
Substring
, a new StringProtocol
protocol to which the two types
will conform will be introduced. For the purposes of this proposal,
StringProtocol
will be defined as a protocol to be used whenever you
would previously extend String
. It should be possible to substitute
extension StringProtocol { ... }
in Swift 4 wherever
extension String { ... }
was written in Swift 3, with one exception: any
passing of self
into an API that takes a concrete String
will need to be
rewritten as String(self)
. If Self
is a String
then this should
effectively optimize to a no-op, whereas if Self
is a Substring
then this
will force a copy, helping to avoid the "memory leak" problems described above.
The exact nature of the protocol – such as which methods should be protocol requirements vs which can be implemented as protocol extensions, are considered implementation details and so not covered in this proposal.
StringProtocol
will conform to BidirectionalCollection
.
RangeReplaceableCollection
conformance will be added directly onto
the String
and Substring
types, as it is possible future
StringProtocol
-conforming types might not be range-replaceable
(e.g. an immutable type that wraps a const char *
).
The C string interop methods will be updated to a variant of those
described
here:
two withCString
operations and two init(cString:)
constructors,
one each for UTF8 and for arbitrary encodings. The primary change is
to remove "non-repairing" variants of construction from nul-terminated
C strings. In both of the construction APIs, any invalid encoding
sequence detected will have its longest valid prefix replaced by
U+FFFD
, the Unicode replacement character, per the Unicode
specification. This covers the common case. The replacement can be
done physically in the underlying storage and the validity of the
result can be recorded in the String's encoding such that future
accesses need not be slowed down by possible error repair
separately. Construction that is aborted when encoding errors are
detected can be accomplished using APIs on the encoding.
Additionally, an init
that takes a collection of code units and an encoding
will allow for construction of a String
from arbitrary collections – for example,
an UnsafeBufferPointer
containing a non-nul-terminated C string.
The current transcoding support will be updated to improve usability and performance. The primary changes will be:
- to allow transcoding directly from one encoding to another without having to triangulate through an intermediate scalar value
- to add the ability to transcode an input collection in reverse, allowing the
different views on
String
to be made bi-directional - to ensure that the APIs can be used to create performant bidirectional decoded and transcoded views of underlying code units.
- to replace the
UnicodeCodec
with a statelessUnicode.Encoding
protocol having associatedForwardParser
andReverseParser
types for decoding.
The standard library currently lacks a Latin1
codec, so a
enum Latin1: Unicode.Encoding
type will be added.
A Unicode
“namespace” will be added for components related to
low-level Unicode operations such as transcoding and grapheme
breaking. Absent more direct language support, Unicode
will, for the
time being, be implemented as a caseless enum
. [The caseless enum
technique is precedented by CommandLine
, which vends the equivalent
of argc
and argv
for command-line applications.]
enum Unicode {
enum ASCII : Unicode.Encoding { ... }
enum UTF8 : Unicode.Encoding { ... }
enum UTF16 : Unicode.Encoding { ... }
enum UTF32 : Unicode.Encoding { ... }
...
enum ParseResult<T> { ... }
struct Scalar { ... }
}
The names UTF8
, UTF16
, UTF32
, and Scalar
correspond
to entities that exist in Swift 3. For backward compatibility they will
be exposed to Swift 3 programs with their legacy spellings:
@available(swift, obsoleted: 4.0, renamed: "Unicode.UTF8")
public typealias UTF8 = Unicode.UTF8
@available(swift, obsoleted: 4.0, renamed: "Unicode.UTF16")
public typealias UTF16 = Unicode.UTF16
@available(swift, obsoleted: 4.0, renamed: "Unicode.UTF32")
public typealias UTF32 = Unicode.UTF32
@available(swift, obsoleted: 4.0, renamed: "Unicode.Scalar")
public typealias UnicodeScalar = Unicode.Scalar
Unicode-specific protocols will be presented as members of this
namespace. Pending the addition of more direct language support,
typealiases will be used to bring them in from underscored names in
the Swift
namespace. The intention is that diagnostics and
documentation will display the nested, non-underscored names.
protocol _UnicodeEncoding { ... }
protocol _UnicodeParser { ... }
extension Unicode {
typealias Encoding = _UnicodeEncoding
typealias Parser = _UnicodeParser
}
UnicodeCodec
will be updated to refine Unicode.Encoding
, and
deprecated for Swift 4. Existing models of UnicodeCodec
such as
UTF8
will inherit Unicode.Encoding
conformance for Swift 3.
As noted below we anticipate
adding many more Unicode-specific components to the Unicode
namespace in the near future.
The following additions will be made to the standard library:
protocol StringProtocol : BidirectionalCollection {
// Implementation detail as described above
}
extension String : StringProtocol, RangeReplaceableCollection {
typealias SubSequence = Substring
subscript(bounds: Range<String.Index>) -> Substring {
...
}
}
struct Substring : StringProtocol, RangeReplaceableCollection {
typealias SubSequence = Substring
// near-identical API surface area to String
}
The slicing operations on String
will be amended to return
Substring
:
struct String {
subscript(bounds: Range<Index>) -> Substring { ... }
}
Note that properties or methods that due to their nature create new
String
storage (such as lowercased()
) will not change.
C string interopability will be consolidated on the following methods:
extension String {
/// Constructs a `String` having the same contents as `codeUnits`.
///
/// - Parameter codeUnits: a collection of code units in
/// the given `encoding`.
/// - Parameter encoding: describes the encoding in which the code units
/// should be interpreted.
init<C: Collection, Encoding: Unicode.Encoding>(
decoding codeUnits: C, as encoding: Encoding.Type
)
where C.Iterator.Element == Encoding.CodeUnit
/// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
///
/// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
/// bytes ending just before the first zero byte (NUL character).
init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
/// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
///
/// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
/// the given `encoding`, ending just before the first zero code unit.
/// - Parameter encoding: describes the encoding in which the code units
/// should be interpreted.
init<Encoding: Unicode.Encoding>(
decodingCString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
as: Encoding.Type)
/// Invokes the given closure on the contents of the string, represented as a
/// pointer to a null-terminated sequence of UTF-8 code units.
func withCString<Result>(
_ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
/// Invokes the given closure on the contents of the string, represented as a
/// pointer to a null-terminated sequence of code units in the given encoding.
func withCString<Result, Encoding: Unicode.Encoding>(
encodedAs: Encoding.Type,
_ body: (UnsafePointer<Encoding.CodeUnit>) throws -> Result
) rethrows -> Result
}
Additionally, the current ability to pass a Swift String
directly
into methods that take a C string (UnsafePointer<CChar>
) will remain
as-is.
A new protocol, Unicode.Encoding
, will be added to replace the
current UnicodeCodec
protocol.
extension Unicode { typealias Encoding = _UnicodeEncoding }
public protocol _UnicodeEncoding {
/// The basic unit of encoding
associatedtype CodeUnit : UnsignedInteger, FixedWidthInteger
/// A valid scalar value as represented in this encoding
associatedtype EncodedScalar : BidirectionalCollection
where EncodedScalar.Iterator.Element == CodeUnit
/// A unicode scalar value to be used when repairing
/// encoding/decoding errors, as represented in this encoding.
///
/// If the Unicode replacement character U+FFFD is representable in this
/// encoding, `encodedReplacementCharacter` encodes that scalar value.
static var encodedReplacementCharacter : EncodedScalar { get }
/// Converts from encoded to encoding-independent representation
static func decode(_ content: EncodedScalar) -> Unicode.Scalar
/// Converts from encoding-independent to encoded representation, returning
/// `nil` if the scalar can't be represented in this encoding.
static func encode(_ content: Unicode.Scalar) -> EncodedScalar?
/// Converts a scalar from another encoding's representation, returning
/// `nil` if the scalar can't be represented in this encoding.
///
/// A default implementation of this method will be provided
/// automatically for any conforming type that does not implement one.
static func transcode<FromEncoding : UnicodeEncoding>(
_ content: FromEncoding.EncodedScalar, from _: FromEncoding.Type
) -> EncodedScalar?
/// A type that can be used to parse `CodeUnits` into
/// `EncodedScalar`s.
associatedtype ForwardParser : Unicode.Parser
where ForwardParser.Encoding == Self
/// A type that can be used to parse a reversed sequence of
/// `CodeUnits` into `EncodedScalar`s.
associatedtype ReverseParser : Unicode.Parser
where ReverseParser.Encoding == Self
}
Parsing CodeUnits
into EncodedScalar
s, in either direction, is
done with models of Unicode.Parser
:
extension Unicode { typealias Parser = _UnicodeParser }
/// Types that separate streams of code units into encoded Unicode
/// scalar values.
public protocol _UnicodeParser {
/// The encoding with which this parser is associated
associatedtype Encoding : Unicode.Encoding
/// Constructs an instance that can be used to begin parsing `CodeUnit`s at
/// any Unicode scalar boundary.
init()
/// Parses a single Unicode scalar value from `input`.
mutating func parseScalar<I : IteratorProtocol>(
from input: inout I
) -> Unicode.ParseResult<Encoding.EncodedScalar>
where I.Element == Encoding.CodeUnit
}
extension Unicode {
/// The result of attempting to parse a `T` from some input.
public enum ParseResult<T> {
/// A `T` was parsed successfully
case valid(T)
/// The input was entirely consumed.
case emptyInput
/// An encoding error was detected.
///
/// `length` is the number of underlying code units consumed by this
/// error (when decoding, the length of the longest prefix that
/// could be recognized of a valid encoding sequence).
case error(length: Int)
}
}
The Unicode processing APIs proposed here are intentionally extremely
low-level. We have proven that they are sufficient to implement
higher-level constructs, but those designs are still baking and not
yet ready for review. We expect to propose generic Iterator
,
Sequence
, and Collection
views that expose transcoded or segmented
views of arbitrary underlying storage, as separate components in the
Unicode
namespace.
Adding collection conformance to String
should not materially impact source
stability as it is purely additive: Swift 3's String
interface currently
fulfills all of the requirements for a bidirectional range replaceable
collection.
Altering String
's slicing operations to return a different type is source
breaking. The following mitigating steps are proposed:
-
Add a deprecated subscript operator that will run in Swift 3 compatibility mode and which will return a
String
not aSubstring
. -
Add deprecated versions of all current slicing methods to similarly return a
String
.
i.e.:
extension String {
@available(swift, obsoleted: 4)
subscript(bounds: Range<Index>) -> String {
return String(characters[bounds])
}
@available(swift, obsoleted: 4)
subscript(bounds: ClosedRange<Index>) -> String {
return String(characters[bounds])
}
}
In a review of 77 popular Swift projects found on GitHub, these changes
resolved any build issues in the 12 projects that assumed an explicit String
type returned from slicing operations.
Due to the change in internal implementation, this means that these operations
will be O(n) rather than O(1). This is not expected to be a major concern,
based on experiences from a similar change made to Java, but projects will be
able to work around performance issues without upgrading to Swift 4 by
explicitly typing slices as Substring
, which will call the Swift 4 variant,
and which will be available but not invoked by default in Swift 3 mode.
The C string interoperability methods outside the ones described in the
detailed design will remain in Swift 3 mode, be deprecated in Swift 4 mode, and
be removed in a subsequent release. UnicodeCodec
will be similarly deprecated.
As a fundamental currency type for Swift, it is essential that the
String
type (and its associated subsequence) is in a good long-term
state before being locked down when Swift declares ABI stability.
Shrinking the size of String
to be 64 bits is an important part of
the story. As full ABI stablity is not planned for Swift 4, it is
currently unclear when the transition to a 64-bit memory layout will
occur.
Decisions about the API resilience of the String
type are still to be
determined, but are not adversely affected by this proposal.
For a more in-depth discussion of some of the trade-offs in string design, see the manifesto and associated evolution thread.
This proposal does not yet introduce an implicit conversion from Substring
to
String
. The decision on whether to add this will be deferred pending feedback
on the initial implementation. The intention is to make a preview toolchain
available for feedback, including on whether this implicit conversion is
necessary, prior to the release of Swift 4.