String Index Overhaul
- Proposal: SE-0180
- Author: Dave Abrahams
- Review Manager: Ted Kremenek
- Status: Implemented (Swift 4)
- Decision Notes: Rationale
- Implementation: apple/swift#9806
- Previous Revision: 1
Introduction
Today String
shares an Index
type with its CharacterView
but not
with its UTF8View
, UTF16View
, or UnicodeScalarView
. This
proposal redefines String.UTF8View.Index
, String.UTF16View.Index
,
and String.CharacterView.Index
as typealiases for String.Index
,
and exposes a public encodedOffset
property and initializer that can
be used to serialize and deserialize positions in a String
or
Substring
.
Swift-evolution thread: Pitch: String Index Overhaul
Motivation
The different index types are supported by a set of Index
initializers, which are failable whenever the source index might not
correspond to a position in the target view:
if let j = String.UnicodeScalarView.Index(
someUTF16Position, within: s.unicodeScalars) {
...
}
The current API is as follows:
public extension String.Index {
init?(_: String.UnicodeScalarIndex, within: String)
init?(_: String.UTF16Index, within: String)
init?(_: String.UTF8Index, within: String)
}
public extension String.UTF16View.Index {
init?(_: String.UTF8Index, within: String.UTF16View)
init(_: String.UnicodeScalarIndex, within: String.UTF16View)
init(_: String.Index, within: String.UTF16View)
}
public extension String.UTF8View.Index {
init?(_: String.UTF16Index, within: String.UTF8View)
init(_: String.UnicodeScalarIndex, within: String.UTF8View)
init(_: String.Index, within: String.UTF8View)
}
public extension String.UnicodeScalarView.Index {
init?(_: String.UTF16Index, within: String.UnicodeScalarView)
init?(_: String.UTF8Index, within: String.UnicodeScalarView)
init(_: String.Index, within: String.UnicodeScalarView)
}
These initializers are supplemented by a corresponding set of convenience conversion methods:
if let j = someUTF16Position.samePosition(in: s.unicodeScalars) {
...
}
with the following API:
public extension String.Index {
func samePosition(in: String.UTF8View) -> String.UTF8View.Index
func samePosition(in: String.UTF16View) -> String.UTF16View.Index
func samePosition(
in: String.UnicodeScalarView) -> String.UnicodeScalarView.Index
}
public extension String.UTF16View.Index {
func samePosition(in: String) -> String.Index?
func samePosition(in: String.UTF8View) -> String.UTF8View.Index?
func samePosition(
in: String.UnicodeScalarView) -> String.UnicodeScalarView.Index?
}
public extension String.UTF8View.Index {
func samePosition(in: String) -> String.Index?
func samePosition(in: String.UTF16View) -> String.UTF16View.Index?
func samePosition(
in: String.UnicodeScalarView) -> String.UnicodeScalarView.Index?
}
public extension String.UnicodeScalarView.Index {
func samePosition(in: String) -> String.Index?
func samePosition(in: String.UTF8View) -> String.UTF8View.Index
func samePosition(in: String.UTF16View) -> String.UTF16View.Index
}
The result is a great deal of API surface area for apparently little gain in ordinary code, that normally only interchanges indices among views when the positions match up exactly (i.e. when the conversion is going to succeed). Also, the resulting code is needlessly awkward.
Finally, the opacity of these index types makes it difficult to record
String
or Substring
positions in files or other archival forms,
and reconstruct the original positions with respect to a deserialized
String
or Substring
.
Proposed solution
All String
views will use a single index type (String.Index
), so
that positions can be interchanged without awkward explicit
conversions:
let html: String = "See <a href=\"http://swift.org\">swift.org</a>"
// Search the UTF16, instead of characters, for performance reasons:
let open = "<".utf16.first!, close = ">".utf16.first!
let tagStart = html.utf16.index(of: open)
let tagEnd = html.utf16[tagStart...].index(of: close)
// Slice the String with the UTF-16 indices to retrieve the tag.
let tag = html[tagStart...tagEnd]
A property and an intializer will be added to String.Index
, exposing
the offset of the index in code units (currently only UTF-16) from the
beginning of the string:
let n: Int = html.endIndex.encodedOffset
let end = String.Index(encodedOffset: n)
assert(end == String.endIndex)
Comparison and Subscript Semantics
When two indices being compared correspond to positions that are valid
in any single String
view, comparison semantics are already fully
specified by the Collection
requirements. The other cases occur
when indices fall between Unicode scalar boundaries in views having
distinct encodings. For example, the string "\u{1f773}"
(“🝳”) is
encoded as 0xD83D, 0xDF73
in UTF-16 and 0xF0, 0x9F, 0x9D, 0xB3
in
UTF-8, and there is no obvious way to compare the second positions in
each of those sequences. The proposed rule is that such indices are
compared by comparing their encodedOffset
s. Such index values are
not totally ordered but do satisfy strict weak ordering requirements,
which is sufficient for algorithms such as sort
to exhibit sensible
behavior. We might consider loosening the specified requirements on
these algorithms and on Comparable
to support strict weak ordering,
but for now we can treat such index pairs as being formally outside
the domain of comparison, like any other indices from completely
distinct collections.
With respect to subscripts, an index that does not fall on an exact
boundary in a given String
or Substring
view will be treated as
falling at its encodedOffset
in the underlying code units, with the
actual contents of the result being an emergent property of applying
the usual Unicode rules for decoding those code units. For example,
when slicing a string with an index i
that falls between two
Character
boundaries, i.encodedOffset
is treated as a position in
the string's underlying code units, and the Character
s of the result
are determined by performing standard Unicode grapheme breaking on the
resulting sequence of code units.
let s = "e\u{301}galite\u{301}" // "égalité"
let i = Array(s.unicodeScalars.indices)
print(s[i[1]...]) // "◌́galité"
print(s[..<i.last!]) // "égalite"
print(s[i[1]) // "◌́"
Similarly, assignment to a slice of a string is performed by replacing
the corresponding code units, and again the resulting Characters
are
determined by re-applying standard grapheme breaking rules.
Replacing the failable APIs listed above that detect whether an index represents a valid position in a given view, and enhancement that explicitly round index positions to nearby boundaries in a given view, are left to a later proposal. For now, we do not propose to remove the existing index conversion APIs.
Detailed design
String.Index
acquires an encodedOffset
property and initializer:
public extension String.Index {
/// Creates a position corresponding to the given offset in a
/// `String`'s underlying (UTF-16) code units.
init(encodedOffset: Int)
/// The position of this index expressed as an offset from the
/// beginning of the `String`'s underlying (UTF-16) code units.
var encodedOffset: Int
}
Index
types of String.UTF8View
, String.UTF16View
, and
String.UnicodeScalarView
are replaced by String.Index
:
public extension String.UTF8View {
typealias Index = String.Index
}
public extension String.UTF16View {
typealias Index = String.Index
}
public extension String.UnicodeScalarView {
typealias Index = String.Index
}
Because the index types are collapsing, index conversion methods and initializers are reduced to the following:
public extension String.Index {
init?(_: String.Index, within: String)
init?(_: String.Index, within: String.UTF8View)
init?(_: String.Index, within: String.UTF16View)
init?(_: String.Index, within: String.UnicodeScalarView)
func samePosition(in: String) -> String.Index?
func samePosition(in: String.UTF8View) -> String.Index?
func samePosition(in: String.UTF16View) -> String.Index?
func samePosition(in: String.UnicodeScalarView) -> String.Index?
}
Source compatibility
Because of the collapse of index types, existing non-failable APIs become failable. To avoid breaking Swift 3 code, the following overloads of existing functions are added, allowing the resulting optional indices to be used where previously non-optional indices were used. These overloads were driven by making the new APIs work with existing code, including the Swift source compatibility test suite, and should be viewed as migration aids only, rather than additions to the Swift 3 API.
extension Optional where Wrapped == String.Index {
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional indices")
public static func ..<(
lhs: String.Index?, rhs: String.Index?
) -> Range<String.Index> {
return lhs! ..< rhs!
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional indices")
public static func ...(
lhs: String.Index?, rhs: String.Index?
) -> ClosedRange<String.Index> {
return lhs! ... rhs!
}
}
// backward compatibility for index interchange.
extension String.UTF16View {
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public func index(after i: Index?) -> Index {
return index(after: i)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public func index(
_ i: Index?, offsetBy n: IndexDistance) -> Index {
return index(i!, offsetBy: n)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional indices")
public func distance(from i: Index?, to j: Index?) -> IndexDistance {
return distance(from: i!, to: j!)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public subscript(i: Index?) -> Unicode.UTF16.CodeUnit {
return self[i!]
}
}
extension String.UTF8View {
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public func index(after i: Index?) -> Index {
return index(after: i!)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public func index(_ i: Index?, offsetBy n: IndexDistance) -> Index {
return index(i!, offsetBy: n)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional indices")
public func distance(
from i: Index?, to j: Index?) -> IndexDistance {
return distance(from: i!, to: j!)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public subscript(i: Index?) -> Unicode.UTF8.CodeUnit {
return self[i!]
}
}
// backward compatibility for index interchange.
extension String.UnicodeScalarView {
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public func index(after i: Index?) -> Index {
return index(after: i)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public func index(_ i: Index?, offsetBy n: IndexDistance) -> Index {
return index(i!, offsetBy: n)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional indices")
public func distance(from i: Index?, to j: Index?) -> IndexDistance {
return distance(from: i!, to: j!)
}
@available(
swift, deprecated: 3.2, obsoleted: 4.0,
message: "Any String view index conversion can fail in Swift 4; please unwrap the optional index")
public subscript(i: Index?) -> Unicode.Scalar {
return self[i!]
}
}
-
Q: Will existing correct Swift 3 applications stop compiling due to this change?
A: it is possible but unlikely. The existing index conversion APIs are relatively rarely used, and the overloads listed above handle the common cases in Swift 3 compatibility mode.
-
Q: Will applications still compile but produce different behavior than they used to?
A: No.
-
Q: Is it possible to automatically migrate from the old syntax to the new syntax?
A: Yes, although usages of these APIs may be rare enough that it isn't worth the trouble.
-
Q: Can Swift applications be written in a common subset that works both with Swift 3 and Swift 4 to aid in migration?
A: Yes, the Swift 4 APIs will all be available in Swift 3 mode.
Effect on ABI stability
This proposal changes the ABI of the standard library.
Effect on API resilience
This proposal makes no changes to the resilience of any APIs.
Alternatives considered
The only alternative considered was no action.