Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-7602] UTF8 should be (one of) the fastest String encoding(s) #50144

Closed
weissi opened this issue May 4, 2018 · 22 comments
Closed

[SR-7602] UTF8 should be (one of) the fastest String encoding(s) #50144

weissi opened this issue May 4, 2018 · 22 comments

Comments

@weissi
Copy link
Member

@weissi weissi commented May 4, 2018

Previous ID SR-7602
Radar None
Original Reporter @weissi
Type Bug
Status Resolved
Resolution Done
Additional Detail from JIRA
Votes 26
Component/s Standard Library
Labels Bug, AffectsABI
Assignee @milseman
Priority Medium

md5: f681e7f0741f98e436f811971add77c3

Sub-Tasks:

  • SR-7725 [String] New validity model

Issue Description:

I believe that there are really only one (and a half) encodings that matter today: UTF8 (and its subset ASCII).
Therefore it's important that Swift's fastest String encoding is UTF8.

From what I can tell today the fastest String encodings are UTF16 and ASCII. Everything else will have worse performance.

This also seems to ABI relevant so AFAIK this needs to be fixed very soon.

Requirements:

  1. being able to copy UTF-8 encoded bytes from a String into a pre-allocated raw buffer must be allocation-free and as fast as memcpy can copy them

  2. creating a String from UTF-8 encoded bytes should just validate the encoding and store the bytes as they are

  3. slightly softer but still very strong requirement: currently (even with ASCII) only the stdlib seems to be able to get a pointer to the contiguous ASCII representation (if at all in that form). That works fine if you just want to copy the bytes (UnsafeMutableBufferPointer(start: destinationStart, count: destinationLength).initialize(from: string.utf8) which will use memcpy if in ASCII representation) but doesn't allow you to implement your own algorithms that are only performant on a contiguously stored [UInt8]

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 4, 2018

Comment by tanner0101 (JIRA)

Huge +1 to this.

To give some additional insights here, `String` has been a major burden for us in developing Vapor (server side Swift framework). At this point it's basically forbidden to use String in any internal code and only our public APIs will accept it. We resort to using things like `[UInt8]` and `UnsafeBufferPointer<UInt8>` internally instead.

Still even with our internal optimizations, there is still a lot of friction cross-module and to our end users where `String`s are used.

If `String` were more performant dealing with UTF-8 that would greatly improve the speed of our framework and cleanup a lot of our internal code.

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 4, 2018

Comment by Alex Reilly (JIRA)

+1

@jdmcd
Copy link

@jdmcd jdmcd commented May 4, 2018

Server side swift needs this big time. Would love to see this added.

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 4, 2018

Comment by Francisco Rivas (JIRA)

+1

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 5, 2018

Comment by Mikhail Isaev (JIRA)

+1

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 5, 2018

Comment by Anthony Castelli (JIRA)

+1

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 5, 2018

Comment by Damir Stuhec (JIRA)

+1

@tkrajacic
Copy link

@tkrajacic tkrajacic commented May 5, 2018

+2

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 6, 2018

Comment by Petro Rovenskyy (JIRA)

+1

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 6, 2018

Comment by Helge Heß (JIRA)

+1

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 6, 2018

Comment by Lucian Boboc (JIRA)

+1

@belkadan
Copy link
Contributor

@belkadan belkadan commented May 7, 2018

cc @milseman

Aside: please stop with the +1 comments. There's a "Vote for this issue" button right over there.

@milseman
Copy link
Mannequin

@milseman milseman mannequin commented May 7, 2018

@belkadan +1 😛 . I think this is vital to Swift's long-term health on any platform or domain outside of its current niche (and honestly, even within it).

Thank you for sharing your experience tannernelson (JIRA User). Do you have anything with more detail here, such as how much performance and code bloat is due to this? How does this friction manifest for your users and how could storing UTF-8 without transcoding remove it?

Is anyone else from this thread able to share their experience? Reports like this really help the project to prioritize effectively and push for the right thing.

(As for the fear concerning ABI stability, it's a little complicated and there are degrees to which we can reserve the ability to support this in the future.)

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 8, 2018

Comment by tanner0101 (JIRA)

@milseman Virtually all of it comes down to `String(data: myData, encoding: .utf8)` and `myString.data(encoding: .utf8)`.

When parsing protocols such as HTTP, Redis, MySQL, PostgreSQL, etc we will read data from the OS into an `UnsafeBufferPointer<UInt8>`. This is almost always via NIO's [`ByteBuffer`](https://apple.github.io/swift-nio/docs/current/NIO/Structs/ByteBuffer.html) type. We sometimes grab `String` from that directly or grab `Data` if we want to iterate over the bytes for additional parsing. [Here is an example of common byte buffer usage](https://github.com/vapor/mysql/blob/master/Sources/MySQL/Protocol/MySQLBinaryResultsetRow.swift#L39-L40).

In other words, from `UnsafePointer<UInt8>` we commonly read `FixedWidthInteger`, `BinaryFloatingPoint`, `Data`, and `String`. All are very performant except String which is the concern since the vast majority of bytes ends up being `String`s. Considering the DB use case specifically, the data transfer is usually emails, names, bios, comments, etc. Very few bytes are actually dedicated to binary numbers or data blobs. Strings everywhere.

To summarize, the faster we can get from `Swift.Unsafe...Pointer<UInt8>` or `Foundation.Data` to `String` the better. That will affect (for the better!) quite literally our entire framework.

@weissi
Copy link
Member Author

@weissi weissi commented May 8, 2018

just to add to tannernelson (JIRA User)'s great comment (thanks!): If the String comes from a ByteBuffer through get/readString(length🙂 then we'll construct it using the stdlib's String decoding mechanism

String(decoding: UnsafeBufferPointer(...), as: UTF8.self)

@milseman
Copy link
Mannequin

@milseman milseman mannequin commented May 8, 2018

How do you manage the lifetimes of the storage? I think that String should also express the ability to share storage, but that is yet to be designed and potentially separable. By default, String should allocate new storage and copy in the bytes.

edit: For `String(data: myData, encoding: .utf8)`, where did you get `myData` from? For `myString.data(encoding: .utf8)`, where do you typically send the result?

@weissi
Copy link
Member Author

@weissi weissi commented May 8, 2018

@milseman agreed, that'd be awesome! And also agreed that that's potentially a separate issue. FYI, in the NIOFoundationCompat module we have this for Foundation.Data which is unusably slow for other reasons 😉: https://github.com/apple/swift-nio/blob/master/Sources/NIOFoundationCompat/ByteBuffer-foundation.swift#L73-L78

@milseman
Copy link
Mannequin

@milseman milseman mannequin commented May 8, 2018

Along the lines of potentially separable issues, what is your validation story? If the stream of bytes contains invalid UTF-8, do you want:

1) The initializer to fail resulting in nil
2) The initializer to fail producing an error
3) The invalid bytes to be replaced with U+FFFD
4) The bytes verbatim, and experience the emergent behavior / unspecified results / security hazard from those bytes.

For reference, I think [Rust's model](https://doc.rust-lang.org/std/string/struct.String.html) is pretty good:

`from_utf8` produces an error explaining why the code units were invalid
`from_utf8_lossy` replaces encoding errors with U+FFFD
`from_utf8_unchecked` which takes the bytes, but if there's an encoding error, then memory safety has been violated

I'm not entirely sure if accepting invalid bytes requires voiding memory safety (assuming bounds checking always happens), but it is totally a security hazard if used improperly. We may want to be very cautious about if/how we expose it.

I think that trying to do read-time validation is dubious for UTF-16, and totally bananas for UTF-8.

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 8, 2018

Comment by tanner0101 (JIRA)

From a high-level user perspective, I would love a throwing variant of the String(..., encoding: ) initializer and friends. When I see people using Vapor, the nil-fallable one is almost always getting force unwrapped. (in a context where throwing is handled much better, I should add)

In terms of copying, I would expect that the String initializer from `Unsafe...Pointer<UInt8>` would copy the bytes into its own storage. And that in turn is how it would operate with NIO's ByteBuffer type. Which seems fine since the buffer is potentially going to get re-used and filled in with new bytes. I forget whether NIO actually does re-use the unsafe pointers, but it's a method I've used before.

In terms of what to do initializing from `Data`, it would be great if they could do some intelligent COW sharing of the internal storage to minimize copies, but idk if that's possible.

@swift-ci
Copy link
Collaborator

@swift-ci swift-ci commented May 8, 2018

Comment by Helge Heß (JIRA)

For the NIOFoundationCompat thing I filed SR-7378.

@milseman
Copy link
Mannequin

@milseman milseman mannequin commented May 9, 2018

tannernelson (JIRA User) when you or your users do `String(data: myData, encoding: .utf8)`, where did `myData` come from? Similarly, for `myString.data(encoding: .utf8)` where or what do you do with the resulting `Data`?

edit: the reason I ask is that this work becomes much more compelling if we're able to not only skip transcoding overhead, but also eliminate an intermediary allocation.

@milseman
Copy link
Mannequin

@milseman milseman mannequin commented Nov 5, 2018

It's now the fastest encoding.

https://forums.swift.org/t/string-s-abi-and-utf-8/17676/1
#20315

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants