[se-0405] Implement API additions #68419

glessard · 2023-09-09T22:56:03Z

API additions (and tests) from SE-0405, namely:

String.init?<Encoding>(
    validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as encoding: Encoding.Type
  ) where Encoding: Unicode.Encoding

String.init?<Encoding>(
    validating codeUnits: some Sequence<Int8>,
    as encoding: Encoding.Type
  ) where Encoding: Unicode.Encoding, Encoding.CodeUnit == UInt8

The API renaming is in #68423.

rdar://114999766

glessard · 2023-09-09T22:56:14Z

@swift-ci please test

glessard · 2023-09-10T00:30:13Z

@swift-ci please smoke test linux platform

glessard · 2023-09-10T04:53:22Z

@swift-ci please test linux platform

stdlib/public/core/String.swift

glessard · 2023-09-10T18:27:14Z

@swift-ci please smoke test

karwa

I think this needs another look - particularly when it comes to the stack buffer.

karwa · 2023-09-11T10:23:17Z

stdlib/public/core/String.swift

This allocates an Array (which will likely require resizing as transcoded code-units are appended to it -- the ._validate function used by the contiguous path estimates 3-4x as many bytes of UTF8 out as code-units in), then allocates separate String storage and writes the result to it.

Have you tried performing a dry-run of the transcoding and measuring the required capacity, then allocating String storage and writing directly to it?

Ohhh we can't do a dry run because this takes a Sequence, not a Collection 😔.

But it's pointless - we're not going to do this in a single pass; we're just going to copy internally anyway. This API should've taken a Collection, IMO.

We can dispatch based on whether or not the type is a collection, but the optimiser does a poor job specialising it (#62264). I'd still suggest we do that, though, and let the optimiser catch up. The existential wrapping and unwrapping overhead is likely less than allocating and copying everything to an array, and one day the compiler will just eliminate that overhead entirely.

There is quite a bit of improvement to be done to the internal transcoding machinery, and this slowest path will then be adapted to use that. It would be better to allocate a string buffer directly and resize that, but it is not possible at the moment.

My understanding is that it is fairly common in C APIs (e.g. wcsrtombs, WideCharToMultiByte, etc) to perform a dry run by specifying the output buffer as NULL, getting the length, and converting in to an appropriately-sized buffer. That's why I suggested it.

Resizing is less than ideal because it's quadratic. Array mitigates this somewhat with an over-allocation strategy that scales geometrically, but that is also wasteful, and I don't think String employs the same strategy (?).

For Sequence, where we can only make one pass, we unfortunately have to transcode in to some kind of resizing buffer. For Collection, we can just do two passes, guaranteeing no resizing and lower memory water-mark.

karwa · 2023-09-11T10:30:32Z

stdlib/public/core/String.swift

I wonder if it's worth checking this later (after writing). Checking whether contiguous UTF8 is ASCII is super-cheap.

Checking after writing would potentially mean re-loading parts of a large string buffer that has already been expunged from cache, and that wouldn't be cheap. A chunked transcoding API would be much better.

Well, modern processors have several megabytes of cache -- far larger than most UTF8 strings (and if your string is several megabytes, this isn't going to be significant either way). The ASCII check on contiguous data can process 8 bytes in a single instruction (more if we SIMDize it), so I figured it may (or may not; the only way to know is to test it) be better to keep the transcoding loop tighter and perform a separate pass for this analysis.

Chunked transcoding might be a good middle-ground, though. That's a fair point.

karwa · 2023-09-11T10:40:02Z

stdlib/public/core/String.swift

This is difficult to read. Might I suggest:

let contiguousResult: String?? = ... switch contiguousResult { case .some(.some(let newString)): self = newString return case .some(.none): return nil default: break // source is non-contiguous. }

I don't find the switch alternative to be a particularly easy read either. I tried to improve readability by using fastidiously named bindings.

karwa · 2023-09-11T11:19:39Z

stdlib/public/core/StringCreate.swift

This is a generic function - transcoding will ultimately be performed by a user-defined type, and if that type were to produce more than 4 bytes of UTF8 for its custom code-units, this would over-write the buffer and corrupt the stack.

Toy example, showing that the Unicode.Encoding API allows this. This encoding is UTF8, except that every time the real UTF8 would parse a single scalar from its code-units, this encoding parses 5 repetitions of that scalar:

https://gist.github.com/karwa/ece9cdcf8c66613fdeea85bcd8b2cea8

Good point. We should bounds-check here. We should write directly into a string buffer, reallocating when appropriate.

karwa · 2023-09-11T11:31:48Z

stdlib/public/core/StringCreate.swift

This is the only part that benefits from contiguous storage, because validateUTF8 and _allASCII require an unsafe buffer pointer. The second part of the function (the slow path, which transcodes) doesn't need contiguous storage at all, and is basically duplicated code from the initialiser.

I would suggest splitting in to separate functions, and making the call to the contiguous UTF8/ASCII one directly from the initialiser. Since the initialiser is inlinable, I'd expect the compiler could specialise this to a direct call to the UTF8/ASCII fast-path in most cases.

The comment that I made above about performing a dry run and writing directly to string storage would then apply to the second function (the slow path), and we could remove the transcoding in the initialiser.

Sorry for the delay, I forgot to respond to this earlier. It is superficially true that the two slow paths are similar at this time, but we have intentionally made the _validate function non-inlinable in order to have room to improve it using API that is not yet public. In particular, the current public API doesn't allow for chunked decoding, which would be a significant speedup in a generic context, by significantly amortizing the function call overhead. Getting there from here will be in future work.

glessard · 2023-12-21T06:41:12Z

@swift-ci please test

Co-authored-by: Ben Rimmington <me@benrimmington.com>

glessard · 2023-12-21T20:02:37Z

Rebased to pick up the new ABI checker

glessard · 2023-12-22T18:48:58Z

@swift-ci please test

test/abi/macOS/arm64/stdlib-asserts.swift

glessard · 2024-01-03T22:25:41Z

@swift-ci please test

benrimmington · 2024-01-04T01:02:42Z

stdlib/public/core/String.swift

When printing an optional, or when debugPrinting, the \0 character will be output:

"Ca\0fé"

All examples will have a compiler warning:

Expression implicitly coerced from 'String?' to 'Any'

The examples don't intend to debugPrint, but it slipped my mind that printing through Optional would do that. Perhaps I should change them to something like print(valid ?? "nil")?

print(valid == "Ca\0fé") // Prints "true"

print(invalid == nil) // Prints "true"

stdlib/public/core/StringCreate.swift

glessard · 2024-01-04T19:17:52Z

@swift-ci please smoke test

glessard · 2024-01-08T18:58:23Z

@swift-ci please test

milseman

Doesn't have to hold up this PR, but it would make sense to add some benchmarks so that if we can improve the code in the future (e.g. skipping an intermediary allocation) we'd see the impact.

milseman · 2024-01-10T20:14:31Z

stdlib/public/core/StringStorage.swift

At one point in time we were worried about the heap size of these objects. Does this change it? If so, would it make sense to put in _countAndFlags?

I didn't consider changing _countAndFlags, that is a good thought. As is, 1 byte is added to the class's stored properties, so it goes from 32 to 33 bytes on 64-bit platforms.

milseman · 2024-01-10T20:17:44Z

stdlib/public/core/StringCreate.swift

Would it make sense to allocate a normal _StringStorage instance of appropriate size and write into that?

We might want to do that as a next step. The __StringStorage class doesn't currently have anything resembling an appropriate initializer, though.

Was off by one.

glessard · 2024-01-17T01:22:19Z

@swift-ci please test

glessard mentioned this pull request Sep 9, 2023

[se-0405] Implement API additions #68418

Closed

glessard added swift evolution approved Flag → feature: A feature that was approved through the Swift evolution process swift evolution implemented Flag → feature: A feature that was approved through the Swift evolution process and implemented labels Sep 9, 2023

glessard requested review from allevato, lorentey, milseman and stephentyrone September 10, 2023 04:53

benrimmington reviewed Sep 10, 2023

View reviewed changes

stdlib/public/core/String.swift Outdated Show resolved Hide resolved

stdlib/public/core/String.swift Outdated Show resolved Hide resolved

stdlib/public/core/String.swift Outdated Show resolved Hide resolved

glessard mentioned this pull request Sep 11, 2023

[se-0405] rename String.init(validatingUTF8:) #68423

Merged

karwa requested changes Sep 11, 2023

View reviewed changes

glessard marked this pull request as draft September 11, 2023 23:48

glessard force-pushed the se0405-part1 branch from bbe0524 to db516fc Compare December 21, 2023 02:34

glessard and others added 8 commits December 21, 2023 10:44

[se-0405] adapt implementation from staging package

f700688

[test] se-0405 input-validating String initializers

7648210

Apply suggestions from code review

92df9b4

Co-authored-by: Ben Rimmington <me@benrimmington.com>

[se-0405] update availability to a realistic release target

566fbf4

[stdlib] make __SharedStringStorage able to own a pointer

148a7e2

[se-0405] improve fast path

0ba58de

[se-0405] improve slow path

4617553

[se-0405] improve readability of double-optional unwrapping

b869a3c

glessard force-pushed the se0405-part1 branch from db516fc to c36b79a Compare December 21, 2023 19:58

glessard force-pushed the se0405-part1 branch 2 times, most recently from 49d6ed2 to 7a60811 Compare December 22, 2023 18:48

glessard marked this pull request as ready for review December 23, 2023 01:35

glessard requested a review from a team as a code owner December 23, 2023 01:35

Azoy reviewed Dec 24, 2023

View reviewed changes

test/abi/macOS/arm64/stdlib-asserts.swift Outdated Show resolved Hide resolved

glessard force-pushed the se0405-part1 branch 3 times, most recently from d2244d7 to 80961df Compare January 3, 2024 22:25

glessard requested review from benrimmington and karwa January 3, 2024 23:13

benrimmington reviewed Jan 4, 2024

View reviewed changes

glessard commented Jan 4, 2024

View reviewed changes

stdlib/public/core/StringCreate.swift Outdated Show resolved Hide resolved

milseman approved these changes Jan 10, 2024

View reviewed changes

glessard and others added 4 commits January 10, 2024 14:32

[test] round out testing for String.init?(validating:as:)

fa9c80a

Update stdlib/public/core/StringCreate.swift

98273aa

Was off by one.

[se-0405] improve examples in documentation

8d0991f

[abi] additions from SE-0405

ac47533

glessard force-pushed the se0405-part1 branch from 08b750f to ac47533 Compare January 10, 2024 22:34

glessard merged commit d8c809c into swiftlang:main Jan 22, 2024

glessard deleted the se0405-part1 branch January 22, 2024 17:27

[se-0405] Implement API additions #68419

[se-0405] Implement API additions #68419

Uh oh!

Conversation

glessard commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glessard commented Sep 9, 2023

Uh oh!

glessard commented Sep 10, 2023

Uh oh!

glessard commented Sep 10, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glessard commented Sep 10, 2023

Uh oh!

karwa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glessard Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glessard commented Dec 21, 2023

Uh oh!

glessard commented Dec 21, 2023

Uh oh!

glessard commented Dec 22, 2023

Uh oh!

Uh oh!

glessard commented Jan 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glessard Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glessard commented Jan 4, 2024

Uh oh!

glessard commented Jan 8, 2024

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glessard commented Sep 9, 2023 •

edited

Loading

glessard Dec 21, 2023 •

edited

Loading

glessard Jan 4, 2024 •

edited

Loading