Skip to content

Conversation

glessard
Copy link
Contributor

@glessard glessard commented Sep 9, 2023

API additions (and tests) from SE-0405, namely:

String.init?<Encoding>(
    validating codeUnits: some Sequence<Encoding.CodeUnit>,
    as encoding: Encoding.Type
  ) where Encoding: Unicode.Encoding

String.init?<Encoding>(
    validating codeUnits: some Sequence<Int8>,
    as encoding: Encoding.Type
  ) where Encoding: Unicode.Encoding, Encoding.CodeUnit == UInt8

The API renaming is in #68423.

rdar://114999766

@glessard
Copy link
Contributor Author

glessard commented Sep 9, 2023

@swift-ci please test

@glessard glessard added swift evolution approved Flag → feature: A feature that was approved through the Swift evolution process swift evolution implemented Flag → feature: A feature that was approved through the Swift evolution process and implemented labels Sep 9, 2023
@glessard
Copy link
Contributor Author

@swift-ci please smoke test linux platform

@glessard
Copy link
Contributor Author

@swift-ci please test linux platform

@glessard
Copy link
Contributor Author

@swift-ci please smoke test

Copy link
Contributor

@karwa karwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs another look - particularly when it comes to the stack buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates an Array (which will likely require resizing as transcoded code-units are appended to it -- the ._validate function used by the contiguous path estimates 3-4x as many bytes of UTF8 out as code-units in), then allocates separate String storage and writes the result to it.

Have you tried performing a dry-run of the transcoding and measuring the required capacity, then allocating String storage and writing directly to it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh we can't do a dry run because this takes a Sequence, not a Collection 😔.

But it's pointless - we're not going to do this in a single pass; we're just going to copy internally anyway. This API should've taken a Collection, IMO.

We can dispatch based on whether or not the type is a collection, but the optimiser does a poor job specialising it (#62264). I'd still suggest we do that, though, and let the optimiser catch up. The existential wrapping and unwrapping overhead is likely less than allocating and copying everything to an array, and one day the compiler will just eliminate that overhead entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite a bit of improvement to be done to the internal transcoding machinery, and this slowest path will then be adapted to use that. It would be better to allocate a string buffer directly and resize that, but it is not possible at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it is fairly common in C APIs (e.g. wcsrtombs, WideCharToMultiByte, etc) to perform a dry run by specifying the output buffer as NULL, getting the length, and converting in to an appropriately-sized buffer. That's why I suggested it.

Resizing is less than ideal because it's quadratic. Array mitigates this somewhat with an over-allocation strategy that scales geometrically, but that is also wasteful, and I don't think String employs the same strategy (?).

For Sequence, where we can only make one pass, we unfortunately have to transcode in to some kind of resizing buffer. For Collection, we can just do two passes, guaranteeing no resizing and lower memory water-mark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth checking this later (after writing). Checking whether contiguous UTF8 is ASCII is super-cheap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking after writing would potentially mean re-loading parts of a large string buffer that has already been expunged from cache, and that wouldn't be cheap. A chunked transcoding API would be much better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, modern processors have several megabytes of cache -- far larger than most UTF8 strings (and if your string is several megabytes, this isn't going to be significant either way). The ASCII check on contiguous data can process 8 bytes in a single instruction (more if we SIMDize it), so I figured it may (or may not; the only way to know is to test it) be better to keep the transcoding loop tighter and perform a separate pass for this analysis.

Chunked transcoding might be a good middle-ground, though. That's a fair point.

Comment on lines 526 to 535
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is difficult to read. Might I suggest:

let contiguousResult: String?? = ...

switch contiguousResult {
case .some(.some(let newString)):
  self = newString
  return
case .some(.none):
  return nil
default:
  break // source is non-contiguous.
}

Copy link
Contributor Author

@glessard glessard Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the switch alternative to be a particularly easy read either. I tried to improve readability by using fastidiously named bindings.

Comment on lines 325 to 329
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generic function - transcoding will ultimately be performed by a user-defined type, and if that type were to produce more than 4 bytes of UTF8 for its custom code-units, this would over-write the buffer and corrupt the stack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Toy example, showing that the Unicode.Encoding API allows this. This encoding is UTF8, except that every time the real UTF8 would parse a single scalar from its code-units, this encoding parses 5 repetitions of that scalar:

https://gist.github.com/karwa/ece9cdcf8c66613fdeea85bcd8b2cea8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We should bounds-check here. We should write directly into a string buffer, reallocating when appropriate.

Comment on lines 309 to 317
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part that benefits from contiguous storage, because validateUTF8 and _allASCII require an unsafe buffer pointer. The second part of the function (the slow path, which transcodes) doesn't need contiguous storage at all, and is basically duplicated code from the initialiser.

I would suggest splitting in to separate functions, and making the call to the contiguous UTF8/ASCII one directly from the initialiser. Since the initialiser is inlinable, I'd expect the compiler could specialise this to a direct call to the UTF8/ASCII fast-path in most cases.

The comment that I made above about performing a dry run and writing directly to string storage would then apply to the second function (the slow path), and we could remove the transcoding in the initialiser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I forgot to respond to this earlier. It is superficially true that the two slow paths are similar at this time, but we have intentionally made the _validate function non-inlinable in order to have room to improve it using API that is not yet public. In particular, the current public API doesn't allow for chunked decoding, which would be a significant speedup in a generic context, by significantly amortizing the function call overhead. Getting there from here will be in future work.

@glessard glessard marked this pull request as draft September 11, 2023 23:48
@glessard
Copy link
Contributor Author

@swift-ci please test

@glessard
Copy link
Contributor Author

Rebased to pick up the new ABI checker

@glessard glessard force-pushed the se0405-part1 branch 2 times, most recently from 49d6ed2 to 7a60811 Compare December 22, 2023 18:48
@glessard
Copy link
Contributor Author

@swift-ci please test

@glessard glessard marked this pull request as ready for review December 23, 2023 01:35
@glessard glessard requested a review from a team as a code owner December 23, 2023 01:35
@glessard glessard force-pushed the se0405-part1 branch 3 times, most recently from d2244d7 to 80961df Compare January 3, 2024 22:25
@glessard
Copy link
Contributor Author

glessard commented Jan 3, 2024

@swift-ci please test

Comment on lines 506 to 509
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When printing an optional, or when debugPrinting, the \0 character will be output:

"Ca\0fé"

All examples will have a compiler warning:

Expression implicitly coerced from 'String?' to 'Any'

Copy link
Contributor Author

@glessard glessard Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples don't intend to debugPrint, but it slipped my mind that printing through Optional would do that. Perhaps I should change them to something like print(valid ?? "nil")?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(valid == "Ca\0")
// Prints "true"
print(invalid == nil)
// Prints "true"

@glessard
Copy link
Contributor Author

glessard commented Jan 4, 2024

@swift-ci please smoke test

@glessard
Copy link
Contributor Author

glessard commented Jan 8, 2024

@swift-ci please test

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to hold up this PR, but it would make sense to add some benchmarks so that if we can improve the code in the future (e.g. skipping an intermediary allocation) we'd see the impact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point in time we were worried about the heap size of these objects. Does this change it? If so, would it make sense to put in _countAndFlags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider changing _countAndFlags, that is a good thought. As is, 1 byte is added to the class's stored properties, so it goes from 32 to 33 bytes on 64-bit platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to allocate a normal _StringStorage instance of appropriate size and write into that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to do that as a next step. The __StringStorage class doesn't currently have anything resembling an appropriate initializer, though.

@glessard
Copy link
Contributor Author

@swift-ci please test

@glessard glessard merged commit d8c809c into swiftlang:main Jan 22, 2024
@glessard glessard deleted the se0405-part1 branch January 22, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

swift evolution approved Flag → feature: A feature that was approved through the Swift evolution process swift evolution implemented Flag → feature: A feature that was approved through the Swift evolution process and implemented

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants