-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👷🏼♂️[stdlib] Experiment with alternative ways to do grapheme breaking #63051
Draft
lorentey
wants to merge
7
commits into
apple:main
Choose a base branch
from
lorentey:character-recognizer-optimizations
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+568
−285
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Grapheme breaking currently works by essentially looking up the grapheme break properties of each scalar twice — once when it is on the right side of a potential break position, and once more when it is on the left side of the next. This is wasteful. - Incorporate the last scalar value as well as its grapheme break property into state of the grapheme breaking state machine. - Remove the old grapheme breaking speedup implemented by `_hasGraphemeBreakBetween` — having it as a separate phase used to be useful when we delegated to ICU, but now that we’ve a Swift native implementation, this is no longer worth it. However, still prevent dispatching to `_swift_stdlib_getGraphemeBreakProperty` for these characters by checking for them as part of the property lookup code in Swift. - Add a quick check for a guaranteed break / no break case between two specified scalars, with no information about any preceding scalars. This should measurably speed up `_CharacterRecognizer`, but I expect it will do the opposite for `String`, as it cannot currently preserve grapheme breaking state across Character boundaries. (Indices will need to be reworked to allow this.) # Conflicts: # stdlib/public/core/StringGraphemeBreaking.swift
I hope to see some improvements in CharacterRecognizer benchmarks at least. (String walks probably won't like the new indices yet, but we'll see...) @swift-ci benchmark |
@swift-ci benchmark |
…ingState This is mostly just moving the code as is — we’ll need to introduce a separate type to meaningfully speed up backward breaking. However, it seems better to keep all parts of core grapheme breaking logic together in _GraphemeBreakingState rather than keeping parts of it in _StringGuts.
…cter Don’t start a whole grapheme breaking session back & forth if it is immediately obvious that there is a break at the current position.
…reak property of current scalar This makes for far fewer surprises, as we’ll no longer invalidate indices in a native String that precede any region affected by a range replacement. It also allows us to speed up the grapheme breaking process by not looking up grapheme breaking properties for each scalar twice. Note: Character views are still assuming the former behavior, so this change significantly slows down string operations within them. More work is coming to fix this.
This makes full use of the new grapheme break property in String.Index, so we should perhaps start seeing some benefits.
Rwrite the current iterator’s `next` implementation to use the new grapheme breaking primitives instead of forwarding to `_opaqueCharacterStride(startingAt:)`. Unfortunately `String.Iterator` is a frozen type and its members used to be fully inlinable, so we can’t (easily) remember the grapheme breaking state across `next` invocations. Frustrating! 😖
lorentey
force-pushed
the
character-recognizer-optimizations
branch
from
January 16, 2023 18:45
b8c40b6
to
4c2fa27
Compare
@swift-ci benchmark |
Yay, it works:
There are lots of regressions elsewhere, as expected. Substrings and backwards index steps haven't been updated yet.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a draft attempt at eliminating redundant lookups of Unicode properties during the grapheme breaking process. My hope is that this will (eventually) considerably speed up String's character views, but that will take a bit of work -- String indexing operations currently have no easy way to preserve information about scalars that lie on a Character boundary. (Unfortunately, most scalars tend to lie on a Character boundary...)
When finding grapheme breaks in a string such as
”Cafe\u{301}”
, operations such asString.count
currently execute the following independent steps:Each
shouldBreak(a, b)
invocation individually looks up grapheme breaking properties for both scalar values, which includes calling nontrivial functions such ashasBreakWhenPaired
andUnicode._GraphemeBreakProperty(from:)
.This means that during the course of the string operation, we’re retrieving this information twice for each scalar, so we are duplicating nontrivial work. We should be able to significantly speed up grapheme breaking by caching this metadata directly in the state, and preserving it across character boundaries.
To make this work well, I think we need to change
String.Index
to cache the grapheme breaking property of the addressed scalar, rather than the stride of the addressed Character, as it currently does. This will somewhat slow down code that iterates over strings using indices, but I suspect the improved grapheme breaking throughput will largely make up for this. (If it doesn't, we'll "just" need to also replace the huge switch statements in the current core grapheme breaking algorithms with an artisanal, small batch lookup table.)Additionally, having indices only store information about their addressed scalar would eliminate a large source of index invalidation surprises: currently range replacements in native Swift strings sometimes invalidate indices way below the slice that they affected, which usually comes as a shock to people.
In case this proves impractical to do in String itself, at minimum we could still use this to speed up
_CharacterRecognizer
.rdar://103970243