👷🏼‍♂️[stdlib] Experiment with alternative ways to do grapheme breaking #63051

lorentey · 2023-01-16T09:06:52Z

This is a draft attempt at eliminating redundant lookups of Unicode properties during the grapheme breaking process. My hope is that this will (eventually) considerably speed up String's character views, but that will take a bit of work -- String indexing operations currently have no easy way to preserve information about scalars that lie on a Character boundary. (Unfortunately, most scalars tend to lie on a Character boundary...)

When finding grapheme breaks in a string such as ”Cafe\u{301}”, operations such as String.count currently execute the following independent steps:

var state = _GraphemeBreakingState()
state.shouldBreak(“C”, “a”) // 2 lookups, true
state.shouldBreak(“a”, “f”) // 2 lookups, true
state.shouldBreak(“f”, “e”) // 2 lookups, true
state.shouldBreak(“e”, “\u{301}”) // 2 lookups, false
// Total: 8 property lookups

Each shouldBreak(a, b) invocation individually looks up grapheme breaking properties for both scalar values, which includes calling nontrivial functions such as hasBreakWhenPaired and Unicode._GraphemeBreakProperty(from:).

This means that during the course of the string operation, we’re retrieving this information twice for each scalar, so we are duplicating nontrivial work. We should be able to significantly speed up grapheme breaking by caching this metadata directly in the state, and preserving it across character boundaries.

var state = _GraphemeBreakingState()
state.hasBreak(before: “C”) // 1 lookup, true
state.hasBreak(before: “a”) // 1 lookup, true
state.hasBreak(before: “f”) // 1 lookup, true
state.hasBreak(before: “e”) // 1 lookup, true
state.hasBreak(before: “\u{301}”) // 1 lookup, false
// Total: 5 property lookups

To make this work well, I think we need to change String.Index to cache the grapheme breaking property of the addressed scalar, rather than the stride of the addressed Character, as it currently does. This will somewhat slow down code that iterates over strings using indices, but I suspect the improved grapheme breaking throughput will largely make up for this. (If it doesn't, we'll "just" need to also replace the huge switch statements in the current core grapheme breaking algorithms with an artisanal, small batch lookup table.)

Additionally, having indices only store information about their addressed scalar would eliminate a large source of index invalidation surprises: currently range replacements in native Swift strings sometimes invalidate indices way below the slice that they affected, which usually comes as a shock to people.

In case this proves impractical to do in String itself, at minimum we could still use this to speed up _CharacterRecognizer.

rdar://103970243

Grapheme breaking currently works by essentially looking up the grapheme break properties of each scalar twice — once when it is on the right side of a potential break position, and once more when it is on the left side of the next. This is wasteful. - Incorporate the last scalar value as well as its grapheme break property into state of the grapheme breaking state machine. - Remove the old grapheme breaking speedup implemented by `_hasGraphemeBreakBetween` — having it as a separate phase used to be useful when we delegated to ICU, but now that we’ve a Swift native implementation, this is no longer worth it. However, still prevent dispatching to `_swift_stdlib_getGraphemeBreakProperty` for these characters by checking for them as part of the property lookup code in Swift. - Add a quick check for a guaranteed break / no break case between two specified scalars, with no information about any preceding scalars. This should measurably speed up `_CharacterRecognizer`, but I expect it will do the opposite for `String`, as it cannot currently preserve grapheme breaking state across Character boundaries. (Indices will need to be reworked to allow this.) # Conflicts: # stdlib/public/core/StringGraphemeBreaking.swift

lorentey · 2023-01-16T09:12:17Z

I hope to see some improvements in CharacterRecognizer benchmarks at least. (String walks probably won't like the new indices yet, but we'll see...)

@swift-ci benchmark

lorentey · 2023-01-16T09:19:48Z

@swift-ci benchmark

…ingState This is mostly just moving the code as is — we’ll need to introduce a separate type to meaningfully speed up backward breaking. However, it seems better to keep all parts of core grapheme breaking logic together in _GraphemeBreakingState rather than keeping parts of it in _StringGuts.

…cter Don’t start a whole grapheme breaking session back & forth if it is immediately obvious that there is a break at the current position.

…reak property of current scalar This makes for far fewer surprises, as we’ll no longer invalidate indices in a native String that precede any region affected by a range replacement. It also allows us to speed up the grapheme breaking process by not looking up grapheme breaking properties for each scalar twice. Note: Character views are still assuming the former behavior, so this change significantly slows down string operations within them. More work is coming to fix this.

This makes full use of the new grapheme break property in String.Index, so we should perhaps start seeing some benefits.

Rwrite the current iterator’s `next` implementation to use the new grapheme breaking primitives instead of forwarding to `_opaqueCharacterStride(startingAt:)`. Unfortunately `String.Iterator` is a frozen type and its members used to be fully inlinable, so we can’t (easily) remember the grapheme breaking state across `next` invocations. Frustrating! 😖

lorentey · 2023-01-16T18:45:41Z

@swift-ci benchmark

lorentey · 2023-01-16T21:45:50Z

Yay, it works:

CharacterRecognizer.mixed                   21.156      13.096     -38.1%   **1.62x**
StringDistance.characters.mixed             3726.0      2568.0     -31.1%   **1.45x**

There are lots of regressions elsewhere, as expected. Substrings and backwards index steps haven't been updated yet.

------- Performance (x86_64): -O -------

REGRESSION                                  OLD         NEW        DELTA    RATIO    
FindString.Loop1.Substring                  300.25      481.4      +60.3%   **0.62x**
FindString.Rec3.Substring                   115.235     179.25     +55.6%   **0.64x**
FindString.Rec3.String                      119.111     181.455    +52.3%   **0.66x**
CSVParsing.Char                             142.563     211.889    +48.6%   **0.67x**
SubstringTrimmingASCIIWhitespace            135.833     199.875    +47.1%   **0.68x**
StringMatch                                 5450.0      7881.25    +44.6%   **0.69x**
StringHasSuffixUnicode                      76111.111   102187.5   +34.3%   **0.74x (?)**
RomanNumbers2                               465.0       624.0      +34.2%   **0.75x**
RemoveWhereQuadraticString                  221.778     274.125    +23.6%   **0.81x**
Calculator                                  137.308     158.417    +15.4%   **0.87x (?)**
CharacterRecognizer.ascii                   63.444      71.3       +12.4%   **0.89x (?)**
LineSink.characters.alpha                   67.714      76.063     +12.3%   **0.89x (?)**
DropLastSequenceLazy                        315.5       351.833    +11.5%   **0.90x (?)**
NormalizedIterator_fastPrenormal            486.17      541.429    +11.4%   **0.90x (?)**
DropLastSequence                            315.0       350.5      +11.3%   **0.90x (?)**
NormalizedIterator_nonBMPSlowestPrenormal   390.469     433.429    +11.0%   **0.90x (?)**
DistinctClassFieldAccesses                  40.371      44.714     +10.8%   **0.90x (?)**
DropLastAnySequence                         326.667     360.667    +10.4%   **0.91x (?)**
LuhnAlgoEager                               183.556     200.625    +9.3%    **0.91x (?)**
Chars2                                      3062.0      3342.857   +9.2%    **0.92x (?)**
OpenClose                                   50.08       54.435     +8.7%    **0.92x (?)**
LuhnAlgoLazy                                184.8       200.667    +8.6%    **0.92x (?)**
NormalizedIterator_slowerPrenormal          271.778     294.545    +8.4%    **0.92x (?)**

IMPROVEMENT                                 OLD         NEW        DELTA    RATIO    
CharacterRecognizer.mixed                   21.156      13.096     -38.1%   **1.62x**
StringDistance.characters.mixed             3726.0      2568.0     -31.1%   **1.45x**
EqualSubstringSubstringGenericEquatable     30.692      24.787     -19.2%   **1.24x (?)**
LessSubstringSubstringGenericComparable     30.719      24.941     -18.8%   **1.23x (?)**
EqualStringSubstring                        30.917      25.267     -18.3%   **1.22x (?)**
EqualSubstringString                        30.923      25.353     -18.0%   **1.22x (?)**
LessSubstringSubstring                      30.813      25.865     -16.1%   **1.19x (?)**
StringComparison_longSharedPrefix           241.2       206.5      -14.4%   **1.17x (?)**
EqualSubstringSubstring                     31.36       26.923     -14.1%   **1.16x (?)**
FlattenListFlatMap                          4520.0      3926.0     -13.1%   **1.15x (?)**
Breadcrumbs.UTF16ToIdxRange.longASCII       15.258      13.536     -11.3%   **1.13x (?)**
StringDistance.characters.ascii             2669.0      2379.0     -10.9%   **1.12x (?)**
Set.isSubset.Int.Empty                      50.08       45.714     -8.7%    **1.10x (?)**
Set.isDisjoint.Box.Empty                    54.465      50.109     -8.0%    **1.09x (?)**
Set.isDisjoint.Seq.Box.Empty                88.222      81.667     -7.4%    **1.08x (?)**
Breadcrumbs.UTF16ToIdx.longASCII            42.491      39.339     -7.4%    **1.08x (?)**
StringUTF16SubstringBuilder                 1506.364    1398.462   -7.2%    **1.08x (?)**
Set.isStrictSubset.Int.Empty                45.731      42.464     -7.1%    **1.08x (?)**

------- Code size: -O -------

IMPROVEMENT        OLD     NEW     DELTA   RATIO  
StringDistance.o   13199   12663   -4.1%   **1.04x**

lorentey added 2 commits January 16, 2023 00:47

[stdlib] String: Mark indexing operations as releasenone

5d956ab

lorentey added 5 commits January 16, 2023 10:45

[stdlib] Add a fast path to _StringGuts. _slowRoundDownToNearestChara…

05b52d9

…cter Don’t start a whole grapheme breaking session back & forth if it is immediately obvious that there is a break at the current position.

[stdlib] String: Rewrite forward index operations

9b4631c

This makes full use of the new grapheme break property in String.Index, so we should perhaps start seeing some benefits.

lorentey force-pushed the character-recognizer-optimizations branch from b8c40b6 to 4c2fa27 Compare January 16, 2023 18:45

lorentey requested a review from Azoy January 25, 2023 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👷🏼‍♂️[stdlib] Experiment with alternative ways to do grapheme breaking #63051

👷🏼‍♂️[stdlib] Experiment with alternative ways to do grapheme breaking #63051

lorentey commented Jan 16, 2023 •

edited

lorentey commented Jan 16, 2023

lorentey commented Jan 16, 2023

lorentey commented Jan 16, 2023

lorentey commented Jan 16, 2023

👷🏼‍♂️[stdlib] Experiment with alternative ways to do grapheme breaking #63051

Are you sure you want to change the base?

👷🏼‍♂️[stdlib] Experiment with alternative ways to do grapheme breaking #63051

Conversation

lorentey commented Jan 16, 2023 • edited

lorentey commented Jan 16, 2023

lorentey commented Jan 16, 2023

lorentey commented Jan 16, 2023

lorentey commented Jan 16, 2023

lorentey commented Jan 16, 2023 •

edited