Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👷🏼‍♂️[stdlib] Experiment with alternative ways to do grapheme breaking #63051

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lorentey
Copy link
Member

@lorentey lorentey commented Jan 16, 2023

This is a draft attempt at eliminating redundant lookups of Unicode properties during the grapheme breaking process. My hope is that this will (eventually) considerably speed up String's character views, but that will take a bit of work -- String indexing operations currently have no easy way to preserve information about scalars that lie on a Character boundary. (Unfortunately, most scalars tend to lie on a Character boundary...)

When finding grapheme breaks in a string such as ”Cafe\u{301}”, operations such as String.count currently execute the following independent steps:

var state = _GraphemeBreakingState()
state.shouldBreak(“C”, “a”) // 2 lookups, true
state.shouldBreak(“a”, “f”) // 2 lookups, true
state.shouldBreak(“f”, “e”) // 2 lookups, true
state.shouldBreak(“e”, “\u{301}) // 2 lookups, false
// Total: 8 property lookups

Each shouldBreak(a, b) invocation individually looks up grapheme breaking properties for both scalar values, which includes calling nontrivial functions such as hasBreakWhenPaired and Unicode._GraphemeBreakProperty(from:).

This means that during the course of the string operation, we’re retrieving this information twice for each scalar, so we are duplicating nontrivial work. We should be able to significantly speed up grapheme breaking by caching this metadata directly in the state, and preserving it across character boundaries.

var state = _GraphemeBreakingState()
state.hasBreak(before: “C”) // 1 lookup, true
state.hasBreak(before: “a”) // 1 lookup, true
state.hasBreak(before: “f”) // 1 lookup, true
state.hasBreak(before: “e”) // 1 lookup, true
state.hasBreak(before: “\u{301}) // 1 lookup, false
// Total: 5 property lookups

To make this work well, I think we need to change String.Index to cache the grapheme breaking property of the addressed scalar, rather than the stride of the addressed Character, as it currently does. This will somewhat slow down code that iterates over strings using indices, but I suspect the improved grapheme breaking throughput will largely make up for this. (If it doesn't, we'll "just" need to also replace the huge switch statements in the current core grapheme breaking algorithms with an artisanal, small batch lookup table.)

Additionally, having indices only store information about their addressed scalar would eliminate a large source of index invalidation surprises: currently range replacements in native Swift strings sometimes invalidate indices way below the slice that they affected, which usually comes as a shock to people.

In case this proves impractical to do in String itself, at minimum we could still use this to speed up _CharacterRecognizer.

rdar://103970243

Grapheme breaking currently works by essentially looking up the
grapheme break properties of each scalar twice — once when it is on
the right side of a potential break position, and once more when
it is on the left side of the next. This is wasteful.

- Incorporate the last scalar value as well as its grapheme break
   property into state of the grapheme breaking state machine.
- Remove the old grapheme breaking speedup implemented by
   `_hasGraphemeBreakBetween` — having it as a separate phase used
   to be useful when we delegated to ICU, but now that we’ve a
   Swift native implementation, this is no longer worth it.
   However, still prevent dispatching to
   `_swift_stdlib_getGraphemeBreakProperty` for these characters
   by checking for them as part of the property lookup code in Swift.
- Add a quick check for a guaranteed break / no break case between
   two specified scalars, with no information about any preceding
   scalars.

This should measurably speed up `_CharacterRecognizer`, but I expect
it will do the opposite for `String`, as it cannot currently preserve
grapheme breaking state across Character boundaries. (Indices will
need to be reworked to allow this.)

# Conflicts:
#	stdlib/public/core/StringGraphemeBreaking.swift
@lorentey
Copy link
Member Author

I hope to see some improvements in CharacterRecognizer benchmarks at least. (String walks probably won't like the new indices yet, but we'll see...)

@swift-ci benchmark

@lorentey
Copy link
Member Author

@swift-ci benchmark

…ingState

This is mostly just moving the code as is — we’ll need to introduce
a separate type to meaningfully speed up backward breaking.

However, it seems better to keep all parts of core grapheme breaking
logic together in _GraphemeBreakingState rather than keeping parts of
it in _StringGuts.
…cter

Don’t start a whole grapheme breaking session
back & forth if it is immediately obvious that
there is a break at the current position.
…reak property of current scalar

This makes for far fewer surprises, as we’ll no
longer invalidate indices in a native String
that precede any region affected by a range replacement.
It also allows us to speed up the grapheme breaking
process by not looking up grapheme breaking properties
for each scalar twice.

Note: Character views are still assuming the former behavior,
so this change significantly slows down string operations
within them. More work is coming to fix this.
This makes full use of the new grapheme break property in
String.Index, so we should perhaps start seeing some benefits.
Rwrite the current iterator’s `next` implementation
to use the new grapheme breaking primitives instead of
forwarding to `_opaqueCharacterStride(startingAt:)`.

Unfortunately `String.Iterator` is a frozen type and its
members used to be fully inlinable, so we can’t (easily)
remember the grapheme breaking state across `next`
invocations. Frustrating! 😖
@lorentey lorentey force-pushed the character-recognizer-optimizations branch from b8c40b6 to 4c2fa27 Compare January 16, 2023 18:45
@lorentey
Copy link
Member Author

@swift-ci benchmark

@lorentey
Copy link
Member Author

Yay, it works:

CharacterRecognizer.mixed                   21.156      13.096     -38.1%   **1.62x**
StringDistance.characters.mixed             3726.0      2568.0     -31.1%   **1.45x**

There are lots of regressions elsewhere, as expected. Substrings and backwards index steps haven't been updated yet.

------- Performance (x86_64): -O -------

REGRESSION                                  OLD         NEW        DELTA    RATIO    
FindString.Loop1.Substring                  300.25      481.4      +60.3%   **0.62x**
FindString.Rec3.Substring                   115.235     179.25     +55.6%   **0.64x**
FindString.Rec3.String                      119.111     181.455    +52.3%   **0.66x**
CSVParsing.Char                             142.563     211.889    +48.6%   **0.67x**
SubstringTrimmingASCIIWhitespace            135.833     199.875    +47.1%   **0.68x**
StringMatch                                 5450.0      7881.25    +44.6%   **0.69x**
StringHasSuffixUnicode                      76111.111   102187.5   +34.3%   **0.74x (?)**
RomanNumbers2                               465.0       624.0      +34.2%   **0.75x**
RemoveWhereQuadraticString                  221.778     274.125    +23.6%   **0.81x**
Calculator                                  137.308     158.417    +15.4%   **0.87x (?)**
CharacterRecognizer.ascii                   63.444      71.3       +12.4%   **0.89x (?)**
LineSink.characters.alpha                   67.714      76.063     +12.3%   **0.89x (?)**
DropLastSequenceLazy                        315.5       351.833    +11.5%   **0.90x (?)**
NormalizedIterator_fastPrenormal            486.17      541.429    +11.4%   **0.90x (?)**
DropLastSequence                            315.0       350.5      +11.3%   **0.90x (?)**
NormalizedIterator_nonBMPSlowestPrenormal   390.469     433.429    +11.0%   **0.90x (?)**
DistinctClassFieldAccesses                  40.371      44.714     +10.8%   **0.90x (?)**
DropLastAnySequence                         326.667     360.667    +10.4%   **0.91x (?)**
LuhnAlgoEager                               183.556     200.625    +9.3%    **0.91x (?)**
Chars2                                      3062.0      3342.857   +9.2%    **0.92x (?)**
OpenClose                                   50.08       54.435     +8.7%    **0.92x (?)**
LuhnAlgoLazy                                184.8       200.667    +8.6%    **0.92x (?)**
NormalizedIterator_slowerPrenormal          271.778     294.545    +8.4%    **0.92x (?)**

IMPROVEMENT                                 OLD         NEW        DELTA    RATIO    
CharacterRecognizer.mixed                   21.156      13.096     -38.1%   **1.62x**
StringDistance.characters.mixed             3726.0      2568.0     -31.1%   **1.45x**
EqualSubstringSubstringGenericEquatable     30.692      24.787     -19.2%   **1.24x (?)**
LessSubstringSubstringGenericComparable     30.719      24.941     -18.8%   **1.23x (?)**
EqualStringSubstring                        30.917      25.267     -18.3%   **1.22x (?)**
EqualSubstringString                        30.923      25.353     -18.0%   **1.22x (?)**
LessSubstringSubstring                      30.813      25.865     -16.1%   **1.19x (?)**
StringComparison_longSharedPrefix           241.2       206.5      -14.4%   **1.17x (?)**
EqualSubstringSubstring                     31.36       26.923     -14.1%   **1.16x (?)**
FlattenListFlatMap                          4520.0      3926.0     -13.1%   **1.15x (?)**
Breadcrumbs.UTF16ToIdxRange.longASCII       15.258      13.536     -11.3%   **1.13x (?)**
StringDistance.characters.ascii             2669.0      2379.0     -10.9%   **1.12x (?)**
Set.isSubset.Int.Empty                      50.08       45.714     -8.7%    **1.10x (?)**
Set.isDisjoint.Box.Empty                    54.465      50.109     -8.0%    **1.09x (?)**
Set.isDisjoint.Seq.Box.Empty                88.222      81.667     -7.4%    **1.08x (?)**
Breadcrumbs.UTF16ToIdx.longASCII            42.491      39.339     -7.4%    **1.08x (?)**
StringUTF16SubstringBuilder                 1506.364    1398.462   -7.2%    **1.08x (?)**
Set.isStrictSubset.Int.Empty                45.731      42.464     -7.1%    **1.08x (?)**

------- Code size: -O -------

IMPROVEMENT        OLD     NEW     DELTA   RATIO  
StringDistance.o   13199   12663   -4.1%   **1.04x**

@lorentey lorentey requested a review from Azoy January 25, 2023 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant