various improvements #31

jnqnfe · 2018-11-04T19:12:50Z

please don't request breaking them into separate pull requests, that would be horrendous :)

I have no yet found time to review with respect to #30

not everything needs to be in one file testing public interfaces, so no need to be within lib

redundant now with big test block moved out of lib

- shorter first para gives tidier documentation - hyperlinks added that can be clicked on when reading documentation - quoted literals

http://smartquotesforsmartpeople.com/

utterly pointless to be doing the matching

use helper function for doing the comparison this also gives a nice `assert_eq` style output on failure, enabling you to see the actual value generated, if ever of interest

much clearer to have opening brace on next line with multi-line conditions

better clarity of purpose - these are char counts, not to be confused with a string's len() value

can return early if one is an empty string, no need to waste effort counting chars

only count chars where needed this could be written in fewer lines, but this seems much more readable

char counts are not used if strings are equal, so count after comparison

as just done with levenshtein

as per levenshtein

to be used shortly for optimisation

use new split_on_common_prefix() helper In cases of two non-equal strings with equal byte length, some portion of them would be compared in the equality check, before the check fails and it moved on to a more thorough calculation. Unfortunately, the information about what length of the start of the strings did match was thrown away (not made available from the basic equality check), which would have been useful, since we could skip past it (the common prefix has zero value). The new helper allows precisely this. It finds the common prefix, splitting the strings after it. We can then ignore it, and proceed on with just the suffixes that remain (if any).

The new 'normalised' form does char counting on top of the standard algorithm it calls. This change avoids unnecessary recalculation by moving the main algorithm to a private function which takes `Option` wrapped counts, allowing the normalised form to pass in the values it calculates, thus letting this be done only once.

avoid floats for simply counting, convert to float at end where it needs to actually be a float for the final calculation

the strings are already `&str`

can't expect users to have read the readme

as per levenshtein optimisation rapidfuzz#2

`usize` implements copy, thus we should prefer copying rather than cloning, as recommended by the std documentation

rather than copy prev → prev2 then curr → prev, we can simplify and optimise by just switching the prev & prev2 vecs, and copying curr → prev.

avoid clone() on a usize, it's copy-able so let's just copy it

more simple, and construction with vec! here should be more optimal over a push() loop (see the implementation which uses the private extend_with() function)

use vec! for `curr_distances` construction, as with jaro optimisation rapidfuzz#3

little hard to walk through all of the std code, but I'm pretty certain two separate range collect()'s are more efficient than the push() loop. it's also more simple

same as with levenstein optimisation rapidfuzz#3 - avoid unnecessary repeated char counting with 'normalised' form.

`RangeInclusive` would have been most appropriate, but no direct access to inner attributes

the benefit of being a macro is that test failure output will show the file and line number of the assertion that failed, within the test, rather than pointing within the helper, which helps pinpoint the location better than just having the name of the test function

This optimisation addresses the fact that in the Jaro-Winkler implementation, any common prefix is processed to determine its char count in addition to this portion of the stirngs also being processed within the use of the standard Jaro algorithm. Here we move the Jaro implementation to an 'inner' function, which takes an additional parameter - the char count of a common prefix that has been removed from the provided pair of strings. The algorithm can then take this into account in calculating the metric. The standard Jaro public API function calls this with both full strings and a value of zero for this new param. The Jaro-Winkler implementation now uses the prefix-splitting helper, passing the pair of suffixes in to the 'inner' Jaro algorithm, along with the char-count of the prefix. The adjustment is then performed as before. This avoids wasting unnecessary effort on the prefix portion within the use of the Jaro algorithm.

While `get_diverge_indice` retrieves the indice as before, it now keeps track of the char count while it is looping over chars, so this can be used where wanted (jaro-winkler) without a separate `.chars().count()`. Hopefully, with `inline(always)` the counting will be optimised away in the implementations that do not use the count.

use the char count returned from the updated helper

Taking the j-w optimisations further, this makes use of the prefix splitting helper within the inner Jaro algorithm. The function has been modified such that instead of taking a char-count of the size of the common prefix removed from the pair of strings, it now optionally takes a pointer to return the count, obtaining it within the function through use of the helper internally. Using the prefix splitting helper within the function means that we avoid doing a `.chars().count()` iteration over the prefix twice, once going over `a` and once going over `b`. It also then allows the main part of the algorithm to completely avoid processing the common prefix portion of the strings.

dguo · 2019-04-08T23:20:18Z

Hey, I'm sorry for taking so long to get to this. I do appreciate all the work you put into it. I know you requested that I not ask you to break it up into smaller pull requests, but it is just too big. Would you be open to me helping you do so? I could go through the commits and make a checklist of discrete changes. So at least you wouldn't have to do the work of figuring out how to split it up.

jnqnfe · 2019-04-11T21:44:32Z

Sure, a plan would be great :)

to be used with suggestion matching. this implementation is based upon the implementation from the `strsim` Rust crate, authored by Danny Guo, available at [1]; more specifically the (as yet un-merged) optimised copy authored by myself, available at [2]. the code is available under the MIT license. one implementation difference is that we use floats rather than doubles since we only intend to use this for suggestion matching in unknown option error messages and we don't need the extra precision. [1]: https://github.com/dguo/strsim-rs [2]: rapidfuzz/strsim-rs#31

maxbachmann · 2024-01-01T16:43:22Z

It has been quite a while since back then and a few things have changed in the meantime. One of the major changes is that going forward we will focus on implementing algorithms with reasonable performance, while keeping the binary size small. People who care about the maximum performance will be recommended to use https://github.com/rapidfuzz/rapidfuzz-rs which focuses more on performance, but has a larger binary size.

From a quick look over the changes they can be grouped into a couple of buckets:

update formatting: we do now use cargo fmt to keep consistent formatting
move tests: in rust it appears to be common to have the tests next to the code, while only placing integration tests outside. We will need to look into where we want which parts in the future
improve tests: We certainly need to improve them. Currently the test output is not particularly helpful
improve algorithms:
- avoid calculating lengths multiple times: that probably makes sense and is already done e.g. in reimplement damerau levenshtein distance #61
- remove common prefix: This is generally an improvement and for this reason is done in https://github.com/rapidfuzz/rapidfuzz-rs. In fact it is often possible to remove both common prefix and postfix. However this is likely nothing we want here, since it can affect the binary size quite a bit.
- use inclusive ranges: I opened evaluate inclusive ranges #59 for this, since while this improves readability a bit, it can negatively affect the binary size.
some documentation improvements like mentioning that the algorithms are not based on grapheme clusters. This could be extended to mention how to match using grapheme cluster which should be possible when using the generic versions of the metrics

I agree with @dguo that if we want to have even part of this in, we would need to split this up into multiple separate PRs. In the current form it's simply not possible to review this properly.

@jnqnfe are you still interested in working on these things?

jnqnfe · 2024-01-05T01:20:05Z

Currently I have far too much on my plate to do any further work on this and I don't see that changing any time soon, sorry. It's taken me three days to even find time to provide this reply.

maxbachmann · 2024-01-05T07:40:26Z

That's completely understandable. I just wanted to give you the option to continue working on this, since I know you did put quite a bit of work into this back then.

I did look through the changes a bit more:

The algorithmic improvements are probably nothing we would want. Not because they are useless, but because the project does focus on binary size nowadays and they just do not provide enough of an improvement to warrant the binary size increase.
The tests move is nothing we would want. I opened improve error message on test failure #67 which just improves the failure message, since the current one is not really helpful
the benchmark cleanup is something we would want to change. I opened remove superfluous module around benchmarks #66 just cleaning up the unnecessary bits. I will probably change this to use criterion soon though.
the documentation will need to be tackled separately, since we certainly need to improve it.

Not sure I missed anything small, but finding it would require this to be rebased which would be a significant amount of work. For this reason I will close this PR.

Thanks again for the work you did put into this. I know the review has taken us quite a while 😅

jnqnfe added 30 commits November 3, 2018 03:23

add missing abs() call in test

c6db850

move tests out to tests directory

fe97094

not everything needs to be in one file testing public interfaces, so no need to be within lib

purge "works" tests

2788a60

redundant now with big test block moved out of lib

reorganise tests into separate files

9644ed1

add copyright+license header blocks

23aac76

doc clarity fix

2b67d0d

formatting fixes

4de9169

add note that implementations are scalar value based

d710e2a

tidy func descriptions & add links

69dd8d6

- shorter first para gives tidier documentation - hyperlinks added that can be clicked on when reading documentation - quoted literals

smart quotes

b918eac

http://smartquotesforsmartpeople.com/

simplify hamming tests

2e1e713

utterly pointless to be doing the matching

simplify and improve jaro/jaro-winkler tests

1e999a7

use helper function for doing the comparison this also gives a nice `assert_eq` style output on failure, enabling you to see the actual value generated, if ever of interest

fix formatting

594748c

much clearer to have opening brace on next line with multi-line conditions

nicer range syntax

6128de6

rename a_len/b_len to a_numchars/b_numchars

b1fb470

better clarity of purpose - these are char counts, not to be confused with a string's len() value

jaro optimisation rapidfuzz#1

6364c44

can return early if one is an empty string, no need to waste effort counting chars

levenshtein optimisation rapidfuzz#1

0e42654

only count chars where needed this could be written in fewer lines, but this seems much more readable

osa optimisation rapidfuzz#1

4db35f9

char counts are not used if strings are equal, so count after comparison

osa optimisation rapidfuzz#2

533028b

as just done with levenshtein

d-l optimisation rapidfuzz#1

9e9f6d3

as per levenshtein

add helper functions

e01dd0d

to be used shortly for optimisation

jaro optimisation rapidfuzz#2

88e2747

avoid floats for simply counting, convert to float at end where it needs to actually be a float for the final calculation

drop leftover appveyor artefacts

287b944

remove unnecessary submod wrapper in benches

fba3c80

remove unnecessary references in benches

6bb3daa

the strings are already `&str`

Add note to jaro-winkler fn doc about unlimited prefix length

8c11d12

can't expect users to have read the readme

osa optimisation rapidfuzz#3

ce21126

as per levenshtein optimisation rapidfuzz#2

d-l optimisation rapidfuzz#2

2e6087d

as per levenshtein optimisation rapidfuzz#2

jnqnfe added 13 commits November 4, 2018 05:39

osa optimisation rapidfuzz#4

9ce8df6

`usize` implements copy, thus we should prefer copying rather than cloning, as recommended by the std documentation

osa optimisation rapidfuzz#5

feb6683

rather than copy prev → prev2 then curr → prev, we can simplify and optimise by just switching the prev & prev2 vecs, and copying curr → prev.

d-l optimisation rapidfuzz#3

afc7524

avoid clone() on a usize, it's copy-able so let's just copy it

jaro optimisation rapidfuzz#3

960ff55

more simple, and construction with vec! here should be more optimal over a push() loop (see the implementation which uses the private extend_with() function)

osa optimisation rapidfuzz#6

56a2286

use vec! for `curr_distances` construction, as with jaro optimisation rapidfuzz#3

d-l optimisation rapidfuzz#4

53a54c6

jaro simplification

a790a9f

jaro further simplification

04928f7

osa optimisation rapidfuzz#7

f2e5e16

little hard to walk through all of the std code, but I'm pretty certain two separate range collect()'s are more efficient than the push() loop. it's also more simple

d-l optimisation rapidfuzz#5

501a5a4

same as with levenstein optimisation rapidfuzz#3 - avoid unnecessary repeated char counting with 'normalised' form.

clarify jaro with use of std::ops::Range

c1a0964

`RangeInclusive` would have been most appropriate, but no direct access to inner attributes

update changelog per optimisation work

711b58b

credit myself in changelog

960f164

jnqnfe mentioned this pull request Nov 4, 2018

[Suggestion] performance improvement by avoiding character counts and string comparisons whenever this is possible #28

Closed

jnqnfe added 9 commits November 5, 2018 02:58

typo

ed8c930

add helper inlining hints

c776489

j-w optimisation rapidfuzz#2

99b6b22

use the char count returned from the updated helper

tests: add j/j-w 'same-one'char' tests

871252b

add some inline hints

c68a55a

maxbachmann closed this Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

various improvements #31

various improvements #31

jnqnfe commented Nov 4, 2018

dguo commented Apr 8, 2019

jnqnfe commented Apr 11, 2019

maxbachmann commented Jan 1, 2024

jnqnfe commented Jan 5, 2024

maxbachmann commented Jan 5, 2024

various improvements #31

various improvements #31

Conversation

jnqnfe commented Nov 4, 2018

dguo commented Apr 8, 2019

jnqnfe commented Apr 11, 2019

maxbachmann commented Jan 1, 2024

jnqnfe commented Jan 5, 2024

maxbachmann commented Jan 5, 2024