Add support for ReadOnlySpan<char> #28

deeprobin · 2021-07-28T13:14:54Z

Benchmark: https://www.stevejgordon.co.uk/an-introduction-to-optimising-code-using-span-t

paulirwin · 2021-07-30T13:55:19Z

Thanks for the feedback!

At first glance I like this, however I have a few concerns that I want to make sure we think through before doing this.

This has the potential to cause a noticeable divergence from the upstream Java code, which we've been trying to keep to a minimum. I'm not against this change strictly for this reason, and I doubt it will be significant enough to cause issues, but it is a concern. A prototype could resolve any concerns I might have in that regard.

I'd also like to better understand the use cases that warrant this change. In my experience, this library is most often used for small strings, such as a street address. I suppose that one case would be if you are analyzing millions or billions of small strings against one to get similarity scores, where you might start to see a noticeable benefit here. If there is truly significant benefit to this change, it might be worth putting in some benchmarks with BenchmarkDotNet. But premature optimization being a problem and all... I don't want to gold-plate this library if it benefits no one or if it just shaves off a few nanoseconds here and there for microbenchmark points. I'd like to know the real benefit, especially given the divergence from the upstream code.

I think a good starting point would be the ShingleBased.GetProfile method, changing its parameter to ReadOnlySpan<char> and see what happens. We'll look at prototyping it and see how it goes. Thanks again for the interest!

deeprobin · 2021-10-27T18:01:27Z

What is the current status of this 😄?

paulirwin · 2021-10-27T18:09:39Z

@deeprobin I hadn't heard any response so to be honest it fell off my radar 😄 I should have phrased my post earlier in the form of a question - do you have real world examples of where this change would be a worthwhile benefit? I'd like to better understand the need to justify the diversion from the upstream code. Thanks!

deeprobin · 2021-10-27T18:13:10Z

@paulirwin Well, when it comes to bulk operations, for example, you want to go through thousands of entries and have the string similarity.

paulirwin · 2021-10-27T18:18:13Z

@deeprobin Mind sharing some performance numbers and/or an example data set as a baseline? Or is this just hypothetical at this point?

deeprobin · 2021-11-14T09:33:17Z

@paulirwin I found this 😄 https://adamsitnik.com/Span/

paulirwin · 2021-11-15T15:11:56Z

@deeprobin Thanks, but I was meaning about actual performance problems using this library that Span would help with. If there's no feedback from the community that this is a real problem, I'll leave this open to track it, but I'm not in a rush to deviate from upstream code if there's no compelling reason to.

SingThatSong · 2023-01-11T02:39:20Z

So I got curious about usefulness of spans in this project and that's what I've found:

Project is targeting .net standart 2.0, so it needs new reference to be able to use spans - System.Memory
Out of all measurers I found actually only one that can actually benefit from Span class - Ratcliff-Obershelp, which make huge use of substrings and IndexOf

I made a new class for RatcliffObershelp version, replacing strings with spans, the only non-technical change I had to make - Contains inside FrontMaxMatch must have StringComparison set, I used Ordinal, as it used in other places in this class. This new class passed all existing tests.

So, to benchmarks. I come up with three comparisons: non-random strings (I took them from the test, "My string" / "My tsring") and two test on random strings, one 10 characters long and another one 1000 characters long. Results are as follows:

Method	Mean	Error	StdDev	Gen0	Allocated
RatcliffObershelp_Static	1,133.0 ns	20.54 ns	19.21 ns	0.1469	2472 B
RatcliffObershelp_Static_Span	679.2 ns	12.86 ns	13.20 ns	0.0305	520 B
RatcliffObershelp	1,031.3 ns	19.70 ns	21.90 ns	0.1678	2824 B
RatcliffObershelp_Span	802.8 ns	15.67 ns	16.09 ns	0.0210	360 B
RatcliffObershelp_Large	94,435,206.7 ns	1,635,644.68 ns	1,529,983.05 ns	65200.0000	1090605024 B
RatcliffObershelp_Large_Span	37,851,482.9 ns	334,134.79 ns	312,549.89 ns	-	14995 B

It's a few times faster and consumes less memory (in case of long random strings - MUCH less memory, but i'm not sure if anyone uses the library this way)

paulirwin · 2023-01-11T15:37:18Z

@SingThatSong Thanks for the analysis! I do like limiting the scope of this change just to where it is the most impactful. I'd love to see a PR with your changes. The performance improvement is good, but the allocation improvement is dramatic!

paulirwin · 2023-01-16T15:15:57Z

This is completed for Ratcliff-Obsershelp via #33, please add new issues with benchmarks for other types if needed. Thanks!

SingThatSong mentioned this issue Jan 14, 2023

Optimization for Ratcliff-Obershelp using ReadOnlySpan #33

Merged

paulirwin closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ReadOnlySpan<char> #28

Add support for ReadOnlySpan<char> #28

deeprobin commented Jul 28, 2021

paulirwin commented Jul 30, 2021

deeprobin commented Oct 27, 2021

paulirwin commented Oct 27, 2021

deeprobin commented Oct 27, 2021

paulirwin commented Oct 27, 2021

deeprobin commented Nov 14, 2021

paulirwin commented Nov 15, 2021

SingThatSong commented Jan 11, 2023

paulirwin commented Jan 11, 2023

paulirwin commented Jan 16, 2023

Add support for ReadOnlySpan<char> #28

Add support for ReadOnlySpan<char> #28

Comments

deeprobin commented Jul 28, 2021

paulirwin commented Jul 30, 2021

deeprobin commented Oct 27, 2021

paulirwin commented Oct 27, 2021

deeprobin commented Oct 27, 2021

paulirwin commented Oct 27, 2021

deeprobin commented Nov 14, 2021

paulirwin commented Nov 15, 2021

SingThatSong commented Jan 11, 2023

paulirwin commented Jan 11, 2023

paulirwin commented Jan 16, 2023