Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ReadOnlySpan<char> #28

Closed
deeprobin opened this issue Jul 28, 2021 · 10 comments
Closed

Add support for ReadOnlySpan<char> #28

deeprobin opened this issue Jul 28, 2021 · 10 comments

Comments

@deeprobin
Copy link

Benchmark: https://www.stevejgordon.co.uk/an-introduction-to-optimising-code-using-span-t

@paulirwin
Copy link
Member

Thanks for the feedback!

At first glance I like this, however I have a few concerns that I want to make sure we think through before doing this.

This has the potential to cause a noticeable divergence from the upstream Java code, which we've been trying to keep to a minimum. I'm not against this change strictly for this reason, and I doubt it will be significant enough to cause issues, but it is a concern. A prototype could resolve any concerns I might have in that regard.

I'd also like to better understand the use cases that warrant this change. In my experience, this library is most often used for small strings, such as a street address. I suppose that one case would be if you are analyzing millions or billions of small strings against one to get similarity scores, where you might start to see a noticeable benefit here. If there is truly significant benefit to this change, it might be worth putting in some benchmarks with BenchmarkDotNet. But premature optimization being a problem and all... I don't want to gold-plate this library if it benefits no one or if it just shaves off a few nanoseconds here and there for microbenchmark points. I'd like to know the real benefit, especially given the divergence from the upstream code.

I think a good starting point would be the ShingleBased.GetProfile method, changing its parameter to ReadOnlySpan<char> and see what happens. We'll look at prototyping it and see how it goes. Thanks again for the interest!

@deeprobin
Copy link
Author

What is the current status of this 😄?

@paulirwin
Copy link
Member

@deeprobin I hadn't heard any response so to be honest it fell off my radar 😄 I should have phrased my post earlier in the form of a question - do you have real world examples of where this change would be a worthwhile benefit? I'd like to better understand the need to justify the diversion from the upstream code. Thanks!

@deeprobin
Copy link
Author

@paulirwin Well, when it comes to bulk operations, for example, you want to go through thousands of entries and have the string similarity.

@paulirwin
Copy link
Member

@deeprobin Mind sharing some performance numbers and/or an example data set as a baseline? Or is this just hypothetical at this point?

@deeprobin
Copy link
Author

@paulirwin I found this 😄 https://adamsitnik.com/Span/

@paulirwin
Copy link
Member

@deeprobin Thanks, but I was meaning about actual performance problems using this library that Span would help with. If there's no feedback from the community that this is a real problem, I'll leave this open to track it, but I'm not in a rush to deviate from upstream code if there's no compelling reason to.

@SingThatSong
Copy link
Contributor

So I got curious about usefulness of spans in this project and that's what I've found:

  1. Project is targeting .net standart 2.0, so it needs new reference to be able to use spans - System.Memory
  2. Out of all measurers I found actually only one that can actually benefit from Span class - Ratcliff-Obershelp, which make huge use of substrings and IndexOf

I made a new class for RatcliffObershelp version, replacing strings with spans, the only non-technical change I had to make - Contains inside FrontMaxMatch must have StringComparison set, I used Ordinal, as it used in other places in this class. This new class passed all existing tests.

So, to benchmarks. I come up with three comparisons: non-random strings (I took them from the test, "My string" / "My tsring") and two test on random strings, one 10 characters long and another one 1000 characters long. Results are as follows:

Method Mean Error StdDev Gen0 Allocated
RatcliffObershelp_Static 1,133.0 ns 20.54 ns 19.21 ns 0.1469 2472 B
RatcliffObershelp_Static_Span 679.2 ns 12.86 ns 13.20 ns 0.0305 520 B
RatcliffObershelp 1,031.3 ns 19.70 ns 21.90 ns 0.1678 2824 B
RatcliffObershelp_Span 802.8 ns 15.67 ns 16.09 ns 0.0210 360 B
RatcliffObershelp_Large 94,435,206.7 ns 1,635,644.68 ns 1,529,983.05 ns 65200.0000 1090605024 B
RatcliffObershelp_Large_Span 37,851,482.9 ns 334,134.79 ns 312,549.89 ns - 14995 B

It's a few times faster and consumes less memory (in case of long random strings - MUCH less memory, but i'm not sure if anyone uses the library this way)

@paulirwin
Copy link
Member

@SingThatSong Thanks for the analysis! I do like limiting the scope of this change just to where it is the most impactful. I'd love to see a PR with your changes. The performance improvement is good, but the allocation improvement is dramatic!

@paulirwin
Copy link
Member

This is completed for Ratcliff-Obsershelp via #33, please add new issues with benchmarks for other types if needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants