-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ReadOnlySpan<char> #28
Comments
Thanks for the feedback! At first glance I like this, however I have a few concerns that I want to make sure we think through before doing this. This has the potential to cause a noticeable divergence from the upstream Java code, which we've been trying to keep to a minimum. I'm not against this change strictly for this reason, and I doubt it will be significant enough to cause issues, but it is a concern. A prototype could resolve any concerns I might have in that regard. I'd also like to better understand the use cases that warrant this change. In my experience, this library is most often used for small strings, such as a street address. I suppose that one case would be if you are analyzing millions or billions of small strings against one to get similarity scores, where you might start to see a noticeable benefit here. If there is truly significant benefit to this change, it might be worth putting in some benchmarks with BenchmarkDotNet. But premature optimization being a problem and all... I don't want to gold-plate this library if it benefits no one or if it just shaves off a few nanoseconds here and there for microbenchmark points. I'd like to know the real benefit, especially given the divergence from the upstream code. I think a good starting point would be the |
What is the current status of this 😄? |
@deeprobin I hadn't heard any response so to be honest it fell off my radar 😄 I should have phrased my post earlier in the form of a question - do you have real world examples of where this change would be a worthwhile benefit? I'd like to better understand the need to justify the diversion from the upstream code. Thanks! |
@paulirwin Well, when it comes to bulk operations, for example, you want to go through thousands of entries and have the string similarity. |
@deeprobin Mind sharing some performance numbers and/or an example data set as a baseline? Or is this just hypothetical at this point? |
@paulirwin I found this 😄 https://adamsitnik.com/Span/ |
@deeprobin Thanks, but I was meaning about actual performance problems using this library that Span would help with. If there's no feedback from the community that this is a real problem, I'll leave this open to track it, but I'm not in a rush to deviate from upstream code if there's no compelling reason to. |
So I got curious about usefulness of spans in this project and that's what I've found:
I made a new class for RatcliffObershelp version, replacing strings with spans, the only non-technical change I had to make - So, to benchmarks. I come up with three comparisons: non-random strings (I took them from the test, "My string" / "My tsring") and two test on random strings, one 10 characters long and another one 1000 characters long. Results are as follows:
It's a few times faster and consumes less memory (in case of long random strings - MUCH less memory, but i'm not sure if anyone uses the library this way) |
@SingThatSong Thanks for the analysis! I do like limiting the scope of this change just to where it is the most impactful. I'd love to see a PR with your changes. The performance improvement is good, but the allocation improvement is dramatic! |
This is completed for Ratcliff-Obsershelp via #33, please add new issues with benchmarks for other types if needed. Thanks! |
Benchmark: https://www.stevejgordon.co.uk/an-introduction-to-optimising-code-using-span-t
The text was updated successfully, but these errors were encountered: