Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

please reconsider the addition of ndarray and hashbrown as dependencies #34

Closed
BurntSushi opened this issue Apr 26, 2019 · 8 comments
Closed

Comments

@BurntSushi
Copy link

I left a comment here on the commit in which these deps were introduced, but I'm not sure if it was seen: d6717db#r33186236

So I'd like to open an issue to increase its visibility. In particular, it would be great if strsim could remain fairly lightweight since it is depended on by a bunch of crates via clap. I'm hoping that we, as an ecosystem, can be a bit more judicious in adding dependencies to crates.

For hashbrown, which is awesome, it is going to be included in the standard library very soon. Folks upgrading their Rust version will automatically get this optimization soon enough. I don't think it's worth adding it as an explicit dependency for a small gain for users on older versions of Rust, because it impacts compilation times for everyone.

For ndarray, it looks like it was just used for some small convenience. I don't think it's worth bringing ndarray in (as excellent as it is) along with all of its dependencies just for that.

dguo referenced this issue Apr 26, 2019
* [performance] Improve the performance of damerau_levenshtein

My tests show the new version is approximately 2.5 times faster than
the old ones. Most of the gains come from using hashbrown for the
hashmap.

* Switch back to rust 2015 to fix CI

* Increase minimum required rust version to 1.31

* Simplify the levenshtein function
@dguo
Copy link
Member

dguo commented Apr 26, 2019

Hi, thanks for bringing this up. I did consider the usual trade offs with taking on dependencies. The performance gain seemed worth it to me, but I didn't know that hashbrown was going to be merged into the standard library until I saw it on Reddit recently. Based on that, I agree it doesn't need to be a dependency here.

For ndarray, that was laziness on my part. I should have separated it out to see if it alone would improve performance significantly. Based on the first comment in #32, it probably won't. I will try it for myself this weekend to verify. Apologies for not doing so before merging. After that, I will pull hashbrown out and ndarray as well, barring any surprising results from my testing.

@lovasoa, please let me know if you have any objections or concerns.

@BurntSushi, I'll try to be more diligent in the future, as I agree that lower level libraries should require a correspondingly higher bar for taking on dependencies.

@BurntSushi
Copy link
Author

Thank you so much! :-)

@dguo
Copy link
Member

dguo commented Apr 30, 2019

From my testing, it appears that using ndarray does result in a speedup of about 17%.

0.9.0:

test benches::bench_damerau_levenshtein            ... bench:      20,601 ns/iter (+/- 474)

with ndarray:

test benches::bench_damerau_levenshtein            ... bench:      17,160 ns/iter (+/- 576)

17% is a bigger value than I expected, but it still doesn't seem large enough to justify all the sub-dependencies, especially considering only one of the metrics is affected. I'm going to sleep on it.

@BurntSushi
Copy link
Author

Can you point me in the right direction to re-run those benchmarks? I can take a look. It is surprising to me that ndarray is responsible for that speedup.

@lovasoa
Copy link
Contributor

lovasoa commented Apr 30, 2019

A compromise between adding a dependency to ndarray and using a vector of vectors may be to use a single fixed size array, and make the necessary computations at array access time, replacing

distances[i][j]

by

distances[i*width + j]

dguo added a commit that referenced this issue May 5, 2019
Instead of representing a 2x2 grid with a vector of vectors, just use a single
vector to improve performance. We can do this since the dimensions are fixed.

This method was suggested by @lovasoa as an alternative to adding a dependency
on the ndarray crate.

In my benchmark testing, the new approach is about as fast using ndarray. On my
machine, the original approach takes about 22,000 ns/iter, whereas the new
approach takes about 17,000 ns/iter.

See #34 for more context.
@dguo
Copy link
Member

dguo commented May 5, 2019

@BurntSushi, you can run cargo +nightly bench in the ndarray branch that I just pushed up and compare it to 0.9.0.

@lovasoa, nice suggestion. I just tried using a single, flat vector (see the flatten-vectors branch), and it's as fast as using ndarray. Looks like that's the way to go, and we can drop ndarray while retaining the performance boost.

I suppose ndarray is doing something similar under the hood, rather than the naive, vector of vectors approach that I did.

dguo added a commit that referenced this issue May 9, 2019
dguo added a commit that referenced this issue May 9, 2019
We can drop a dependency while retaining the performance boost.

Related to #34
@dguo
Copy link
Member

dguo commented May 9, 2019

Ok, I've published v0.9.2, which removes hashbrown and switches out ndarray for the one vector approach. Thanks everyone!

@dguo dguo closed this as completed May 9, 2019
@BurntSushi
Copy link
Author

Thank you! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants