Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Adding simhash to --filter-similar-to #711

Closed
Luoooio opened this issue Nov 21, 2022 · 3 comments · Fixed by #794
Closed

[FEATURE REQUEST] Adding simhash to --filter-similar-to #711

Luoooio opened this issue Nov 21, 2022 · 3 comments · Fixed by #794
Labels
enhancement New feature or request pinned

Comments

@Luoooio
Copy link

Luoooio commented Nov 21, 2022

I wondered why --filter-similar-to didn't work well until I read the article and realized that it uses SSDeep, which does a terrible job of handling text with little content.

SSDeep was selected as it does a good job of identifying near-duplicate pages once content-length reaches a certain size, while remaining performant. Other algorithms were tested but resulted in huge performance hits (orders of magnitude slower on requests/second).

But this is a necessary test, especially for some API endpoint that returns a small amount of information. so why not add simhash for it additionally ? I find it work well when dealing with short texts. I'm expecting such an update.

@Luoooio Luoooio added the enhancement New feature or request label Nov 21, 2022
@epi052
Copy link
Owner

epi052 commented Nov 25, 2022

Good morning, and thanks for the suggestion!

I haven't heard of, nor tried simhash. When I get some time, I'll experiment with it and see how it goes. I'm definitely not opposed to a better algo, if it works out that way.

Thanks again!

@epi052
Copy link
Owner

epi052 commented Feb 28, 2023

@all-contributors add @Luoooio for ideas

@allcontributors
Copy link
Contributor

@epi052

I've put up a pull request to add @Luoooio! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pinned
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants