-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up duplication detection process #1
Comments
How come is it |
Given a document, If we process I think hashing (more specifically, LSH) can help reduce the complexity of |
Ok. I'll try to find some solution, also, can you tell me where is this findDuplication function is called? |
Thanks. Call path:
|
It is said that Levenshtein Automata is useful when the character distribution is poor. Not sure if it works, I just note it here: https://www.npmjs.com/package/node-levenshtein-automata |
Please note that we are moving away from Airtable very soon. This issue will be trivial after the shift. After that, in the seed script we no longer need to ask for similarities -- we can just provide static set of seed data, in which all rumors and answers are already unique. See https://www.facebook.com/groups/1847232902175197/1896817880550032/ for the shift, and the derivative issues are #2 , #3 and #4 . |
We have shifted to elastic search and this is no longer required. |
Currently it's
O(n^2)
scan for all rumor entries.https://github.com/MrOrz/rumors-db/blob/master/scripts/csvToElasticSearch.js#L108
Locality sensitive hashing should help, but requires effective design for the hashing bins.
The text was updated successfully, but these errors were encountered: