Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up duplication detection process #1

Closed
MrOrz opened this issue Jan 11, 2017 · 7 comments
Closed

Speed up duplication detection process #1

MrOrz opened this issue Jan 11, 2017 · 7 comments

Comments

@MrOrz
Copy link
Member

MrOrz commented Jan 11, 2017

Currently it's O(n^2) scan for all rumor entries.

https://github.com/MrOrz/rumors-db/blob/master/scripts/csvToElasticSearch.js#L108

Locality sensitive hashing should help, but requires effective design for the hashing bins.

@AnshulMalik
Copy link

How come is it O(n^2) ?

@MrOrz
Copy link
Member Author

MrOrz commented Jan 11, 2017

Given a document, findDuplication scans through each processed document to calculate similarity between them (O(n)).

If we process n documents, findDuplication is invoked 1+2+...+n times, which derives O(n^2).

I think hashing (more specifically, LSH) can help reduce the complexity of findDuplication to constant. But I have no idea how to implement LSH now.

@AnshulMalik
Copy link

AnshulMalik commented Jan 11, 2017

Ok.

I'll try to find some solution, also, can you tell me where is this findDuplication function is called?

@MrOrz
Copy link
Member Author

MrOrz commented Jan 11, 2017

@MrOrz
Copy link
Member Author

MrOrz commented Jan 11, 2017

It is said that Levenshtein Automata is useful when the character distribution is poor.

Not sure if it works, I just note it here: https://www.npmjs.com/package/node-levenshtein-automata

@MrOrz
Copy link
Member Author

MrOrz commented Jan 14, 2017

Please note that we are moving away from Airtable very soon. This issue will be trivial after the shift.

After that, in the seed script we no longer need to ask for similarities -- we can just provide static set of seed data, in which all rumors and answers are already unique.

See https://www.facebook.com/groups/1847232902175197/1896817880550032/ for the shift, and the derivative issues are #2 , #3 and #4 .

@MrOrz
Copy link
Member Author

MrOrz commented Mar 18, 2017

We have shifted to elastic search and this is no longer required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants